Question about the Fulltext search!

List overview All Threads
Download

newer

older

"declare copy-namespaces...

how does one output the first line?

Truong An Nguyen

25 Nov 2011 25 Nov '11

5:30 a.m.

Hi,

I've two questions about the fulltext search.

1) I've created fulltext i for my BaseX doesn't understand ftcontains. When I try a CONTAINS query in a 1 Gigabyte Database, using text index is slower than without text index. How can I use fulltext index rightly? 2) Does BaseX supports regular expression in the fulltext extension? If yes, could you please give me an example.

Thanks

Cheers, An

Attachments:

attachment.html (text/html — 578 bytes)

Show replies by date

Dimitar Popov

25 Nov 25 Nov

9:31 a.m.

Hi An,

Am Freitag, 25. November 2011, 11:30:00 schrieb Truong An Nguyen:

...

Hi,

I've two questions about the fulltext search.

I've created fulltext i for my BaseX doesn't understand ftcontains.

When I try a CONTAINS query in a 1 Gigabyte Database, using text index is slower than without text index. How can I use fulltext index rightly?

You can do several things:

1. Make sure that the full-text index is used, i.e. check the "query info" in the GUI; it should contain "FTIndexAccess" similar to:

2a. If the full-text index is used, please send more details about your query and data (e.g. what full-text options are used); it would be interesting to see why the index query is slower.

2b. If the full-text index is NOT used, please check that the full-text options you use in your query correspond to the options with which the full- text index is created. For more information check our wiki page [1].

...

Does BaseX supports regular expression in the fulltext extension? If

yes, could you please give me an example.

No, full blown regular expressions are not supported by XQuery Full-Text. However, wild-cards are supported. For the correspond syntax and examples, you can check the XQuery Full-Text specification [2].

...

Thanks

Cheers, An

Greetings, Dimitar

[1] http://docs.basex.org/wiki/Full-Text [2] http://www.w3.org/TR/xpath-full-text-10/#ftwildcardoption

Truong An Nguyen

10:45 a.m.

Hi Dimitar,

thank you for your answer. I will check the performance of the fulltext search again and give you the test data if it is really slower.

I still have one question about the fulltext query. Can I find a whole word (the result of the regular expression: \bWord\b) with fulltext query?

Many thanks.

On Fri, Nov 25, 2011 at 3:31 PM, Dimitar Popov < Dimitar.Popov@uni-konstanz.de> wrote:

...

Hi An,

Am Freitag, 25. November 2011, 11:30:00 schrieb Truong An Nguyen:

...
Hi,

I've two questions about the fulltext search.

I've created fulltext i for my BaseX doesn't understand ftcontains.

When I try a CONTAINS query in a 1 Gigabyte Database, using text index is slower than without text index. How can I use fulltext index rightly?

You can do several things:

Make sure that the full-text index is used, i.e. check the "query info"

in the GUI; it should contain "FTIndexAccess" similar to:

<FTIndexAccess data="factbook"> <FTWords> <Item value="norway" type="xs:string"/> </FTWords> </FTIndexAccess>

2a. If the full-text index is used, please send more details about your query and data (e.g. what full-text options are used); it would be interesting to see why the index query is slower.

2b. If the full-text index is NOT used, please check that the full-text options you use in your query correspond to the options with which the full- text index is created. For more information check our wiki page [1].

...

Does BaseX supports regular expression in the fulltext extension? If

yes, could you please give me an example.

No, full blown regular expressions are not supported by XQuery Full-Text. However, wild-cards are supported. For the correspond syntax and examples, you can check the XQuery Full-Text specification [2].

...
Thanks

Cheers, An

Greetings, Dimitar

[1] http://docs.basex.org/wiki/Full-Text [2] http://www.w3.org/TR/xpath-full-text-10/#ftwildcardoption

Christian Grün

1:06 p.m.

Hi An,

...

I've created fulltext i for my BaseX doesn't understand ftcontains.

just to avoid any confusion: the W3 XML Query WG has changed the "ftcontains" keyword to "contains text" due to conflicts with the existing specification of element constructors [1]. This is also the reason why 2 keywords are required for all updating expressions.. e.g.: "insert node", "delete node", etc.

...

I still have one question about the fulltext query. Can I find a whole word (the result of the regular expression: \bWord\b) with fulltext query?

This is actually the default behavior of XQuery Full Text, as shown in the following examples:

"a b" contains text "a" → true "ab c" contains text "a" → false

On behalf of Dimitri, Christian

[1] http://www.w3.org/Bugs/Public/show_bug.cgi?id=7247 ___________________________

On Fri, Nov 25, 2011 at 4:45 PM, Truong An Nguyen ngtruongan@gmail.com wrote:

...

Hi Dimitar, thank you for your answer. I will check the performance of the fulltext search again and give you the test data if it is really slower. I still have one question about the fulltext query. Can I find a whole word (the result of the regular expression: \bWord\b) with fulltext query? Many thanks. An

On Fri, Nov 25, 2011 at 3:31 PM, Dimitar Popov Dimitar.Popov@uni-konstanz.de wrote:

...
Hi An,

Am Freitag, 25. November 2011, 11:30:00 schrieb Truong An Nguyen:

...
Hi,

I've two questions about the fulltext search.

I've created fulltext i for my BaseX doesn't understand ftcontains.

When I try a CONTAINS query in a 1 Gigabyte Database, using text index is slower than without text index. How can I use fulltext index rightly?

You can do several things:

Make sure that the full-text index is used, i.e. check the "query info"

in the GUI; it should contain "FTIndexAccess" similar to:

<FTIndexAccess data="factbook"> <FTWords> <Item value="norway" type="xs:string"/> </FTWords> </FTIndexAccess>

2a. If the full-text index is used, please send more details about your query and data (e.g. what full-text options are used); it would be interesting to see why the index query is slower.

2b. If the full-text index is NOT used, please check that the full-text options you use in your query correspond to the options with which the full- text index is created. For more information check our wiki page [1].

...

Does BaseX supports regular expression in the fulltext extension? If

yes, could you please give me an example.

No, full blown regular expressions are not supported by XQuery Full-Text. However, wild-cards are supported. For the correspond syntax and examples, you can check the XQuery Full-Text specification [2].

...
Thanks

Cheers, An

Greetings, Dimitar

[1] http://docs.basex.org/wiki/Full-Text [2] http://www.w3.org/TR/xpath-full-text-10/#ftwildcardoption

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Pascal Heus

1:31 p.m.

Christian and all: I have a similar question. When I execute the following query: declare namespace ddi = "http://www.icpsr.umich.edu/DDI"; let $allvars := /ddi:codeBook/ddi:dataDscr/ddi:var let $vars := $allvars[ddi:labl contains text 'child'] return $vars/ddi:labl my query plan does not seem to take advantage of the full text index (see below). Now I did enable full text search after loading the XML in the DB but performed and optimze and can see the FT active in the properties dialog. Does this make a difference? best *P

Query: declare namespace ddi = "http://www.icpsr.umich.edu/DDI"; Timing: - Parsing: 0.14 ms - Compiling: 0.15 ms - Evaluating: 1450.27 ms - Printing: 24.39 ms - Total Time: 1474.97 ms Result: - Results: 3114 Items - Updated: 0 Items - Printed: 308 KB Query plan: <FLWR> <Let var="$allvars"> <IterPath> <Root/> <IterStep axis="child" test="ddi:codeBook"/> <IterStep axis="child" test="ddi:dataDscr"/> <IterStep axis="child" test="ddi:var"/> </IterPath> </Let> <Let var="$vars"> <IterFilter> <VarRef name="$allvars"/> <FTContains> <AxisPath> <IterStep axis="child" test="ddi:labl"/> </AxisPath> <FTWords> <Item value="child" type="xs:string"/> </FTWords> </FTContains> </IterFilter> </Let> <Return> <AxisPath> <VarRef name="$vars"/> <IterStep axis="child" test="ddi:labl"/> </AxisPath> </Return> </FLWR>

On 11/25/11 1:06 PM, Christian Grün wrote:

...

Hi An,

...

I've created fulltext i for my BaseX doesn't understand ftcontains.

just to avoid any confusion: the W3 XML Query WG has changed the "ftcontains" keyword to "contains text" due to conflicts with the existing specification of element constructors [1]. This is also the reason why 2 keywords are required for all updating expressions.. e.g.: "insert node", "delete node", etc.

...
I still have one question about the fulltext query. Can I find a whole word (the result of the regular expression: \bWord\b) with fulltext query?

This is actually the default behavior of XQuery Full Text, as shown in the following examples:

"a b" contains text "a" → true "ab c" contains text "a" → false

On behalf of Dimitri, Christian

[1] http://www.w3.org/Bugs/Public/show_bug.cgi?id=7247 ___________________________

On Fri, Nov 25, 2011 at 4:45 PM, Truong An Nguyen ngtruongan@gmail.com wrote:

...
Hi Dimitar, thank you for your answer. I will check the performance of the fulltext search again and give you the test data if it is really slower. I still have one question about the fulltext query. Can I find a whole word (the result of the regular expression: \bWord\b) with fulltext query? Many thanks. An

On Fri, Nov 25, 2011 at 3:31 PM, Dimitar Popov Dimitar.Popov@uni-konstanz.de wrote:

...
Hi An,

Am Freitag, 25. November 2011, 11:30:00 schrieb Truong An Nguyen:

...
Hi,

I've two questions about the fulltext search.

I've created fulltext i for my BaseX doesn't understand ftcontains.

When I try a CONTAINS query in a 1 Gigabyte Database, using text index is slower than without text index. How can I use fulltext index rightly?

You can do several things:

Make sure that the full-text index is used, i.e. check the "query info"

in the GUI; it should contain "FTIndexAccess" similar to:

<FTIndexAccess data="factbook"> <FTWords> <Item value="norway" type="xs:string"/> </FTWords> </FTIndexAccess>

2a. If the full-text index is used, please send more details about your query and data (e.g. what full-text options are used); it would be interesting to see why the index query is slower.

2b. If the full-text index is NOT used, please check that the full-text options you use in your query correspond to the options with which the full- text index is created. For more information check our wiki page [1].

...

Does BaseX supports regular expression in the fulltext extension? If

yes, could you please give me an example.

No, full blown regular expressions are not supported by XQuery Full-Text. However, wild-cards are supported. For the correspond syntax and examples, you can check the XQuery Full-Text specification [2].

...
Thanks

Cheers, An

Greetings, Dimitar

[1] http://docs.basex.org/wiki/Full-Text [2] http://www.w3.org/TR/xpath-full-text-10/#ftwildcardoption

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Christian Grün

1:53 p.m.

Hi Pascal,

it often helps to simplify the query and re-check the info view. For example, you could try if one of the following queries benefit from the index:

[1] let $vars := /ddi:codeBook/ddi:dataDscr/ddi:var[ddi:labl contains text 'child'] return $vars/ddi:labl

[2] let $vars := /*:codeBook/*:dataDscr/*:var[*:labl contains text 'child'] return $vars/*:labl

[3] /*:codeBook/*:dataDscr/*:var[*:labl contains text 'child']/*:labl

Hope this helps, Christian

On Fri, Nov 25, 2011 at 7:31 PM, Pascal Heus pascal.heus@gmail.com wrote:

...

Christian and all: I have a similar question. When I execute the following query: declare namespace ddi = "http://www.icpsr.umich.edu/DDI"; let $allvars := /ddi:codeBook/ddi:dataDscr/ddi:var let $vars := $allvars[ddi:labl contains text 'child'] return $vars/ddi:labl my query plan does not seem to take advantage of the full text index (see below). Now I did enable full text search after loading the XML in the DB but performed and optimze and can see the FT active in the properties dialog. Does this make a difference? best *P

Query: declare namespace ddi = "http://www.icpsr.umich.edu/DDI"; Timing: - Parsing: 0.14 ms - Compiling: 0.15 ms - Evaluating: 1450.27 ms - Printing: 24.39 ms - Total Time: 1474.97 ms Result:

Results: 3114 Items

Updated: 0 Items

Printed: 308 KB

Query plan:

<FLWR> <Let var="$allvars"> <IterPath> <Root/> <IterStep axis="child" test="ddi:codeBook"/> <IterStep axis="child" test="ddi:dataDscr"/> <IterStep axis="child" test="ddi:var"/> </IterPath> </Let> <Let var="$vars"> <IterFilter> <VarRef name="$allvars"/> <FTContains> <AxisPath> <IterStep axis="child" test="ddi:labl"/> </AxisPath> <FTWords> <Item value="child" type="xs:string"/> </FTWords> </FTContains> </IterFilter> </Let> <Return> <AxisPath> <VarRef name="$vars"/> <IterStep axis="child" test="ddi:labl"/> </AxisPath> </Return> </FLWR>

Pascal Heus

2:08 p.m.

Christian: Thanks. Yes, this triggers full text search. I often break my queries into multiple variables though, particularly complex one. What should be a general design rule to leverage FT indexes? *P

On 11/25/11 1:53 PM, Christian Grün wrote:

...

Hi Pascal,

it often helps to simplify the query and re-check the info view. For example, you could try if one of the following queries benefit from the index:

[1] let $vars := /ddi:codeBook/ddi:dataDscr/ddi:var[ddi:labl contains text 'child'] return $vars/ddi:labl

[2] let $vars := /*:codeBook/*:dataDscr/*:var[*:labl contains text 'child'] return $vars/*:labl

[3] /*:codeBook/*:dataDscr/*:var[*:labl contains text 'child']/*:labl

Hope this helps, Christian

On Fri, Nov 25, 2011 at 7:31 PM, Pascal Heus pascal.heus@gmail.com wrote:

...
Christian and all: I have a similar question. When I execute the following query: declare namespace ddi = "http://www.icpsr.umich.edu/DDI"; let $allvars := /ddi:codeBook/ddi:dataDscr/ddi:var let $vars := $allvars[ddi:labl contains text 'child'] return $vars/ddi:labl my query plan does not seem to take advantage of the full text index (see below). Now I did enable full text search after loading the XML in the DB but performed and optimze and can see the FT active in the properties dialog. Does this make a difference? best *P

Query: declare namespace ddi = "http://www.icpsr.umich.edu/DDI"; Timing:

Parsing: 0.14 ms

Compiling: 0.15 ms

Evaluating: 1450.27 ms

Printing: 24.39 ms

Total Time: 1474.97 ms

Result:

Results: 3114 Items

Updated: 0 Items

Printed: 308 KB

Query plan:

<FLWR> <Let var="$allvars"> <IterPath> <Root/> <IterStep axis="child" test="ddi:codeBook"/> <IterStep axis="child" test="ddi:dataDscr"/> <IterStep axis="child" test="ddi:var"/> </IterPath> </Let> <Let var="$vars"> <IterFilter> <VarRef name="$allvars"/> <FTContains> <AxisPath> <IterStep axis="child" test="ddi:labl"/> </AxisPath> <FTWords> <Item value="child" type="xs:string"/> </FTWords> </FTContains> </IterFilter> </Let> <Return> <AxisPath> <VarRef name="$vars"/> <IterStep axis="child" test="ddi:labl"/> </AxisPath> </Return> </FLWR>

Christian Grün

6:08 p.m.

...

Thanks. Yes, this triggers full text search. I often break my queries into multiple variables though, particularly complex one. What should be a general design rule to leverage FT indexes?

It's difficult to give a general advice here, as the languages is just too complex; I'm frequently surprised how many alternatives exist to answer a single question with XQuery. However, If your query compiler won't manage to optimize your query for index access, you can always directly access the index structures by using our built-in XQuery functions db:fulltext(), db:attribute(), db:text(), ft:search(), etc. [1,2].

Christian

[1] http://docs.basex.org/wiki/Full-Text_Module [2] http://docs.basex.org/wiki/Database_Module#db:fulltext

...

On 11/25/11 1:53 PM, Christian Grün wrote:

...
Hi Pascal,

it often helps to simplify the query and re-check the info view. For example, you could try if one of the following queries benefit from the index:

[1] let $vars := /ddi:codeBook/ddi:dataDscr/ddi:var[ddi:labl contains text 'child'] return $vars/ddi:labl

[2] let $vars := /*:codeBook/*:dataDscr/*:var[*:labl contains text 'child'] return $vars/*:labl

[3] /*:codeBook/*:dataDscr/*:var[*:labl contains text 'child']/*:labl

Hope this helps, Christian

On Fri, Nov 25, 2011 at 7:31 PM, Pascal Heus pascal.heus@gmail.com wrote:

...
Christian and all: I have a similar question. When I execute the following query: declare namespace ddi = "http://www.icpsr.umich.edu/DDI"; let $allvars := /ddi:codeBook/ddi:dataDscr/ddi:var let $vars := $allvars[ddi:labl contains text 'child'] return $vars/ddi:labl my query plan does not seem to take advantage of the full text index (see below). Now I did enable full text search after loading the XML in the DB but performed and optimze and can see the FT active in the properties dialog. Does this make a difference? best *P

Query: declare namespace ddi = "http://www.icpsr.umich.edu/DDI"; Timing: - Parsing: 0.14 ms - Compiling: 0.15 ms - Evaluating: 1450.27 ms - Printing: 24.39 ms - Total Time: 1474.97 ms Result:

Results: 3114 Items

Updated: 0 Items

Printed: 308 KB

Query plan:

<FLWR> <Let var="$allvars"> <IterPath> <Root/> <IterStep axis="child" test="ddi:codeBook"/> <IterStep axis="child" test="ddi:dataDscr"/> <IterStep axis="child" test="ddi:var"/> </IterPath> </Let> <Let var="$vars"> <IterFilter> <VarRef name="$allvars"/> <FTContains> <AxisPath> <IterStep axis="child" test="ddi:labl"/> </AxisPath> <FTWords> <Item value="child" type="xs:string"/> </FTWords> </FTContains> </IterFilter> </Let> <Return> <AxisPath> <VarRef name="$vars"/> <IterStep axis="child" test="ddi:labl"/> </AxisPath> </Return> </FLWR>

Pascal Heus

26 Nov 26 Nov

10:30 p.m.

Christian: Thanks for this. I have a question related to OR queries.

The following leverages full text indexes: declare namespace ddi = "http://www.icpsr.umich.edu/DDI"; let $text := 'education' let $vars := /ddi:codeBook/ddi:dataDscr/ddi:var[ddi:labl contains text {$text} and ddi:qstn contains text {$text}] return $vars/ddi:labl

But if I change the condition to an OR instead of AND: declare namespace ddi = "http://www.icpsr.umich.edu/DDI"; let $text := 'education' let $vars := /ddi:codeBook/ddi:dataDscr/ddi:var[ddi:labl contains text {$text} or ddi:qstn contains text {$text}] return $vars/ddi:labl the query no longer benefits form FT indexes

I tried to resolve by running as: let $varsWithLabel := /ddi:codeBook/ddi:dataDscr/ddi:var[ddi:labl contains text {$text}] let $varsWithQuestion := /ddi:codeBook/ddi:dataDscr/ddi:var[ddi:qstn contains text {$text}] let $vars := $varsWithLabel intersect $varsWithQuestion return $vars/ddi:labl but the intersect method seem to be quite time consuming.

Any suggestion> Or could the query optimizer resolve this one to take advantage of FT? This is a fairly common use case for us (I actually need multiple OR conditions)

thanks *P

PS: btw, let me know if you want a test database for these queries. The one I typically use is around 500Gb so would be great if you have an FTP server I could upload to.

On 11/25/11 6:08 PM, Christian Grün wrote:

...

...
Thanks. Yes, this triggers full text search. I often break my queries into multiple variables though, particularly complex one. What should be a general design rule to leverage FT indexes?

It's difficult to give a general advice here, as the languages is just too complex; I'm frequently surprised how many alternatives exist to answer a single question with XQuery. However, If your query compiler won't manage to optimize your query for index access, you can always directly access the index structures by using our built-in XQuery functions db:fulltext(), db:attribute(), db:text(), ft:search(), etc. [1,2].

Christian

[1] http://docs.basex.org/wiki/Full-Text_Module [2] http://docs.basex.org/wiki/Database_Module#db:fulltext

...
On 11/25/11 1:53 PM, Christian Grün wrote:

...
Hi Pascal,

it often helps to simplify the query and re-check the info view. For example, you could try if one of the following queries benefit from the index:

[1] let $vars := /ddi:codeBook/ddi:dataDscr/ddi:var[ddi:labl contains text 'child'] return $vars/ddi:labl

[2] let $vars := /*:codeBook/*:dataDscr/*:var[*:labl contains text 'child'] return $vars/*:labl

[3] /*:codeBook/*:dataDscr/*:var[*:labl contains text 'child']/*:labl

Hope this helps, Christian

On Fri, Nov 25, 2011 at 7:31 PM, Pascal Heus pascal.heus@gmail.com wrote:

...
Christian and all: I have a similar question. When I execute the following query: declare namespace ddi = "http://www.icpsr.umich.edu/DDI"; let $allvars := /ddi:codeBook/ddi:dataDscr/ddi:var let $vars := $allvars[ddi:labl contains text 'child'] return $vars/ddi:labl my query plan does not seem to take advantage of the full text index (see below). Now I did enable full text search after loading the XML in the DB but performed and optimze and can see the FT active in the properties dialog. Does this make a difference? best *P

Query: declare namespace ddi = "http://www.icpsr.umich.edu/DDI"; Timing:

Parsing: 0.14 ms

Compiling: 0.15 ms

Evaluating: 1450.27 ms

Printing: 24.39 ms

Total Time: 1474.97 ms

Result:

Results: 3114 Items

Updated: 0 Items

Printed: 308 KB

Query plan:

<FLWR> <Let var="$allvars"> <IterPath> <Root/> <IterStep axis="child" test="ddi:codeBook"/> <IterStep axis="child" test="ddi:dataDscr"/> <IterStep axis="child" test="ddi:var"/> </IterPath> </Let> <Let var="$vars"> <IterFilter> <VarRef name="$allvars"/> <FTContains> <AxisPath> <IterStep axis="child" test="ddi:labl"/> </AxisPath> <FTWords> <Item value="child" type="xs:string"/> </FTWords> </FTContains> </IterFilter> </Let> <Return> <AxisPath> <VarRef name="$vars"/> <IterStep axis="child" test="ddi:labl"/> </AxisPath> </Return> </FLWR>

Christian Grün

10:54 p.m.

Hi Pascal,

...

The following leverages full text indexes: [...] But if I change the condition to an OR instead of AND: [...] the query no longer benefits form FT indexes

This is indeed surprising, as AND/OR shouldn't make a difference. It might be, however, that the AND expression is split into several predicates, and one of the resulting predicates is then optimized.

You could try to add an explicit .../text() location step to the paths:

...[ddi:labl/text() contains text {$text} or ddi:qstn/text() contains text {$text}]

If that doesn't help as well, feel free to provide with some small sample data.

...

PS: btw, let me know if you want a test database for these queries. The one I typically use is around 500Gb so would be great if you have an FTP server I could upload to.

Sounds interesting; I may get back to you soon. Christian

Pascal Heus

11:11 p.m.

Christian: Oops, you're right. Some of the qstn elements have a qstnLit child. The ddi:qstn/text() solved the problem, likewise if I use ddi:qstn/ddi;qstnLit

Now an interesting one is that if I use a different database that contains ddi:var/ddi:labl elements but does not contain any ddi:qstn/ddi:qstnLit elements, then FT is not used for the OR query. It is if I use AND (and of course returns an empty list as there is no match).

Many thanks as always for your assistance.

best *P

On 11/26/11 10:54 PM, Christian Grün wrote:

...

Hi Pascal,

...
The following leverages full text indexes: [...] But if I change the condition to an OR instead of AND: [...] the query no longer benefits form FT indexes

This is indeed surprising, as AND/OR shouldn't make a difference. It might be, however, that the AND expression is split into several predicates, and one of the resulting predicates is then optimized.

You could try to add an explicit .../text() location step to the paths:

...[ddi:labl/text() contains text {$text} or ddi:qstn/text() contains text {$text}]

If that doesn't help as well, feel free to provide with some small sample data.

...
PS: btw, let me know if you want a test database for these queries. The one I typically use is around 500Gb so would be great if you have an FTP server I could upload to.

Sounds interesting; I may get back to you soon. Christian

Truong An Nguyen

27 Nov 27 Nov

11:33 a.m.

Hi Dimitar,

I have done the comparison between the function "contains" (without fulltext index) and "contains text" by using a 800 MB XML database. You can download the test data here: http://dl.dropbox.com/u/22427941/RootSearch.rar .

1) Using a simple query:

+ declare default element namespace "http://iso.org/OTX"; for $pro in collection()//specification[text() contains text "Specification"] return $pro

+ declare default element namespace "http://iso.org/OTX"; for $pro in collection()//specification[contains(text() ,"Specification")] return $pro

The query with "contains text" ran a little bit faster than the query without full text index.

2) Using an complex query:

declare default element namespace "http://iso.org/OTX";

for $pro in collection()/otx/procedures/procedure return for $hd in $pro/realisation/flow//handler where exists($hd/@*[contains(data(.),"Variable1")]) or exists($hd/realisation/catch/exception//@*[contains(data(.),"Variable1")]) or $hd/specification contains text "Specification" (: or exists ($hd/specification[contains(data(.),"Specification")] ):) return concat(data($pro/../../@package),":",data($pro/../../@name),":",data($pro/@name),":","handler",":",$hd/@id)

The variant with "contains text" ran much slower than the variant with "contains".

The indexes are used: path, text index, attribute index, full-text index (without any options)

Thanks for helping me.

Cheers, An

On Fri, Nov 25, 2011 at 3:31 PM, Dimitar Popov < Dimitar.Popov@uni-konstanz.de> wrote:

...

Hi An,

Am Freitag, 25. November 2011, 11:30:00 schrieb Truong An Nguyen:

...
Hi,

I've two questions about the fulltext search.

I've created fulltext i for my BaseX doesn't understand ftcontains.

When I try a CONTAINS query in a 1 Gigabyte Database, using text index is slower than without text index. How can I use fulltext index rightly?

You can do several things:

Make sure that the full-text index is used, i.e. check the "query info"

in the GUI; it should contain "FTIndexAccess" similar to:

<FTIndexAccess data="factbook"> <FTWords> <Item value="norway" type="xs:string"/> </FTWords> </FTIndexAccess>

2a. If the full-text index is used, please send more details about your query and data (e.g. what full-text options are used); it would be interesting to see why the index query is slower.

2b. If the full-text index is NOT used, please check that the full-text options you use in your query correspond to the options with which the full- text index is created. For more information check our wiki page [1].

...

Does BaseX supports regular expression in the fulltext extension? If

yes, could you please give me an example.

No, full blown regular expressions are not supported by XQuery Full-Text. However, wild-cards are supported. For the correspond syntax and examples, you can check the XQuery Full-Text specification [2].

...
Thanks

Cheers, An

Greetings, Dimitar

[1] http://docs.basex.org/wiki/Full-Text [2] http://www.w3.org/TR/xpath-full-text-10/#ftwildcardoption

Dimitar Popov

2:49 p.m.

Hi An,

thank you for the provided data and sample query. Please, check my comments, below.

Am Sonntag, 27. November 2011, 17:33:00 schrieb Truong An Nguyen:

...

declare default element namespace "http://iso.org/OTX";

for $pro in collection()/otx/procedures/procedure return for $hd in $pro/realisation/flow//handler where exists($hd/@*[contains(data(.),"Variable1")]) or exists($hd/realisation/catch/exception//@*[contains(data(.),"Variable1")]) or $hd/specification contains text "Specification" (: or exists ($hd/specification[contains(data(.),"Specification")] ):) return concat(data($pro/../../@package),":",data($pro/../../@name),":",data($pro/@n ame),":","handler",":",$hd/@id)

The variant with "contains text" ran much slower than the variant with "contains".

Hm, on my computer the difference is not huge (1307.42 ms for fn:contains() vs. 1446.64 ms for "contains text"), but, yes, "slow" is a relative term :)

Anyway, the difference is due to the fact, that while fn:contains() does simple sub-string search, "contains text" offers more advanced options such as case insensitivity, stemming, stop words, etc. Thus, when the full-text index is not used, there is some more processing of both the query string as well as the matched string, which results the slower performance.

...

The indexes are used: path, text index, attribute index, full-text index (without any options)

With the provided query, the full-text index is not used. The reason for this, is that BaseX does not index the string values of attributes, i.e. only text nodes are indexed.

I don't know what the query should do, but please note the different behavior of fn:contains() and contains text. Just a quick example:

fn:contains('GlobalDocumentVariable1_String', 'Variable1') -> true 'GlobalDocumentVariable1_String' contains text 'Variable1' -> false

Further, one small optimization would be to remove the data() function call in the predicates, i.e.

$hd/realisation/catch/exception//@*[contains(.,"Variable1")]

is enough.

I hope this helps.

Greetings, Dimitar

Truong An Nguyen

3:57 p.m.

Hi Dimitar,

handler/specification is a node text, not an attribute. That is the reason that I used fulltext search just for $hd/specification.

I don't understand why full text index is not used here.

Greetings, An

On Sun, Nov 27, 2011 at 8:49 PM, Dimitar Popov < Dimitar.Popov@uni-konstanz.de> wrote:

...

**

Hi An,

thank you for the provided data and sample query. Please, check my comments, below.

Am Sonntag, 27. November 2011, 17:33:00 schrieb Truong An Nguyen:

...
declare default element namespace "http://iso.org/OTX";

...
...
for $pro in collection()/otx/procedures/procedure

...
return for $hd in $pro/realisation/flow//handler

...
where exists($hd/@*[contains(data(.),"Variable1")])

...
or

...
exists($hd/realisation/catch/exception//@*[contains(data(.),"Variable1")])

...
or $hd/specification contains text "Specification"

...
(: or exists ($hd/specification[contains(data(.),"Specification")] ):)

...
return

...
concat(data($pro/../../@package),":",data($pro/../../@name),":",data($pro/@n

...
ame),":","handler",":",$hd/@id)

...
...
The variant with "contains text" ran much slower than the variant with

...
"contains".

Hm, on my computer the difference is not huge (1307.42 ms for fn:contains() vs. 1446.64 ms for "contains text"), but, yes, "slow" is a relative term :)

Anyway, the difference is due to the fact, that while fn:contains() does simple sub-string search, "contains text" offers more advanced options such as case insensitivity, stemming, stop words, etc. Thus, when the full-text index is not used, there is some more processing of both the query string as well as the matched string, which results the slower performance.

...
The indexes are used: path, text index, attribute index, full-text index

...
(without any options)

With the provided query, the full-text index is not used. The reason for this, is that BaseX does not index the string values of attributes, i.e. only text nodes are indexed.

I don't know what the query should do, but please note the different behavior of fn:contains() and contains text. Just a quick example:

fn:contains('GlobalDocumentVariable1_String', 'Variable1') -> true

'GlobalDocumentVariable1_String' contains text 'Variable1' -> false

Further, one small optimization would be to remove the data() function call in the predicates, i.e.

$hd/realisation/catch/exception//@*[contains(.,"Variable1")]

is enough.

I hope this helps.

Greetings,

Dimitar

Dimitar Popov

4:30 p.m.

Am Sonntag, 27. November 2011, 21:57:41 schrieb Truong An Nguyen:

...

Hi Dimitar,

handler/specification is a node text, not an attribute. That is the reason that I used fulltext search just for $hd/specification.

I don't understand why full text index is not used here.

I'm not sure if it is possible to re-write the query using index access, because of the nested for-loop and the "or" conditions. Someone more knowledgeable with the compiler optimizations than me could probably give you more details.

Greetings, Dimitar

Christian Grün

5:26 p.m.

Dear Truong,

in your scenario, the contains text expression is not much faster than the fn:contains() function because the returned result set is very large. As an example, please compare the following two queries:

//*:specification[contains(text(), 'Variable4')] //*:specification[text() contains text 'variable4']

In other words: accessing the index and checking the nodes' parents takes pretty much the same time as sequentially scanning all texts of the <specification/> elements. Next, please note that, in many cases, "contains text" and "contains()" are not equivalent and will return different results (but you may be aware of this anyway).

In the last query you discussed, I assume that the query optimizer cannot benefit from the full text index due to the OR operator. As an alternative, you can directly access the index as follows:

db:fulltext('truong', 'Specification')/parent::*:specification

Hope this helps, Christian ___________________________

On Sun, Nov 27, 2011 at 10:30 PM, Dimitar Popov Dimitar.Popov@uni-konstanz.de wrote:

...

Am Sonntag, 27. November 2011, 21:57:41 schrieb Truong An Nguyen:

...
Hi Dimitar,

handler/specification is a node text, not an attribute. That is the reason that I used fulltext search just for $hd/specification.

I don't understand why full text index is not used here.

I'm not sure if it is possible to re-write the query using index access, because of the nested for-loop and the "or" conditions. Someone more knowledgeable with the compiler optimizations than me could probably give you more details.

Greetings, Dimitar _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Pascal Heus

10:51 p.m.

Dimitar: Noticed that you mentioned that BaseX does not text index attribute. Is this something that could be added as an indexing option? The two core metadata standards I work with store names and identification information in element attributes and I was hoping to leverage FT search for quick lookup purposes. Otherwise are attribute values always indexed? For example, if I need to look for a unique key like <element urn='some-unique-urn-string-value'>, would I get an instant match? What about composite keys like <element id='id1234' version='1.0.0' agency='myagency'>? best *P

On 11/27/11 2:49 PM, Dimitar Popov wrote:

...

Hi An,

thank you for the provided data and sample query. Please, check my comments, below.

Am Sonntag, 27. November 2011, 17:33:00 schrieb Truong An Nguyen:

...
declare default element namespace "http://iso.org/OTX";

...
...
for $pro in collection()/otx/procedures/procedure

...
return for $hd in $pro/realisation/flow//handler

...
where exists($hd/@*[contains(data(.),"Variable1")])

...
or

...
exists($hd/realisation/catch/exception//@*[contains(data(.),"Variable1")])

...
or $hd/specification contains text "Specification"

...
(: or exists ($hd/specification[contains(data(.),"Specification")] ):)

...
return

...
concat(data($pro/../../@package),":",data($pro/../../@name),":",data($pro/@n

...
ame),":","handler",":",$hd/@id)

...
...
The variant with "contains text" ran much slower than the variant with

...
"contains".

Hm, on my computer the difference is not huge (1307.42 ms for fn:contains() vs. 1446.64 ms for "contains text"), but, yes, "slow" is a relative term :)

Anyway, the difference is due to the fact, that while fn:contains() does simple sub-string search, "contains text" offers more advanced options such as case insensitivity, stemming, stop words, etc. Thus, when the full-text index is not used, there is some more processing of both the query string as well as the matched string, which results the slower performance.

...
The indexes are used: path, text index, attribute index, full-text index

...
(without any options)

With the provided query, the full-text index is not used. The reason for this, is that BaseX does not index the string values of attributes, i.e. only text nodes are indexed.

I don't know what the query should do, but please note the different behavior of fn:contains() and contains text. Just a quick example:

fn:contains('GlobalDocumentVariable1_String', 'Variable1') -> true

'GlobalDocumentVariable1_String' contains text 'Variable1' -> false

Further, one small optimization would be to remove the data() function call in the predicates, i.e.

$hd/realisation/catch/exception//@*[contains(.,"Variable1")]

is enough.

I hope this helps.

Greetings,

Dimitar

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Dimitar Popov

28 Nov 28 Nov

9:26 a.m.

Hi Pascal,

I meant that attribute values are not index by the full-text index. However, there is a separate index (non-full-text) which contains only the attribute values and can be used to speed up attribute value queries with the equal operator ("=" not "eq"!). Thus, the index will be used when searching for unique ids.

When searching using several attributes, then I guess you'll use something like @id1 = 'x' and @id2 = 'y' In this case, BaseX will use the index only for one of the attributes, and the other predicate will be evaluated iteratively. This is the common way in most database systems.

If you use "or", e.g. @id1 = 'x' or @id2 = 'y' then both predicates will be evaluated using the index.

For more details about different index types, please, check our wiki page [1].

Regards, Dimitar

[1] http://docs.basex.org/wiki/Indexes

On Nov 28, 2011, at 4:51 AM, Pascal Heus wrote:

...

Dimitar: Noticed that you mentioned that BaseX does not text index attribute. Is this something that could be added as an indexing option? The two core metadata standards I work with store names and identification information in element attributes and I was hoping to leverage FT search for quick lookup purposes. Otherwise are attribute values always indexed? For example, if I need to look for a unique key like <element urn='some-unique-urn-string-value'>, would I get an instant match? What about composite keys like <element id='id1234' version='1.0.0' agency='myagency'>? best *P

On 11/27/11 2:49 PM, Dimitar Popov wrote:

...
Hi An,

thank you for the provided data and sample query. Please, check my comments, below.

Am Sonntag, 27. November 2011, 17:33:00 schrieb Truong An Nguyen:

...
declare default element namespace "http://iso.org/OTX";

for $pro in collection()/otx/procedures/procedure return for $hd in $pro/realisation/flow//handler where exists($hd/@*[contains(data(.),"Variable1")]) or exists($hd/realisation/catch/exception//@*[contains(data(.),"Variable1")]) or $hd/specification contains text "Specification" (: or exists ($hd/specification[contains(data(.),"Specification")] ):) return concat(data($pro/../../@package),":",data($pro/../../@name),":",data($pro/@n ame),":","handler",":",$hd/@id)

The variant with "contains text" ran much slower than the variant with "contains".

Hm, on my computer the difference is not huge (1307.42 ms for fn:contains() vs. 1446.64 ms for "contains text"), but, yes, "slow" is a relative term :)

Anyway, the difference is due to the fact, that while fn:contains() does simple sub-string search, "contains text" offers more advanced options such as case insensitivity, stemming, stop words, etc. Thus, when the full-text index is not used, there is some more processing of both the query string as well as the matched string, which results the slower performance.

...
The indexes are used: path, text index, attribute index, full-text index (without any options)

With the provided query, the full-text index is not used. The reason for this, is that BaseX does not index the string values of attributes, i.e. only text nodes are indexed.

I don't know what the query should do, but please note the different behavior of fn:contains() and contains text. Just a quick example:

fn:contains('GlobalDocumentVariable1_String', 'Variable1') -> true 'GlobalDocumentVariable1_String' contains text 'Variable1' -> false

Further, one small optimization would be to remove the data() function call in the predicates, i.e.

$hd/realisation/catch/exception//@*[contains(.,"Variable1")]

is enough.

I hope this helps.

Greetings, Dimitar

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Pascal Heus

10:33 a.m.

Dimitar: Thanks for the clarification. Exact matches would solve some use cases but I would however welcome full text search on attributes as I often need to perform partial matches. I assume this is a fairly common need. A few examples are illustrated below. Is this something that could potentially be supported (as an FT indexing option)? best Pascal

<Foo id="" version="1.0> <Foo id="" version="1.1> <Foo id="" version="2.0> --> Search for all Foo under version 1.*

<Book author="John Doe"> <Book author="Jane Doe"> --> Search by author name

<variable name="xyz_1"> <variable name="xyz_2"> --> Search all variables that start with "xyz"

<a href="http://www.example.org/home"> <a href="http://www.basex.org"> <a href="http://www.example.org/acbout"> --> Find all links pointing to example.org

On 11/28/11 9:26 AM, Dimitar Popov wrote:

...

Hi Pascal,

I meant that attribute values are not index by the full-text index. However, there is a separate index (non-full-text) which contains only the attribute values and can be used to speed up attribute value queries with the equal operator ("=" not "eq"!). Thus, the index will be used when searching for unique ids.

When searching using several attributes, then I guess you'll use something like @id1 = 'x' and @id2 = 'y' In this case, BaseX will use the index only for one of the attributes, and the other predicate will be evaluated iteratively. This is the common way in most database systems.

If you use "or", e.g. @id1 = 'x' or @id2 = 'y' then both predicates will be evaluated using the index.

For more details about different index types, please, check our wiki page [1].

Regards, Dimitar

[1] http://docs.basex.org/wiki/Indexes

On Nov 28, 2011, at 4:51 AM, Pascal Heus wrote:

...
Dimitar: Noticed that you mentioned that BaseX does not text index attribute. Is this something that could be added as an indexing option? The two core metadata standards I work with store names and identification information in element attributes and I was hoping to leverage FT search for quick lookup purposes. Otherwise are attribute values always indexed? For example, if I need to look for a unique key like <element urn='some-unique-urn-string-value'>, would I get an instant match? What about composite keys like <element id='id1234' version='1.0.0' agency='myagency'>? best *P

On 11/27/11 2:49 PM, Dimitar Popov wrote:

...
Hi An,

thank you for the provided data and sample query. Please, check my comments, below.

Am Sonntag, 27. November 2011, 17:33:00 schrieb Truong An Nguyen:

...
declare default element namespace "http://iso.org/OTX";

for $pro in collection()/otx/procedures/procedure return for $hd in $pro/realisation/flow//handler where exists($hd/@*[contains(data(.),"Variable1")]) or

exists($hd/realisation/catch/exception//@*[contains(data(.),"Variable1")])

...
or $hd/specification contains text "Specification" (: or exists ($hd/specification[contains(data(.),"Specification")] ):) return

concat(data($pro/../../@package),":",data($pro/../../@name),":",data($pro/@n

...
ame),":","handler",":",$hd/@id)

The variant with "contains text" ran much slower than the variant with "contains".

Hm, on my computer the difference is not huge (1307.42 ms for fn:contains() vs. 1446.64 ms for "contains text"), but, yes, "slow" is a relative term :)

Anyway, the difference is due to the fact, that while fn:contains() does simple sub-string search, "contains text" offers more advanced options such as case insensitivity, stemming, stop words, etc. Thus, when the full-text index is not used, there is some more processing of both the query string as well as the matched string, which results the slower performance.

...
The indexes are used: path, text index, attribute index, full-text

index

...
(without any options)

With the provided query, the full-text index is not used. The reason for this, is that BaseX does not index the string values of attributes, i.e. only text nodes are indexed.

I don't know what the query should do, but please note the different behavior of fn:contains() and contains text. Just a quick example:

fn:contains('GlobalDocumentVariable1_String', 'Variable1') -> true 'GlobalDocumentVariable1_String' contains text 'Variable1' -> false

Further, one small optimization would be to remove the data() function call in the predicates, i.e.

$hd/realisation/catch/exception//@*[contains(.,"Variable1")]

is enough.

I hope this helps.

Greetings, Dimitar

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de <mailto:BaseX-Talk@mailman.uni-konstanz.de> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Dimitar Popov

11:47 a.m.

Hi Pascal,

...

Dimitar: Thanks for the clarification. Exact matches would solve some use cases but I would however welcome full text search on attributes as I often need to perform partial matches. I assume this is a fairly common need.

The XQuery Full-Text specification [1] has been designed for the purpose of searching keywords and phrases in large text corpora, and not substring or pattern matching. Central concept of full-text search is tokenization, i.e. splitting searched and matched text into tokens. This is why, although it is possible to use full-text search to a certain extent for inexact string matching, the results may not be what one expects.

I know that inexact matching is relatively common, but I'm afraid I'm not aware of a DBMS which has a general purpose index structure which can speed up pattern matching, besides the classical case of prefix matching (e.g. SQL queries with LIKE 'abc%' conditions).

Concrete to your examples:

...

<Foo id="" version="1.0> <Foo id="" version="1.1> <Foo id="" version="2.0> --> Search for all Foo under version 1.*

This will not work because of tokenization: consider the case of a version which looks like this "2.1.2" - it will be matched by the full-text search, although it's not what you want.

...

<Book author="John Doe"> <Book author="Jane Doe"> --> Search by author name

It's safe to use full-text search in this case.

...

<variable name="xyz_1"> <variable name="xyz_2"> --> Search all variables that start with "xyz"

Same with the version: e.g. "1_xyz_2" will be matched, and there is no way specified how to denote the string beginning (i.e. full-text search != regex matching).

...

<a href="http://www.example.org/home"> <a href="http://www.basex.org"> <a href="http://www.example.org/acbout"> --> Find all links pointing to example.org

You can't use full-text search in this case, because "example.org" will match for example "example/org", too. Of course if you are willing to take the risk of having false matches, you can.

I hope my comments will be useful and that I've convinced you that ft search is not what you need :)

Regards, Dimitar

[1] http://www.w3.org/TR/xpath-full-text-10/

Pascal Heus

6:49 p.m.

Dimitar: Thanks for this extensive explanation, most interesting. Very much appreciated. best Pascal

On 11/28/11 11:47 AM, Dimitar Popov wrote:

...

Hi Pascal,

...
Dimitar: Thanks for the clarification. Exact matches would solve some use cases but I would however welcome full text search on attributes as I often need to perform partial matches. I assume this is a fairly common need.

The XQuery Full-Text specification [1] has been designed for the purpose of searching keywords and phrases in large text corpora, and not substring or pattern matching. Central concept of full-text search is tokenization, i.e. splitting searched and matched text into tokens. This is why, although it is possible to use full-text search to a certain extent for inexact string matching, the results may not be what one expects.

I know that inexact matching is relatively common, but I'm afraid I'm not aware of a DBMS which has a general purpose index structure which can speed up pattern matching, besides the classical case of prefix matching (e.g. SQL queries with LIKE 'abc%' conditions).

Concrete to your examples:

...
<Foo id="" version="1.0> <Foo id="" version="1.1> <Foo id="" version="2.0> --> Search for all Foo under version 1.*

This will not work because of tokenization: consider the case of a version which looks like this "2.1.2" - it will be matched by the full-text search, although it's not what you want.

...
<Book author="John Doe"> <Book author="Jane Doe"> --> Search by author name

It's safe to use full-text search in this case.

...
<variable name="xyz_1"> <variable name="xyz_2"> --> Search all variables that start with "xyz"

Same with the version: e.g. "1_xyz_2" will be matched, and there is no way specified how to denote the string beginning (i.e. full-text search != regex matching).

...
<a href="http://www.example.org/home"> <a href="http://www.basex.org"> <a href="http://www.example.org/acbout"> --> Find all links pointing to example.org <http://example.org>

You can't use full-text search in this case, because "example.org http://example.org" will match for example "example/org", too. Of course if you are willing to take the risk of having false matches, you can.

I hope my comments will be useful and that I've convinced you that ft search is not what you need :)

Regards, Dimitar

[1] http://www.w3.org/TR/xpath-full-text-10/

4980

Age (days ago)

4983

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

20 comments

5 participants

tags (0)

participants (5)

Christian Grün
Dimitar Popov
Dimitar Popov
Pascal Heus
Truong An Nguyen