full-text search with wildcard
Hello all, We are using Basex 7.0.2 and using wildcard for full-text search we ran into some problems when it comes to tokenization related issues. Our database contains these entries: bb (aa)bb bb(cc) (aa)bb(cc) We ran a test as following with the given results shown in each case: 1- .//value[text() contains text {'.*(bb)'} using wildcards] returned (aa)bb and (aa)bb(cc) 2- .//value[text() contains text {'.(bb).*'} using wildcards] returned bb(cc) and (aa)bb(cc) 3- .//value[text() contains text {'(bb)'} using wildcards] returned (aa)bb and (aa)bb(cc) and bb(cc) and bb so far so good, but the following case is the weird case: 4- .//value[text() contains text {'.*(bb).*'} using wildcards] returning only (aa)bb(cc) Can anyone explain why is the behavior of the last case different? Whereas it should be the most general case , it turns out to be the most exclusive one ? Are we missing something or is it a bug?
Dear Shakila, thanks for your mail and all details.
4- .//value[text() contains text {'.*(bb).*'} using wildcards] returning only (aa)bb(cc)
The is indeed the correct answer, and can be explained with the general process of how full-text expressions are evaluated: Both the input and query terms are fully "tokenized", e.g., split into several tokens. All non-token-characters (in this case the parentheses) are interpreted as "separators", which means that your query is equivalent to .//value[text() contains text { '.* bb .*' } using wildcards] As a result, we have three tokens ".*", "bb" and ".*", which require at least three words in the input text to yield a result. For instance, the following query returns "false" and "true": 'X bb' contains text '.* bb .*' using wildcards, 'X bb X' contains text '.* bb .*' using wildcards If you need to search for special characters such as parentheses, you'll probably have to resort to the XQuery functions fn:substring() or fn:matches(). What you can do as well: you may first want to use "contains text" to speed up your query and then do some refinement with the results, such as shown here: for $v in .//value[text() contains text { '.* bb .*' } using wildcards] return $v[matches(text(), "(bb)" ] Note, however, that full-text queries that start with a wildcard will not be evaluated by the index anyway, which means that a single fn:matches() function may be faster anyway. Hope this helps, Christian
participants (2)
-
Christian Grün -
Shakila Shayan