Christian, I will second your description of this logic as “nonintuitive”. It seems to be driven more by efficiency concerns than usability (on the part of the W3C). Would it be possible to create a custom index structure in BaseX that would get around this limitation? If yes, as you seem to suggest below, can this be done dynamically? I had difficulty following the example in [2].
Thanks, Ron
On February 2, 2016 at 2:34:35 PM, Christian Grün (christian.gruen@gmail.com) wrote:
Any idea why?
Yes – See one of my previous replies ;) In a nutshell: In the first query, stopwords will be dropped. In the second one, they will only be ignored (“Tokens matched by stop words retain their position numbers […]” [1]):
"A B C" contains text "A C" using stop words ("B") → false "A B C" contains text "A B C" using stop words ("B") → true
It may not be the most intuitive decision that has been taken back then by the designers of the spec, but… Les jeux sont faits.
In some projects, we’ve decided to work with custom index structures [2]. It’s some more work, but it will give you complete freedom on what tokens you want to store.
Hope this helps, Christian
[1] https://www.w3.org/TR/xpath-full-text-10/#ftstopwordoption [2] http://docs.basex.org/wiki/Indexes#Custom_Index_Structures
On Tue, Feb 2, 2016 at 6:56 PM, Ron Katriel rkatriel@mdsol.com wrote:
Thanks, Christian. You are right about the tokenization of ampersands. However, I still see unexpected behavior with the built-in stop words.
- This works (using your clever stop word workaround, slightly modified
with string-join):
let $sw := map:merge( for $sw in file:read-text-lines('stopwords.txt') return map { $sw : true() } )
let $t1 := 'Frontier Science & Technology Research Foundation, Inc.' let $t2 := 'Frontier Science and Technology Research Foundation, Inc.' let $q1 := string-join(ft:tokenize($t1)[not($sw(.))], ' ') let $q2 := string-join(ft:tokenize($t2)[not($sw(.))], ' ') where $q1 contains text { $q2 } return <r> { <q1> { $q1 } </q1>, <q2> { $q2 } </q2> } </r>
- This fails:
let $t1 := 'Frontier Science & Technology Research Foundation, Inc.' let $t2 := 'Frontier Science and Technology Research Foundation, Inc.' where $t1 contains text { $t2 } using stop words at 'stopwords.txt' or $t2 contains text { $t1 } using stop words at 'stopwords.txt' return <r> { <q1> { $t1 } </q1>, <q2> { $t2 } </q2> } </r>
Any idea why?
Thanks, Ron
On February 2, 2016 at 12:13:14 PM, Christian Grün (christian.gruen@gmail.com) wrote:
Hi Ron,
I’m pretty sure that the default tokenizer discards the ampersand and doesn’t pass it on as token at all.
Hope this helps (…at least for understanding the query result), Christian
On Tue, Feb 2, 2016 at 6:10 PM, Ron Katriel rkatriel@mdsol.com wrote:
Hi,
Given this thesaurus entry
<thesaurus xmlns="http://www.w3.org/2007/xqftts/thesaurus"> <entry> <term>&</term> <synonym> <term>and</term> <relationship>USE</relationship> </synonym> </entry> </thesaurus>
I was expecting the following query to return true (file path omitted for clarify)
'Frontier Science and Technology Research Foundation, Inc.' contains text 'Frontier Science & Technology Research Foundation, Inc.' using thesaurus at "thesaurus.xml”
but it returns false. Switching the order of the term and synonym makes no difference.
I tried getting around this using a stop word file (which includes ‘and’, ‘&’, and '&’, just in case) but it does not work either.
Am I missing something?
Thanks, Ron