Re: [basex-talk] Thesaurus question

4 Feb 2016

      Christian, I will second your description of this logic as “nonintuitive”. It seems to be driven more by efficiency concerns than usability (on the part of the W3C). Would it be possible to create a custom index structure in BaseX that would get around this limitation? If yes, as you seem to suggest below, can this be done dynamically? I had difficulty following the example in [2].

Thanks,
Ron

On February 2, 2016 at 2:34:35 PM, Christian Grün (christian.gruen@gmail.com) wrote:
...
Any idea why?
Yes – See one of my previous replies ;) In a nutshell: In the first  
query, stopwords will be dropped. In the second one, they will only be  
ignored (“Tokens matched by stop words retain their position numbers  
[…]” [1]):  

"A B C" contains text "A C" using stop words ("B")  
→ false  
"A B C" contains text "A B C" using stop words ("B")  
→ true  

It may not be the most intuitive decision that has been taken back  
then by the designers of the spec, but… Les jeux sont faits.  

In some projects, we’ve decided to work with custom index structures  
[2]. It’s some more work, but it will give you complete freedom on  
what tokens you want to store.  

Hope this helps,  
Christian  

[1] https://www.w3.org/TR/xpath-full-text-10/#ftstopwordoption  
[2] http://docs.basex.org/wiki/Indexes#Custom_Index_Structures  

On Tue, Feb 2, 2016 at 6:56 PM, Ron Katriel <rkatriel@mdsol.com> wrote:
...
Thanks, Christian. You are right about the tokenization of ampersands.  
However, I still see unexpected behavior with the built-in stop words.
1. This works (using your clever stop word workaround, slightly modified  
with string-join):
let $sw := map:merge(  
for $sw in file:read-text-lines('stopwords.txt')  
return map { $sw : true() }  
)
let $t1 := 'Frontier Science & Technology Research Foundation, Inc.'  
let $t2 := 'Frontier Science and Technology Research Foundation, Inc.'  
let $q1 := string-join(ft:tokenize($t1)[not($sw(.))], ' ')  
let $q2 := string-join(ft:tokenize($t2)[not($sw(.))], ' ')  
where $q1 contains text { $q2 }  
return <r> { <q1> { $q1 } </q1>, <q2> { $q2 } </q2> } </r>
2. This fails:
let $t1 := 'Frontier Science & Technology Research Foundation, Inc.'  
let $t2 := 'Frontier Science and Technology Research Foundation, Inc.'  
where $t1 contains text { $t2 } using stop words at 'stopwords.txt' or  
$t2 contains text { $t1 } using stop words at 'stopwords.txt'  
return <r> { <q1> { $t1 } </q1>, <q2> { $t2 } </q2> } </r>
Any idea why?
Thanks,  
Ron
On February 2, 2016 at 12:13:14 PM, Christian Grün  
(christian.gruen@gmail.com) wrote:
Hi Ron,
I’m pretty sure that the default tokenizer discards the ampersand and  
doesn’t pass it on as token at all.
Hope this helps (…at least for understanding the query result),  
Christian
On Tue, Feb 2, 2016 at 6:10 PM, Ron Katriel <rkatriel@mdsol.com> wrote:
...
Hi,
Given this thesaurus entry
<thesaurus xmlns="http://www.w3.org/2007/xqftts/thesaurus">  
<entry>  
<term>&</term>  
<synonym>  
<term>and</term>  
<relationship>USE</relationship>  
</synonym>  
</entry>  
</thesaurus>
I was expecting the following query to return true (file path omitted for  
clarify)
'Frontier Science and Technology Research Foundation, Inc.' contains text  
'Frontier Science & Technology Research Foundation, Inc.' using  
thesaurus at "thesaurus.xml”
but it returns false. Switching the order of the term and synonym makes no  
difference.
I tried getting around this using a stop word file (which includes ‘and’,  
‘&’, and '&’, just in case) but it does not work either.
Am I missing something?
Thanks,  
Ron