Thanks, Christian. You are right about the tokenization of ampersands. However, I still see unexpected behavior with the built-in stop words.
1. This works (using your clever stop word workaround, slightly modified with string-join):
let $sw := map:merge( for $sw in file:read-text-lines('stopwords.txt') return map { $sw : true() } )
let $t1 := 'Frontier Science & Technology Research Foundation, Inc.' let $t2 := 'Frontier Science and Technology Research Foundation, Inc.' let $q1 := string-join(ft:tokenize($t1)[not($sw(.))], ' ') let $q2 := string-join(ft:tokenize($t2)[not($sw(.))], ' ') where $q1 contains text { $q2 } return <r> { <q1> { $q1 } </q1>, <q2> { $q2 } </q2> } </r>
2. This fails:
let $t1 := 'Frontier Science & Technology Research Foundation, Inc.' let $t2 := 'Frontier Science and Technology Research Foundation, Inc.' where $t1 contains text { $t2 } using stop words at 'stopwords.txt' or $t2 contains text { $t1 } using stop words at 'stopwords.txt' return <r> { <q1> { $t1 } </q1>, <q2> { $t2 } </q2> } </r>
Any idea why?
Thanks, Ron
On February 2, 2016 at 12:13:14 PM, Christian Grün (christian.gruen@gmail.com) wrote:
Hi Ron,
I’m pretty sure that the default tokenizer discards the ampersand and doesn’t pass it on as token at all.
Hope this helps (…at least for understanding the query result), Christian
On Tue, Feb 2, 2016 at 6:10 PM, Ron Katriel rkatriel@mdsol.com wrote:
Hi,
Given this thesaurus entry
<thesaurus xmlns="http://www.w3.org/2007/xqftts/thesaurus"> <entry> <term>&</term> <synonym> <term>and</term> <relationship>USE</relationship> </synonym> </entry> </thesaurus>
I was expecting the following query to return true (file path omitted for clarify)
'Frontier Science and Technology Research Foundation, Inc.' contains text 'Frontier Science & Technology Research Foundation, Inc.' using thesaurus at "thesaurus.xml”
but it returns false. Switching the order of the term and synonym makes no difference.
I tried getting around this using a stop word file (which includes ‘and’, ‘&’, and '&’, just in case) but it does not work either.
Am I missing something?
Thanks, Ron