Hi Christian,

I hope you had a good weekend!

Otherwise, no, this doesn't help as it doesn't choose to use the full text index on my content :(. This is what I am getting at the moment:

Compiling:

- pre-evaluating fn:collection("edil")

- simplifying descendant-or-self step(s)

- converting descendant::*:entry to child steps

- simplifying descendant-or-self step(s)

- removing context expression (.)

- rewriting where clause(s)

- simplifying flwor expression

Query:

declare variable $term as xs:string external := 'athgabāi.*'; declare variable $col as xs:string external := 'edil'; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x//text() contains text {$term} using diacritics insensitive using wildcards return $x), 1, 5000)}</results>

Optimized Query:

element results { (fn:subsequence(ft:mark((db:open-pre("edil",0), db:open-pre("edil",155748), ...)/*:sample/*:entry[descendant::text() contains text "athgabāi.*" using wildcards using language 'English']), 1, 5000)) }

I tried this as well with the same results:

Compiling:

- pre-evaluating fn:collection("edil")

- simplifying descendant-or-self step(s)

- converting descendant::*:entry to child steps

- removing context expression (.)

- rewriting where clause(s)

- simplifying flwor expression

Query:

declare variable $term as xs:string external := 'athgabāi.*'; declare variable $col as xs:string external := 'edil'; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x/descendant::*[text() contains text 'athgabāi.*' using diacritics insensitive using wildcards] return $x), 1, 5000)}</results>

Optimized Query:

element results { (fn:subsequence(ft:mark((db:open-pre("edil",0), db:open-pre("edil",155748), ...)/*:sample/*:entry[descendant::*[text() contains text "athgabāi.*" using wildcards using language 'English']]), 1, 5000)) }

There are the options set on the database:

Database Properties

Name: edil

Size: 194 MB

Nodes: 7951662

Documents: 19

Binaries: 0

Timestamp: 2014-08-15-17-00-29

Resource Properties

Input Path: /home/cyocum/temp/edil_src/xml_src

Input Size: 87 MB

Timestamp: 2014-08-15-16-46-31

Encoding: UTF-8

CHOP: true

Indexes

Up-to-date: true

TEXTINDEX: true

ATTRINDEX: true

FTINDEX: true

LANGUAGE:

STEMMING: false

CASESENS: false

DIACRITICS: true

STOPWORDS:

UPDINDEX: false

MAXCATS: 100

MAXLEN: 96

I hope this helps.

All the best,

Chris

On Tue, Aug 19, 2014 at 10:12 AM, Christian Grün <christian.gruen@gmail.com> wrote:

Hi Chris,

sorry for letting you wait, I’ve been offline over the weekend.

> Thank you again for all your help. Unfortunately, my documents are
> multi-language and multi-diacritics so my users expect it to match
> athgabáil, athgabail, and athgabāil as the same word. They also want
> wildcard searching to work in the same way.

This should be no problem, even with the full-text default settings.
An example: the following query...

/descendant::*[text() contains text 'athgabāi.*'
using diacritics insensitive
using wildcards]

...will give you three results for the following document...

<xml>
<term>athgabáil</term>
<term>athgabail</term>
<term>athgabāil</term>
</xml>

...and the results will be retrieved by the full-text index, using the
default settings:

- applying full-text index for "athgabāi.*" using wildcards using
language 'English'

The solution that I mentioned in my last mail is required if you want
to do both diacritics sensitive and insensitive search.

Does this help?
Christian

> At the moment the query looks like this and it does not use the full text
> index:
>
> declare variable $term as xs:string external := 'athgab.*'; declare variable
> $col as xs:string external := 'edil'; <results>{subsequence(ft:mark(for $x
> in collection($col)//entry where $x//text() contains text {$term} using
> wildcards using diacritics insensitive order by
> fn:lower-case(fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','')) collation
> "?lang=ga" return $x), 1, 5000)}</results>
>
> If anyone has any suggestions, I would be grateful.
>
> All the best,
> Chris
>
>
> On Thu, Aug 14, 2014 at 10:35 PM, Christian Grün <christian.gruen@gmail.com>
> wrote:
>>
>> Hi Chris,
>>
>> as you already noted, the full-text index
>> will
>> only
>> be
>> utilized with
>> the
>> options that you choose when creating an index. If you want to do more
>> fine-grained searches, it’s
>> usually
>> recommendable to
>> choose
>> the most general options for creating the index (case insensitive,
>> diacritics insensitive, etc). and
>> then
>> refine the results in a second step.
>> This can e.g. look as follows
>> :
>>
>> declare function local:search($db, $terms) {
>> for $result in db:open($db)//*[text() contains text { $terms }]
>> return $result[text() contains text { $terms } using case sensitive]
>> };
>> local:search('factbook', ('German', 'English'))
>>
>> Hope this helps,
>> Christian
>>
>>
>>
>> On Thu, Aug 14, 2014 at 10:54 PM, Chris Yocum <cyocum@gmail.com> wrote:
>> > Hi Christian,
>> >
>> > Apologies for bringing this back up but if I use "using diacritics
>> > insensitive" in the full text search, it seems to turn full text
>> > searching off. I have diacritics true on the database. I am just
>> > suprised to see diacritics causing the full text searching to be
>> > turned off.
>> >
>> > All the best,
>> > Chris
>> >
>> > On Wed, Aug 13, 2014 at 01:18:26PM +0200, Christian Grün wrote:
>> >> Hi Chris,
>> >>
>> >> there are various caches involved when evaluating queries, but I can't
>> >> see for the given query where a cache may be utilized. However, your
>> >> query may be evaluated faster if you simplify the nested where clause:
>> >>
>> >> <results>{
>> >> subsequence(
>> >> ft:mark(
>> >> for $x in collection($col)//entry
>> >> where $x//text() contains text { $term } using wildcards
>> >> order by fn:lower-case(
>> >> fn:replace(($x//orth[1]/text())[1], '\\p{P}|\\d+','')
>> >> ) collation "?lang=ga"
>> >> return $x
>> >> ), 1, 5000
>> >> )
>> >> }</results>
>> >>
>> >> You could as well use a predicate with position(), it may be evaluated
>> >> faster than subsequence (I'm not sure, though, because most time will
>> >> probably be spent for ordering all results):
>> >>
>> >> <results>{
>> >> ft:mark(
>> >> for $x in collection($col)//entry
>> >> where $x//text() contains text { $term } using wildcards
>> >> order by fn:lower-case(
>> >> fn:replace(($x//orth[1]/text())[1], '\\p{P}|\\d+','')
>> >> ) collation "?lang=ga"
>> >> return $x
>> >> )[position() = 1 to 5000]
>> >> }</results>
>> >>
>> >> Could you please open the InfoView in the GUI, execute the query again
>> >> and check if the full-text index is applied?
>> >>
>> >> Christian
>> >>
>> >>
>> >>
>> >> On Wed, Aug 13, 2014 at 12:02 PM, Christopher Yocum <cyocum@gmail.com>
>> >> wrote:
>> >> > declare variable $term as xs:string external; declare variable $col
>> >> > as
>> >> > xs:string external; <results>{subsequence(ft:mark(for $x in
>> >> > collection($col)//entry where $x//text()[. contains text {$term}
>> >> > using
>> >> > wildcards] order by fn:lower-case(fn:replace(($x//orth[1]/text())[1],
>> >> > '\\p{P}|\\d+','')) collation \"?lang=ga\" return $x), 1,
>> >> > 5000)}</results>
>
>