Hi Christian,

I hope you had a good weekend!  

Otherwise, no, this doesn't help as it doesn't choose to use the full text index on my content :(.  This is what I am getting at the moment:

Compiling:
- pre-evaluating fn:collection("edil")
- simplifying descendant-or-self step(s)
- converting descendant::*:entry to child steps
- simplifying descendant-or-self step(s)
- removing context expression (.)
- rewriting where clause(s)
- simplifying flwor expression

Query:
declare variable $term as xs:string external := 'athgabāi.*'; declare variable $col as xs:string external := 'edil'; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x//text() contains text {$term} using diacritics insensitive using wildcards return $x), 1, 5000)}</results>

Optimized Query:
element results { (fn:subsequence(ft:mark((db:open-pre("edil",0), db:open-pre("edil",155748), ...)/*:sample/*:entry[descendant::text() contains text "athgabāi.*" using wildcards using language 'English']), 1, 5000)) }

I tried this as well with the same results:

Compiling:
- pre-evaluating fn:collection("edil")
- simplifying descendant-or-self step(s)
- converting descendant::*:entry to child steps
- removing context expression (.)
- rewriting where clause(s)
- simplifying flwor expression

Query:
declare variable $term as xs:string external := 'athgabāi.*'; declare variable $col as xs:string external := 'edil'; <results>{subsequence(ft:mark(for $x in collection($col)//entry where $x/descendant::*[text() contains text 'athgabāi.*' using diacritics insensitive using wildcards] return $x), 1, 5000)}</results>
Optimized Query:

element results { (fn:subsequence(ft:mark((db:open-pre("edil",0), db:open-pre("edil",155748), ...)/*:sample/*:entry[descendant::*[text() contains text "athgabāi.*" using wildcards using language 'English']]), 1, 5000)) }

There are the options set on the database:

Database Properties
 Name: edil
 Size: 194 MB
 Nodes: 7951662
 Documents: 19
 Binaries: 0
 Timestamp: 2014-08-15-17-00-29

Resource Properties
 Input Path: /home/cyocum/temp/edil_src/xml_src
 Input Size: 87 MB
 Timestamp: 2014-08-15-16-46-31
 Encoding: UTF-8
 CHOP: true

Indexes
 Up-to-date: true
 TEXTINDEX: true
 ATTRINDEX: true
 FTINDEX: true
 LANGUAGE: 
 STEMMING: false
 CASESENS: false
 DIACRITICS: true
 STOPWORDS: 
 UPDINDEX: false
 MAXCATS: 100
 MAXLEN: 96

I hope this helps.

All the best,
Chris


On Tue, Aug 19, 2014 at 10:12 AM, Christian Grün <christian.gruen@gmail.com> wrote:
Hi Chris,

sorry for letting you wait, I’ve been offline over the weekend.

> Thank you again for all your help.  Unfortunately, my documents are
> multi-language and multi-diacritics so my users expect it to match
> athgabáil, athgabail, and athgabāil as the same word. They also want
> wildcard searching to work in the same way.

This should be no problem, even with the full-text default settings.
An example: the following query...

  /descendant::*[text() contains text 'athgabāi.*'
    using diacritics insensitive
    using wildcards]

...will give you three results for the following document...

<xml>
  <term>athgabáil</term>
  <term>athgabail</term>
  <term>athgabāil</term>
</xml>

...and the results will be retrieved by the full-text index, using the
default settings:

- applying full-text index for "athgabāi.*" using wildcards using
language 'English'

The solution that I mentioned in my last mail is required if you want
to do both diacritics sensitive and insensitive search.

Does this help?
Christian




> At the moment the query looks like this and it does not use the full text
> index:
>
> declare variable $term as xs:string external := 'athgab.*'; declare variable
> $col as xs:string external := 'edil'; <results>{subsequence(ft:mark(for $x
> in collection($col)//entry where $x//text() contains text {$term} using
> wildcards using diacritics insensitive order by
> fn:lower-case(fn:replace(($x//orth[1]/text())[1], '\p{P}|\d+','')) collation
> "?lang=ga" return $x), 1, 5000)}</results>
>
> If anyone has any suggestions, I would be grateful.
>
> All the best,
> Chris
>
>
> On Thu, Aug 14, 2014 at 10:35 PM, Christian Grün <christian.gruen@gmail.com>
> wrote:
>>
>> Hi Chris,
>>
>> as you already noted, the full-text index
>> will
>> only
>> be
>> utilized with
>> the
>> options that you choose when creating an index. If you want to do more
>> fine-grained searches, it’s
>> usually
>> recommendable to
>> choose
>> the most general options for creating the index (case insensitive,
>> diacritics insensitive, etc). and
>> then
>> refine the results in a second step.
>> This can e.g. look as follows
>> :
>>
>>   declare function local:search($db, $terms) {
>>     for $result in db:open($db)//*[text() contains text { $terms }]
>>     return $result[text() contains text { $terms } using case sensitive]
>>   };
>>   local:search('factbook', ('German', 'English'))
>>
>> Hope this helps,
>> Christian
>>
>>
>>
>> On Thu, Aug 14, 2014 at 10:54 PM, Chris Yocum <cyocum@gmail.com> wrote:
>> > Hi Christian,
>> >
>> > Apologies for bringing this back up but if I use "using diacritics
>> > insensitive" in the full text search, it seems to turn full text
>> > searching off.  I have diacritics true on the database.  I am just
>> > suprised to see diacritics causing the full text searching to be
>> > turned off.
>> >
>> > All the best,
>> > Chris
>> >
>> > On Wed, Aug 13, 2014 at 01:18:26PM +0200, Christian Grün wrote:
>> >> Hi Chris,
>> >>
>> >> there are various caches involved when evaluating queries, but I can't
>> >> see for the given query where a cache may be utilized. However, your
>> >> query may be evaluated faster if you simplify the nested where clause:
>> >>
>> >> <results>{
>> >>   subsequence(
>> >>     ft:mark(
>> >>       for $x in collection($col)//entry
>> >>       where $x//text() contains text { $term } using wildcards
>> >>       order by fn:lower-case(
>> >>         fn:replace(($x//orth[1]/text())[1], '\\p{P}|\\d+','')
>> >>       ) collation "?lang=ga"
>> >>       return $x
>> >>     ), 1, 5000
>> >>   )
>> >> }</results>
>> >>
>> >> You could as well use a predicate with position(), it may be evaluated
>> >> faster than subsequence (I'm not sure, though, because most time will
>> >> probably be spent for ordering all results):
>> >>
>> >> <results>{
>> >>   ft:mark(
>> >>     for $x in collection($col)//entry
>> >>     where $x//text() contains text { $term } using wildcards
>> >>     order by fn:lower-case(
>> >>       fn:replace(($x//orth[1]/text())[1], '\\p{P}|\\d+','')
>> >>     ) collation "?lang=ga"
>> >>     return $x
>> >>   )[position() = 1 to 5000]
>> >> }</results>
>> >>
>> >> Could you please open the InfoView in the GUI, execute the query again
>> >> and check if the full-text index is applied?
>> >>
>> >> Christian
>> >>
>> >>
>> >>
>> >> On Wed, Aug 13, 2014 at 12:02 PM, Christopher Yocum <cyocum@gmail.com>
>> >> wrote:
>> >> > declare variable $term as xs:string external; declare variable $col
>> >> > as
>> >> > xs:string external; <results>{subsequence(ft:mark(for $x in
>> >> > collection($col)//entry where $x//text()[. contains text {$term}
>> >> > using
>> >> > wildcards] order by fn:lower-case(fn:replace(($x//orth[1]/text())[1],
>> >> > '\\p{P}|\\d+','')) collation \"?lang=ga\" return $x), 1,
>> >> > 5000)}</results>
>
>