Hi Gerrit,

Thanks for the suggestions. I would like to retain the original diacritics (for output purposes) but only match them when warranted (e.g., match acétazolamide to acétazolamide, but not acétazolamide to acetazolamide). I am looking for a simple solution that does not involve modifying the database or maintaining multiple copies (both for processing simplicity and storage efficiency reasons).

Thanks,
Ron

On August 3, 2018 at 9:08:19 AM, Imsieke, Gerrit, le-tex (gerrit.imsieke@le-tex.de) wrote:

Hi Ron,

You can add an extra element (or attribute) to the content when
importing or modifying it. (Or another document in another database if
you like – you can create and later find such an index document by
giving it the same db:path as the original document.)

In this extra database, document, element and/or attribute, you can
recreate the original text, except that you normalize the characters
with diacritical marks to a canonical decomposition form and then strip
away the diacritical marks like this:

replace(normalize-unicode($input, 'NFKD'), '\p{Mn}', '')

The full updating statement is beyond my cursory XQuery capabilities –
I’d probably do it in XSLT. Also I don’t know how to trigger an event
that would cause an update of the auxiliary fields when the underlying
data changes.

Gerrit


On 03.08.2018 14:39, Ron Katriel wrote:
> Christian,
>
> Adding diacritics sensitive slows execution by a factor of 3. My script
> (fragment below), which joins two large databases, namely CT.gov
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__clinicaltrials.gov&d=DwIDaQ&c=fi2D4-9xMzmjyjREwHYlAw&r=44jDQvzmnB_-ovfO6Iusj0ItciJrcWMOQQwd2peEBBE&m=Ey4jDDhPLggInP39ySeaE3VfSTCYVNh_9_fJGgZfoMQ&s=koceIUV9xm7YkAEx4zHuVLM00ueSFrJPydvVoqoa_JE&e=> and DrugBank, takes 2 hours without the
> diacritics sensitive constraint but 6 hours with it. Given the
> combinatorics involved, I am wondering if there is a better way to do
> this in BaseX.
>
> Thanks,
> Ron
>
>
> for $drug in db:open('DrugBank')/drugbank/drug
>  let $drug_name := $drug/name/text()
>  let $drug_synonyms :=
> functx:value-union(normalize-space(lower-case($drug/name)),
> local:drug-synonyms($drug_name))
>  for $synonym_name in $drug_synonyms
>  ...
>  for $study in
> db:open('CTGov')/clinical_study[intervention/intervention_name contains
> text { $synonym_name } using case insensitive using diacritics sensitive]
>  ...
>
>
> Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions
> <http://www.mdsol.com/>
> 350 Hudson Street, 7th Floor, New York, NY 10014
> rkatriel@mdsol.com <mailto:tbrophy@mdsol.com> | direct: +1 201 337 3622
> <tel://201%20337%203622> | mobile: +1 201 675 5598
> <tel://+1%20201%20675%205598> | main: +1 212 918 1800
> <tel://+1%20212%20918%201800>
>
> On August 1, 2018 at 12:41:26 PM, Ron Katriel (rkatriel@mdsol.com
> <mailto:rkatriel@mdsol.com>) wrote:
>
>> Thanks, Christian. Strange, prior to contacting you and on a hunch, I
>> tried adding the missing “using” keyword but still got the syntax
>> error. Anyway, everything is good now!
>>
>> Best,
>> Ron
>>
>> On August 1, 2018 at 3:57:51 AM, Christian Grün
>> (christian.gruen@gmail.com <mailto:christian.gruen@gmail.com>) wrote:
>>
>>> I have fixed the example in the doc.
>>> Best, Christian
>>>
>>>
>>> On Wed, Aug 1, 2018 at 5:08 AM Ron Katriel <rkatriel@mdsol.com
>>> <mailto:rkatriel@mdsol.com>> wrote:
>>> >
>>> > Hi,
>>> >
>>> > The following from your website (docs.basex.org/wiki/Full-Text
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.basex.org_wiki_Full-2DText&d=DwIDaQ&c=fi2D4-9xMzmjyjREwHYlAw&r=44jDQvzmnB_-ovfO6Iusj0ItciJrcWMOQQwd2peEBBE&m=Ey4jDDhPLggInP39ySeaE3VfSTCYVNh_9_fJGgZfoMQ&s=SiWAa4ymPcj4HabGlA411Rp03-IG4l2krSrbu2-WJSs&e=>) appears to be syntactically
>>> incorrect
>>> >
>>> > "'Äpfel' will not be found..." contains text "Apfel" diacritics sensitive
>>> >
>>> > In the BaseX GUI the keyword diacritics is underlined in red and the following error is reported
>>> >
>>> > Unexpected end of query: 'diacritic sens...'.
>>> >
>>> > This happens in version 8.6.4 and also the latest (9.0.2).
>>> >
>>> > Thanks,
>>> > Ron
>>> >
>>> >
>>> > Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions
>>> >
>>> > 350 Hudson Street, 7th Floor, New York, NY 10014
>>> >
>>> > rkatriel@mdsol.com <mailto:rkatriel@mdsol.com> | direct: +1 201 337
>>> 3622 | mobile: +1 201 675 5598 | main: +1 212 918 1800
>>> >
>>> >