Hi Günter,
You can take advantage of the unicode normalization features of XQuery:
declare function local:normalize($string) { $string => normalize-unicode('NFKD') => replace('\p{IsCombiningDiacriticalMarks}', '') }; for $text in ('Büchſe', 'Buͤchſe') return local:normalize($text) contains text 'Büchse'
In a future version of BaseX, we want to incorporate Unicode decomposition into the XQuery Full Text tokenizer. For now, if you want to speed up your queries with an index, you can create a custom index structure in which all text strings are stored in a normalized representation [1].
Hope this helps Christian
[1] http://docs.basex.org/wiki/Indexes#Custom_Index_Structures
On Fri, Jul 26, 2019 at 5:39 PM Günter Dunz-Wolff guenter.dunzwolff@gmail.com wrote:
Hi all,
I’m working since some years on a digital edition of the works of a former german author. In my transcription of those works are lots of gothic characters like the old german long s (Unicode: LATIN SMALL LETTER LONG S). For example: Büchſe (exactly Buͤchſe).
In my Full-Text-Search my goal is, that the user asks for „Büchse“ and gets „Büchse“ AND „Büchſe“ (with long s). In best case, she should get „Büchse“ AND „Büchſe“ AND „Buͤchſe“. How can I achieve, that //text[. contains text { } treats s and ſ and ü and uͤ as the same character?
Thanks a lot for any help.
Best regards, Guenter