Hi all, I’m working since some years on a digital edition of the works of a former german author. In my transcription of those works are lots of gothic characters like the old german long s (Unicode: LATIN SMALL LETTER LONG S). For example: Büchſe (exactly Buͤchſe). In my Full-Text-Search my goal is, that the user asks for „Büchse“ and gets „Büchse“ AND „Büchſe“ (with long s). In best case, she should get „Büchse“ AND „Büchſe“ AND „Buͤchſe“. How can I achieve, that //text[. contains text { } treats s and ſ and ü and uͤ as the same character? Thanks a lot for any help. Best regards, Guenter
Hi Guenter, you should have a look a the matches [1] function and work with regular expressions to perform this task. Best regards, Markus [1] http://www.xqueryfunctions.com/xq/fn_matches.html Am 26.07.2019 um 17:39 schrieb Günter Dunz-Wolff:
Hi all,
I’m working since some years on a digital edition of the works of a former german author. In my transcription of those works are lots of gothic characters like the old german long s (Unicode: LATIN SMALL LETTER LONG S). For example: Büchſe (exactly Buͤchſe).
In my Full-Text-Search my goal is, that the user asks for „Büchse“ and gets „Büchse“ AND „Büchſe“ (with long s). In best case, she should get „Büchse“ AND „Büchſe“ AND „Buͤchſe“. How can I achieve, that //text[. contains text { } treats s and ſ and ü and uͤ as the same character?
Thanks a lot for any help.
Best regards, Guenter
-- Markus Wittenberg Tel +49 (0)341 248 475 36 Mail wittenberg@axxepta.de ---- axxepta solutions GmbH Lehmgrubenweg 17, 88131 Lindau Amtsgericht Berlin HRB 97544B Geschäftsführer: Karsten Becke, Maximilian Gärber
Hi Günter, You can take advantage of the unicode normalization features of XQuery: declare function local:normalize($string) { $string => normalize-unicode('NFKD') => replace('\p{IsCombiningDiacriticalMarks}', '') }; for $text in ('Büchſe', 'Buͤchſe') return local:normalize($text) contains text 'Büchse' In a future version of BaseX, we want to incorporate Unicode decomposition into the XQuery Full Text tokenizer. For now, if you want to speed up your queries with an index, you can create a custom index structure in which all text strings are stored in a normalized representation [1]. Hope this helps Christian [1] http://docs.basex.org/wiki/Indexes#Custom_Index_Structures On Fri, Jul 26, 2019 at 5:39 PM Günter Dunz-Wolff <guenter.dunzwolff@gmail.com> wrote:
Hi all,
I’m working since some years on a digital edition of the works of a former german author. In my transcription of those works are lots of gothic characters like the old german long s (Unicode: LATIN SMALL LETTER LONG S). For example: Büchſe (exactly Buͤchſe).
In my Full-Text-Search my goal is, that the user asks for „Büchse“ and gets „Büchse“ AND „Büchſe“ (with long s). In best case, she should get „Büchse“ AND „Büchſe“ AND „Buͤchſe“. How can I achieve, that //text[. contains text { } treats s and ſ and ü and uͤ as the same character?
Thanks a lot for any help.
Best regards, Guenter
participants (3)
-
Christian Grün -
Günter Dunz-Wolff -
Markus Wittenberg