Re: [basex-talk] Full-Text Search long s

26 Jul 2019


      Hi Günter,
You can take advantage of the unicode normalization features of XQuery:
declare function local:normalize($string) {
    $string
    => normalize-unicode('NFKD')
    => replace('\p{IsCombiningDiacriticalMarks}', '')
  };
  for $text in ('Büchſe', 'Buͤchſe')
  return local:normalize($text) contains text 'Büchse'
In a future version of BaseX, we want to incorporate Unicode
decomposition into the XQuery Full Text tokenizer. For now, if you
want to speed up your queries with an index, you can create a custom
index structure in which all text strings are stored in a normalized
representation [1].
Hope this helps
Christian
[1] http://docs.basex.org/wiki/Indexes#Custom_Index_Structures
On Fri, Jul 26, 2019 at 5:39 PM Günter Dunz-Wolff
guenter.dunzwolff@gmail.com wrote:
...
Hi all,
I’m working since some years on a digital edition of the works of a former german author. In my transcription of those works are lots of gothic characters like the old german long s (Unicode: LATIN SMALL LETTER LONG S). For example: Büchſe (exactly Buͤchſe).
In my Full-Text-Search my goal is, that the user asks for „Büchse“ and gets „Büchse“ AND „Büchſe“ (with long s). In best case, she should get „Büchse“ AND „Büchſe“ AND „Buͤchſe“. How can I achieve, that //text[. contains text { } treats s and ſ and ü and uͤ as the same character?
Thanks a lot for any help.
Best regards,
Guenter

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] Full-Text Search long s