Re: [basex-talk] full text search collation

22 Jun 2012


      On 2012-06-22, Charles Kowalski alxarch@gmail.com wrote:
...
I am concerned though because this is not always the desired behavior.
Sometimes (ie in an academic context) I could see the need for
accent-sensitive searches.
The optimal scenario would be to have these mappings in an easily
parsable text file (just like stopword list behaves).
OTOH this would be reinventing the collation wheel (an oversimplified
version of it)
For example the mapping above only covers modern Greek text.
Ancient/Polytonic Greek has a lot more mappings
that are not needed for modern greek. Also I'm pretty sure other
languages have such needs too.
I'd like to hear your thoughts on this.
Yes, this is reinventing the wheel, and my impression is that Token.java
is also a reinvented wheel, and I'm sorry to say that it doesn't look
very round to me.
Unicode provides all the tools to make reinventing wheels unnecessary,
in particular normalization forms, character properties, and the Unicode
collation algorithm.  These tools are already available in Java (and
many other languages) and cover *all* of Unicode, not just small
subsets.  They implement sensible defaults for most cases in different
scenarios.  Of course it should be possible to override the defaults for
applications with special needs, but for most applications it shouldn't
be necessary.
For example, no tables are necessary for stripping accents.  Instead,
you apply NFD (or NFKD) normalization (decomposing all characters) and
then remove all characters with the property Mn (the accents).
The NFKD normalization form also allows you to match "ſ" with "s", "ﬀ"
with "ff", "²" with "2", etc.
Please consult the Unicode standard, it's all there, so *please* don't
try to invent new wheels.
Best regards
-- 
Dr.-Ing. Michael Piotrowski, M.A. mxp@cl.uzh.ch
Institute of Computational Linguistics, University of Zurich
Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044
* OUT NOW: Systems and Frameworks for Computational Morphology
*          http://www.springeronline.com/978-3-642-23137-7

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] full text search collation