On 2012-06-22, Charles Kowalski alxarch@gmail.com wrote:
I am concerned though because this is not always the desired behavior. Sometimes (ie in an academic context) I could see the need for accent-sensitive searches. The optimal scenario would be to have these mappings in an easily parsable text file (just like stopword list behaves). OTOH this would be reinventing the collation wheel (an oversimplified version of it) For example the mapping above only covers modern Greek text. Ancient/Polytonic Greek has a lot more mappings that are not needed for modern greek. Also I'm pretty sure other languages have such needs too.
I'd like to hear your thoughts on this.
Yes, this is reinventing the wheel, and my impression is that Token.java is also a reinvented wheel, and I'm sorry to say that it doesn't look very round to me.
Unicode provides all the tools to make reinventing wheels unnecessary, in particular normalization forms, character properties, and the Unicode collation algorithm. These tools are already available in Java (and many other languages) and cover *all* of Unicode, not just small subsets. They implement sensible defaults for most cases in different scenarios. Of course it should be possible to override the defaults for applications with special needs, but for most applications it shouldn't be necessary.
For example, no tables are necessary for stripping accents. Instead, you apply NFD (or NFKD) normalization (decomposing all characters) and then remove all characters with the property Mn (the accents).
The NFKD normalization form also allows you to match "ſ" with "s", "ff" with "ff", "²" with "2", etc.
Please consult the Unicode standard, it's all there, so *please* don't try to invent new wheels.
Best regards