Re: [basex-talk] full text search collation

22 Jun 2012


      Hi Christian,
...
If you'd manage to provide me with some appropriate tables for Greek
characters, I'll be glad to extend this mapping.
If i understood correctly i used the unicode codes from
http://unicode.org/charts/PDF/U0370.pdf
to produce the following mapping:
{'\u0390', 'ι'},
{'\u03b0', 'υ'},
{'\u03d3', 'Υ'},
{'\u03d4', 'Υ'},
{'\u0386', 'Α'},
{'\u0388', 'Ε'},
{'\u0389', 'Η'},
{'\u038a', 'Ι'},
{'\u03aa', 'Ι'},
{'\u03ca', 'ι'},
{'\u03ab', 'Υ'},
{'\u03cb', 'υ'},
{'\u038c', 'Ο'},
{'\u03ac', 'α'},
{'\u03cc', 'ο'},
{'\u03ad', 'ε'},
{'\u03cd', 'υ'},
{'\u038e', 'Υ'},
{'\u03ae', 'η'},
{'\u03ce', 'ω'},
{'\u038f', 'Ω'},
{'\u03af', 'ι'},
I am concerned though because this is not always the desired behavior.
Sometimes (ie in an academic context) I could see the need for 
accent-sensitive searches.
The optimal scenario would be to have these mappings in an easily 
parsable text file (just like stopword list behaves).
OTOH this would be reinventing the collation wheel (an oversimplified 
version of it)
For example the mapping above only covers modern Greek text. 
Ancient/Polytonic Greek has a lot more mappings
that are not needed for modern greek. Also I'm pretty sure other 
languages have such needs too.
I'd like to hear your thoughts on this.
...
Do you have a direct reference to your prefered Greek stemmer class?
The specific class I am referring to is:
http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/el/Gr...
It is included in the lucene-3.6.0.tgz tarball under 
contrib/analyzers/common/lucene-analyzers-3.6.0.jar
IIRC greek analyzer was introduced in 3.5.0 release hence it is not 
present in the 3.4.0 stemmers jar.
Thanks,
alex

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] full text search collation