Hi Christian,
If you'd manage to provide me with some appropriate tables for Greek characters, I'll be glad to extend this mapping.
If i understood correctly i used the unicode codes from
http://unicode.org/charts/PDF/U0370.pdf
to produce the following mapping:
{'\u0390', 'ι'},
{'\u03b0', 'υ'},
{'\u03d3', 'Υ'},
{'\u03d4', 'Υ'},
{'\u0386', 'Α'},
{'\u0388', 'Ε'},
{'\u0389', 'Η'},
{'\u038a', 'Ι'},
{'\u03aa', 'Ι'},
{'\u03ca', 'ι'},
{'\u03ab', 'Υ'},
{'\u03cb', 'υ'},
{'\u038c', 'Ο'},
{'\u03ac', 'α'},
{'\u03cc', 'ο'},
{'\u03ad', 'ε'},
{'\u03cd', 'υ'},
{'\u038e', 'Υ'},
{'\u03ae', 'η'},
{'\u03ce', 'ω'},
{'\u038f', 'Ω'},
{'\u03af', 'ι'},
I am concerned though because this is not always the desired behavior. Sometimes (ie in an academic context) I could see the need for accent-sensitive searches. The optimal scenario would be to have these mappings in an easily parsable text file (just like stopword list behaves). OTOH this would be reinventing the collation wheel (an oversimplified version of it) For example the mapping above only covers modern Greek text. Ancient/Polytonic Greek has a lot more mappings that are not needed for modern greek. Also I'm pretty sure other languages have such needs too.
I'd like to hear your thoughts on this.
Do you have a direct reference to your prefered Greek stemmer class?
The specific class I am referring to is:
http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/analysis/el/Gr...
It is included in the lucene-3.6.0.tgz tarball under contrib/analyzers/common/lucene-analyzers-3.6.0.jar IIRC greek analyzer was introduced in 3.5.0 release hence it is not present in the 3.4.0 stemmers jar.
Thanks, alex