I noticed a minor bug in my Greek stemmer implementation. After removing two characters in the code, queries such as the following one..
"ΧΑΡΑΚΤΗΡΕΣ" contains text "χαρακτηρ" using stemming using language 'el'
..should now return the same results as the Lucene stemmer. Just try the latest snapshot. Christian
PS: by the way, I noticed that Lucene also avoids Java's Unicode normalization and has its custom character mappings – most probably to improve performance. The following class is triggered by the Greek stemmer implementation:
http://www.docjar.com/html/api/org/apache/lucene/analysis/el/GreekLowerCaseF...
___________________________________
On Sat, Jun 23, 2012 at 2:24 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Αλέξανδρος,
The stemmer OTOH does not seem to be working. I think it needs to be integrated in the same way that the other lucene stemmers are integrated, using the whole lucene-analyzers-3.6.0.jar instead of the lucene-stemmers-3.4.0.jar.
Thanks for your feedback; I already guessed that this might take a little bit more time. Could you provide us with some simple example queries and their expected result? Similar to..
"ά" contains text "α" → true "..." contains text "..." using stemming using language "el" → ...
Thanks in advance, Christian