Hi Michael, thanks for your links, I'll keep them in mind. Christian
PS to everyone: if you believe that better Unicode support is a major concern for you, feel free to raise your hands. ___________________________
Thanks for your feedback. In a nutshell: yes, it's quite a challenge to satisfy the wild range of scenarios BaseX is used for, which is why we need to set priorities, and can't do justice to all users.
I agree. I believe, however, that everybody would benefit from good Unicode support :-)
As a matter of fact, the project is Open Source, and contributions are welcome and needed (as long as features are not financially sponsored).
I'm an open-source author myself, so I know what you mean. Unfortunately, I currently don't have any capacities left for contributing code to BaseX.
Btw, what's your opinion on the Lucene tokenizers and stemmers? As you may know, they also focus on performance. They also bypass Java's Unicode normalization algorithms and do everything by themselves, which is we they may be more relevant to us than the standard Java libraries.
My understanding is that this is mostly for historical reasons, not (primarily) for performance reasons; and some of them are probably hacks and workarounds. Lucene's support for Unicode has a number of problems, and I think they're now moving towards the use of ICU [1], see, e.g.,
http://2010.lucene-eurocon.org/sessions-track1-day1.html#2
Some of this is already available in contrib:
https://issues.apache.org/jira/browse/LUCENE-1488 http://lucene.apache.org/core/3_6_0/api/contrib-icu/index.html
I think it would be a good idea for BaseX to have a look at ICU.
Best regards
Footnotes: [1] http://site.icu-project.org/
-- Dr.-Ing. Michael Piotrowski, M.A. mxp@cl.uzh.ch Institute of Computational Linguistics, University of Zurich Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044
- OUT NOW: Systems and Frameworks for Computational Morphology
- http://www.springeronline.com/978-3-642-23137-7
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk