Hi Christian --
After various adventures re-learning Perl's encoding management quirks, I generated a simple XML file of all the codepoints between 0x20 and 0xD7FF; this isn't complete for XML but I thought it would be enough to be interesting.
If I load that file into current BaseX dev version (BaseX80-20141128.214728.zip) using the gui, and *do* turn on the Full Text indexing and *do not* turn on diacritics,
//text()[contains(.,'<')]
gives me three hits.
<codepoint> <value>U+003C</value> <glyph><</glyph> <nodiacritic><</nodiacritic> <basevalue>U+003C</basevalue> </codepoint> <codepoint> <value>U+226E</value> <glyph>≮</glyph> <nodiacritic><</nodiacritic> <basevalue>U+003C</basevalue> </codepoint>
I think there should "should" be four against the relevant bit of XML with full-text search, since with no diacritics, U+226E should match. (U+226E's ability to decompose into a less-than sign is one of my very favourite surprises involved in stripping diacritics. What do you mean the document stopped being well-formed...?)
How I get the full-text search to confirm this is not obvious;
for $x in //text() where $x contains text { "A" } return $x
happily gives me 101 results, case and diacritic-insensitive;
for $x in //text() where $x contains text { "<" } return $x
gives me nothing, presumably on the grounds that < isn't a letter.
I suspect ICU is the way to go; having to keep an all-unicode table up to date involves more suffering than anyone should willingly undertake. (I can probably still generate that table for you if you like.)
-- Graydon
On Sun, Nov 23, 2014 at 8:42 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Graydon,
Thanks for your detailed reply, very appreciated.
For today, I decided to choose a pragmatic solution that provides support for much more cases than before. I have added some more (glorious) mappings motivated by John Cowan's mail, which can now be found in a new class [1].
However, to push things a bit further, I have rewritten the code for removing diacritics. Normalized tokens may now have a different byte length than the original token, as I'm removing combining marks as well (starting from 0300, and others).
As a result, the following query will now yield the expected result (true):
(: U+00E9 vs. U+0065 U+0E01 :) let $e1 := codepoints-to-string(233) let $e2 := codepoints-to-string((101, 769)) return $e1 contains text { $e2 }
I will have some more thoughts on embracing the full Unicode normalization. I fully agree that it makes sense to use standards whenever appropriate. However, one disadvantage for us is that it usually works on String data, whereas most textual data in BaseX is internally represented in byte arrays. One more challenge is that Java's Unicode support is not up-to-date anymore. For example, I am checking diacritical combining marks from Unicode 7.0 that are not detected as such by current versions of Java (1AB0–1AFF).
To be able to support the new requirements of XQuery 3.1 (see e.g. [2]), we are already working with ICU [3]; it will be requested dynamically if it's found in the classpath. In future, we could use it for all of our full-text operations as well, but the optional embedding comes at a price in terms of performance.
Looking forward to your feedback on the new snapshot, Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba... [2] http://www.w3.org/TR/xpath-functions-31/#uca-collations [3] http://site.icu-project.org/