Hi Graydon,
//text()[contains(.,'<')]
gives me three hits.
I think there should "should" be four against the relevant bit of XML with full-text search, since with no diacritics, U+226E should match.
So you would expected this node to be returned as well?
<glyph>≮</glyph>
For this, you'll probably have to call normalize-unicode first:
//text()[contains(normalize-unicode(., 'NFD),'<')]
for $x in //text() where $x contains text { "<" } return $x
gives me nothing, presumably on the grounds that < isn't a letter.
Exactly. With "contains text", only letters can be found. It would generally be possible to write a tokenizer that also returns other characters as tokens, but there has been no use for that until now (and it would generate many new questions in regards to normalization, with and without ICU).
(I can probably still generate that table for you if you like.)
I think that for now we will stick with the existing tokenization. If it turns out that we need more power, we could think about optionally providing support for ICU as well. However, I'll be glad to have your feedback if you find examples that are currently not, but should be, covered by our diacritics normalization mapping.
If you want to play around with our current ICU support, feel free to download the latest snapshot, add ICU to the classpath, and use the new XQuery 3.1 UCA collation. The new fn:collation-key() function is still work in progress, but all other collation features should already be available when using the XQuery default string functions.
Thanks for your feedback, Christian