Hi Christian --
On Sat, Nov 29, 2014 at 6:03 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Graydon,
//text()[contains(.,'<')]
gives me three hits.
I think there should "should" be four against the relevant bit of XML with full-text search, since with no diacritics, U+226E should match.
So you would expected this node to be returned as well?
<glyph>≮</glyph>
For this, you'll probably have to call normalize-unicode first:
//text()[contains(normalize-unicode(., 'NFD),'<')]
With that query, absolutely I should only get three hits.
My expectation for "full text search" is that it searches the contents of text nodes. (Since I'm not sure there's a coherent way to describe "text" in XML that isn't "contents of text nodes".)
So I would expect that, with a full text search that ignores diacritics, I'd get four hits.
for $x in //text() where $x contains text { "<" } return $x
gives me nothing, presumably on the grounds that < isn't a letter.
Exactly. With "contains text", only letters can be found. It would generally be possible to write a tokenizer that also returns other characters as tokens, but there has been no use for that until now (and it would generate many new questions in regards to normalization, with and without ICU).
Entirely understood that the tokenizer only recognizes letters.
I don't think it's clear that "text" in "full text" means "groups of letters". Anything that isn't letters is sort of inherently partaking of the edge-case nature, but it's not too hard to imagine text with equations and strange effects from operators with a decomposable unicode representation.
[snip]
If you want to play around with our current ICU support, feel free to download the latest snapshot, add ICU to the classpath, and use the new XQuery 3.1 UCA collation. The new fn:collation-key() function is still work in progress, but all other collation features should already be available when using the XQuery default string functions.
That's very interesting; thank you!
I shall see about taking a poke at that, and maybe trying to produce some performance numbers.
Thanks! Graydon