Re: [basex-talk] More Diacritic Questions

29 Nov 2014


      Hi Graydon,
...
//text()[contains(.,'&lt;')]
gives me three hits.
I think there should "should" be four against the relevant bit of XML
with full-text search, since with no diacritics, U+226E should match.
So you would expected this node to be returned as well?
<glyph>≮</glyph>
For this, you'll probably have to call normalize-unicode first:
//text()[contains(normalize-unicode(., 'NFD),'&lt;')]
...
for $x in //text()
where $x contains text { "<" }
return $x
gives me nothing, presumably on the grounds that < isn't a letter.
Exactly. With "contains text", only letters can be found. It would
generally be possible to write a tokenizer that also returns other
characters as tokens, but there has been no use for that until now
(and it would generate many new questions in regards to normalization,
with and without ICU).
...
(I can probably still generate that table for you if you like.)
I think that for now we will stick with the existing tokenization. If
it turns out that we need more power, we could think about optionally
providing support for ICU as well. However, I'll be glad to have your
feedback if you find examples that are currently not, but should be,
covered by our diacritics normalization mapping.
If you want to play around with our current ICU support, feel free to
download the latest snapshot, add ICU to the classpath, and use the
new XQuery 3.1 UCA collation. The new fn:collation-key() function is
still work in progress, but all other collation features should
already be available when using the XQuery default string functions.
Thanks for your feedback,
Christian

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] More Diacritic Questions