Re: [basex-talk] More Diacritic Questions

29 Nov 2014


      Hi Christian --
After various adventures re-learning Perl's encoding management
quirks, I generated a simple XML file of all the codepoints between
0x20 and 0xD7FF; this isn't complete for XML but I thought it would be
enough to be interesting.
If I load that file into current BaseX dev version
(BaseX80-20141128.214728.zip) using the gui, and *do* turn on the Full
Text indexing and *do not* turn on diacritics,
//text()[contains(.,'&lt;')]
gives me three hits.
<codepoint>
  <value>U+003C</value>
  <glyph>&lt;</glyph>
  <nodiacritic>&lt;</nodiacritic>
  <basevalue>U+003C</basevalue>
</codepoint>
<codepoint>
  <value>U+226E</value>
  <glyph>≮</glyph>
  <nodiacritic>&lt;</nodiacritic>
  <basevalue>U+003C</basevalue>
</codepoint>
I think there should "should" be four against the relevant bit of XML
with full-text search, since with no diacritics, U+226E should match.
(U+226E's ability to decompose into a less-than sign is one of my very
favourite surprises involved in stripping diacritics.  What do you
mean the document stopped being well-formed...?)
How I get the full-text search to confirm this is not obvious;
for $x in //text()
where $x contains text { "A" }
return $x
happily gives me 101 results, case and diacritic-insensitive;
for $x in //text()
where $x contains text { "<" }
return $x
gives me nothing, presumably on the grounds that < isn't a letter.
I suspect ICU is the way to go; having to keep an all-unicode table up
to date involves more suffering than anyone should willingly
undertake.  (I can probably still generate that table for you if you
like.)
-- Graydon
On Sun, Nov 23, 2014 at 8:42 PM, Christian Grün
christian.gruen@gmail.com wrote:
...
Hi Graydon,
Thanks for your detailed reply, very appreciated.
For today, I decided to choose a pragmatic solution that provides
support for much more cases than before. I have added some more
(glorious) mappings motivated by John Cowan's mail, which can now be
found in a new class [1].
However, to push things a bit further, I have rewritten the code for
removing diacritics. Normalized tokens may now have a different byte
length than the original token, as I'm removing combining marks as
well (starting from 0300, and others).
As a result, the following query will now yield the expected result (true):
(: U+00E9 vs. U+0065 U+0E01 :)
  let $e1 := codepoints-to-string(233)
  let $e2 := codepoints-to-string((101, 769))
  return $e1 contains text { $e2 }
I will have some more thoughts on embracing the full Unicode
normalization. I fully agree that it makes sense to use standards
whenever appropriate. However, one disadvantage for us is that it
usually works on String data, whereas most textual data in BaseX is
internally represented in byte arrays. One more challenge is that
Java's Unicode support is not up-to-date anymore. For example, I am
checking diacritical combining marks from Unicode 7.0 that are not
detected as such by current versions of Java (1AB0–1AFF).
To be able to support the new requirements of XQuery 3.1 (see e.g.
[2]), we are already working with ICU [3]; it will be requested
dynamically if it's found in the classpath. In future, we could use it
for all of our full-text operations as well, but the optional
embedding comes at a price in terms of performance.
Looking forward to your feedback on the new snapshot,
Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba...
[2] http://www.w3.org/TR/xpath-functions-31/#uca-collations
[3] http://site.icu-project.org/

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] More Diacritic Questions