normalize-unicode function
Hi, I tried some Japanese letters in fn:normalize-unicode. FULLWIDTH DIGIT, FULLWIDTH LATIN CAPITAL LETTER: fn:normalize-unicode("1234567890abcdefg", "NFKC") returns 1234567890abcdefg FULLWIDTH EXCLAMATION MARK: fn:normalize-unicode("!", "NFKC") returns ! FULLWIDTH LESS-THAN SIGN, ?FULLWIDTH GREATER-THAN SIGN: fn:normalize-unicode("<>", "NFKC") returns <> FULLWIDTH AMPERSAND; fn:normalize-unicode("&", "NFKC") returns?& These are normal. but, FULLWIDTH QUOTATION MARK, FULLWIDTH APOSTROPHE: fn:normalize-unicode("”’", "NFKC") returns ”’ In this case, I think that "' should be returned. Best regards, Toshio HIRAI
Toshio san, thanks for your e-mail.
FULLWIDTH QUOTATION MARK, FULLWIDTH APOSTROPHE: fn:normalize-unicode("”’", "NFKC") returns ”’
As far as I can judge, the result is actually correct; it’s returned by Java’s standard Unicode algorithms, and also returned by other XQuery (Saxon, Zorba, XMLPrime, etc). I may need to do more research on how to normalize quotes the way you’d like them to have, though. Hope this helps (at least a little), Christian
Hi Christian, Thanks for your help. I confirmed it about the relevant part of a program. And, I confirmed that java.text.Normalizer.normalize() returned it without converting a full-width double quotation mark that we usually used (U+201D). At present, As my hope, When U+201C, U+201D, returns U+0022. When U+2018, U+2019, returns U+0027. However, these will not be enough for users of the whole world (U+201A, U+201E... Other punctuation mark... etc...). Furthermore, I cannot be convinced whether you should implement it its changing the standard specifications of the Java. Best regards, Toshio 2012/6/28 Christian Grün <christian.gruen@gmail.com>:
Toshio san,
thanks for your e-mail.
FULLWIDTH QUOTATION MARK, FULLWIDTH APOSTROPHE: fn:normalize-unicode("”’", "NFKC") returns ”’
As far as I can judge, the result is actually correct; it’s returned by Java’s standard Unicode algorithms, and also returned by other XQuery (Saxon, Zorba, XMLPrime, etc). I may need to do more research on how to normalize quotes the way you’d like them to have, though.
Hope this helps (at least a little), Christian
Dear Toshio, Am 28.06.2012 04:41, schrieb Toshio HIRAI:
FULLWIDTH QUOTATION MARK, FULLWIDTH APOSTROPHE: fn:normalize-unicode("”’", "NFKC") returns ”’
In this case, I think that "' should be returned.
this is just a different representation for the same characters, BaseX only escapes quotes when necessary (e.g. inside attributes). this query for example returns `true()`:
normalize-unicode('"'', 'NFKC') eq '"''
The first two character entities are *FULLWIDTH QUOTATION MARK* [1] and *FULLWIDTH APOSTROPHE* [2]. Hope that helps, cheers, Leo [1] http://www.fileformat.info/info/unicode/char/ff02/index.htm [2] http://www.fileformat.info/info/unicode/char/ff07/index.htm
Hi Leo, Thanks for your help, And I understood that, The thing which believed that I was 'FULL WIDTH QUOTATION MARK' so far was 'RIGHT DOUBLE QUOTATION MARK'. 'RIGHT DOUBLE QUOTATION MARK' (U+201D): [1] 'RIGHT SINGLE QUOTATION MARK' (U+2019): [2] However, It is *RIGHT DOUBLE QUOTATION MARK* to be input when a Japanese inputs with a Japanese keyboard. normalize-unicode('"'', 'NFKC') eq '"'' returns true Yes, I understood. But, If it is Japanese Griff, normalize-unicode('”’', 'NFKC') eq '"'' returns false This sample is the same as that of the following. normalize-unicode('”’', 'NFKC') eq '"'' returns false Although before was afflicted, This seems to UNICODE mapping problem in Java (in Japanese). Thing that I want to do is to normalize, such as symbols and numbers that appear in the context of Japanese. However, It is unavoidable when it is a problem peculiar only to Japanese, I could not but exceed by the following solution. normalize-unicode(fn:translate('”’', '”’' , '"''), "NFKC") eq '"'' returns true There will be a good way or something else? Thanks, Toshio [1] http://www.fileformat.info/info/unicode/char/201d/index.htm [2] http://www.fileformat.info/info/unicode/char/2019/index.htm 2012/6/28 Leonard Wörteler <leonard.woerteler@uni-konstanz.de>:
Dear Toshio,
Am 28.06.2012 04:41, schrieb Toshio HIRAI:
FULLWIDTH QUOTATION MARK, FULLWIDTH APOSTROPHE: fn:normalize-unicode("”’", "NFKC") returns ”’
In this case, I think that "' should be returned.
this is just a different representation for the same characters, BaseX only escapes quotes when necessary (e.g. inside attributes). this query for example returns `true()`:
normalize-unicode('"'', 'NFKC') eq '"''
The first two character entities are *FULLWIDTH QUOTATION MARK* [1] and *FULLWIDTH APOSTROPHE* [2].
Hope that helps, cheers, Leo
[1] http://www.fileformat.info/info/unicode/char/ff02/index.htm [2] http://www.fileformat.info/info/unicode/char/ff07/index.htm
Now noticed, That part in the Japanese tokenizer had been hard-coded... https://github.com/BaseXdb/basex/blob/master/src/main/java/org/basex/util/ft... 2012/6/29 Toshio HIRAI <toshio.hirai@gmail.com>:
Hi Leo,
Thanks for your help, And I understood that, The thing which believed that I was 'FULL WIDTH QUOTATION MARK' so far was 'RIGHT DOUBLE QUOTATION MARK'.
'RIGHT DOUBLE QUOTATION MARK' (U+201D): [1] 'RIGHT SINGLE QUOTATION MARK' (U+2019): [2]
However, It is *RIGHT DOUBLE QUOTATION MARK* to be input when a Japanese inputs with a Japanese keyboard.
normalize-unicode('"'', 'NFKC') eq '"'' returns true
Yes, I understood. But, If it is Japanese Griff,
normalize-unicode('”’', 'NFKC') eq '"'' returns false
This sample is the same as that of the following.
normalize-unicode('”’', 'NFKC') eq '"'' returns false
Although before was afflicted, This seems to UNICODE mapping problem in Java (in Japanese).
Thing that I want to do is to normalize, such as symbols and numbers that appear in the context of Japanese. However, It is unavoidable when it is a problem peculiar only to Japanese, I could not but exceed by the following solution.
normalize-unicode(fn:translate('”’', '”’' , '"''), "NFKC") eq '"'' returns true
There will be a good way or something else?
Thanks, Toshio
[1] http://www.fileformat.info/info/unicode/char/201d/index.htm [2] http://www.fileformat.info/info/unicode/char/2019/index.htm
2012/6/28 Leonard Wörteler <leonard.woerteler@uni-konstanz.de>:
Dear Toshio,
Am 28.06.2012 04:41, schrieb Toshio HIRAI:
FULLWIDTH QUOTATION MARK, FULLWIDTH APOSTROPHE: fn:normalize-unicode("”’", "NFKC") returns ”’
In this case, I think that "' should be returned.
this is just a different representation for the same characters, BaseX only escapes quotes when necessary (e.g. inside attributes). this query for example returns `true()`:
normalize-unicode('"'', 'NFKC') eq '"''
The first two character entities are *FULLWIDTH QUOTATION MARK* [1] and *FULLWIDTH APOSTROPHE* [2].
Hope that helps, cheers, Leo
[1] http://www.fileformat.info/info/unicode/char/ff02/index.htm [2] http://www.fileformat.info/info/unicode/char/ff07/index.htm
participants (3)
-
Christian Grün -
Leonard Wörteler -
Toshio HIRAI