Hi,
I tried some Japanese letters in fn:normalize-unicode.
FULLWIDTH DIGIT, FULLWIDTH LATIN CAPITAL LETTER: fn:normalize-unicode("1234567890abcdefg", "NFKC") returns 1234567890abcdefg
FULLWIDTH EXCLAMATION MARK: fn:normalize-unicode("!", "NFKC") returns !
FULLWIDTH LESS-THAN SIGN, ?FULLWIDTH GREATER-THAN SIGN: fn:normalize-unicode("<>", "NFKC") returns <>
FULLWIDTH AMPERSAND; fn:normalize-unicode("&", "NFKC") returns?&
These are normal. but,
FULLWIDTH QUOTATION MARK, FULLWIDTH APOSTROPHE: fn:normalize-unicode("”’", "NFKC") returns ”’
In this case, I think that "' should be returned.
Best regards,
Toshio HIRAI
Toshio san,
thanks for your e-mail.
FULLWIDTH QUOTATION MARK, FULLWIDTH APOSTROPHE: fn:normalize-unicode("”’", "NFKC") returns ”’
As far as I can judge, the result is actually correct; it’s returned by Java’s standard Unicode algorithms, and also returned by other XQuery (Saxon, Zorba, XMLPrime, etc). I may need to do more research on how to normalize quotes the way you’d like them to have, though.
Hope this helps (at least a little), Christian
Hi Christian,
Thanks for your help. I confirmed it about the relevant part of a program. And, I confirmed that java.text.Normalizer.normalize() returned it without converting a full-width double quotation mark that we usually used (U+201D).
At present, As my hope, When U+201C, U+201D, returns U+0022. When U+2018, U+2019, returns U+0027.
However, these will not be enough for users of the whole world (U+201A, U+201E... Other punctuation mark... etc...). Furthermore, I cannot be convinced whether you should implement it its changing the standard specifications of the Java.
Best regards, Toshio
2012/6/28 Christian Grün christian.gruen@gmail.com:
Toshio san,
thanks for your e-mail.
FULLWIDTH QUOTATION MARK, FULLWIDTH APOSTROPHE: fn:normalize-unicode("”’", "NFKC") returns ”’
As far as I can judge, the result is actually correct; it’s returned by Java’s standard Unicode algorithms, and also returned by other XQuery (Saxon, Zorba, XMLPrime, etc). I may need to do more research on how to normalize quotes the way you’d like them to have, though.
Hope this helps (at least a little), Christian
Dear Toshio,
Am 28.06.2012 04:41, schrieb Toshio HIRAI:
FULLWIDTH QUOTATION MARK, FULLWIDTH APOSTROPHE: fn:normalize-unicode("”’", "NFKC") returns ”’
In this case, I think that "' should be returned.
this is just a different representation for the same characters, BaseX only escapes quotes when necessary (e.g. inside attributes). this query for example returns `true()`:
normalize-unicode('"'', 'NFKC') eq '"''
The first two character entities are *FULLWIDTH QUOTATION MARK* [1] and *FULLWIDTH APOSTROPHE* [2].
Hope that helps, cheers, Leo
[1] http://www.fileformat.info/info/unicode/char/ff02/index.htm [2] http://www.fileformat.info/info/unicode/char/ff07/index.htm
Hi Leo,
Thanks for your help, And I understood that, The thing which believed that I was 'FULL WIDTH QUOTATION MARK' so far was 'RIGHT DOUBLE QUOTATION MARK'.
'RIGHT DOUBLE QUOTATION MARK' (U+201D): [1] 'RIGHT SINGLE QUOTATION MARK' (U+2019): [2]
However, It is *RIGHT DOUBLE QUOTATION MARK* to be input when a Japanese inputs with a Japanese keyboard.
normalize-unicode('"'', 'NFKC') eq '"'' returns true
Yes, I understood. But, If it is Japanese Griff,
normalize-unicode('”’', 'NFKC') eq '"'' returns false
This sample is the same as that of the following.
normalize-unicode('”’', 'NFKC') eq '"'' returns false
Although before was afflicted, This seems to UNICODE mapping problem in Java (in Japanese).
Thing that I want to do is to normalize, such as symbols and numbers that appear in the context of Japanese. However, It is unavoidable when it is a problem peculiar only to Japanese, I could not but exceed by the following solution.
normalize-unicode(fn:translate('”’', '”’' , '"''), "NFKC") eq '"'' returns true
There will be a good way or something else?
Thanks, Toshio
[1] http://www.fileformat.info/info/unicode/char/201d/index.htm [2] http://www.fileformat.info/info/unicode/char/2019/index.htm
2012/6/28 Leonard Wörteler leonard.woerteler@uni-konstanz.de:
Dear Toshio,
Am 28.06.2012 04:41, schrieb Toshio HIRAI:
FULLWIDTH QUOTATION MARK, FULLWIDTH APOSTROPHE: fn:normalize-unicode("”’", "NFKC") returns ”’
In this case, I think that "' should be returned.
this is just a different representation for the same characters, BaseX only escapes quotes when necessary (e.g. inside attributes). this query for example returns `true()`:
normalize-unicode('"'', 'NFKC') eq '"''
The first two character entities are *FULLWIDTH QUOTATION MARK* [1] and *FULLWIDTH APOSTROPHE* [2].
Hope that helps, cheers, Leo
[1] http://www.fileformat.info/info/unicode/char/ff02/index.htm [2] http://www.fileformat.info/info/unicode/char/ff07/index.htm
Now noticed, That part in the Japanese tokenizer had been hard-coded...
https://github.com/BaseXdb/basex/blob/master/src/main/java/org/basex/util/ft...
2012/6/29 Toshio HIRAI toshio.hirai@gmail.com:
Hi Leo,
Thanks for your help, And I understood that, The thing which believed that I was 'FULL WIDTH QUOTATION MARK' so far was 'RIGHT DOUBLE QUOTATION MARK'.
'RIGHT DOUBLE QUOTATION MARK' (U+201D): [1] 'RIGHT SINGLE QUOTATION MARK' (U+2019): [2]
However, It is *RIGHT DOUBLE QUOTATION MARK* to be input when a Japanese inputs with a Japanese keyboard.
normalize-unicode('"'', 'NFKC') eq '"'' returns true
Yes, I understood. But, If it is Japanese Griff,
normalize-unicode('”’', 'NFKC') eq '"'' returns false
This sample is the same as that of the following.
normalize-unicode('”’', 'NFKC') eq '"'' returns false
Although before was afflicted, This seems to UNICODE mapping problem in Java (in Japanese).
Thing that I want to do is to normalize, such as symbols and numbers that appear in the context of Japanese. However, It is unavoidable when it is a problem peculiar only to Japanese, I could not but exceed by the following solution.
normalize-unicode(fn:translate('”’', '”’' , '"''), "NFKC") eq '"'' returns true
There will be a good way or something else?
Thanks, Toshio
[1] http://www.fileformat.info/info/unicode/char/201d/index.htm [2] http://www.fileformat.info/info/unicode/char/2019/index.htm
2012/6/28 Leonard Wörteler leonard.woerteler@uni-konstanz.de:
Dear Toshio,
Am 28.06.2012 04:41, schrieb Toshio HIRAI:
FULLWIDTH QUOTATION MARK, FULLWIDTH APOSTROPHE: fn:normalize-unicode("”’", "NFKC") returns ”’
In this case, I think that "' should be returned.
this is just a different representation for the same characters, BaseX only escapes quotes when necessary (e.g. inside attributes). this query for example returns `true()`:
normalize-unicode('"'', 'NFKC') eq '"''
The first two character entities are *FULLWIDTH QUOTATION MARK* [1] and *FULLWIDTH APOSTROPHE* [2].
Hope that helps, cheers, Leo
[1] http://www.fileformat.info/info/unicode/char/ff02/index.htm [2] http://www.fileformat.info/info/unicode/char/ff07/index.htm
basex-talk@mailman.uni-konstanz.de