Hello, I have a question about the BaseX ft:normalize function. What kind of Unicode normalization is performed by this function, and how might it be implemented using standard XPath functions? Thanks in advance! Tim -- Tim A. Thompson (*he, him*) Librarian for Applied Metadata Research Yale University Library www.linkedin.com/in/timathompson timothy.thompson@yale.edu
Hi Tim,
I have a question about the BaseX ft:normalize function. What kind of Unicode normalization is performed by this function, and how might it be implemented using standard XPath functions?
The function is based on a custom BaseX tokenization, which includes normalization of case, removal of diacritics and (if enabled) language-based stemming. It would be rather challenging to implement the behavior with standard XPath (that’s mostly why we introduced ft:tokenize and ft:normalize). If you are looking for a starting point, you could begin with the FtTokenize Java class [1]. Hope this helps, Christian [1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba...
Thanks, Christian. What is the effective character set used when diacritics are removed? Latin-1? Tim -- Tim A. Thompson (*he, him*) Librarian for Applied Metadata Research Yale University Library www.linkedin.com/in/timathompson timothy.thompson@yale.edu On Mon, Nov 22, 2021 at 2:53 PM Christian Grün <christian.gruen@gmail.com> wrote:
Hi Tim,
I have a question about the BaseX ft:normalize function. What kind of Unicode normalization is performed by this function, and how might it be implemented using standard XPath functions?
The function is based on a custom BaseX tokenization, which includes normalization of case, removal of diacritics and (if enabled) language-based stemming. It would be rather challenging to implement the behavior with standard XPath (that’s mostly why we introduced ft:tokenize and ft:normalize). If you are looking for a starting point, you could begin with the FtTokenize Java class [1].
Hope this helps, Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba...
It’s US-ASCII (7 bit). Tim Thompson <timathom@gmail.com> schrieb am Di., 23. Nov. 2021, 17:07:
Thanks, Christian. What is the effective character set used when diacritics are removed? Latin-1?
Tim
-- Tim A. Thompson (*he, him*) Librarian for Applied Metadata Research Yale University Library www.linkedin.com/in/timathompson timothy.thompson@yale.edu
On Mon, Nov 22, 2021 at 2:53 PM Christian Grün <christian.gruen@gmail.com> wrote:
Hi Tim,
I have a question about the BaseX ft:normalize function. What kind of Unicode normalization is performed by this function, and how might it be implemented using standard XPath functions?
The function is based on a custom BaseX tokenization, which includes normalization of case, removal of diacritics and (if enabled) language-based stemming. It would be rather challenging to implement the behavior with standard XPath (that’s mostly why we introduced ft:tokenize and ft:normalize). If you are looking for a starting point, you could begin with the FtTokenize Java class [1].
Hope this helps, Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba...
Thanks again. Does BaseX support any Unicode block properties, such as \p{InCombiningDiacriticalMarks}, in regex functions? \p{Mn} works, but \p{InCombiningDiacriticalMarks} doesn't seem to. Tim -- Tim A. Thompson (*he, him*) Librarian for Applied Metadata Research Yale University Library www.linkedin.com/in/timathompson timothy.thompson@yale.edu On Tue, Nov 23, 2021 at 11:16 AM Christian Grün <christian.gruen@gmail.com> wrote:
It’s US-ASCII (7 bit).
Tim Thompson <timathom@gmail.com> schrieb am Di., 23. Nov. 2021, 17:07:
Thanks, Christian. What is the effective character set used when diacritics are removed? Latin-1?
Tim
-- Tim A. Thompson (*he, him*) Librarian for Applied Metadata Research Yale University Library www.linkedin.com/in/timathompson timothy.thompson@yale.edu
On Mon, Nov 22, 2021 at 2:53 PM Christian Grün <christian.gruen@gmail.com> wrote:
Hi Tim,
I have a question about the BaseX ft:normalize function. What kind of Unicode normalization is performed by this function, and how might it be implemented using standard XPath functions?
The function is based on a custom BaseX tokenization, which includes normalization of case, removal of diacritics and (if enabled) language-based stemming. It would be rather challenging to implement the behavior with standard XPath (that’s mostly why we introduced ft:tokenize and ft:normalize). If you are looking for a starting point, you could begin with the FtTokenize Java class [1].
Hope this helps, Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba...
Hi Tim,
Does BaseX support any Unicode block properties, such as \p{InCombiningDiacriticalMarks}, in regex functions? \p{Mn} works, but \p{InCombiningDiacriticalMarks} doesn't seem to.
Yes, it does. The confusion may have been caused by a little typo [1]; try this: matches('̀', '\p{IsCombiningDiacriticalMarks}') Best, Christian [1] https://www.w3.org/TR/xmlschema-2/#charcter-classes
participants (2)
-
Christian Grün -
Tim Thompson