Re: [basex-talk] Full-text lemmatizing and xml:lang

2 Jul 2017


      Hi Kristian,
Right now, xml:lang attributes are completely ignored when indexing
full-text. It’s an interesting idea to exclude texts that are marked
with languages different to the one that is currently applied; I will
think about it.
However, I should have mentioned that the language option is mostly
irrelevant unless you use stemmers. Tokenization is pretty much the
same for Western texts, so searches like the following one…
'Добрый ДЕНЬ!' contains text 'день'
    using language 'en'
…will still give you the expected result. To some extent, this also
applies to Arabian texts:
'يوم سعيد' contains text 'يوم'
    using language 'en'
Things are definitely different if you work with Japanese or Chinese
texts. The following query yields false:
'今日は' contains text '今'
    using language 'en'
For more information on Japanese tokenization, see Toshio HIRAI’s
article in our wiki [1].
Hope this helps,
Christian
[1] http://docs.basex.org/wiki/Full-Text:_Japanese
...
How is the behavior if the database content is in many different languages
and is correctly marked with xml:lang attributes. Does the full-text index
consider this information and apply full-text indexing only to elements with
matching language?
As a simple illustration (does not run): will the following code create
full-text index only for the Russian text or for both the russian and the
English?
db:create(
    'db-ft-ru',
    <texts>
      <text xml:lang="ru">something in Russian</text>
      <text xml:lang="en">something in English</text>
    </texts>,
    texts,
    map { 'ftindex': true(), 'language': 'ru' }
  )
If BaseX does create the full-text index for both languages (the English
index would contain useless scramble) I would propose a simple filtering of
xml:lang tags according to the language given in the map to ftindex. This
should be simpler to implement than the diversifying as suggested by
Christian.
Best regards
Kristian K

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] Full-text lemmatizing and xml:lang