Hi Kristian,
Right now, xml:lang attributes are completely ignored when indexing full-text. It’s an interesting idea to exclude texts that are marked with languages different to the one that is currently applied; I will think about it.
However, I should have mentioned that the language option is mostly irrelevant unless you use stemmers. Tokenization is pretty much the same for Western texts, so searches like the following one…
'Добрый ДЕНЬ!' contains text 'день' using language 'en'
…will still give you the expected result. To some extent, this also applies to Arabian texts:
'يوم سعيد' contains text 'يوم' using language 'en'
Things are definitely different if you work with Japanese or Chinese texts. The following query yields false:
'今日は' contains text '今' using language 'en'
For more information on Japanese tokenization, see Toshio HIRAI’s article in our wiki [1].
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Full-Text:_Japanese
How is the behavior if the database content is in many different languages and is correctly marked with xml:lang attributes. Does the full-text index consider this information and apply full-text indexing only to elements with matching language?
As a simple illustration (does not run): will the following code create full-text index only for the Russian text or for both the russian and the English?
db:create( 'db-ft-ru', <texts> <text xml:lang="ru">something in Russian</text> <text xml:lang="en">something in English</text> </texts>, texts, map { 'ftindex': true(), 'language': 'ru' } )
If BaseX does create the full-text index for both languages (the English index would contain useless scramble) I would propose a simple filtering of xml:lang tags according to the language given in the map to ftindex. This should be simpler to implement than the diversifying as suggested by Christian.
Best regards Kristian K