Re: [basex-talk] multi-language full-text indexing

22 Apr 2015


      Reminds me of an old GitHub issue.. I have added a link to your
request: https://github.com/BaseXdb/basex/issues/59.
On Wed, Apr 22, 2015 at 11:35 AM, Goetz Heller heller@hellerim.de wrote:
...
Here's another addendum: Even if multi-language full-text indexing is not going tob e implemented in the near future, it still would be a useful feature to be able to restrict  full-text indexing to parts of a document, e.g.
CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
        (path_a)/PART_A,
        (path_b)/ PART_B,…
)
Kind regards,
Goetz
-----Ursprüngliche Nachricht-----
Von: Christian Grün [mailto:christian.gruen@gmail.com]
Gesendet: Mittwoch, 22. April 2015 11:03
An: Goetz Heller
Cc: BaseX
Betreff: Re: [basex-talk] multi-language full-text indexing
...
It is desirable to have
documents indexed by locale-specific parts, e.g.
I can see that this would absolutely make sense, but it would be quite some effort to realize it. There are also various conceptul issues related to XQuery Full Text: If you don't specify the language in the query, we'd need to dynamically decide what stemmers to use for the query strings, depending on the nodes that are currently targeted.
This would pretty much blow up the existing architecture.
As there are so many other types of index structures that could be helpful, depending on the particular use case, we usually recommend users to create additional BaseX databases, which can then serve as indexes. This can all be done in XQuery. I remember there have been various examples for this on this mailing list (see e.g. [1,2]).
Hope this helps,
Christian
[1] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg04837.htm...
[2] https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg06089.htm...
...
CREATE FULL-TEXT INDEX ON DATABASE XY STARTING WITH (
(path_a)/LOCALIZED_PART_A[@LANG=$lang],
(path_b)/LOCALIZED_PART_B[@LG=$lang],…
) FOR LANGUAGE $lang IN (
BG,
DN,
DE WITH STOPWORDS filepath_de WITH STEM = YES,
EN WITH STOPWORDS filepath_en,
FR, …
)  [USING language_code_map]
and then to write full-text retrieval queries with a clause such as
‘FOR LANGUAGE BG’, for example. The index parts would be much smaller
and full-text retrieval therefore much faster. The language codes
would be mapped somehow to standard values recognized by BaseX in the
language_code_map file.
Are there any efforts towards such a feature?

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] multi-language full-text indexing