Re: [basex-talk] Full-text lemmatizing and xml:lang

3 Jul 2017

      To be sure if I understood you correctly:
...

If STEMMING is set to true, then the input to the stemmer should be

filtered by matching the xml:lang and the LANGUAGE option. Text that is sent
to the tokenizer could be left as is and not be filtered by matching
LANGUAGE (see next point).
So you would prefer to have all words indexed, but reduce the stemming
step to the chosen language, right?
To give an example:
<xml>
  <div xml:lang='de'>Häuser</div>
  <div xml:lang='en'>houses</div>
</xml>
If stemming is enabled, and if language is 'de', the index would
include the two terms 'Haus' (stemmed German form) and 'Houses'
(original English form).
The query…
//div[text() contains text { "houses","Häuser" }
  using language 'de'
  using stemming
]
…would only return the German div element (as the German stemmer rewrites
'Häuser' to 'Haus' and 'houses' to 'hou').

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] Full-text lemmatizing and xml:lang