To be sure if I understood you correctly:
- If STEMMING is set to true, then the input to the stemmer should be
filtered by matching the xml:lang and the LANGUAGE option. Text that is sent to the tokenizer could be left as is and not be filtered by matching LANGUAGE (see next point).
So you would prefer to have all words indexed, but reduce the stemming step to the chosen language, right?
To give an example:
<xml> <div xml:lang='de'>Häuser</div> <div xml:lang='en'>houses</div> </xml>
If stemming is enabled, and if language is 'de', the index would include the two terms 'Haus' (stemmed German form) and 'Houses' (original English form).
The query…
//div[text() contains text { "houses","Häuser" } using language 'de' using stemming ]
…would only return the German div element (as the German stemmer rewrites 'Häuser' to 'Haus' and 'houses' to 'hou').