Hi,
After reading Christian answer ( :-) ); I thought it could be interesting to sort your docs according to @xml:lang and create a new DB next to your corpus :
---------------------------------- distinct-values( file:children('input-dir')[matches(.,'xml$')] ! (doc(.)//@xml:lang) ) ! db:create( 'db-' || ., <root xml:lang="{.}"> { for $file in file:children('/Users/xavier/Desktop/')[matches(.,'xml$')] return <text src='{$file}'>{doc($file)//*[@xml:lang=.]//text()}</text> } </root>, "myfile", map { 'ftindex': true(), 'language': . } ) ----------------------------------
2017-06-27 20:49 GMT+02:00 Christian Grün christian.gruen@gmail.com:
Hi Kristian,
It is currently not possible to work with different languages in a single database. This is mostly because all normalized tokens will end up in the same internal index, and it would be a lot of effort to diversify this software behavior.
As Xavier pointed out (thanks!), the best way indeed is to create different databases, one per language. The following example has been inspired by Xavier’s proposal; it groups all files by their language and adopts the language in the name of the database:
for $path-group in file:children('input-dir') where ends-with($path-group, '.xml') group by $lang := ($path-group//@xml:lang)[1] return db:create( 'db-' || $lang, $path-group, (), map { 'ftindex': true(), 'language': $lang } )
Hope this helps, Christian
On Tue, Jun 27, 2017 at 5:19 PM, Xavier-Laurent SALVADOR xavierlaurent.salvador@gmail.com wrote:
Hi Kristian,
This is useful for creating automatically databases according to xml:lang attribute
let $dir := '/Users/me/myDesktop/' for $file in file:list($dir)[matches(.,'xml')] return let $flag := (data(doc($dir||$file)/div/@xml:lang)) return db:create("DB", $dir||$file, (), map { 'ftindex': true(),'language':$flag })
Or you can "ft:tokenize" your string mapping {'language':$flag} into your query
Hope I understood the problem :) Else return 'sorry'
2017-06-27 16:57 GMT+02:00 Kristian Kankainen kristian@keeleleek.ee:
Hello
I have documents with text in several languages. When creating a
database
in BaseX I can choose *one* language for stemming for the full-text
search
index. Is there a way BaseX could lemmatize according to the elements xml:lang attribute?
Best regards Kristian K
-- Ce message peut contenir des informations réservées exclusivement à son destinataire. Toute diffusion sans autorisation est interdite. Si vous
n'en
êtes pas le destinataire, merci de prendre contact avec l'expéditeur et
de
détruire ce message.
This email may contain material for the sole use of the intended
recipient.
Any forwarding without express permission is prohibited. If you are not
the
intended recipient, please contact the sender and delete all copies.