Hello
I have documents with text in several languages. When creating a database in BaseX I can choose *one* language for stemming for the full-text search index. Is there a way BaseX could lemmatize according to the elements xml:lang attribute?
Best regards Kristian K
Hi Kristian,
This is useful for creating automatically databases according to xml:lang attribute
let $dir := '/Users/me/myDesktop/' for $file in file:list($dir)[matches(.,'xml')] return let $flag := (data(doc($dir||$file)/div/@xml:lang)) return db:create("DB", $dir||$file, (), map { 'ftindex': true(),'language':$flag })
Or you can "ft:tokenize" your string mapping {'language':$flag} into your query
Hope I understood the problem :) Else return 'sorry'
2017-06-27 16:57 GMT+02:00 Kristian Kankainen kristian@keeleleek.ee:
Hello
I have documents with text in several languages. When creating a database in BaseX I can choose *one* language for stemming for the full-text search index. Is there a way BaseX could lemmatize according to the elements xml:lang attribute?
Best regards Kristian K
Hi Kristian,
It is currently not possible to work with different languages in a single database. This is mostly because all normalized tokens will end up in the same internal index, and it would be a lot of effort to diversify this software behavior.
As Xavier pointed out (thanks!), the best way indeed is to create different databases, one per language. The following example has been inspired by Xavier’s proposal; it groups all files by their language and adopts the language in the name of the database:
for $path-group in file:children('input-dir') where ends-with($path-group, '.xml') group by $lang := ($path-group//@xml:lang)[1] return db:create( 'db-' || $lang, $path-group, (), map { 'ftindex': true(), 'language': $lang } )
Hope this helps, Christian
On Tue, Jun 27, 2017 at 5:19 PM, Xavier-Laurent SALVADOR xavierlaurent.salvador@gmail.com wrote:
Hi Kristian,
This is useful for creating automatically databases according to xml:lang attribute
let $dir := '/Users/me/myDesktop/' for $file in file:list($dir)[matches(.,'xml')] return let $flag := (data(doc($dir||$file)/div/@xml:lang)) return db:create("DB", $dir||$file, (), map { 'ftindex': true(),'language':$flag })
Or you can "ft:tokenize" your string mapping {'language':$flag} into your query
Hope I understood the problem :) Else return 'sorry'
2017-06-27 16:57 GMT+02:00 Kristian Kankainen kristian@keeleleek.ee:
Hello
I have documents with text in several languages. When creating a database in BaseX I can choose *one* language for stemming for the full-text search index. Is there a way BaseX could lemmatize according to the elements xml:lang attribute?
Best regards Kristian K
-- Ce message peut contenir des informations réservées exclusivement à son destinataire. Toute diffusion sans autorisation est interdite. Si vous n'en êtes pas le destinataire, merci de prendre contact avec l'expéditeur et de détruire ce message.
This email may contain material for the sole use of the intended recipient. Any forwarding without express permission is prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
Hi,
After reading Christian answer ( :-) ); I thought it could be interesting to sort your docs according to @xml:lang and create a new DB next to your corpus :
---------------------------------- distinct-values( file:children('input-dir')[matches(.,'xml$')] ! (doc(.)//@xml:lang) ) ! db:create( 'db-' || ., <root xml:lang="{.}"> { for $file in file:children('/Users/xavier/Desktop/')[matches(.,'xml$')] return <text src='{$file}'>{doc($file)//*[@xml:lang=.]//text()}</text> } </root>, "myfile", map { 'ftindex': true(), 'language': . } ) ----------------------------------
2017-06-27 20:49 GMT+02:00 Christian Grün christian.gruen@gmail.com:
Hi Kristian,
It is currently not possible to work with different languages in a single database. This is mostly because all normalized tokens will end up in the same internal index, and it would be a lot of effort to diversify this software behavior.
As Xavier pointed out (thanks!), the best way indeed is to create different databases, one per language. The following example has been inspired by Xavier’s proposal; it groups all files by their language and adopts the language in the name of the database:
for $path-group in file:children('input-dir') where ends-with($path-group, '.xml') group by $lang := ($path-group//@xml:lang)[1] return db:create( 'db-' || $lang, $path-group, (), map { 'ftindex': true(), 'language': $lang } )
Hope this helps, Christian
On Tue, Jun 27, 2017 at 5:19 PM, Xavier-Laurent SALVADOR xavierlaurent.salvador@gmail.com wrote:
Hi Kristian,
This is useful for creating automatically databases according to xml:lang attribute
let $dir := '/Users/me/myDesktop/' for $file in file:list($dir)[matches(.,'xml')] return let $flag := (data(doc($dir||$file)/div/@xml:lang)) return db:create("DB", $dir||$file, (), map { 'ftindex': true(),'language':$flag })
Or you can "ft:tokenize" your string mapping {'language':$flag} into your query
Hope I understood the problem :) Else return 'sorry'
2017-06-27 16:57 GMT+02:00 Kristian Kankainen kristian@keeleleek.ee:
Hello
I have documents with text in several languages. When creating a
database
in BaseX I can choose *one* language for stemming for the full-text
search
index. Is there a way BaseX could lemmatize according to the elements xml:lang attribute?
Best regards Kristian K
-- Ce message peut contenir des informations réservées exclusivement à son destinataire. Toute diffusion sans autorisation est interdite. Si vous
n'en
êtes pas le destinataire, merci de prendre contact avec l'expéditeur et
de
détruire ce message.
This email may contain material for the sole use of the intended
recipient.
Any forwarding without express permission is prohibited. If you are not
the
intended recipient, please contact the sender and delete all copies.
Hello
Sorry for being slow in reception, being a full-time father of two kids is my only excuse.
Thank you for enlightening answers. At first creating a separate database felt wrong and stupid, but after a while it felt just right and helping to organize different language elements via aggregation instead of composition.
Here is what I came up with:
(:~ This function takes a list of database names and optionally a list of language codes. It creates separate full-text indexed databases for lemmatized searching of each language contained in the original database. If the list of language codes is empty, all existing values of xml:lang found in the database is used. The full-text databases are named 'dbname-ft-langcode' Another function normalizes the texts, removes duplicate entries and inserts xml:id attributes :) declare updating function keeleleek:create-ft-indices-for-each-lang( $db-names as xs:string*, $lang-codes as xs:string* ) { for $db-name in $db-names let $langs := if( not( empty( $lang-codes ))) then( $lang-codes ) else( distinct-values(db:open($db-name)//@xml:lang) ) for $lang in $langs let $lang-group := db:open($db-name)//*[@xml:lang = $lang] let $ft-db-name := concat($db-name, '-ft-', $lang)
(: create full-text db for each language :) return db:create( $ft-db-name, <texts>{$lang-group}</texts>, $ft-db-name, map { 'ftindex': true(), 'language': $lang } ) };
Cheers Kristian K
28.06.2017 09:45 Xavier-Laurent SALVADOR kirjutas:
Hi,
After reading Christian answer ( :-) ); I thought it could be interesting to sort your docs according to @xml:lang and create a new DB next to your corpus :
distinct-values( file:children('input-dir')[matches(.,'xml$')] ! (doc(.)//@xml:lang) ) ! db:create( 'db-' || .,
<root xml:lang="{.}"> { for $file in file:children('/Users/xavier/Desktop/')[matches(.,'xml$')] return <text src='{$file}'>{doc($file)//*[@xml:lang=.]//text()}</text> } </root>, "myfile", map { 'ftindex': true(), 'language': . } ) ----------------------------------
2017-06-27 20:49 GMT+02:00 Christian Grün <christian.gruen@gmail.com mailto:christian.gruen@gmail.com>:
Hi Kristian, It is currently not possible to work with different languages in a single database. This is mostly because all normalized tokens will end up in the same internal index, and it would be a lot of effort to diversify this software behavior. As Xavier pointed out (thanks!), the best way indeed is to create different databases, one per language. The following example has been inspired by Xavier’s proposal; it groups all files by their language and adopts the language in the name of the database: for $path-group in file:children('input-dir') where ends-with($path-group, '.xml') group by $lang := ($path-group//@xml:lang)[1] return db:create( 'db-' || $lang, $path-group, (), map { 'ftindex': true(), 'language': $lang } ) Hope this helps, Christian On Tue, Jun 27, 2017 at 5:19 PM, Xavier-Laurent SALVADOR <xavierlaurent.salvador@gmail.com <mailto:xavierlaurent.salvador@gmail.com>> wrote: > Hi Kristian, > > This is useful for creating automatically databases according to xml:lang > attribute > > let $dir := '/Users/me/myDesktop/' > for $file in file:list($dir)[matches(.,'xml')] > return > let $flag := (data(doc($dir||$file)/div/@xml:lang)) > return > db:create("DB", $dir||$file, (), map { 'ftindex': > true(),'language':$flag }) > > Or you can "ft:tokenize" your string mapping {'language':$flag} into your > query > > Hope I understood the problem :) Else return 'sorry' > > 2017-06-27 16:57 GMT+02:00 Kristian Kankainen <kristian@keeleleek.ee <mailto:kristian@keeleleek.ee>>: >> >> Hello >> >> I have documents with text in several languages. When creating a database >> in BaseX I can choose *one* language for stemming for the full-text search >> index. Is there a way BaseX could lemmatize according to the elements >> xml:lang attribute? >> >> Best regards >> Kristian K >> > > > > -- > Ce message peut contenir des informations réservées exclusivement à son > destinataire. Toute diffusion sans autorisation est interdite. Si vous n'en > êtes pas le destinataire, merci de prendre contact avec l'expéditeur et de > détruire ce message. > > This email may contain material for the sole use of the intended recipient. > Any forwarding without express permission is prohibited. If you are not the > intended recipient, please contact the sender and delete all copies.
-- Ce message peut contenir des informations réservées exclusivement à son destinataire. Toute diffusion sans autorisation est interdite. Si vous n'en êtes pas le destinataire, merci de prendre contact avec l'expéditeur et de détruire ce message.
/This email may contain material for the sole use of the intended recipient. Any forwarding without express permission is prohibited. If you are not the intended recipient, please contact the sender and delete all copies/.
Kristian,
Out of curiosity, how are you linking the normalized texts in the -ft- database to the source documents? Is keeping a reference from the indexed text back to the source document a requirement in your application?
Thanks, Vincent
From: basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] On Behalf Of Kristian Kankainen Sent: Friday, June 30, 2017 5:27 PM To: Xavier-Laurent SALVADOR xavierlaurent.salvador@gmail.com; Christian Grün christian.gruen@gmail.com Cc: BaseX basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Full-text lemmatizing and xml:lang
Hello
Sorry for being slow in reception, being a full-time father of two kids is my only excuse.
Thank you for enlightening answers. At first creating a separate database felt wrong and stupid, but after a while it felt just right and helping to organize different language elements via aggregation instead of composition.
Here is what I came up with:
(:~ This function takes a list of database names and optionally a list of language codes. It creates separate full-text indexed databases for lemmatized searching of each language contained in the original database. If the list of language codes is empty, all existing values of xml:lang found in the database is used. The full-text databases are named 'dbname-ft-langcode' Another function normalizes the texts, removes duplicate entries and inserts xml:id attributes :) declare updating function keeleleek:create-ft-indices-for-each-lang( $db-names as xs:string*, $lang-codes as xs:string* ) { for $db-name in $db-names let $langs := if( not( empty( $lang-codes ))) then( $lang-codes ) else( distinct-values(db:open($db-name)//@xml:lang) ) for $lang in $langs let $lang-group := db:open($db-name)//*[@xml:lang = $lang] let $ft-db-name := concat($db-name, '-ft-', $lang)
(: create full-text db for each language :) return db:create( $ft-db-name, <texts>{$lang-group}</texts>, $ft-db-name, map { 'ftindex': true(), 'language': $lang } ) };
Cheers Kristian K
28.06.2017 09:45 Xavier-Laurent SALVADOR kirjutas: Hi,
After reading Christian answer ( :-) ); I thought it could be interesting to sort your docs according to @xml:lang and create a new DB next to your corpus :
---------------------------------- distinct-values( file:children('input-dir')file://children('input-dir')[matches(.,'xml$')] ! (doc(.)//@xml:lang) ) ! db:create( 'db-' || ., <root xml:lang="{.}"> { for $file in file:children('/Users/xavier/Desktop/')file://children('/Users/xavier/Desktop/')[matches(.,'xml$')] return <text src='{$file}'>{doc($file)//*[@xml:lang=.]//text()}</text> } </root>, "myfile", map { 'ftindex': true(), 'language': . } ) ----------------------------------
2017-06-27 20:49 GMT+02:00 Christian Grün <christian.gruen@gmail.commailto:christian.gruen@gmail.com>: Hi Kristian,
It is currently not possible to work with different languages in a single database. This is mostly because all normalized tokens will end up in the same internal index, and it would be a lot of effort to diversify this software behavior.
As Xavier pointed out (thanks!), the best way indeed is to create different databases, one per language. The following example has been inspired by Xavier’s proposal; it groups all files by their language and adopts the language in the name of the database:
for $path-group in file:children('input-dir')file://children('input-dir') where ends-with($path-group, '.xml') group by $lang := ($path-group//@xml:lang)[1] return db:create( 'db-' || $lang, $path-group, (), map { 'ftindex': true(), 'language': $lang } )
Hope this helps, Christian
On Tue, Jun 27, 2017 at 5:19 PM, Xavier-Laurent SALVADOR <xavierlaurent.salvador@gmail.commailto:xavierlaurent.salvador@gmail.com> wrote:
Hi Kristian,
This is useful for creating automatically databases according to xml:lang attribute
let $dir := '/Users/me/myDesktop/' for $file in file:list($dir)file://list($dir)[matches(.,'xml')] return let $flag := (data(doc($dir||$file)/div/@xml:lang)) return db:create("DB", $dir||$file, (), map { 'ftindex': true(),'language':$flag })
Or you can "ft:tokenize" your string mapping {'language':$flag} into your query
Hope I understood the problem :) Else return 'sorry'
2017-06-27 16:57 GMT+02:00 Kristian Kankainen <kristian@keeleleek.eemailto:kristian@keeleleek.ee>:
Hello
I have documents with text in several languages. When creating a database in BaseX I can choose *one* language for stemming for the full-text search index. Is there a way BaseX could lemmatize according to the elements xml:lang attribute?
Best regards Kristian K
-- Ce message peut contenir des informations réservées exclusivement à son destinataire. Toute diffusion sans autorisation est interdite. Si vous n'en êtes pas le destinataire, merci de prendre contact avec l'expéditeur et de détruire ce message.
This email may contain material for the sole use of the intended recipient. Any forwarding without express permission is prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
-- Ce message peut contenir des informations réservées exclusivement à son destinataire. Toute diffusion sans autorisation est interdite. Si vous n'en êtes pas le destinataire, merci de prendre contact avec l'expéditeur et de détruire ce message.
This email may contain material for the sole use of the intended recipient. Any forwarding without express permission is prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
Perhaps a proposal below.
27.06.2017 21:49 Christian Grün kirjutas:
It is currently not possible to work with different languages in a single database. This is mostly because all normalized tokens will end up in the same internal index, and it would be a lot of effort to diversify this software behavior.
How is the behavior if the database content is in many different languages and is correctly marked with xml:lang attributes. Does the full-text index consider this information and apply full-text indexing only to elements with matching language?
As a simple illustration (does not run): will the following code create full-text index only for the Russian text or for both the russian and the English?
db:create( 'db-ft-ru', <texts> <text xml:lang="ru">something in Russian</text> <text xml:lang="en">something in English</text> </texts>, texts, map { 'ftindex': true(), 'language': 'ru' } )
If BaseX does create the full-text index for both languages (the English index would contain useless scramble) I would propose a simple filtering of xml:lang tags according to the language given in the map to ftindex. This should be simpler to implement than the diversifying as suggested by Christian.
Best regards Kristian K
Hi Kristian,
Right now, xml:lang attributes are completely ignored when indexing full-text. It’s an interesting idea to exclude texts that are marked with languages different to the one that is currently applied; I will think about it.
However, I should have mentioned that the language option is mostly irrelevant unless you use stemmers. Tokenization is pretty much the same for Western texts, so searches like the following one…
'Добрый ДЕНЬ!' contains text 'день' using language 'en'
…will still give you the expected result. To some extent, this also applies to Arabian texts:
'يوم سعيد' contains text 'يوم' using language 'en'
Things are definitely different if you work with Japanese or Chinese texts. The following query yields false:
'今日は' contains text '今' using language 'en'
For more information on Japanese tokenization, see Toshio HIRAI’s article in our wiki [1].
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Full-Text:_Japanese
How is the behavior if the database content is in many different languages and is correctly marked with xml:lang attributes. Does the full-text index consider this information and apply full-text indexing only to elements with matching language?
As a simple illustration (does not run): will the following code create full-text index only for the Russian text or for both the russian and the English?
db:create( 'db-ft-ru', <texts> <text xml:lang="ru">something in Russian</text> <text xml:lang="en">something in English</text> </texts>, texts, map { 'ftindex': true(), 'language': 'ru' } )
If BaseX does create the full-text index for both languages (the English index would contain useless scramble) I would propose a simple filtering of xml:lang tags according to the language given in the map to ftindex. This should be simpler to implement than the diversifying as suggested by Christian.
Best regards Kristian K
Hi Christian,
To refine the proposal. It would be great if the full-text index could be set up to consider xml:lang attributes in the following way:
* If STEMMING is set to true, then the input to the stemmer should be filtered by matching the xml:lang and the LANGUAGE option. Text that is sent to the tokenizer could be left as is and not be filtered by matching LANGUAGE (see next point).
* If STEMMING is set to false, I agree with you that the general strategy for tokenization is okay. But for correctness it still could be extended to exclude all those scripts that doesn't follow Western-centric tokenization algorithms.
* What concerns the DIACRITICS sensitivity option, probably what is given by Unicode and the collation used by the query is good enough.
What do you think?
Best regards Kristian K
02.07.2017 12:36 Christian Grün kirjutas:
Hi Kristian,
Right now, xml:lang attributes are completely ignored when indexing full-text. It’s an interesting idea to exclude texts that are marked with languages different to the one that is currently applied; I will think about it.
However, I should have mentioned that the language option is mostly irrelevant unless you use stemmers. Tokenization is pretty much the same for Western texts, so searches like the following one…
'Добрый ДЕНЬ!' contains text 'день' using language 'en'
…will still give you the expected result. To some extent, this also applies to Arabian texts:
'يوم سعيد' contains text 'يوم' using language 'en'
Things are definitely different if you work with Japanese or Chinese texts. The following query yields false:
'今日は' contains text '今' using language 'en'
For more information on Japanese tokenization, see Toshio HIRAI’s article in our wiki [1].
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Full-Text:_Japanese
How is the behavior if the database content is in many different languages and is correctly marked with xml:lang attributes. Does the full-text index consider this information and apply full-text indexing only to elements with matching language?
As a simple illustration (does not run): will the following code create full-text index only for the Russian text or for both the russian and the English?
db:create( 'db-ft-ru', <texts> <text xml:lang="ru">something in Russian</text> <text xml:lang="en">something in English</text> </texts>, texts, map { 'ftindex': true(), 'language': 'ru' } )
If BaseX does create the full-text index for both languages (the English index would contain useless scramble) I would propose a simple filtering of xml:lang tags according to the language given in the map to ftindex. This should be simpler to implement than the diversifying as suggested by Christian.
Best regards Kristian K
To be sure if I understood you correctly:
- If STEMMING is set to true, then the input to the stemmer should be
filtered by matching the xml:lang and the LANGUAGE option. Text that is sent to the tokenizer could be left as is and not be filtered by matching LANGUAGE (see next point).
So you would prefer to have all words indexed, but reduce the stemming step to the chosen language, right?
To give an example:
<xml> <div xml:lang='de'>Häuser</div> <div xml:lang='en'>houses</div> </xml>
If stemming is enabled, and if language is 'de', the index would include the two terms 'Haus' (stemmed German form) and 'Houses' (original English form).
The query…
//div[text() contains text { "houses","Häuser" } using language 'de' using stemming ]
…would only return the German div element (as the German stemmer rewrites 'Häuser' to 'Haus' and 'houses' to 'hou').
Yes, you are correct.
During index building, only <div xml:lang='de'>Häuser</div> is lemmatized, thus
//div[text() contains text { "houses","Häuser" } using language 'de' using stemming ]
returns only the element with Häuser. But a query without stemming and language:
//div[text() contains text { "houses","Häuser" }]
would return both elements.
Best regards Kristian K
03.07.2017 19:50 Christian Grün kirjutas:
To be sure if I understood you correctly:
- If STEMMING is set to true, then the input to the stemmer should be
filtered by matching the xml:lang and the LANGUAGE option. Text that is sent to the tokenizer could be left as is and not be filtered by matching LANGUAGE (see next point).
So you would prefer to have all words indexed, but reduce the stemming step to the chosen language, right?
To give an example:
<xml> <div xml:lang='de'>Häuser</div> <div xml:lang='en'>houses</div> </xml>
If stemming is enabled, and if language is 'de', the index would include the two terms 'Haus' (stemmed German form) and 'Houses' (original English form).
The query…
//div[text() contains text { "houses","Häuser" } using language 'de' using stemming ]
…would only return the German div element (as the German stemmer rewrites 'Häuser' to 'Haus' and 'houses' to 'hou').
Thanks. I’ll keep this proposal in mind, and think about further implications. If we decided one day to make the full-text index updatable (which would be a nice feature, but a lot of work), we would probably need to reindex sub-trees with modified language attributes.
On Tue, Jul 4, 2017 at 8:32 AM, Kristian Kankainen kristian@keeleleek.ee wrote:
Yes, you are correct.
During index building, only <div xml:lang='de'>Häuser</div> is lemmatized, thus
//div[text() contains text { "houses","Häuser" } using language 'de' using stemming ]
returns only the element with Häuser. But a query without stemming and language:
//div[text() contains text { "houses","Häuser" }]
would return both elements.
Best regards Kristian K
03.07.2017 19:50 Christian Grün kirjutas:
To be sure if I understood you correctly:
- If STEMMING is set to true, then the input to the stemmer should be
filtered by matching the xml:lang and the LANGUAGE option. Text that is sent to the tokenizer could be left as is and not be filtered by matching LANGUAGE (see next point).
So you would prefer to have all words indexed, but reduce the stemming step to the chosen language, right?
To give an example:
<xml> <div xml:lang='de'>Häuser</div> <div xml:lang='en'>houses</div> </xml>
If stemming is enabled, and if language is 'de', the index would include the two terms 'Haus' (stemmed German form) and 'Houses' (original English form).
The query…
//div[text() contains text { "houses","Häuser" } using language 'de' using stemming ]
…would only return the German div element (as the German stemmer rewrites 'Häuser' to 'Haus' and 'houses' to 'hou').
basex-talk@mailman.uni-konstanz.de