Re: [basex-talk] Full-text lemmatizing and xml:lang

List overview All Threads
Download

newer

older

Fwd: invalid entry compressed size...

Uppercase ß ?

Kristian Kankainen

1 Jul 2017 1 Jul '17

9:56 a.m.

Attachments:

attachment.html (text/html — 11.8 KB)

Show replies by date

Xavier-Laurent SALVADOR

1 Jul 1 Jul

10:08 a.m.

New subject: Full-text lemmatizing and xml:lang

Hello Guys,

reference from the indexed text back to the source document should be globally maintained in an @src attribute (for example) and should obviously automatically be maintained by term-to-term full-Text query, so:

let $pInTargetFtLang := (:some sentence of 2 or more words in db:open('-ft-lang'):) let $refIdInSource := doc('source.xml')//*[. contains text {$p} all words ordered]/id

should retrieve and maintain the original path. This is not "wrong and stupid" ;-), but it's a standard way of building fragmented linguistics corpus in order ta reassemble them later in Rest app. A corpus is not just one database: it's a set of databases you have to mix for clients use and display.

br, x

2017-07-01 9:56 GMT+02:00 Kristian Kankainen kristian@keeleleek.ee:

...

Hello.

It's a dictionary and words in the dictionary reference the texts with simple identifiers. The texts are used as examples and can be referenced by many different words. Thus no explicit bookkeeping is kept about which words use which texts as examples, this is rather done implicitly and each text reference (from the dictionary to -ft-) also holds much extra information about the quality of the example text according to the word and such. Statistics and summaries of this data is done by separate queries and not held explicitly in either database.

Not sure whether I answered your question?

BR Kristian K

juuli 2017 3:29 kirjutas kuupäeval "Lizzi, Vincent" <Vincent.Lizzi@

taylorandfrancis.com>:

Kristian,

Out of curiosity, how are you linking the normalized texts in the -ft- database to the source documents? Is keeping a reference from the indexed text back to the source document a requirement in your application?

Thanks,

Vincent

*From:* basex-talk-bounces@mailman.uni-konstanz.de [mailto: basex-talk-bounces@mailman.uni-konstanz.de] *On Behalf Of *Kristian Kankainen *Sent:* Friday, June 30, 2017 5:27 PM *To:* Xavier-Laurent SALVADOR xavierlaurent.salvador@gmail.com; Christian Grün christian.gruen@gmail.com *Cc:* BaseX basex-talk@mailman.uni-konstanz.de *Subject:* Re: [basex-talk] Full-text lemmatizing and xml:lang

Hello

Sorry for being slow in reception, being a full-time father of two kids is my only excuse.

Thank you for enlightening answers. At first creating a separate database felt wrong and stupid, but after a while it felt just right and helping to organize different language elements via aggregation instead of composition.

Here is what I came up with:

(:~ This function takes a list of database names and optionally a list of language codes. It creates separate full-text indexed databases for lemmatized searching of each language contained in the original database. If the list of language codes is empty, all existing values of xml:lang found in the database is used. The full-text databases are named 'dbname-ft-langcode' Another function normalizes the texts, removes duplicate entries and inserts xml:id attributes :) declare updating function keeleleek:create-ft-indices-for-each-lang( $db-names as xs:string*, $lang-codes as xs:string* ) { for $db-name in $db-names let $langs := if( not( empty( $lang-codes ))) then( $lang-codes ) else( distinct-values(db:open($db-name)//@xml:lang) ) for $lang in $langs let $lang-group := db:open($db-name)//*[@xml:lang = $lang] let $ft-db-name := concat($db-name, '-ft-', $lang)
  (: create full-text db for each language :)
  return
    db:create(
      $ft-db-name,
      <texts>{$lang-group}</texts>,
      $ft-db-name,
      map { 'ftindex': true(), 'language': $lang }
  )
};

Cheers Kristian K

28.06.2017 09:45 Xavier-Laurent SALVADOR kirjutas:

Hi,

After reading Christian answer ( :-) ); I thought it could be interesting to sort your docs according to @xml:lang and create a new DB next to your corpus :

distinct-values(

file:children('input-dir')[matches(.,'xml$')] ! (doc(.)//@xml:lang)

)

!

db:create(

'db-' || .,

<root xml:lang="{.}">

{

for $file in file:children('/Users/xavier/Desktop/')[matches(.,'xml$')]

return

<text src='{$file}'>{doc($file)//*[@xml:lang=.]//text()}</text>
}
</root>,

"myfile",

map { 'ftindex': true(), 'language': . }

)

2017-06-27 20:49 GMT+02:00 Christian Grün christian.gruen@gmail.com:

Hi Kristian,

It is currently not possible to work with different languages in a single database. This is mostly because all normalized tokens will end up in the same internal index, and it would be a lot of effort to diversify this software behavior.

As Xavier pointed out (thanks!), the best way indeed is to create different databases, one per language. The following example has been inspired by Xavier’s proposal; it groups all files by their language and adopts the language in the name of the database:

for $path-group in file:children('input-dir') where ends-with($path-group, '.xml') group by $lang := ($path-group//@xml:lang)[1] return db:create( 'db-' || $lang, $path-group, (), map { 'ftindex': true(), 'language': $lang } )

Hope this helps, Christian

On Tue, Jun 27, 2017 at 5:19 PM, Xavier-Laurent SALVADOR xavierlaurent.salvador@gmail.com wrote:

...
Hi Kristian,

This is useful for creating automatically databases according to xml:lang attribute

let $dir := '/Users/me/myDesktop/' for $file in file:list($dir)[matches(.,'xml')] return let $flag := (data(doc($dir||$file)/div/@xml:lang)) return db:create("DB", $dir||$file, (), map { 'ftindex': true(),'language':$flag })

Or you can "ft:tokenize" your string mapping {'language':$flag} into your query

Hope I understood the problem :) Else return 'sorry'

2017-06-27 16:57 GMT+02:00 Kristian Kankainen kristian@keeleleek.ee:

...
Hello

I have documents with text in several languages. When creating a

database

...
...
in BaseX I can choose *one* language for stemming for the full-text

search

...
...
index. Is there a way BaseX could lemmatize according to the elements xml:lang attribute?

Best regards Kristian K

-- Ce message peut contenir des informations réservées exclusivement à son destinataire. Toute diffusion sans autorisation est interdite. Si vous

n'en

...
êtes pas le destinataire, merci de prendre contact avec l'expéditeur et

de

...
détruire ce message.

This email may contain material for the sole use of the intended

recipient.

...
Any forwarding without express permission is prohibited. If you are not

the

...
intended recipient, please contact the sender and delete all copies.

--

Ce message peut contenir des informations réservées exclusivement à son destinataire. Toute diffusion sans autorisation est interdite. Si vous n'en êtes pas le destinataire, merci de prendre contact avec l'expéditeur et de détruire ce message.

*This email may contain material for the sole use of the intended recipient. Any forwarding without express permission is prohibited. If you are not the intended recipient, please contact the sender and delete all copies*.

-- Ce message peut contenir des informations réservées exclusivement à son destinataire. Toute diffusion sans autorisation est interdite. Si vous n'en êtes pas le destinataire, merci de prendre contact avec l'expéditeur et de détruire ce message. *This email may contain material for the sole use of the intended recipient. Any forwarding without express permission is prohibited. If you are not the intended recipient, please contact the sender and delete all copies*.

3090

Age (days ago)

3090

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

1 comments

2 participants

tags (0)

participants (2)

Kristian Kankainen
Xavier-Laurent SALVADOR