Re: [basex-talk] Full-text index: searches for common words in another node. Does it take a lot of time?

18 May 2020

      Hi Sebastian,
Yes I think your search on mark-identification suffers from the huge number of party-names.
...
From what I remember, reverse index (from full text tokens to node ids) is shared across all element's names.
so filtering on the element's name is done at last.
When I was using basex to handle DOCDB patent db, I used to explode a document in sub-documents containing only keys and text to be indexed with respect to language and xml element, and then build seperate databases.
That way I could create a dedicated full text index on a single (element names, language) combination.

Did that help ?

I really appreciated working with basex that time, because others were in a kind of java/relational mapping hell... Me, I just had to add xml documents, reindex, and sometimes purge deleted items.

Best,
Fabrice

________________________________
De : BaseX-Talk <basex-talk-bounces@mailman.uni-konstanz.de> de la part de Sebastian Guerrero <chapeti@gmail.com>
Envoyé : lundi 18 mai 2020 17:23
À : BaseX <basex-talk@mailman.uni-konstanz.de>
Objet : [basex-talk] Full-text index: searches for common words in another node. Does it take a lot of time?

Hi everybody.

I'm here again with my doubts. Thank you for your patience. ^^

I have a database of trademarks with a full-text index for two nodes: *:mark-identification,*:party-name. [1]

Where "mark-identification" contains the name of the trademark, and "party-name" contains the name of the owner of the trademark.

I use the full-text index in order to search trademarks by its name, for example:

for $results in //case-file[case-file-header/mark-identification/text() contains text {'basex'}]
return $results//mark-identification

returns all trademarks with "basex" on its name. It works like a thunderlight: 15ms to get 3 records among 2,134,434,598 nodes. Really a dream. [2]

But, for example, if I change the searched text from "basex" by a common word in "party-name", for example, "corporation" ( has 1096187x occurrences on the full-text index as showed in [1], it's a very common word in owners of trademarks ):

for $results in //case-file[case-file-header/mark-identification/text() contains text {'corporation'}]
return $results//mark-identification

It takes a long time to get 6,715 records: 62,000ms [3]

If I search for "live" ( a common word for trademarks name, but not for owners names ) I get 5,875 records in 2,773 ms, which has not a relationship with the 62,000ms to get the 6k records for "corporation". [4]

So...

  *   Is this an expected behaviour?
  *   Is there a way to specify which "section" of the full-text index should be used to perform the search? ( I don't know... maybe something similar to "using stemming" but "using index 'mark-identification'" )

Please apologize me if I'm asking by something not-logical,

Best regards,
Sebastian

[1] https://imgur.com/uLla1Xt
[2] https://imgur.com/Fkcvv2O
[3] https://imgur.com/Hk71CNe
[4] https://imgur.com/P72k574

Re: [basex-talk] Full-text index: searches for common words in another node. Does it take a lot of time?

ETANCHAUD Fabrice