Re: [basex-talk] index features

22 Jan 2012

      Hi Christian,

I come back to some previously discussed questions:

Zitat von Christian Grün <christian.gruen@gmail.com>:

[...]
...
To give more information, I'll have to look at the
actual data; do you think you can provide me with a little document
that exemplifies your observation?
As I am not sure, if the behavior has something to do with my actual  
data, I didn't create an example, but put a
  sample of my collection consisting of 4 smaller documents online:  
<http://oldphras.unibas.ch/test.tgz>

//*[text() contains text ('Kopf' ftand 'Sand' ftand 'stecken') using  
stemming using language "de"][self::*:p or self::*:l]

gives 3 hits (in Wille, Suttner, and Cervantes)

//*[text() contains text ('Kopf' ftand 'Sand' ftand 'stecken') using  
stemming using language "de" distance at most 10 words][self::*:p or  
self::*:l]

gives 2 hits (in Wille and Suttner)

//*[text() contains text "Kopf Sand stecken" all words using stemming  
using language "de" distance at most 10 words][self::*:p or self::*:l]

gives 3 hits (in Wille, Suttner, and Cervantes), the "distance" option  
seems to be ignored.
...
...
The second question is about "ftand" and "ftor".
//*[text() contains text ('Kopf' ftand 'Sand' ftand 'stecken') using  
stemming using language "de" distance at most 10 words][self::*:p or  
self::*:l]

gives 2 hits (in Wille and Suttner)

//*[text() contains text ('Nase' ftand 'Sand' ftand 'stecken') using  
stemming using language "de" distance at most 10 words][self::*:p or  
self::*:l]

gives 1 hit (in Müllenhoff)

Therefore, for

//*[text() contains text ( ('Nase' ftor 'Kopf') ftand 'Sand' ftand  
'stecken') using stemming using language "de" distance at most 10  
words][self::*:p or self::*:l]

I would expect to get all 3 hits, but actually get only 1 (the one in  
Wille).  It makes no difference, if I put ('Nase' ftor 'Kopf') or  
('Kopf' ftor 'Nase'). Additionally, the highlighting is strange.

In the end, I would like to search for something like this to speed up  
annotating the data:

( Nase | Kopf | Hals ) & ( Sand | Schlinge ) & ( ziehen | stecken )
...
...
The third question is about the full-text index itself. When applying fuzzy
search or using wildcards, the full-text index is not applied -- resulting
in a time out on my website, I need 341859.09 ms in the GUI for applying
Currently, the choice has to be made between efficient fuzzy or
wildcard matching (the latter being based on a Trie index structure).
So I can have fuzzy OR stemming and wildcard.  For searching it's OK,  
I copied the collection and created the other index for the copy, but  
as I wan't to update the collection after searching, I would have to  
update both collections and re-index them after updating one.  Is this  
correct?

Best regards

Cerstin
-- 
Dr. phil. Cerstin Mahlow

Universität Basel
Deutsches Seminar
Nadelberg 4
4051 Basel
Schweiz

Tel:  +41 61 267 07 65
Fax: +41 61 267 34 40
Mail: cerstin.mahlow@unibas.ch
Web: http://www.oldphras.net

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.