Hi Javier,
- The ft:tokenize function tokenizes on-the-fly or tokens are stored in the full text index?
Tokenization is done on-the-fly. It would actually take much longer to find the correspondent tokens for a text in the index. Moreover, you can tokenize arbitrary input strings. The following examples return true:
ft:tokenize("Naïve") = "naive"
deep-equal( ft:tokenize(<div><b>H</b>ello! (Everyone)</div>), ('hello', 'everyone') )
Tokenization is very fast in BaseX. The following query takes appr. 200 ms on my machine:
prof:time(prof:void( for $i in 1 to 1000000 return ft:tokenize(" Amidst the vogue enjoyed by existentialism and positivism in early 20th-century Europe, Adorno advanced a dialectical conception of natural history that critiqued the twin temptations of ontology and empiricism through studies of Kierkegaard and Husserl." ) ))
But you are completely right that the post-processing may be too slow if you need to order thousands or millions of index results. In this case, you could play around with the internally computed score value:
for $sentence score $score in //sentence [text() contains text { 'DNA', 'oxidation' }] order by $score descending return $sentence
The scoring model of BaseX takes into consideration the number of found terms, their frequency in a text, and the length of a text. The shorter the input text is, the higher scores will be (cited from [1]). Distances between words are not considered so far, though (volunteer implementors are welcome ;).
- I guess that if I search something like { “DNA", “oxidation” }, I need to compute the distance for each term using index-of, isn’t it ?
Exactly, that's one way. You can do all kinds of things with the returned tokens and, consequently, their positions in the sequence. Please check out the attached example for some more complex distance computations (it uses fold-left etc., which may all not be required, so don't be frightened.. ;).
If you only want to retrieve results in which the queried words occur in a maximum distance, you could as well try the 'distance' and 'windows' keywords [2].
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Full-Text#Scoring [2] http://docs.basex.org/wiki/Full-Text#Positional_Filters