Hi Christian,
Zitat von Christian Grün christian.gruen@gmail.com:
Thanks. I guess, I cannot do everything directly within XQuery, e.g., extending marked elements to continuous marking, to make "<mark>Korb</mark> <mark>geben</mark>" to be "<mark>Korb geben</mark>" -- it will be more important for queries with ftand or ftor.
Currently, the ft:mark() and ft:extract() functions are mainly used to highlight hits in search results, but we are always interested in extending our XQuery modules with helpful functions/additional arguments, so feel free to suggest new features (..but I cannot give any guarantee when a particular request will be implemented). For example, the latest snapshot contains two new functions ft:tokens() and ft:tokenize() [1], which have recently been requested.
I noticed the tonizing features and will probably use them as well.
Highlighting occurences of search terms, is probably the perfect solution for most XML data. However, I use BaseX as a substitute for a corpus query workbench: The texts I have to deal with, are TEI annotated, but lack linguistic annotation -- most of the texts are non-modern German, so applying state-of-the-art NLP tools is impossible. Therefore I cannot apply queries based on part-of-speech, syntactical structures, or lemmas.
The users will look for evidence of idiomatic phrases by trying to search for the main parts of such phrases. "den Kopf (nicht) in den Sand stecken" results in a query like "Kopf ftand Sand ftand stecken" -- since I don't have information on sentence boundaries, I use the "distance" option for controlling that the query terms probably appear within a sentence. For this usecase I would be interested to highlight the potential "phrase", i.e., starting with the first match until the last match.
I don't program in Java, so I cannot help in implementing such functionality, but I could help specifying and testing. The project I am working for, is located at the University of Basel, if there is a need to have an "official" cooperation, we could do this ;-)
Best
Cerstin