Thanks, Cerstin and Michael, for your suggestions.
Yes, all the use cases sound perfectly reasonable to me. When it comes to the implementation, I see quite a number of obstacles to make this happen. One of the reasons is that the full-text expression, as currently implemented, discards all elements before tokenizing the texts. This means that the following queries are basically the same:
<a>X <b>Y</b> Z</a>[. contains text 'X Y'] <a>X <b>Y</b> Z</a>[data() contains text 'X Y']
As Cerstin indicated, you'll probably have to parse all text nodes individually;
ft:mark(//*[text() contains text {'X', 'Y'}])
This simple approach, however, won't work out with phrases (multiple terms) that reach into descendant nodes.
Christian
While I concede this may be useful in numerous use cases (and may even seem obvious), it would take quite some time to get implemented, so... please don't expect too much magic for the moment. There will also be some conceptual issues that need to be resolved. As an example, which result would you expect for the following query?
ft:mark(<a>X <b>Y</b> Z</a>[. contains text 'X Y'])
I think it should be
<a><mark>X</mark> <b><mark>Y</mark></b> Z</a>
Each token from the search string would be enclosed in a <mark>-element.
Exactly. While this probably wouldn't cover *all* possible scenarios, it would still cover most of the useful ones. In fact, it would be similar to http://www.raymondhill.net/blog/?p=272. It would also be applicable when ignoring elements in a search.
For complex applications it may help to get the start and end character positions of the matches (essentially standoff markup), and the application could then do the highlighting itself on the basis of this information.
[...]
If you don't need the inner elements, you may as well remove them from your document before applying ft:mark().
This is a great idea if you would like to know whether the search elements are somewhere in your text.
However, if you would like to show the results to end users (= humanities people) or to annotate the document further, it's not a good idea to destroy the original structure. Or maybe one would have to come up with some tricky workaround to first replace the hierarchical node with a flat one for searching, then annotate something and somehow replace the original hierarchical one with the annotated one preserving the original hierarchy.
And for searching only, the scenario is a TEI-document representing an old printed book with highlighting (e.g., some things in italics), foreign-language words printed in a different font, person names already marked, etc. The TEI rendering is intended to mimic the original printed page. When implementing a full-text search, the end user expects to see the highlighted search tokens within the rendered page. Therefore the "easiest" way is to search in descendant nodes and use ft:mark to highlight the hits, without any need to change the TEI rendering. This would also allow the end user to not only see the node where the search string was found, but scroll up and down to inspect the context of the node.
I fully agree, this is exactly what I need in my application: I don't want to retrieve snippets from the document, but I always have to display the full document with the hits highlighted.
What I'm going to do now is probably highlight the full paragraph which contains the node retrieved by the search, i.e., get the node ID, walk up the tree until I encounter a <p> and get its @xml:id, which I can then use in a CSS stylesheet. Or something like this. But this is clearly only an approximation.
Best regards
-- Dr.-Ing. Michael Piotrowski, M.A. mxp@cl.uzh.ch Institute of Computational Linguistics, University of Zurich Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044
- OUT NOW: Systems and Frameworks for Computational Morphology
- http://www.springeronline.com/978-3-642-23137-7
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk