Re: [basex-talk] Full-text search and mixed content

9 May 2012


      Thanks, Cerstin and Michael, for your suggestions.
Yes, all the use cases sound perfectly reasonable to me. When it comes
to the implementation, I see quite a number of obstacles to make this
happen. One of the reasons is that the full-text expression, as
currently implemented, discards all elements before tokenizing the
texts. This means that the following queries are basically the same:
<a>X <b>Y</b> Z</a>[. contains text 'X Y']
  <a>X <b>Y</b> Z</a>[data() contains text 'X Y']
As Cerstin indicated, you'll probably have to parse all text nodes individually;
ft:mark(//*[text() contains text {'X', 'Y'}])
This simple approach, however, won't work out with phrases (multiple
terms) that reach into descendant nodes.
Christian
...
...
...
While I concede this may be useful in numerous use cases (and may even
seem obvious), it would take quite some time to get implemented, so...
please don't expect too much magic for the moment. There will also be
some conceptual issues that need to be resolved. As an example, which
result would you expect for the following query?
ft:mark(<a>X <b>Y</b> Z</a>[. contains text 'X Y'])
I think it should be
<a><mark>X</mark> <b><mark>Y</mark></b> Z</a>
Each token from the search string would be enclosed in a <mark>-element.
Exactly.  While this probably wouldn't cover *all* possible scenarios,
it would still cover most of the useful ones.  In fact, it would be
similar to http://www.raymondhill.net/blog/?p=272.  It would also be
applicable when ignoring elements in a search.
For complex applications it may help to get the start and end character
positions of the matches (essentially standoff markup), and the
application could then do the highlighting itself on the basis of this
information.
[...]
...
...
If you don't need the inner elements, you may as well remove them from
your document before applying ft:mark().
This is a great idea if you would like to know whether the search
elements are somewhere in your text.
However, if you would like to show the results to end users (=
humanities people) or to annotate the document further, it's not a
good idea to destroy the original structure. Or maybe one would have
to come up with some tricky workaround to first replace the
hierarchical node with a flat one for searching, then annotate
something and somehow replace the original hierarchical one with the
annotated one preserving the original hierarchy.
And for searching only, the scenario is a TEI-document representing an
old printed book with highlighting (e.g., some things in italics),
foreign-language words printed in a different font, person names
already marked, etc. The TEI rendering is intended to mimic the
original printed page. When implementing a full-text search, the end
user expects to see the highlighted search tokens within the rendered
page. Therefore the "easiest" way is to search in descendant nodes and
use ft:mark to highlight the hits, without any need to change the TEI
rendering. This would also allow the end user to not only see the node
where the search string was found, but scroll up and down to inspect
the context of the node.
I fully agree, this is exactly what I need in my application: I don't
want to retrieve snippets from the document, but I always have to
display the full document with the hits highlighted.
What I'm going to do now is probably highlight the full paragraph which
contains the node retrieved by the search, i.e., get the node ID, walk
up the tree until I encounter a <p> and get its @xml:id, which I can
then use in a CSS stylesheet.  Or something like this.  But this is
clearly only an approximation.
Best regards
--
Dr.-Ing. Michael Piotrowski, M.A. mxp@cl.uzh.ch
Institute of Computational Linguistics, University of Zurich
Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044

OUT NOW: Systems and Frameworks for Computational Morphology
http://www.springeronline.com/978-3-642-23137-7


BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] Full-text search and mixed content