Re: [basex-talk] Full-text search and mixed content

9 May 2012


      Hi Christian,
Zitat von Christian Grün christian.gruen@gmail.com:
...
...
Thanks for the information, that's good to know. I think I'll file an
enhancement request then: For text-oriented applications (e.g., TEI
documents), it would be extremely useful if ft:mark would work with
descendant elements; typically you have lots of mixed content, with
elements containing rendering information or annotations, such as <hi>,
<orig>, <corr>, <persName>, <placeName>, <handShift>, etc.: These
elements don't interrupt the logical text flow.
While I concede this may be useful in numerous use cases (and may even
seem obvious), it would take quite some time to get implemented, so...
please don't expect too much magic for the moment. There will also be
some conceptual issues that need to be resolved. As an example, which
result would you expect for the following query?
ft:mark(<a>X <b>Y</b> Z</a>[. contains text 'X Y'])
I think it should be
<a><mark>X</mark> <b><mark>Y</mark></b> Z</a>
Each token from the search string would be enclosed in a <mark>-element.
I once asked for continuous marking, i.e., only one opening and one  
closing mark-tag for a search string consisting of several words.   
This would be not suitable for elements where I assume the hit to  
cover inner elements because of overlapping mark-up. But this might be  
solved with an option for ft:mark, to apply it either continuously or  
for each token found.
...
If you don't need the inner elements, you may as well remove them from
your document before applying ft:mark().
This is a great idea if you would like to know whether the search  
elements are somewhere in your text.
However, if you would like to show the results to end users (=  
humanities people) or to annotate the document further, it's not a  
good idea to destroy the original structure. Or maybe one would have  
to come up with some tricky workaround to first replace the  
hierarchical node with a flat one for searching, then annotate  
something and somehow replace the original hierarchical one with the  
annotated one preserving the original hierarchy.
And for searching only, the scenario is a TEI-document representing an  
old printed book with highlighting (e.g., some things in italics),  
foreign-language words printed in a different font, person names  
already marked, etc. The TEI rendering is intended to mimic the  
original printed page. When implementing a full-text search, the end  
user expects to see the highlighted search tokens within the rendered  
page. Therefore the "easiest" way is to search in descendant nodes and  
use ft:mark to highlight the hits, without any need to change the TEI  
rendering. This would also allow the end user to not only see the node  
where the search string was found, but scroll up and down to inspect  
the context of the node.
But maybe for searching and displaying the results in the original  
document, one would have to develop a bigger application.
Best regards
Cerstin
-- 
Dr. phil. Cerstin Mahlow

Universität Basel
Departement Sprach- und Literaturwissenschaften
Fachbereich Deutsche Sprach- und Literaturwissenschaft
Nadelberg 4
4051 Basel
Schweiz

Tel:  +41 61 267 07 65
Fax: +41 61 267 34 40
Mail: cerstin.mahlow@unibas.ch
Web: http://www.oldphras.net

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] Full-text search and mixed content