Hi Christian,
Zitat von Christian Grün christian.gruen@gmail.com:
//*[text() contains text "A" ftand ftnot 'C']
Thanks, this seems to work. However, I encountered strange behavior, which is probably related to mixed content.
Given this document:
<doc> <p>1 Ich fresse Dich mit Haut und Haar <pb/> und allem drum und dran.</p> <p>2 Ich fresse Dich mit Haut und <pb/> Haar und allem drum und dran.</p> <p>3 Ich fresse Dich mit Haut und Fell und allem drum und dran.</p> <p>4 Ich fresse Dich mit Haut und Pelz und allem drum und dran.</p> <p>5 Ich werde Dich mit Haut und Haar <pb/> und allem drum und dran fressen.</p> <p>6 Du kannst mich mit Haut und Haar und allem drum und dran fressen.</p> </doc>
from which I created a collection with whitespacechopping OFF, stemming for German ON. And then I run these queries:
(1) //*[text() contains text ("Haut" ftand "fressen") using stemming using language "de"] (2) //*[text() contains text ("Haut" ftand "fressen" ftand ftnot "Haar") using stemming using language "de"]
(1) should return all <p>-nodes, but does not return 5 (2) should return 1, 3, and 4, but does return 2, 3, and 4.
Is it correct, that when looking into a node, only text _before_ any other node will be handled, i.e. fore the first <p> node, only until "Haar", for the second one only until "und" and for the fifth one only until "Haar".
So everything after another node included in a particular node will be ignored? As there are a lot of nodes like page-breakes or line-breakes (not including relevant text, but only rendering information) in TEI-documents, this is rather irritating. There is no way to search the whole text of a paragraph or line node.
Best regards
Cerstin