Hi,
This may be a question about XQuery Full Text, or only about common usage (or misusage?) of XPath; in either case I hope it's on topic. Please tell me if not.
In BaseX [A]:
let $test := <test> <p>The apple <em>never</em> falls far from the tree.</p> <p><!-- comment -->Apples and trees.</p> <p>Trees and <!-- comment --> apples.</p> <p><fruit>Apple</fruit> trees.</p> </test>
return $test/*[text() contains text ('apple' ftand 'tree') using stemming using language 'en']
This returns
<p> <!-- comment --> Apples and trees.</p>
As an experienced XPath user, this is what I expect, assuming "contains text" allows a sequence of nodes as its first argument (and returns true if any of them satisfies the test). Only the second 'p' element has a child text node whose value contains both "apple" and "tree".
Of course the problem in the others is the mixed content: in the first, an element node 'em' intervenes, while in the third, a comment intervenes, so both these cases contain text nodes with either "apple" or "tree", but not both. In the case of the fourth 'p', there is no text node child containing "apple" at all, only a grandchild.
Assuming I want all four back, I can write either:
[B] return $test/*[string() contains text ('apple' ftand 'tree') using stemming using language 'en']
or
[C] return $test/*[. contains text ('apple' ftand 'tree') using stemming using language 'en']
In the case of [B], the string() function casts the element to a string, flattening its structure. [C] passes the element itself to the "contains text" operation, which happily has the same effect.
I have several related questions about this:
1. Unless I learn better, I'm going to prefer [B] or [C], because in my world, mixed content is common; is there any reason (performance or otherwise) to prefer [A] in cases where I know it will be robust? Is there any reason to prefer [B] or prefer [C]?
2. I see examples like [A] offered frequently in the XQuery literature, of "text()" being used apparently to refer to an element's string (text) value not to its text node children. And I see this usage in running code. I can only imagine that those who write it are simply not aware that mixed content will complicate their queries like this; maybe they have just never thought about it, or they don't know what text() actually does. In any case, the error is pernicious, since nothing tells you the query you gave isn't the one you intended -- it even works, until the day it doesn't, and the cases where gives correct but unwanted results may be rare.
But maybe I'm wrong and they just know something about XQuery, XQuery FT, or their tools, that I don't.
What do the experts say?
Cheers, Wendell
-- Wendell Piez | http://www.wendellpiez.com XML | XSLT | electronic publishing Eat Your Vegetables _____oo_________o_o___ooooo____ooooooo_^
On Monday 28 January 2013 16:27:18 Wendell Piez wrote:
I have several related questions about this:
- Unless I learn better, I'm going to prefer [B] or [C], because in
my world, mixed content is common; is there any reason (performance or otherwise) to prefer [A] in cases where I know it will be robust?
When you use BaseX, there are good chances that a full-text index will be used (if available in the database), so a significant performance gain could be achieved.
Is there any reason to prefer [B] or prefer [C]?
I think it does not make any difference when you use BaseX, but Christian can tell better.
- I see examples like [A] offered frequently in the XQuery
literature, of "text()" being used apparently to refer to an element's string (text) value not to its text node children. And I see this usage in running code. I can only imagine that those who write it are simply not aware that mixed content will complicate their queries like this; maybe they have just never thought about it, or they don't know what text() actually does. In any case, the error is pernicious, since nothing tells you the query you gave isn't the one you intended -- it even works, until the day it doesn't, and the cases where gives correct but unwanted results may be rare.
But maybe I'm wrong and they just know something about XQuery, XQuery FT, or their tools, that I don't.
What do the experts say?
Hm, I'm not an expert, but this doesn't look like a question to me ;) Anyway, a quote from the W3C XQuery Full-Text spec [1] says it all:
"Some XML elements represent semantic markup, e.g., <title>. Others represent formatting markup, e.g., <b> to indicate bold. Semantic markup serves well as token boundaries. Some formatting markup serves well as token boundaries; for example, paragraphs are most commonly delimited by formatting markup. Other formatting markup may not serve well as token boundaries. Implementations are free to provide implementation-defined ways to differentiate between the markup's effect on token boundaries during tokenization."
So, the short answer is that BaseX does not provide a way to "differentiate between the markup's effect on token boundaries".
I hope this helps.
Regards, Dimitar
[1] http://www.w3.org/TR/xpath-full-text-10/#tq-ftsearch-xml
On Mon, 2013-01-28 at 16:27 -0500, Wendell Piez wrote: [...]
- I see examples like [A] offered frequently in the XQuery
literature, of "text()" being used apparently to refer to an element's string (text) value not to its text node children.
I think it common for people to suppose text(foo) returns foo as a string, i.e. as what they think of as text, rather than being a node test. I innded to mention this in the XQuery chapter of "Beginning XML" but I do not remember now whether or not I did so.
Liam
Hi Wendell,
Am 28.01.2013 um 22:27 schrieb Wendell Piez:
1. Unless I learn better, I'm going to prefer [B] or [C], because in my world, mixed content is common; is there any reason (performance or otherwise) to prefer [A] in cases where I know it will be robust? Is there any reason to prefer [B] or prefer [C]?
My world is a world of mixed content, too. So with questions like [A], you miss a lot of things you want to retrieve. However, [A] is the only possibility of making use of the index. So with [B] or [C] you might get all hits you are interested in, but you will never get them because of performance issues.
Flattening the structure in the first place, i.e., getting rid of all non-structural information not really relevant for your particular query, and then applying [A] would be a bad idea when your user scenario involves inspecting the hits in the original context, i.e., including all formatting, and annotating hits back into the original text.
As I see it, the handling of mixed content is the biggest obstacle when working with BaseX in the Humanities.
For some reason, eXist seems capable of handling mixed content AND using the index. But when I experimented with it, it wasn't that stable, so I came back to BaseX and my users know that it is very likely some hits will be missed when querying the corpus. However, for every "query", they are interested in, they formulate various xqueries including different search terms -- this way they get hold of almost everything, eXist was capable to find. I can show some examples at the BaseX user meeting in Prague.
Best regards
Cerstin -- Dr. phil. Cerstin Mahlow
Universität Basel Departement Sprach- und Literaturwissenschaften Fachbereich Deutsche Sprach- und Literaturwissenschaft Nadelberg 4 4051 Basel Schweiz
Tel: +41 61 267 07 65 Fax: +41 61 267 34 40 Mail: cerstin.mahlow@unibas.chmailto:cerstin.mahlow@unibas.ch Web: http://www.oldphras.nethttp://www.oldphras.net/
Dear Wendell,
if you query structured documents, Query [C] will be automatically optimized to [A]; this will not apply, however, if the addressed element contains other elements, such as is the case for mixed content.
As Cerstin indicated, the text index is based on text nodes, so the main reason for using [A] is to take advantage of the full-text index. Discussion on this list and some more use cases have shown that the current solution is quite restrictive when it comes to mixed content. One future option could be to extend the full-text index to also support queries across element boundaries. Several indexing techniques exist for that approach, so it’s mainly a question of finding someone to implement it in our core.
Hope this helps, feel free to ask for more, Christian ___________________________
On Mon, Jan 28, 2013 at 10:27 PM, Wendell Piez wapiez@wendellpiez.com wrote:
Hi,
This may be a question about XQuery Full Text, or only about common usage (or misusage?) of XPath; in either case I hope it's on topic. Please tell me if not.
In BaseX [A]:
let $test :=
<test> <p>The apple <em>never</em> falls far from the tree.</p> <p><!-- comment -->Apples and trees.</p> <p>Trees and <!-- comment --> apples.</p> <p><fruit>Apple</fruit> trees.</p> </test>
return $test/*[text() contains text ('apple' ftand 'tree') using stemming using language 'en']
This returns
<p> <!-- comment --> Apples and trees.</p>
As an experienced XPath user, this is what I expect, assuming "contains text" allows a sequence of nodes as its first argument (and returns true if any of them satisfies the test). Only the second 'p' element has a child text node whose value contains both "apple" and "tree".
Of course the problem in the others is the mixed content: in the first, an element node 'em' intervenes, while in the third, a comment intervenes, so both these cases contain text nodes with either "apple" or "tree", but not both. In the case of the fourth 'p', there is no text node child containing "apple" at all, only a grandchild.
Assuming I want all four back, I can write either:
[B] return $test/*[string() contains text ('apple' ftand 'tree') using stemming using language 'en']
or
[C] return $test/*[. contains text ('apple' ftand 'tree') using stemming using language 'en']
In the case of [B], the string() function casts the element to a string, flattening its structure. [C] passes the element itself to the "contains text" operation, which happily has the same effect.
I have several related questions about this:
- Unless I learn better, I'm going to prefer [B] or [C], because in
my world, mixed content is common; is there any reason (performance or otherwise) to prefer [A] in cases where I know it will be robust? Is there any reason to prefer [B] or prefer [C]?
- I see examples like [A] offered frequently in the XQuery
literature, of "text()" being used apparently to refer to an element's string (text) value not to its text node children. And I see this usage in running code. I can only imagine that those who write it are simply not aware that mixed content will complicate their queries like this; maybe they have just never thought about it, or they don't know what text() actually does. In any case, the error is pernicious, since nothing tells you the query you gave isn't the one you intended -- it even works, until the day it doesn't, and the cases where gives correct but unwanted results may be rare.
But maybe I'm wrong and they just know something about XQuery, XQuery FT, or their tools, that I don't.
What do the experts say?
Cheers, Wendell
-- Wendell Piez | http://www.wendellpiez.com XML | XSLT | electronic publishing Eat Your Vegetables _____oo_________o_o___ooooo____ooooooo_^ _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
basex-talk@mailman.uni-konstanz.de