On Monday 28 January 2013 16:27:18 Wendell Piez wrote:
I have several related questions about this:
- Unless I learn better, I'm going to prefer [B] or [C], because in
my world, mixed content is common; is there any reason (performance or otherwise) to prefer [A] in cases where I know it will be robust?
When you use BaseX, there are good chances that a full-text index will be used (if available in the database), so a significant performance gain could be achieved.
Is there any reason to prefer [B] or prefer [C]?
I think it does not make any difference when you use BaseX, but Christian can tell better.
- I see examples like [A] offered frequently in the XQuery
literature, of "text()" being used apparently to refer to an element's string (text) value not to its text node children. And I see this usage in running code. I can only imagine that those who write it are simply not aware that mixed content will complicate their queries like this; maybe they have just never thought about it, or they don't know what text() actually does. In any case, the error is pernicious, since nothing tells you the query you gave isn't the one you intended -- it even works, until the day it doesn't, and the cases where gives correct but unwanted results may be rare.
But maybe I'm wrong and they just know something about XQuery, XQuery FT, or their tools, that I don't.
What do the experts say?
Hm, I'm not an expert, but this doesn't look like a question to me ;) Anyway, a quote from the W3C XQuery Full-Text spec [1] says it all:
"Some XML elements represent semantic markup, e.g., <title>. Others represent formatting markup, e.g., <b> to indicate bold. Semantic markup serves well as token boundaries. Some formatting markup serves well as token boundaries; for example, paragraphs are most commonly delimited by formatting markup. Other formatting markup may not serve well as token boundaries. Implementations are free to provide implementation-defined ways to differentiate between the markup's effect on token boundaries during tokenization."
So, the short answer is that BaseX does not provide a way to "differentiate between the markup's effect on token boundaries".
I hope this helps.
Regards, Dimitar
[1] http://www.w3.org/TR/xpath-full-text-10/#tq-ftsearch-xml