text() vs string() - BaseX-Talk - mailman.uni-konstanz.de

28 Jan 2013


      Hi,
This may be a question about XQuery Full Text, or only about common
usage (or misusage?) of XPath; in either case I hope it's on topic.
Please tell me if not.
In BaseX [A]:
let $test :=
  <test>
    <p>The apple <em>never</em> falls far from the tree.</p>
    <p><!-- comment -->Apples and trees.</p>
    <p>Trees and <!-- comment --> apples.</p>
    <p><fruit>Apple</fruit> trees.</p>
  </test>
return
  $test/*[text() contains text ('apple' ftand 'tree')
          using stemming using language 'en']
This returns
<p>
  <!-- comment -->
      Apples and trees.</p>
As an experienced XPath user, this is what I expect, assuming
"contains text" allows a sequence of nodes as its first argument (and
returns true if any of them satisfies the test). Only the second 'p'
element has a child text node whose value contains both "apple" and
"tree".
Of course the problem in the others is the mixed content: in the
first, an element node 'em' intervenes, while in the third, a comment
intervenes, so both these cases contain text nodes with either "apple"
or "tree", but not both. In the case of the fourth 'p', there is no
text node child containing "apple" at all, only a grandchild.
Assuming I want all four back, I can write either:
[B] return
  $test/*[string() contains text ('apple' ftand 'tree')
          using stemming using language 'en']
or
[C] return
  $test/*[. contains text ('apple' ftand 'tree')
          using stemming using language 'en']
In the case of [B], the string() function casts the element to a
string, flattening its structure. [C] passes the element itself to the
"contains text" operation, which happily has the same effect.
I have several related questions about this:
1. Unless I learn better, I'm going to prefer [B] or [C], because in
my world, mixed content is common; is there any reason (performance or
otherwise) to prefer [A] in cases where I know it will be robust? Is
there any reason to prefer [B] or prefer [C]?
2. I see examples like [A] offered frequently in the XQuery
literature, of "text()" being used apparently to refer to an element's
string (text) value not to its text node children. And I see this
usage in running code. I can only imagine that those who write it are
simply not aware that mixed content will complicate their queries like
this; maybe they have just never thought about it, or they don't know
what text() actually does. In any case, the error is pernicious, since
nothing tells you the query you gave isn't the one you intended -- it
even works, until the day it doesn't, and the cases where gives
correct but unwanted results may be rare.
But maybe I'm wrong and they just know something about XQuery, XQuery
FT, or their tools, that I don't.
What do the experts say?
Cheers, Wendell
--
Wendell Piez | http://www.wendellpiez.com
XML | XSLT | electronic publishing
Eat Your Vegetables
_____oo_________o_o___ooooo____ooooooo_^