Free Text - understanding - BaseX-Talk - mailman.uni-konstanz.de

14 Apr 2022


      Dear BaseX people,
it would be kind if you could check my understanding of Free Text @ BaseX.
(1) Tokenization ignores element borders.
If correct, I suggest documentation of the fact, seehttps://www.w3.org/TR/xpath-full-text-10/#TokenizationSec%22In the absence of an implementation-definedway to differentiate, element markup (start tags, end tags, andempty-element tags) creates token boundaries."
(2) Function ft:search can only find individual text nodes, it is not possible to apply scope "phrase" or "all words" beyond the boundaries of an individual text node. So, for example, given a document<doc>    <t1>Stand </t1>    <t2>der Information. Siehe unten.</t2></doc>
there is no way of searching for "Stand der Information" *and* obtain information about the location of the match (in other words - search via ft:search).
(3) The unit "sentence" (as for example used in the qualifier same sentence) is exclusively defined by the occurrences of "." (dot) characters. In particular, it is unrelated to text node boundaries. For example:$doc contains text "Stand der Information siehe" same sentence yields false.$doc contains text "Stand der Information" same sentence yields true.
(4) The unit "paragraph" (as for example used in the qualifier "same paragraph") is not delimited - "same paragraph" always applies.
A check would be highly appreciated!
Kind regards,Hans-Jürgen
PS: I think there is a bug concerning "different sentence":
basex "'base.x' contains text 'base x' same sentence"
false
basex "'base.x' contains text 'base x' different sentence"
false
basex "'base x' contains text 'base x' same sentence"
true
basex "'base x' contains text 'base x' different sentence"
false
PPS: Thank you very much for the excellent implementation of Free Text - for several years, it has been in productive use by a mission critical service mapping format markup to semantic markup.