Dear BaseX people, it would be kind if you could check my understanding of Free Text @ BaseX. (1) Tokenization ignores element borders. If correct, I suggest documentation of the fact, seehttps://www.w3.org/TR/xpath-full-text-10/#TokenizationSec%22In the absence of an implementation-definedway to differentiate, element markup (start tags, end tags, andempty-element tags) creates token boundaries." (2) Function ft:search can only find individual text nodes, it is not possible to apply scope "phrase" or "all words" beyond the boundaries of an individual text node. So, for example, given a document<doc> <t1>Stand </t1> <t2>der Information. Siehe unten.</t2></doc> there is no way of searching for "Stand der Information" *and* obtain information about the location of the match (in other words - search via ft:search). (3) The unit "sentence" (as for example used in the qualifier same sentence) is exclusively defined by the occurrences of "." (dot) characters. In particular, it is unrelated to text node boundaries. For example:$doc contains text "Stand der Information siehe" same sentence yields false.$doc contains text "Stand der Information" same sentence yields true.
(4) The unit "paragraph" (as for example used in the qualifier "same paragraph") is not delimited - "same paragraph" always applies. A check would be highly appreciated! Kind regards,Hans-Jürgen PS: I think there is a bug concerning "different sentence": basex "'base.x' contains text 'base x' same sentence" false basex "'base.x' contains text 'base x' different sentence" false basex "'base x' contains text 'base x' same sentence" true basex "'base x' contains text 'base x' different sentence" false PPS: Thank you very much for the excellent implementation of Free Text - for several years, it has been in productive use by a mission critical service mapping format markup to semantic markup.
Hi Hans-Jürgen,
it would be kind if you could check my understanding of Free Text @ BaseX.
You mean Full Text?
(1) Tokenization ignores element borders.
You could say so. Before tokenization, a node that’s to be tokenized will be atomized, similar to when you apply fn:data to it. For example, the following function call returns 'hi' and 'there':
ft:tokenize(<div><b>H</b>i there</div>)
(2) Function ft:search can only find individual text nodes, it is not possible to apply scope "phrase" or "all words" beyond the boundaries of an individual text node.
Exactly. This could possibly change in a future version. Maybe you’ve seen the issue that I have mentioned in a previous mailing list thread [1]. I haven’t got any feedback on the proposal yet.
(3) The unit "sentence" (as for example used in the qualifier same sentence) is exclusively defined by the occurrences of "." (dot) characters. (4) The unit "paragraph" (as for example used in the qualifier "same paragraph") is not delimited - "same paragraph" always applies.
Unit detection is very basic. For Western languages, it’s currently limited to (3) dots, exclamation and question marks, and (4) to newlines [2].
basex "'base.x' contains text 'base x' different sentence" false
Surprising indeed; I will look at that [3].
Thanks and cheers, Christian
[1] https://github.com/BaseXdb/basex/issues/2079 [2] https://github.com/BaseXdb/basex/blob/da1e55d0214e44c1532f121c282021db50a9aa... [3] https://github.com/BaseXdb/basex/issues/2088
Hi Hans-Jürgen,
PS: I think there is a bug concerning "different sentence":
basex "'base.x' contains text 'base x' same sentence" false
basex "'base.x' contains text 'base x' different sentence" false
After some intents, I decided to stick with the current solution, as I believe it’s formally correct (albeit counter-intuitive). The specification merely indicates that:
“A scope selection selects matches which satisfy the operand full-text selection and for which the matched tokens and phrases are contained in the same scope or in different scopes.” [1]
It does not elaborate on what should happen when a phrase spans multiple scopes (sentences, paragraphs), and I didn’t manage to define concise rules that provide consistent results without considering various edge cases.
If your use case allows you to ignore the difference between token and phrase matches, it’s advisable to use the following syntax:
let $input := 'base.x' let $tokens := ft:tokenize('base x') return $input contains text { $tokens } all different sentence
Hope this helps, Christian
Many thanks for checking, Christian! I'll study the spec and get back to you should I come to a different conclusion. Kind regards,Hans-Jürgen
Am Montag, 25. April 2022, 12:00:41 MESZ hat Christian Grün christian.gruen@gmail.com Folgendes geschrieben:
Hi Hans-Jürgen,
PS: I think there is a bug concerning "different sentence":
basex "'base.x' contains text 'base x' same sentence" false basex "'base.x' contains text 'base x' different sentence" false
After some intents, I decided to stick with the current solution, as I believe it’s formally correct (albeit counter-intuitive). The specification merely indicates that: “A scope selection selects matches which satisfy the operand full-text selection and for which the matched tokens and phrases are contained in the same scope or in different scopes.” [1] It does not elaborate on what should happen when a phrase spans multiple scopes (sentences, paragraphs), and I didn’t manage to define concise rules that provide consistent results without considering various edge cases. If your use case allows you to ignore the difference between token and phrase matches, it’s advisable to use the following syntax: let $input := 'base.x'let $tokens := ft:tokenize('base x')return $input contains text { $tokens } all different sentence Hope this helps,Christian [1] https://www.w3.org/TR/xpath-full-text-10/#ftscope
basex-talk@mailman.uni-konstanz.de