Hello
I encountered some strange things when tokenizing text. Sample runnable
code is added below. Here is my list of problems:
1) the regular expression "(\.){3}" doesn't match the same as
"(\.\.\.)". Shouldn't they be equal?
2) a very annoying whitespace is placed text to the newline of out:nl().
It is placed before out:nl() if it is called in the beginning of an
element, or it is placed after the newline if out:nl() is called in the
end of an element.
E.g the serialized output is either "<s> this" or ". </s>".
Sample runnable code:
for $text in ("this one... is the first.", "this one is second.")
return
<s>{
out:nl(),
string-join(
analyze-string($text, '(\.){3}|[\W]')//text()[not(.=" ")],
out:nl()
),
out:nl()
}</s>
,
"--------- VERSUS --------- "
,
for $text in ("this one... is the first.", "this one is second.")
return
<s>{
out:nl(),
string-join(
analyze-string($text, '(\.\.\.)|[\W]')//text()[not(.=" ")],
out:nl()
),
out:nl()
}</s>