Hello
I encountered some strange things when tokenizing text. Sample runnable code is added below. Here is my list of problems:
1) the regular expression "(.){3}" doesn't match the same as "(...)". Shouldn't they be equal?
2) a very annoying whitespace is placed text to the newline of out:nl(). It is placed before out:nl() if it is called in the beginning of an element, or it is placed after the newline if out:nl() is called in the end of an element.
E.g the serialized output is either "<s> this" or ". </s>".
Sample runnable code:
for $text in ("this one... is the first.", "this one is second.") return <s>{ out:nl(), string-join( analyze-string($text, '(.){3}|[\W]')//text()[not(.=" ")], out:nl() ), out:nl() }</s> , "--------- VERSUS --------- " ,
for $text in ("this one... is the first.", "this one is second.") return <s>{ out:nl(), string-join( analyze-string($text, '(...)|[\W]')//text()[not(.=" ")], out:nl() ), out:nl() }</s>
Hi Kristian,
- the regular expression "(.){3}" doesn't match the same as "(...)".
Shouldn't they be equal?
They look similar indeed, but are not equivalent. In the first expression, the repeated dots will be part of the resulting match, but not of the subordinate match group. "(.{3})" is probably what you are looking for.
- a very annoying whitespace is placed text to the newline of out:nl(). It
is placed before out:nl() if it is called in the beginning of an element, or it is placed after the newline if out:nl() is called in the end of an element.
This is not related to out:nl(), but to the way how XQuery node construction works (“The individual strings resulting from the previous step are merged into a single string by concatenating them with a single space character between each pair.” [1]). Simply use string-join() for concatenating the results, as you do anyway:
let $regex := '(.{3})|[\W]' let $text := "this one... is the first." return <s>{ string-join( ( out:nl(), analyze-string($text, $regex)//text()[not(.=" ")], out:nl() ), out:nl() )}</s>
Best, Christian
basex-talk@mailman.uni-konstanz.de