Hi Mark,
you should be able to build your own word boundary matcher using lookahead/lookbehind.
import module namespace functx = "http://www.functx.com" at "/Users/jenserat/Downloads/functx-1.0-nodoc-2007-01.xq";
let $string := "Lorem ipsum dolor sit amet"
let $m := "(?<!\w)(?=\w)" (: \m, matches start of a word :)
let $M := "(?<=\w)(?!\w)" (: \M, matches end of a word :)
let $b := string-join(("(", $m, "|", $M, ")")) (: \b, matches both :)
let $pattern := string-join(($M, ".{5}", $m))
return <result>
<does-match>{matches($string, $pattern)}</does-match>
<matches>{for $match in functx:get-matches($string, $pattern)
return <match>{$match}</match>}</matches>
</result>
Fortunately `\m` and `\M` were available ready for use at [regular-expressions.info](1), a great regex reference (though that regex is not very complex when you know about lookahead and -behind). It always `\b` shouldn't be anything but `(\m|\M)`.
Unfortunately there seems to be some issue in `functx:get-matches` as there are inconsistencies between `matches` and the functx-function, try querying `string-join(($M, ".{5}", $m))` which should match " sit ".
Not completely satisfying, but you will stay compatible with this workaround.
Regards from Lake Constance, Germany,
Jens
[1]: http://www.regular-expressions.info/wordboundaries.html
--
Jens Erat
[phone]: tel:+49-151-56961126
[mail]: mailto:email@jenserat.de
[jabber]: xmpp:jabber@jenserat.de
[web]:
http://www.jenserat.de
PGP: 350E D9B6 9ADC 2DED F5F2 8549 CBC2 613C D745 722B
Am 07.01.2012 um 21:55 schrieb Christian Grün:
> Dear Mark,
>
> I'm sorry to tell you that the boundary matcher \b is not officially
> supported by XQuery [1,2]; this is why it is not supported anymore by
> the latest version of BaseX. If you want to have this feature provided
> in XQuery 3.0 or a future version, you are invited to submit a small
> feature request in the W3 Issue Tracker [3].
>
> All the best,
> Christian
>
> [1]
http://www.w3.org/TR/xpath-functions/#regex-syntax
> [2]
http://www.w3.org/TR/xmlschema-2/#regexs
> [3]
https://www.w3.org/Bugs/Public/
>
>
> On Sat, Jan 7, 2012 at 9:37 PM, Mark Boon
tesujisoftware@gmail.com wrote:
>> It's been a while since I did anything with BaseX. The last version I used
>> was BaseX 6.5. Today I downloaded the latest version 7.0.2 and ran into a
>> problem when trying to revive an old project. I get messages saying
>> " [FORX0002] Invalid escape character: '\b'". Since the use of \b is rather
>> essential to my project I had little choice but to investigate, so I
>> retrieved the BaseX source code to see if I could determine the problem. I
>> found the following in RegEx.java, line 88:
>>
>> if("0123456789cCdDniIrsStwW|.-^$?*+{}()[]\".indexOf(c) == -1)
>> REGESC.thrw(input, c);
>>
>> I believe that's a mistake and should include 'b' and read:
>>
>> if("0123456789bcCdDniIrsStwW|.-^$?*+{}()[]\".indexOf(c) == -1)
>> REGESC.thrw(input, c);
>>
>> When I make that modification my program works again as before. There may be
>> more missing? I'm not especially a RegExp expert. And I'd like to make sure
>> this modification didn't inadvertantly break something else that I just
>> haven't noticed yet.
>>
>> Mark Boon
>>
>>
>>
>>
>> _______________________________________________
>> BaseX-Talk mailing list
>> BaseX-Talk@mailman.uni-konstanz.de
>>
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
>>
> _______________________________________________
> BaseX-Talk mailing list
> BaseX-Talk@mailman.uni-konstanz.de
>
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk