It's been a while since I did anything with BaseX. The last version I used was BaseX 6.5. Today I downloaded the latest version 7.0.2 and ran into a problem when trying to revive an old project. I get messages saying " [FORX0002] Invalid escape character: '\b'". Since the use of \b is rather essential to my project I had little choice but to investigate, so I retrieved the BaseX source code to see if I could determine the problem. I found the following in RegEx.java, line 88:
if("0123456789cCdDniIrsStwW|.-^$?*+{}()[]\".indexOf(c) == -1) REGESC.thrw(input, c);
I believe that's a mistake and should include 'b' and read:
if("0123456789bcCdDniIrsStwW|.-^$?*+{}()[]\".indexOf(c) == -1) REGESC.thrw(input, c);
When I make that modification my program works again as before. There may be more missing? I'm not especially a RegExp expert. And I'd like to make sure this modification didn't inadvertantly break something else that I just haven't noticed yet.
Mark Boon
Dear Mark,
I'm sorry to tell you that the boundary matcher \b is not officially supported by XQuery [1,2]; this is why it is not supported anymore by the latest version of BaseX. If you want to have this feature provided in XQuery 3.0 or a future version, you are invited to submit a small feature request in the W3 Issue Tracker [3].
All the best, Christian
[1] http://www.w3.org/TR/xpath-functions/#regex-syntax [2] http://www.w3.org/TR/xmlschema-2/#regexs [3] https://www.w3.org/Bugs/Public/
On Sat, Jan 7, 2012 at 9:37 PM, Mark Boon tesujisoftware@gmail.com wrote:
It's been a while since I did anything with BaseX. The last version I used was BaseX 6.5. Today I downloaded the latest version 7.0.2 and ran into a problem when trying to revive an old project. I get messages saying " [FORX0002] Invalid escape character: '\b'". Since the use of \b is rather essential to my project I had little choice but to investigate, so I retrieved the BaseX source code to see if I could determine the problem. I found the following in RegEx.java, line 88:
if("0123456789cCdDniIrsStwW|.-^$?*+{}()[]\".indexOf(c) == -1) REGESC.thrw(input, c);
I believe that's a mistake and should include 'b' and read:
if("0123456789bcCdDniIrsStwW|.-^$?*+{}()[]\".indexOf(c) == -1) REGESC.thrw(input, c);
When I make that modification my program works again as before. There may be more missing? I'm not especially a RegExp expert. And I'd like to make sure this modification didn't inadvertantly break something else that I just haven't noticed yet.
Mark Boon
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Hi Mark,
you should be able to build your own word boundary matcher using lookahead/lookbehind.
import module namespace functx = "http://www.functx.com" at "/Users/jenserat/Downloads/functx-1.0-nodoc-2007-01.xq";
let $string := "Lorem ipsum dolor sit amet" let $m := "(?<!\w)(?=\w)" (: \m, matches start of a word :) let $M := "(?<=\w)(?!\w)" (: \M, matches end of a word :) let $b := string-join(("(", $m, "|", $M, ")")) (: \b, matches both :) let $pattern := string-join(($M, ".{5}", $m)) return <result> <does-match>{matches($string, $pattern)}</does-match> <matches>{for $match in functx:get-matches($string, $pattern) return <match>{$match}</match>}</matches> </result>
Fortunately `\m` and `\M` were available ready for use at [regular-expressions.info](1), a great regex reference (though that regex is not very complex when you know about lookahead and -behind). It always `\b` shouldn't be anything but `(\m|\M)`.
Unfortunately there seems to be some issue in `functx:get-matches` as there are inconsistencies between `matches` and the functx-function, try querying `string-join(($M, ".{5}", $m))` which should match " sit ".
Not completely satisfying, but you will stay compatible with this workaround.
Regards from Lake Constance, Germany, Jens
[1]: http://www.regular-expressions.info/wordboundaries.html
Hi Jens,
Thank you for the answer. I'm not sure I understand it though, it seems more complicated than I'm willing to go through.
Instead I might follow Christian's suggestion and submit a request for \b to be added to the XQuery 3 specification. Until then I can always live with the small modification or fall back on an older version of BaseX.
Mark
On Sat, Jan 7, 2012 at 12:13 PM, Jens Erat jens.erat@uni-konstanz.de wrote:
Hi Mark,
you should be able to build your own word boundary matcher using lookahead/lookbehind.
import module namespace functx = "http://www.functx.com" at "/Users/jenserat/Downloads/functx-1.0-nodoc-2007-01.xq";
let $string := "Lorem ipsum dolor sit amet" let $m := "(?<!\w)(?=\w)" (: \m, matches start of a word :) let $M := "(?<=\w)(?!\w)" (: \M, matches end of a word :) let $b := string-join(("(", $m, "|", $M, ")")) (: \b, matches both :) let $pattern := string-join(($M, ".{5}", $m)) return <result> <does-match>{matches($string, $pattern)}</does-match> <matches>{for $match in functx:get-matches($string, $pattern) return <match>{$match}</match>}</matches> </result>
Fortunately `\m` and `\M` were available ready for use at [regular-expressions.info](1), a great regex reference (though that regex is not very complex when you know about lookahead and -behind). It always `\b` shouldn't be anything but `(\m|\M)`.
Unfortunately there seems to be some issue in `functx:get-matches` as there are inconsistencies between `matches` and the functx-function, try querying `string-join(($M, ".{5}", $m))` which should match " sit ".
Not completely satisfying, but you will stay compatible with this workaround.
Regards from Lake Constance, Germany, Jens
-- Jens Erat
[phone]: tel:+49-151-56961126 [mail]: mailto:email@jenserat.de [jabber]: xmpp:jabber@jenserat.de [web]: http://www.jenserat.de
PGP: 350E D9B6 9ADC 2DED F5F2 8549 CBC2 613C D745 722B
Am 07.01.2012 um 21:55 schrieb Christian Grün:
Dear Mark,
I'm sorry to tell you that the boundary matcher \b is not officially supported by XQuery [1,2]; this is why it is not supported anymore by the latest version of BaseX. If you want to have this feature provided in XQuery 3.0 or a future version, you are invited to submit a small feature request in the W3 Issue Tracker [3].
All the best, Christian
[1] http://www.w3.org/TR/xpath-functions/#regex-syntax [2] http://www.w3.org/TR/xmlschema-2/#regexs [3] https://www.w3.org/Bugs/Public/
On Sat, Jan 7, 2012 at 9:37 PM, Mark Boon tesujisoftware@gmail.com wrote:
It's been a while since I did anything with BaseX. The last version I used was BaseX 6.5. Today I downloaded the latest version 7.0.2 and ran into a problem when trying to revive an old project. I get messages saying " [FORX0002] Invalid escape character: '\b'". Since the use of \b is rather essential to my project I had little choice but to investigate, so I retrieved the BaseX source code to see if I could determine the problem. I found the following in RegEx.java, line 88:
if("0123456789cCdDniIrsStwW|.-^$?*+{}()[]\".indexOf(c) == -1) REGESC.thrw(input, c);
I believe that's a mistake and should include 'b' and read:
if("0123456789bcCdDniIrsStwW|.-^$?*+{}()[]\".indexOf(c) == -1) REGESC.thrw(input, c);
When I make that modification my program works again as before. There may be more missing? I'm not especially a RegExp expert. And I'd like to make sure this modification didn't inadvertantly break something else that I just haven't noticed yet.
Mark Boon
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
basex-talk@mailman.uni-konstanz.de