Thanks to Christian Grün's prompt response to my question about attributes, I upgraded to Basex 6.5.1 the other day. And I've run into an unexpected behavior.
I have several versions of the Unicode database in XML (the Unicode consortium started shipping XML versions with 5.1.0, and I've created XML documents with the information I need for all the earlier versions); they are all in directory ~/2011/Unicode.
But when I ask to create a new database in the GUI, giving that directory as the path and accepting the default pattern of *.xml, the only document in the resulting database appears to be the small schemas.xml file that nXML mode placed in one of the subdirectories of ~/2011/Unicode when I edited an XSLT stylesheet there.
What I was expecting was that all the XML documents in that subtree of the file directory would be added -- I think that was the behavior in earlier versions.
Has something changed? Or have I gotten out of practice and done something wrong?
I hope this is clear; I can try to explain further if needed.
Thanks!
On Mar 11, 2011, at 5:12 PM, C. M. Sperberg-McQueen wrote:
But when I ask to create a new database in the GUI, giving that directory as the path and accepting the default pattern of *.xml, the only document in the resulting database appears to be the small schemas.xml file that nXML mode placed in one of the subdirectories of ~/2011/Unicode when I edited an XSLT stylesheet there.
I should perhaps say that none of the various XML files in the subtree are actually in the Unicode directory. They are all in subdirectories, as the following reformatted shell log illustrates:
$ pwd /Users/cmsmcq/2011/Unicode $ ls *.xml */*.xml ls: *.xml: No such file or directory 2.0.0/unicode.blocks.xml 2.1.9/unicode.blocks.xml 3.0.0/unicode.blocks.xml 3.1.0/unicode.blocks.xml 3.2.0/unicode.blocks.xml 4.0.0/unicode.blocks.xml 4.0.1/unicode.blocks.xml 4.1.0/unicode.blocks.xml 5.0.0/unicode.blocks.xml 5.1.0/ucd.nounihan.grouped.xml 5.2.0/ucd.nounihan.grouped.xml 6.0.0/ucd.nounihan.grouped.xml bin/schemas.xml $
The one that shows up in the collection is bin/schemas.xml, which is (coincidentally?) the last one in the list produced by ls.
I can't tell whether the other documents are being processed or not: BaseX indexing is too fast for me to tell whether it's actually handling a document or skipping it :).
Seems to be related to the file name containing more than one dot: ucd.xml will be inserted during db creation while ucd.nounihan.grouped.xml won't. Same (non-) effect when adding the directory later in the GUI (through "Database > Add documents..."). But when you point to the file ucd.nounihan.grouped.xml instead of to the directory, it will be imported.
Gerrit
On 2011-03-12 01:12, C. M. Sperberg-McQueen wrote:
Thanks to Christian Grün's prompt response to my question about attributes, I upgraded to Basex 6.5.1 the other day. And I've run into an unexpected behavior.
I have several versions of the Unicode database in XML (the Unicode consortium started shipping XML versions with 5.1.0, and I've created XML documents with the information I need for all the earlier versions); they are all in directory ~/2011/Unicode.
But when I ask to create a new database in the GUI, giving that directory as the path and accepting the default pattern of *.xml, the only document in the resulting database appears to be the small schemas.xml file that nXML mode placed in one of the subdirectories of ~/2011/Unicode when I edited an XSLT stylesheet there.
What I was expecting was that all the XML documents in that subtree of the file directory would be added -- I think that was the behavior in earlier versions.
Has something changed? Or have I gotten out of practice and done something wrong?
I hope this is clear; I can try to explain further if needed.
Thanks!
Trying to understand the arcane regex construction routine in https://github.com/BaseXdb/basex/blob/master/src/main/java/org/basex/io/IOFi..., public static String regex(final String filter)
Suppose filter is '*.xml'
glob is '*.xml' sb.length() is 0 initially then, because ch = glob.charAt(0) == "*": sb.append("[^.]") => sb: '[^.]' sb.append(ch) => sb: '[^.]*' glob.charAt(1) == ".": suf = true sb.append('\') => sb: '[^.]*' (will '\' really append a single ''?) sb.append(ch) => sb: '[^.]*.' 'x', 'm', and 'l' will simply be appended: => sb: '[^.]*.xml'
Then, in https://github.com/BaseXdb/basex/blob/master/src/main/java/org/basex/build/x..., line 53: filter = !path.isDir() ? null : Pattern.compile(IOFile.regex(pr.get(Prop.CREATEFILTER)));
filter is a java.util.regex.Pattern, and it will be matched against the file name (without directory parts) of each candidate resource.
I think the string 'ucd.nounihan.grouped.xml' should match the regex '[^.]*.xml', but obviously it doesn't.
I once (with 6.5) encountered another undesired behaviour: files named somename.xml.svn-base were indexed, too. This behaviour is absent in 6.5.1. So it seems as if the regex in 6.5.1 is anchored at the beginning and the end of the string: '^[^.]*.xml$' But this is doesn't become obvious from looking at the code. Maybe the team can clarify. The desired regex looks like '[^.]*.xml$', that is, it should match 'ucd.nounihan.grouped.xml' but not 'somename.xml.svn-base'. '[^.]*.xml$' may be shortened to '.xml$'.
Gerrit
On 2011-03-12 02:00, Imsieke, Gerrit, le-tex wrote:
Seems to be related to the file name containing more than one dot: ucd.xml will be inserted during db creation while ucd.nounihan.grouped.xml won't. Same (non-) effect when adding the directory later in the GUI (through "Database > Add documents..."). But when you point to the file ucd.nounihan.grouped.xml instead of to the directory, it will be imported.
Gerrit
On 2011-03-12 01:12, C. M. Sperberg-McQueen wrote:
Thanks to Christian Grün's prompt response to my question about attributes, I upgraded to Basex 6.5.1 the other day. And I've run into an unexpected behavior.
I have several versions of the Unicode database in XML (the Unicode consortium started shipping XML versions with 5.1.0, and I've created XML documents with the information I need for all the earlier versions); they are all in directory ~/2011/Unicode.
But when I ask to create a new database in the GUI, giving that directory as the path and accepting the default pattern of *.xml, the only document in the resulting database appears to be the small schemas.xml file that nXML mode placed in one of the subdirectories of ~/2011/Unicode when I edited an XSLT stylesheet there.
What I was expecting was that all the XML documents in that subtree of the file directory would be added -- I think that was the behavior in earlier versions.
Has something changed? Or have I gotten out of practice and done something wrong?
I hope this is clear; I can try to explain further if needed.
Thanks!
Dear Michael, and thanks Gerrit,
I agree, the current glob syntax conversion is still dissatisfying. Unfortunately, I didn't stumble across standard globbing algorithms that fulfill all of our needs. This is why the current implementation in BaseX is the result of various user requests from both the Linux and Windows world. Some details on the differences before and after Version 6.5: The following glob syntax
*.
now returns all file names without suffixes; it is internally rewritten to the following regex:
'^[^.]*$'
As Gerrit noticed, the rewritten regex for *.xml looks like
'^[^.].*xml$'
This syntax disallows any dots other than the single suffix dot.
As a temporary solution, I'd recommend to try no filter at all (*), include all necessary dots (*.*.*.xml), or use several runs to get all files into the database.
Sorry for that; We'll work on a better solution (see also: https://github.com/BaseXdb/basex/issues/41).
Best, Christian
___________________________
On Sat, Mar 12, 2011 at 3:01 AM, Imsieke, Gerrit, le-tex gerrit.imsieke@le-tex.de wrote:
Trying to understand the arcane regex construction routine in https://github.com/BaseXdb/basex/blob/master/src/main/java/org/basex/io/IOFi..., public static String regex(final String filter)
Suppose filter is '*.xml'
glob is '*.xml' sb.length() is 0 initially then, because ch = glob.charAt(0) == "*": sb.append("[^.]") => sb: '[^.]' sb.append(ch) => sb: '[^.]*' glob.charAt(1) == ".": suf = true sb.append('\') => sb: '[^.]*' (will '\' really append a single ''?) sb.append(ch) => sb: '[^.]*.' 'x', 'm', and 'l' will simply be appended: => sb: '[^.]*.xml'
Then, in https://github.com/BaseXdb/basex/blob/master/src/main/java/org/basex/build/x..., line 53: filter = !path.isDir() ? null : Pattern.compile(IOFile.regex(pr.get(Prop.CREATEFILTER)));
filter is a java.util.regex.Pattern, and it will be matched against the file name (without directory parts) of each candidate resource.
I think the string 'ucd.nounihan.grouped.xml' should match the regex '[^.]*.xml', but obviously it doesn't.
I once (with 6.5) encountered another undesired behaviour: files named somename.xml.svn-base were indexed, too. This behaviour is absent in 6.5.1. So it seems as if the regex in 6.5.1 is anchored at the beginning and the end of the string: '^[^.]*.xml$' But this is doesn't become obvious from looking at the code. Maybe the team can clarify. The desired regex looks like '[^.]*.xml$', that is, it should match 'ucd.nounihan.grouped.xml' but not 'somename.xml.svn-base'. '[^.]*.xml$' may be shortened to '.xml$'.
Gerrit
On 2011-03-12 02:00, Imsieke, Gerrit, le-tex wrote:
Seems to be related to the file name containing more than one dot: ucd.xml will be inserted during db creation while ucd.nounihan.grouped.xml won't. Same (non-) effect when adding the directory later in the GUI (through "Database > Add documents..."). But when you point to the file ucd.nounihan.grouped.xml instead of to the directory, it will be imported.
Gerrit
On 2011-03-12 01:12, C. M. Sperberg-McQueen wrote:
Thanks to Christian Grün's prompt response to my question about attributes, I upgraded to Basex 6.5.1 the other day. And I've run into an unexpected behavior.
I have several versions of the Unicode database in XML (the Unicode consortium started shipping XML versions with 5.1.0, and I've created XML documents with the information I need for all the earlier versions); they are all in directory ~/2011/Unicode.
But when I ask to create a new database in the GUI, giving that directory as the path and accepting the default pattern of *.xml, the only document in the resulting database appears to be the small schemas.xml file that nXML mode placed in one of the subdirectories of ~/2011/Unicode when I edited an XSLT stylesheet there.
What I was expecting was that all the XML documents in that subtree of the file directory would be added -- I think that was the behavior in earlier versions.
Has something changed? Or have I gotten out of practice and done something wrong?
I hope this is clear; I can try to explain further if needed.
Thanks!
-- Gerrit Imsieke Geschäftsführer / Managing Director le-tex publishing services GmbH Weissenfelser Str. 84, 04229 Leipzig, Germany Phone +49 341 355356 110, Fax +49 341 355356 510 gerrit.imsieke@le-tex.de, http://www.le-tex.de
Registergericht / Commercial Register: Amtsgericht Leipzig Registernummer / Registration Number: HRB 24930
Geschäftsführer: Gerrit Imsieke, Svea Jelonek, Thomas Schmidt, Dr. Reinhard Vöckler _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
On Mar 12, 2011, at 10:36 AM, Christian Grün wrote:
Dear Michael, and thanks Gerrit,
.... Some details on the differences before and after Version 6.5: The following glob syntax
*.
now returns all file names without suffixes; it is internally rewritten to the following regex:
'^[^.]*$'
As Gerrit noticed, the rewritten regex for *.xml looks like
'^[^.].*xml$'
Ah, ok. That helps.
This syntax disallows any dots other than the single suffix dot.
As a temporary solution, I'd recommend to try no filter at all (*), include all necessary dots (*.*.*.xml), or use several runs to get all files into the database.
I can live with that. I'm reassured to learn that the difference in behavior is not my imagination, and that there's a way to get what I need.
Thank you for the clarification.
Dear Michael,
I guess it took more than an eterrnity this time due to other pressing issues, but I just wanted to let you know that I've fixed the glob parser for finding files with more than one dot in the filename…
https://github.com/BaseXdb/basex/issues/41 http://files.basex.org/releases/latest/
All the best, Christian ___________________________
Christian Grün Uni KN, Box 188 78457 Konstanz, Germany http://www.inf.uni-konstanz.de/~gruen
On Sat, Mar 12, 2011 at 7:37 PM, C. M. Sperberg-McQueen cmsmcq@blackmesatech.com wrote:
On Mar 12, 2011, at 10:36 AM, Christian Grün wrote:
Dear Michael, and thanks Gerrit,
.... Some details on the differences before and after Version 6.5: The following glob syntax
*.
now returns all file names without suffixes; it is internally rewritten to the following regex:
'^[^.]*$'
As Gerrit noticed, the rewritten regex for *.xml looks like
'^[^.].*xml$'
Ah, ok. That helps.
This syntax disallows any dots other than the single suffix dot.
As a temporary solution, I'd recommend to try no filter at all (*), include all necessary dots (*.*.*.xml), or use several runs to get all files into the database.
I can live with that. I'm reassured to learn that the difference in behavior is not my imagination, and that there's a way to get what I need.
Thank you for the clarification.
--
- C. M. Sperberg-McQueen, Black Mesa Technologies LLC
- http://www.blackmesatech.com
- http://cmsmcq.com/mib
- http://balisage.net
basex-talk@mailman.uni-konstanz.de