Dear Michael,
Thanks for the patched 6.5 jar, that helped diagnostics.
But of course it was my fault:
Despite my conviction that I had thoroughly checked every path in every file, the path given in the catalog file entry's @rewritePrefix was wrong. And instead of complaining that a resolution as specified in the catalog file wasn't feasible, the resolver chose to silently resolve to the original URL. Which seems to be standards compliant behavior: “If the processor attempts to load a resource and fails (because the resource does not exist or is not reachable, for example), it must recover by ignoring the catalog entry file that failed and proceeding.” [1]
But there are some things I learned:
- When importing XHTML files, using a catalog resolver will speed up things significantly: it took approx. 2 minutes to import a test file of 87 kB, while using a catalog resolver reduced this time to approx. one quarter of a second.
- It didn’t matter much whether I used the internal (approx 245 ms) or the Apache XML Commons resolver (approx 260 ms).
- The XHTML file doesn’t need to be valid, just well-formed (DTD is only used for entity resolution).
- With the internal parser (nb, both parser and resolver may be internal or external), the time is reduced to approx. 100 ms.
- But then the entities won’t be resolved (maybe some people even prefer this, we don’t). If you want to resolve them using the internal parser, you’ll have to do SET ENTITY ON and SET DTD ON, upon which the parser complains: “Error: "xhtml1-strict.dtd" (Line 28): "xhtml-lat1.ent" could not be parsed.” This messages appeared no matter what the CATFILE setting was.
⇒ If you have to deal with data that a) contains a DOCTYPE declaration (we are speaking of XML data here, so this means there is a system identifier in the DOCTYPE declaration that the parser will try to access) and b) if you want to resolve the entities (to Unicode characters or whatever is in the entity), then you should consider mirroring the DTD and using a catalog file.
- An observation that was not self-evident for me: The paths to both: the files to be added and to the catalog file (CATFILE option) may be given as relative to the *server’s* working directory.
- As a corollary, if you plan to use relative paths in your commands: always start your server from the same working directory.
That’s it for today. Thanks again, Michael, for your quick and really useful response on a Sunday afternoon.
-Gerrit
[1] http://www.oasis-open.org/committees/entity/spec-2001-08-06.html#s.res.fail
On 23.01.2011 20:07, Michael Seiferle wrote:
Hi Gerrit,
glad you can finally put it use, quite sad it hasn't worked in the first place.
As you assumed, the resolver is not invoked on your side at all... I guess you already turned off the internal parser, at least using the GUI this should happen automatically.
To further investigate this issue:
Are you using the internal (com.sun... which might be missing on your machine) or an external [1] resolver? The external resolver has precedence over the internal package if found in the classpath. I attached a modified JAR that outputs the class of the used resolver [2].
It should output sth. among the lines of:
add /Users/michael/Desktop/w3c.html
*Using com.sun.org.apache.xml.internal.resolver.CatalogManager@29e97f9f for parsing*...
to System.out, so you might want to use it in "local" mode (GUI or BaseXConsole). I used a catalog file similar to yours, simply updated the paths to the local files. It works and suffers from being unable to resolve linked files:
/Users/michael/tmp/cats/dtds/xhtml-lat1.ent (No such file or directory)
To use a external resolver you have to add resolver.jar to your classpath. $ java -cp basex-6.5.jar:xml-commons-resolver-1.2/resolver.jar org.basex.BaseX
INTPARSE: OFF
SET CATFILE /Users/michael/tmp/cats/catalog.xml
CATFILE: /Users/michael/tmp/cats/catalog.xml
open TEST
Database 'TEST' opened in 364.32 ms.
add /Users/michael/Desktop/w3c.html
*Using org.apache.xml.resolver.CatalogManager@7290cb03 for parsing*... /Users/michael/tmp/cats/dtds/xhtml-lat1.ent (No such file or directory)
Could you please check the output the modified jar generates? (Maybe using [1] renders the error report obsolete anyway.)
Kind regards
Michael
Hope P.S. You may skip creating that properties file, we set some default options while initializing the CatalogResolver in org.basex.build.xml.CatalogResolverWrapper.java
P.P.S. Adding a little how-to to our Wiki would be great! Besides your input I have no further experience with Catalog Resolving; suggestions or issues you think that might be of general interest are very welcome. Article stubs even more, I will start one tomorrow :-)
[1] http://apache.myamplifiers.com//xml/commons/xml-commons-resolver-1.2.tar.gz [2] http://dl.dropbox.com/u/603903/basex-6.5-rslv.jar Am 23.01.2011 um 14:42 schrieb Imsieke, Gerrit, le-tex:
Dear Team (especially Michael),
I'm trying to use an XML catalog during import (for the first time admittedly), but the catalog is obviously not being used.
It works on my machine using the following approach (internal parser is disabled):
BaseX 6.5 [Standalone] Try "help" to get more information.
GET INTPARSE
INTPARSE: false
SET CATFILE /Users/michael/tmp/cats/catalog.xml
CATFILE: /Users/michael/tmp/cats/catalog.xml
GET CATFILE
CATFILE: /Users/michael/tmp/cats/catalog.xml
CREATE DB test
Database 'test' created in 367.64 ms.
open test
Database 'test' opened in 3.17 ms.
add /Users/michael/Desktop/w3c.html
/Users/michael/tmp/cats/dtds/xhtml-lat1.ent (No such file or directory)
-- Gerrit Imsieke Geschäftsführer / Managing Director le-tex publishing services GmbH Weissenfelser Str. 84, 04229 Leipzig, Germany Phone +49 341 355356 110, Fax +49 341 355356 510 gerrit.imsieke@le-tex.de, http://www.le-tex.de
Registergericht / Commercial Register: Amtsgericht Leipzig Registernummer / Registration Number: HRB 24930
Geschäftsführer: Gerrit Imsieke, Svea Jelonek, Thomas Schmidt, Dr. Reinhard Vöckler _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk