Using the catalog feature does not automatically use the grammar cache—that has to be explicitly enabled as part of the parser configuration—it’s something I’ve done in the past for e.g., DITA Open Toolkit.
I tried to do it for the BaseX parser but ran into some road block and didn’t have time motivation to push on it harder since I already had code that can supply the DITA-defined attribute default values in XQuery code when we need them.
For comparison: with our 60K DITA doc set, it takes about 2 minutes to load without DTD processing, 2 hours with, because of the DTD processing overhead. With grammar cache implemented it would probably be less than 3 minutes to load everything with DTDs.
In my case, it was easier to just not use DTDs then fix the underlying Java code. I suspect that for the vast majority of BaseX users, DTDs are either not an option at all or their DTDs are not the monster
that the DITA DTDs are.
For incoming docs, if you’re turning DTD processing off you still have to strip out the DOCTYPE declarations as the parser is still obligated to resolve entity references, which is part of my motivation for
a pre-processor.
The latest versions of libxmxl2 and the Python lxml library have very strict controls, making it safe to use them to sanitize incoming docs. I’m not sure how Java parsers compare because I haven’t had to worry
about it in a Java context (it’s actually a problem for us that the Python and libxml2 is so strict because the DITA DTDs exceed the default limits and can’t be processed with lxml after v 4.9.4
☹, which is why I’m familiar with their implementation of entity expansion limits).
Cheers,
E.
_____________________________________________
Eliot Kimber
Sr. Staff Content Engineer
O: 512 554 9368
servicenow
servicenow.com
LinkedIn | X | YouTube | Instagram
From:
Christian Grün <christian.gruen@gmail.com>
Date: Friday, March 14, 2025 at 9:39 AM
To: Nico Verwer (Rakensi) <nverwer@rakensi.com>
Cc: basex-talk@mailman.uni-konstanz.de <basex-talk@mailman.uni-konstanz.de>
Subject: [basex-talk] Re: Protecting against XML vulnerabilities
> Is there a way to set parser properties like `jdk.xml.entityExpansionLimit
`
in BaseX?
By default, more recent versions of the JDK have static entity expansion limits. Maybe those are not strict enough? Do you have an example at hand that causes problems?
> I am using the internal parser with the DTD option set to false, but this is still vulnerable to the one billion laughs attack.
Thanks for the hint. I have improved the entity expansion checks in our internal XML parser [1]. If you find an example that will not be caught by our (very simple) heuristics, feel free to share
it with us.
I agree with Eliot that it can be hazardous to process arbitrary external contents (you are probably aware of that, too). Good firewall/proxy settings may be able to tackle some of the issues
that will not be handled during XML parsing.
And @Eliot, with regard to caching: Have you played around with the XML Catalog feature?
Thank you, Eliot Kimber for your response:
These vulnerabilities are only an issue if you allow untrusted users to supply XML documents with DTDs.
My application will be open to the outer world, so there will be untrusted users. We do not use DTDs, but DTDs are just one vulnerability.
[...] pre-parse them before supplying them to BaseX,
My solution is to simply not use DTD-aware parsing, [...]
I am using the internal parser with the DTD option set to false, but this is still vulnerable to the one billion laughs attack.
My next action will be to try to install my own parser into BaseX, which will be an interesting exercise...