I’d like to add some more info on why we initially decided to chop whitespaces, and why a sudden change of the default value may break existing applications (if you know the details, simply skip this section..):
Many XML documents contain whitespace-only text nodes for properly indenting elements. In highly structured data (i.e., when not working with mixed content), these nodes are in fact completely irrelevant. For example, if the following document…
<xml> <a>X</a> </xml>
…is parsed with CHOP set to true, we will get a document with a single text node. The following query…
for $t in //text() return replace node $t with 'x'
…will generate the following result:
<xml> <a>x</a> </xml>
If we set CHOP to false, the document will have three text nodes, two of them whitespace-only, and the same query will create the following result document:
<xml>x<a>x</a>x</xml>
This is just one example to demonstrate that a sudden change of the default for chop would most probably lead to unwanted side effects in existing applications. Another side effect: databases are expected to increase in size, as all whitespace nodes will get their own node ids, will be fully stored and indexed, etc.
However, I completely agree that the removal of whitespaces may lead to serious changes in mixed contents, and I easily admit that we haven’t been aware of all the implications some years ago when we started off designing the database. While I still believe that our storage copes pretty well with nowaday’s requirements, I would love to have some weeks off to completely rebuild it, and include optimizations for all kinds of features that are relevant today (including larger ranges for node ids and namespaces, or support for other tree formats such as json).
Thanks for reading, Christian ___________________________
On Sat, Apr 13, 2013 at 8:28 AM, Liam R E Quin liam@w3.org wrote:
On Fri, 2013-04-05 at 11:31 +0200, Dirk Kirsten wrote:
So if you could point out some details as why this is not conforming behaviour, this would be interesting.
It's a requirement in the XML Spec that the XML parser pass all whitespace back to the application. Some whitespace may be marked as not significant - that is only possible if there's a DTD and the space is in a context where only elements would be valid, not #PCDATA. There's no formal specification, although constructing an XDM instance from an infoset, and constructing an infoset from XML, does not entail discarding these spaces: Chopping internal whitespace nodes in mixed content contexts is not sanctioned by any version of any XML specification, with any setting of xml:space. I think the onus would be on you to justify the non-standard behaviour.
On the other hand I can see its uses too. But I don't want it, and always turn it off with BaseX :-)
Best,
Liam
-- Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/ Pictures from old books: http://fromoldbooks.org/ Ankh: irc.sorcery.net irc.gnome.org freenode/#xml
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk