Re: [basex-talk] whitespace around comments

13 Apr 2013


      I’d like to add some more info on why we initially decided to chop
whitespaces, and why a sudden change of the default value may break
existing applications (if you know the details, simply skip this
section..):
Many XML documents contain whitespace-only text nodes for properly
indenting elements. In highly structured data (i.e., when not working
with mixed content), these nodes are in fact completely irrelevant.
For example, if the following document…
<xml>
  <a>X</a>
</xml>
…is parsed with CHOP set to true, we will get a document with a single
text node. The following query…
for $t in //text()
  return replace node $t with 'x'
…will generate the following result:
<xml>
  <a>x</a>
</xml>
If we set CHOP to false, the document will have three text nodes, two
of them whitespace-only, and the same query will create the following
result document:
<xml>x<a>x</a>x</xml>
This is just one example to demonstrate that a sudden change of the
default for chop would most probably lead to unwanted side effects in
existing applications. Another side effect: databases are expected to
increase in size, as all whitespace nodes will get their own node ids,
will be fully stored and indexed, etc.
However, I completely agree that the removal of whitespaces may lead
to serious changes in mixed contents, and I easily admit that we
haven’t been aware of all the implications some years ago when we
started off designing the database. While I still believe that our
storage copes pretty well with nowaday’s requirements, I would love to
have some weeks off to completely rebuild it, and include
optimizations for all kinds of features that are relevant today
(including larger ranges for node ids and namespaces, or support for
other tree formats such as json).
Thanks for reading,
Christian
___________________________
On Sat, Apr 13, 2013 at 8:28 AM, Liam R E Quin liam@w3.org wrote:
...
On Fri, 2013-04-05 at 11:31 +0200, Dirk Kirsten wrote:
...
So if you could point out some details as why this is not conforming
behaviour, this would be interesting.
It's a requirement in the XML Spec that the XML parser pass all
whitespace back to the application. Some whitespace may be marked as not
significant - that is only possible if there's a DTD and the space is in
a context where only elements would be valid, not #PCDATA. There's no
formal specification, although constructing an XDM instance from an
infoset, and constructing an infoset from XML, does not entail
discarding these spaces:
Chopping internal whitespace nodes in mixed content contexts is not
sanctioned by any version of any XML specification, with any setting of
xml:space. I think the onus would be on you to justify the non-standard
behaviour.
On the other hand I can see its uses too. But I don't want it, and
always turn it off with BaseX :-)
Best,
Liam
--
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org freenode/#xml

BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] whitespace around comments