Dirk,
On 2013-04-05, Dirk Kirsten dk@basex.org wrote:
You are certainly right that with mixed content and the example you have given here chopping does make a semantic difference. However, you can disable this behaviour so BaseX does what you want. So the only reason I see why one should change the default behaviour would be because the default is not confirmant to some XML standard. However, I can not find any specifics in the spec about which is the expected behaviour, so in my opinion BaseX is doing nothing wrong here.
Well, if you agree that chopping may alter the semantics of a document, wouldn't you agree that applying such a transformation *by default* is a bad idea?
With respect to the XML specification, section 2.10 "White Space Handling" says:
An XML processor MUST always pass all characters in a document that are not markup through to the application.
Yes, the spec is vague wrt. to whitespace handling, and the existence of the xml:space attribute shows that different behaviors--including potentially corrupting ones--are possible. I would therefore interpret the spec to mean that by default all characters should be preserved, but that other behaviors are possible.
I see that this behaviour might be surprising for some users, but this might as well be the case if it were the other way round.
No, because their documents wouldn't be corrupted. You can easily remove all whitespace afterwards if you decide you don't need it, but once it's gone, it's gone and cannot be restored. That's the problem.
Additionally, if we would change this now it would break application code and unless there is a good reason (i.e. BaseX is actually doing something wrong or non-compliant) I don't see why one should change the default.
Well, I'm not on a crusade or anything, so if you believe that it's a good idea to corrupt, by default, all documents containing mixed content on import, or if this behavior must be kept for compatiblity, so be it. I just wanted to point out that whitespace chopping may, in fact, alter the semantics of documents--it's not as harmless as it may seem.
Best regards