Yes, you are certainly right. I think it was around 2007 when we chopped whitespaces by default, although we knew it didn't comply with the specification. One reason was that we rarely worked with mixed-content data at that time, and the whitespace indentations increased the size of databases and led to worse rendering results in the built-in visualizations (our first users were confused about that).

Maybe we’ll switch the default in a future version of BaseX.




Jos van den Oever <jos@vandenoever.info> schrieb am Di., 16. Feb. 2021, 23:36:
Thanks for the context.

Still, it does not explain the difference in behavior bestween doc() and
parse-xml().

As far as I understand the XDM specification, whitespace may be ignored by the
parser if there is a DTD or XML Schema that says that an element is not PCDATA
(DTD) or mixed (XML Schema). In the absense of (support for) schemas, all
whitespace should be left in. Wendell Piez writes it with many details.

Whitespace in XML tricky. E.g. indenting XML cannot be done well without
knowing which elements are PCDATA/mixed.

Now that I know about the CHOP option, I can use BaseX predictably. And the
legacy reasons for keeping it set are understandable.

Best regards,
Jos

On dinsdag 16 februari 2021 23:10:05 CET Christian Grün wrote:
> There is an old (and still open) issue on GitHub [1] that might give you
> some more insight into the history of whitespace chopping in BaseX.
>
> Hope this helps
> Christian
>
> [1] https://github.com/BaseXdb/basex/issues/913
>
>
>
>
> Jos van den Oever <jos@vandenoever.info> schrieb am Di., 16. Feb. 2021,
>
> 22:41:
> > Hi Christian,
> >
> > Yes, writing 'CHOP=OFF' in .basex stops the vanishing of whitespace.
> >
> > But where in the XQuery or XDM spec does it say that whitespace handling
> > when
> > parsing is implementation dependent?
> >
> > Cheers,
> > Jos
> >
> > On dinsdag 16 februari 2021 22:10:30 CET Christian Grün wrote:
> > > Hi Jos,
> > >
> > > Whitespaces will be preserved if the CHOP option is disabled. You can
> >
> > make
> >
> > > this a default by adding CHOP=false in your .basex configuration file
> >
> > [1,2].
> >
> > > Hope this helps,
> > > Christian
> > >
> > > [1] https://docs.basex.org/wiki/Full-Text#Mixed_Content
> > > [2] https://docs.basex.org/wiki/Configuration
> > >
> > >
> > >
> > >
> > > Jos van den Oever <jos@vandenoever.info> schrieb am Di., 16. Feb. 2021,
> > >
> > > 22:00:
> > > > Dear all,
> > > >
> > > > First off: BaseX is great to work with. I use it for a few statically
> > > > generated websites.
> > > >
> > > > But I recently found what might be a bug.
> > > >
> > > > Some whitespace vanishes when loading xml files. E.g. this xml file:
> > > >
> > > > ```test.xml
> > > > <a> a b <a> c </a> d e </a>
> > > > ```
> > > >
> > > > run like this:
> > > >
> > > > doc('test.xml')
> > > >
> > > > gives:
> > > >
> > > > <a>a b<a>c</a>d e</a>
> > > >
> > > > But running this:
> > > >
> > > > ```
> > > > parse-xml('<a> a b <a> c </a> d e </a>')
> > > > ```
> > > >
> > > > retains the whitespace.
> > > >
> > > > I've tested this with BaseX 7.0, 8.0, 9.0 and 9.4.6.
> > > >
> > > > Running this in saxon-he-10.3.jar retains the whitespace.
> > > >
> > > > I can work around this issue by placing xml:space="preserve" in the
> > > > document
> > > > element.
> > > >
> > > > I cannot come up with a scenario in which discarding whitespace during
> >
> > is
> >
> > > > parsing is ok when no DTD or XML Schema is provided.
> > > >
> > > > Best regards,
> > > > Jos