I am experiencing unexpected behavior with a database I am working with in BaseX 8.3.1. The database is a collection of information about trials in the late Roman Republic (see http://tlrr.blackmesatech.com/ for more information), and while the upper-level elements have only element content, most of the actual data values are mixed content.
I reloaded the data the other day, having run some cleanup processes on it to regularize the whitespace and make the XML source more readable. In one trial record, for example, the information about the defendant looks like this:
<defGrp> <defendant> <namelist> <person-entry> <person pid="pSulpicius58Ser.Galba" ix="2" form="Sulpicius (+58), Ser. Galba" >Ser. Sulpicius Galba (58)</person> cos. 144 spoke <i>pro se</i> (<i>ORF</i> 19.II, III)</person-entry> </namelist> </defendant> </defGrp>
(This says that the defendant in the case was one Servius Sulpicius Galba, whose biography is given as the 58th entry under "Galba" in the Pauly/Wissowa Reallexikon, that this man was consul in 144 BC, that he spoke on his own behalf, and that the extant fragments of his speech are printed in the collection Oratorum Romanorum Fragmenta (ORF) as items 19.II and 19.III.)
After a little research, I learned (I think) how to make the default settings for the database have the value CHOP = false (I call db:create($dbname,(),(), map{ "chop": false{}) to create the db), and also (redundantly, I hope) to specify CHOP = false as an option on the db:add() and db:replace() calls I am using to reload records in the database.
When the web front end retrieves the individual trial record whose defendant information is shown above, I get a result that looks essentially like what is shown above. When a different query retrieves just portions of the trial record, using the expression
<trial id="{$e/@id}" tlrr1="{$e/@tlrr1}" doc="{document-uri(root($e))}">{ $e/date, $e/ccGrp, $e/defGrp(: /defendant :), $e/ppGrp(: /prosecutor :), $e/partiesGrp, $e/advGrp }</trial>
the defendant information looks like this, according to both Safari and Opera:
<defGrp> <defendant> <namelist> <person-entry> <person pid="pSulpicius58Ser.Galba" ix="2" form="Sulpicius (+58), Ser. Galba">Ser. Sulpicius Galba (58)</person>cos. 144 spoke<i>pro se</i>(<i>ORF</i>19.II, III)</person-entry> </namelist> </defendant> </defGrp>
Note that within the person-entry element, the whitespace adjacent to the 'person' and 'i' elements has disappeared.
It looks almost as if some queries were stripping whitespace as part of the query, or as part of returning a result. To confuse me even further, dynamic queries using the dba application on the server return data with the whitespace chopped.
Is there something obvious I am overlooking or doing wrong?
Actually, I guess i have two questions: first, I'd like to figure out why BaseX is currently behaving as it does. And then I'd like to make it behave differently.
I realize now that the documents I just updated all had xml:space="preserve" on their root elements, because I couldn't make this work last time I tried, either. I would much much rather avoid resorting to that again, if I can, since it feels like a hack and it complicates processing of the data.
I will try to construct a minimum repeatable example that illustrates the problem, but I have not done so yet.
thanks for any help anyone on the list can provide,
Michael
On Jun 16, 2016, at 10:32 PM, C. M. Sperberg-McQueen wrote:
I am experiencing unexpected behavior with a database I am working with in BaseX 8.3.1. ...
I will try to construct a minimum repeatable example that illustrates the problem, but I have not done so yet.
One attempt to reduce things to the essentials is:
1. I've placed one input document at http://tlrr.blackmesatech.com/2016/06/ZAA.xml
2. Running curl http://tlrr.blackmesatech.com/2016/06/ZAA.xml | grep person shows whitespace in the data, as show in (A) below. As can be seen, I've added xml:space to one 'person-entry' element as an experiment.
3. Running the following updating query in the Queries interface in the database server produces a 'Query successful' method.
(: load a single document :)
let $options-map := map { "chop": false(), "intparse": true() }
let $host := "http://tlrr.blackmesatech.com", $path := "trials/ZAA.xml", $uri := concat($host, '/2016/06/ZAA.xml'), $doc := doc($uri)
return db:replace('tlrr1-alpha', $path, $doc, $options-map)
4. Running the command
curl --user ... http://modeleditions.blackmesatech.com/BaseX831/rest/tlrr1-alpha/trials/ZAA.... | grep person
with a userid assigned read-only access to the database produces the results shown in (B) below, which shows that whitespace being stripped, despite (a) the database having been created with
db:create('tlrr1-alpha',(),(), map { "chop" : false() })
and the use of map { "chop" : false() } in the update query shown above.
5 Trying this with "chop" : false(), "chop" : 'false', "chop" : 0 does not change the result. Nor does "chop" : true(), which I tried just in case I was reading the documentation wrong. Including "intparse" : true() also has no visible effect.
As the URI of the REST interface suggests, the server is running BaseX 8.3.1. The documentation says the XML parsing options were added to db:replace in version 7.9.
I'm close to my wits' end. Why is the CHOP option not working as advertised? Or what am I doing wrong in trying to set it?
Michael
p.s. the earlier report that some queries returned stripped text nodes and others returned unstripped text nodes appears to be irreproducible. Perhaps it was caused by stale caches.
......
(A) output of curl http://tlrr.blackmesatech.com/2016/06/ZAA.xml | grep person
Note white space after 'person' elements and elsewhere.
<person-entry> <person pid="pSulpicius58Ser.Galba" form="Sulpicius (+58), Ser. Galba">Ser. Sulpicius Galba (58)</person> cos. 144 spoke <i>pro se</i> (<i>ORF</i> 19.II, III)</person-entry> <person-entry xml:space="preserve"> <person pid="pFulvius95Q.Nobilior" ix="3" form="Fulvius (+95), Q. Nobilior">Q. Fulvius Nobilior (95)</person> cos. 153, cens. 136</person-entry> <person-entry> <person pid="pCornelius91L.Cethegus" form="Cornelius (+91), L. Cethegus">L. Cornelius Cethegus (91)</person> </person-entry> <person-entry> <person pid="pPorcius9M.Cato" ix="4" form="Porcius (++9), M. Cato">M. Porcius Cato (9)</person> cos. 195, cens. 184 (<i>ORF</i> 8.LI)</person-entry> <person-entry> <person pid="pScribonius18L.Libo" ix="4" form="Scribonius (+18), L. Libo">L. Scribonius Libo (18)</person> tr. pl. 149 (<i>promulgator</i>)</person-entry>
(B) output of curl --user ... http://modeleditions.blackmesatech.com/BaseX831/rest/tlrr1-alpha/trials/ZAA.... | grep person
Note absence of whitespace after 'person' elements, except in the entry for Q. Fulvius Nobilior.
<person-entry> <person pid="pSulpicius58Ser.Galba" ix="2" form="Sulpicius (+58), Ser. Galba">Ser. Sulpicius Galba (58)</person>cos. 144 spoke<i>pro se</i>(<i>ORF</i>19.II, III)</person-entry> <person-entry xml:space="preserve"> <person pid="pFulvius95Q.Nobilior" ix="3" form="Fulvius (+95), Q. Nobilior">Q. Fulvius Nobilior (95)</person> cos. 153, cens. 136</person-entry> <person-entry> <person pid="pCornelius91L.Cethegus" ix="4" form="Cornelius (+91), L. Cethegus">L. Cornelius Cethegus (91)</person> </person-entry> <person-entry> <person pid="pPorcius9M.Cato" ix="4" form="Porcius (++9), M. Cato">M. Porcius Cato (9)</person>cos. 195, cens. 184 (<i>ORF</i>8.LI)</person-entry> <person-entry> <person pid="pScribonius18L.Libo" ix="4" form="Scribonius (+18), L. Libo">L. Scribonius Libo (18)</person>tr. pl. 149 (<i>promulgator</i>)</person-entry>
Dear Michael,
As you correctly guessed, if you want to preserve whitespaces, you will need to set the CHOP option to false. I remember there has been discussion around this option on this list more than once. It turned out it would cause quite a lot of surprises if we changed the default to 'false', because the visualizations, the database layout etc. have been tailored to work best without superfluous whitespaces. But whitespace chopping is surely not what you would expect when working with full-text [1].
By the way, I never stopped wondering why only 'preserve' and 'default' are allowed as values for the xml:space attribute. As one of the renowned editors of the spec, can you tell why a 'strip' value was omitted back then?
Please note that 'chop' in combination with db:create will only get effective if you specify actual input with this command [2]. If you want to globally deactivated whitespace chopping, you can specify this option in the .basex configuration file or (if you are working with RESTXQ, REST, etc.), add it in the web.xml file.
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Full-Text#Mixed_Content [2] http://docs.basex.org/wiki/Database_Module#db:create
On Fri, Jun 17, 2016 at 8:31 PM, C. M. Sperberg-McQueen cmsmcq@blackmesatech.com wrote:
On Jun 16, 2016, at 10:32 PM, C. M. Sperberg-McQueen wrote:
I am experiencing unexpected behavior with a database I am working with in BaseX 8.3.1. ...
I will try to construct a minimum repeatable example that illustrates the problem, but I have not done so yet.
One attempt to reduce things to the essentials is:
- I've placed one input document at
http://tlrr.blackmesatech.com/2016/06/ZAA.xml
- Running curl http://tlrr.blackmesatech.com/2016/06/ZAA.xml | grep person
shows whitespace in the data, as show in (A) below. As can be seen, I've added xml:space to one 'person-entry' element as an experiment.
- Running the following updating query in the Queries interface in the
database server produces a 'Query successful' method.
(: load a single document :)
let $options-map := map { "chop": false(), "intparse": true() }
let $host := "http://tlrr.blackmesatech.com", $path := "trials/ZAA.xml", $uri := concat($host, '/2016/06/ZAA.xml'), $doc := doc($uri)
return db:replace('tlrr1-alpha', $path, $doc, $options-map)
- Running the command
curl --user ... http://modeleditions.blackmesatech.com/BaseX831/rest/tlrr1-alpha/trials/ZAA.... | grep person
with a userid assigned read-only access to the database produces the results shown in (B) below, which shows that whitespace being stripped, despite (a) the database having been created with
db:create('tlrr1-alpha',(),(), map { "chop" : false() })
and the use of map { "chop" : false() } in the update query shown above.
5 Trying this with "chop" : false(), "chop" : 'false', "chop" : 0 does not change the result. Nor does "chop" : true(), which I tried just in case I was reading the documentation wrong. Including "intparse" : true() also has no visible effect.
As the URI of the REST interface suggests, the server is running BaseX 8.3.1. The documentation says the XML parsing options were added to db:replace in version 7.9.
I'm close to my wits' end. Why is the CHOP option not working as advertised? Or what am I doing wrong in trying to set it?
Michael
p.s. the earlier report that some queries returned stripped text nodes and others returned unstripped text nodes appears to be irreproducible. Perhaps it was caused by stale caches.
......
(A) output of curl http://tlrr.blackmesatech.com/2016/06/ZAA.xml | grep person
Note white space after 'person' elements and elsewhere.
<person-entry> <person pid="pSulpicius58Ser.Galba" form="Sulpicius (+58), Ser. Galba">Ser. Sulpicius Galba (58)</person> cos. 144 spoke <i>pro se</i> (<i>ORF</i> 19.II, III)</person-entry> <person-entry xml:space="preserve"> <person pid="pFulvius95Q.Nobilior" ix="3" form="Fulvius (+95), Q. Nobilior">Q. Fulvius Nobilior (95)</person> cos. 153, cens. 136</person-entry> <person-entry> <person pid="pCornelius91L.Cethegus" form="Cornelius (+91), L. Cethegus">L. Cornelius Cethegus (91)</person> </person-entry> <person-entry> <person pid="pPorcius9M.Cato" ix="4" form="Porcius (++9), M. Cato">M. Porcius Cato (9)</person> cos. 195, cens. 184 (<i>ORF</i> 8.LI)</person-entry> <person-entry> <person pid="pScribonius18L.Libo" ix="4" form="Scribonius (+18), L. Libo">L. Scribonius Libo (18)</person> tr. pl. 149 (<i>promulgator</i>)</person-entry>
(B) output of curl --user ... http://modeleditions.blackmesatech.com/BaseX831/rest/tlrr1-alpha/trials/ZAA.... | grep person
Note absence of whitespace after 'person' elements, except in the entry for Q. Fulvius Nobilior.
<person-entry> <person pid="pSulpicius58Ser.Galba" ix="2" form="Sulpicius (+58), Ser. Galba">Ser. Sulpicius Galba (58)</person>cos. 144 spoke<i>pro se</i>(<i>ORF</i>19.II, III)</person-entry> <person-entry xml:space="preserve"> <person pid="pFulvius95Q.Nobilior" ix="3" form="Fulvius (+95), Q. Nobilior">Q. Fulvius Nobilior (95)</person> cos. 153, cens. 136</person-entry> <person-entry> <person pid="pCornelius91L.Cethegus" ix="4" form="Cornelius (+91), L. Cethegus">L. Cornelius Cethegus (91)</person> </person-entry> <person-entry> <person pid="pPorcius9M.Cato" ix="4" form="Porcius (++9), M. Cato">M. Porcius Cato (9)</person>cos. 195, cens. 184 (<i>ORF</i>8.LI)</person-entry> <person-entry> <person pid="pScribonius18L.Libo" ix="4" form="Scribonius (+18), L. Libo">L. Scribonius Libo (18)</person>tr. pl. 149 (<i>promulgator</i>)</person-entry>
--
- C. M. Sperberg-McQueen, Black Mesa Technologies LLC
- http://www.blackmesatech.com
- http://cmsmcq.com/mib
- http://balisage.net
On Jun 18, 2016, at 6:35 AM, Christian Grün wrote:
Dear Michael,
As you correctly guessed, if you want to preserve whitespaces, you will need to set the CHOP option to false. I remember there has been discussion around this option on this list more than once.
Yes. What puzzles me is that calling db:replace with a fourth argument of map { "chop" : false() } appears not to have any effect in the database in question. (I still have not put together a minimal reproducible example; for the moment I solved the problem by adding xml:space="preserve" to each mixed-content element. I hope to come back to the issue of the chop option when my current rush is past.)
By the way, I never stopped wondering why only 'preserve' and 'default' are allowed as values for the xml:space attribute. As one of the renowned editors of the spec, can you tell why a 'strip' value was omitted back then?
The short answer is no, I cannot. (My prayers have been answered! There are some details of the design discussions of 1996 that I cannot remember!)
I have just spent much more time than I intended trying to find the relevant parts of the discussion in the email archive at
http://lists.w3.org/Archives/Public/w3c-sgml-wg/
Before being named 'xml:space', the attribute in question appears to have gone by the name 'xml-space' or '-xml-space' (as the group's attempts to reserve a portion of the namespace for itself changed over time). As far as I can tell, the discussions on whitespace handling began in September 1996 and may have been mostly concluded by December of that year.
A document containing the group's summaries of design decisions mentions (what became) the xml:space attribute in decisions of 29 October and again on 18 December, and again on 4 June 1997.
http://www.w3.org/XML/9712-reports.html
The two values were labeled 'KEEP' and 'COLLAPSE' in the draft of 14 November 1996 (which appears to be the oldest one in the W3C technical-reports area); 'COLLAPSE' was later renamed to 'DEFAULT'. Later proposals to introduce a third value with a name like REMOVE or DISCARD did come up, but appear never to have gotten any traction.
http://www.w3.org/TR/WD-xml-961114#sec2.7
Speaking for myself, I think a better heuristic than dropping all whitespace-only text nodes and removing leading and trailing whitespace would be dropping whitespace-only text nodes only if every text-node seen so far as a child of this parent has been whitespace-only, and stripping leading whitespace only after a start-tag and trailing whitespace only after an end-tag.
The first would prevent the loss of the inter-word whitespace in
<p>This <em>is</em> <strong>IMPORTANT</strong></p>
But it may be as impossible for BaseX to change the details of the CHOP option as is is to change the default value for the CHOP option from true to false.
Please note that 'chop' in combination with db:create will only get effective if you specify actual input with this command [2].
Thank you; I had not realized that (I imagined it was somehow setting a default for the collection being created).
If you want to globally deactivated whitespace chopping, you can specify this option in the .basex configuration file or (if you are working with RESTXQ, REST, etc.), add it in the web.xml file.
Aha. That may be the thing to do.
Hope this helps,
As always, it does. Thank you very much.
Michael
Yes. What puzzles me is that calling db:replace with a fourth argument of map { "chop" : false() } appears not to have any effect in the database in question.
This is probably because the input you are specifying are nodes, for which whitespaces have already been chopped in a previous step I tried to explain this better now in our documentation [1].
I have just spent much more time than I intended trying to find the relevant parts of the discussion in the email archive at
Thanks a lot, it was really interesting to read! Although it’s obviously too late to change anything about the status quo.
Speaking for myself, I think a better heuristic than dropping all whitespace-only text nodes and removing leading and trailing whitespace would be dropping whitespace-only text nodes only if every text-node seen so far as a child of this parent has been whitespace-only, and stripping leading whitespace only after a start-tag and trailing whitespace only after an end-tag.
Your idea sounds very appealing to me; I have just added it to our issue tracker [2]. I will still need to think about all the minor and hidden implications, but I think we could definitely live with changing the default behaviour:
* Nothing would change for highly structured input * For mixed content, things can only get better.
If there will be cases in which relevant whitespaces get lost, we could still fine tune your heuristics.
All the best, Christian
[1] http://docs.basex.org/wiki/Database_Module#db:replace [2] https://github.com/BaseXdb/basex/issues/913
On Jun 21, 2016, at 12:43 AM, Christian Grün wrote:
Yes. What puzzles me is that calling db:replace with a fourth argument of map { "chop" : false() } appears not to have any effect in the database in question.
This is probably because the input you are specifying are nodes, for which whitespaces have already been chopped in a previous step I tried to explain this better now in our documentation [1].
[Headslap] D'oh! Thank you very much.
I have created two simple examples to try to teach myself what is going on here. Is the characterization correct?
db:add("DB", "http://example.com/doc.xml", "doc.xml", map { "chop" false() }) -- parses the file at the URI http://example.com/doc.xml with the CHOP option turned off (so whitespace is preserved), and adds it to database DB.
db:add("DB", doc("http://example.com/doc.xml"), "doc.xml", map { "chop" false() }) -- parses the file at the URI http://example.com/doc.xml with the default parser settings and adds it to database DB. Note that the CHOP setting in the fourth argument has no effect, since the document is parsed by the doc() function, not the db:add function.
If you think it helpful, feel free to add these to the documentation for db:add() or db:replace; it might help even readers like me to understand what is going on.
best,
Michael
To add to these examples, I think another way to add a document with whitespace preserved is:
declare option db:chop 'false'; db:add("DB", doc("http://example.com/doc.xml"), "doc.xml")
Is this equivalent to db:add("DB", "http://example.com/doc.xml", "doc.xml", map { "chop": false() }) ?
Vincent
From: basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] On Behalf Of C. M. Sperberg-McQueen Sent: Friday, June 24, 2016 1:12 PM To: Christian Grün christian.gruen@gmail.com Cc: C. M. Sperberg-McQueen cmsmcq@blackmesatech.com; BaseX basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] unexpected whitespace-handling behavior in BaseX 8.3.1
On Jun 21, 2016, at 12:43 AM, Christian Grün wrote:
Yes. What puzzles me is that calling db:replace with a fourth argument of map { "chop" : false() } appears not to have any effect in the database in question.
This is probably because the input you are specifying are nodes, for which whitespaces have already been chopped in a previous step I tried to explain this better now in our documentation [1].
[Headslap] D'oh! Thank you very much.
I have created two simple examples to try to teach myself what is going on here. Is the characterization correct?
db:add("DB", "http://example.com/doc.xmlhttp://example.com/doc.xml", "doc.xml", map { "chop" false() }) -- parses the file at the URI http://example.com/doc.xmlhttp://example.com/doc.xml with the CHOP option turned off (so whitespace is preserved), and adds it to database DB.
db:add("DB", doc("http://example.com/doc.xmlhttp://example.com/doc.xml"), "doc.xml", map { "chop" false() }) -- parses the file at the URI http://example.com/doc.xmlhttp://example.com/doc.xml with the default parser settings and adds it to database DB. Note that the CHOP setting in the fourth argument has no effect, since the document is parsed by the doc() function, not the db:add function.
If you think it helpful, feel free to add these to the documentation for db:add() or db:replace; it might help even readers like me to understand what is going on.
best,
Michael
-- **************************************************************** * C. M. Sperberg-McQueen, Black Mesa Technologies LLC * http://www.blackmesatech.comhttp://www.blackmesatech.com * http://cmsmcq.com/mibhttp://cmsmcq.com/mib * http://balisage.nethttp://balisage.net ****************************************************************
Thanks for the examples, I’ll add them to the documentation (as usual, we are happy about external edits in our Wiki).
Regarding the fine tuning of the CHOP algorithm, I still want to systemically investigate the effects on a larger number of mixed-content documents (still in the pipeline, hopefully not for too long).
Hi Michael,
Speaking for myself, I think a better heuristic than dropping all whitespace-only text nodes and removing leading and trailing whitespace would be dropping whitespace-only text nodes only if every text-node seen so far as a child of this parent has been whitespace-only, and stripping leading whitespace only after a start-tag and trailing whitespace only after an end-tag.
Some more time passed, and I finally tried to rewrite your little proposal into some XQuery code to test the implications. When executing it, I end up with the following result…
<p>This <em>is</em><strong>IMPORTANT</strong></p>
Maybe it’s because the removal of heading OR trailing whitespaces can also lead to a zero-length text node? Maybe I should simply spend some more time on thinking about it? ;)
I have attached a mini example to this mail; suggestions (from everyone) are welcome.
Christian
basex-talk@mailman.uni-konstanz.de