Hello,
I was stressing the BaseX 8.6 planner/optimizer when I noticed that expressions like `count(//elem)` are not optimized at all, even though they are correctly indexed, as demonstrated by `index:element-names()`.
The current database is a 300 MB TEI document. All the elements are in the `http://www.tei-c.org/ns/1.0%60 namespace.
The following test case will report the correct number, but it will take a couple of seconds to run, instead of a few milliseconds.
``` declare namespace tei="http://www.tei-c.org/ns/1.0";
let $n := index:element-names("monier")[. = 're']/@count
let $c := count(//tei:re)
return <res><in-index>{$n}</in-index><in-doc>{$c}</in-doc></res> ```
I wonder if the presence of the namespace somehow confuses the optimizer. The same problem can be observed running the same test case with
``` declare default element namespace "http://www.tei-c.org/ns/1.0"; [...] let $c := count(//re) ```
Regards,
-- Gioele Barabucci gioele.barabucci@uni-koeln.de
Hi Gioele,
I wonder if the presence of the namespace somehow confuses the optimizer.
Exactly, that’s the reason. For some historical reason (but not such a wise one, as most quoted “historical reasons” are), we decided to index the node names without considering the namespace URI. As a result, the index:element-names function will yield…
<entry count="2">xml</entry>
…for the following document:
<xml> <xml xmlns='uri'/> </xml>
For the same reason, various optimizations that are based on the database statistics will only get into effect if a document contains no, or at most one global, namespace declaration. In various cases, optimizations could still be made possible (e.g. if we know that the element/attribute names with and without namespace URIs are distinct), but that hasn’t been implemented so far.
Cheers, Christian
I was stressing the BaseX 8.6 planner/optimizer when I noticed that expressions like `count(//elem)` are not optimized at all, even though they are correctly indexed, as demonstrated by `index:element-names()`.
The current database is a 300 MB TEI document. All the elements are in the `http://www.tei-c.org/ns/1.0%60 namespace.
The following test case will report the correct number, but it will take a couple of seconds to run, instead of a few milliseconds.
declare namespace tei="http://www.tei-c.org/ns/1.0"; let $n := index:element-names("monier")[. = 're']/@count let $c := count(//tei:re) return <res><in-index>{$n}</in-index><in-doc>{$c}</in-doc></res>
I wonder if the presence of the namespace somehow confuses the optimizer. The same problem can be observed running the same test case with
declare default element namespace "http://www.tei-c.org/ns/1.0"; [...] let $c := count(//re)
Regards,
-- Gioele Barabucci gioele.barabucci@uni-koeln.de
Am 22.02.2017 um 15:53 schrieb Christian Grün:
I wonder if the presence of the namespace somehow confuses the optimizer.
For the same reason, various optimizations that are based on the database statistics will only get into effect if a document contains no, or at most one global, namespace declaration.
Hi Christian, thank you for the explanation.
I have a couple further questions and a request.
The first question: in my case all the elements are in one namespace, declared in the root element. Shouldn't the optimizations kick in?
The other question: what can be done right now to work around this problem? Any workaround aside reimporting the database stripping the namespace information?
And now the request. Would it be possible to document in the wiki which optimizations are not possible when namespaces (or more than one namespace) are used?
Regards,
Hi Gioele,
The first question: in my case all the elements are in one namespace, declared in the root element. Shouldn't the optimizations kick in?
I must admit it’s been a while ago when we were working on these optimizations. If you are interested to dig deeper, the Path.size function could be a good entry point [1].
Any workaround aside reimporting the database stripping the namespace information?
Stripping namespace (via STRIPNS [2]) would surely be an option.
And now the request. Would it be possible to document in the wiki which optimizations are not possible when namespaces (or more than one namespace) are used?
It would take me quite some time to get this all documented, but let’s see what I can do.
Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba... [2] http://docs.basex.org/wiki/Options#STRIPNS
Hi Christian,
Does this mean, for a given set of XML files that have several namespace declarations attached to the root element including a default namespace, if namespaces are removed when these XML files are loaded into BaseX, for example by using the “Strip namespaces” option in the GUI, BaseX may be able to use additional query optimizations?
If the answer is yes, I may have to try re-importing several databases with “Strip namespaces” turned on to see if there is any difference in query performance.
Thank you, Vincent
From: basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] On Behalf Of Christian Grün Sent: Wednesday, February 22, 2017 9:54 AM To: Gioele Barabucci gioele@svario.it Cc: BaseX basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] count(//elem) not optimized, even though `elem` is in the index
Hi Gioele,
I wonder if the presence of the namespace somehow confuses the optimizer.
Exactly, that’s the reason. For some historical reason (but not such a wise one, as most quoted “historical reasons” are), we decided to index the node names without considering the namespace URI. As a result, the index:element-names function will yield…
<entry count="2">xml</entry>
…for the following document:
<xml> <xml xmlns='uri'/> </xml>
For the same reason, various optimizations that are based on the database statistics will only get into effect if a document contains no, or at most one global, namespace declaration. In various cases, optimizations could still be made possible (e.g. if we know that the element/attribute names with and without namespace URIs are distinct), but that hasn’t been implemented so far.
Cheers, Christian
I was stressing the BaseX 8.6 planner/optimizer when I noticed that expressions like `count(//elem)` are not optimized at all, even though they are correctly indexed, as demonstrated by `index:element-names()`.
The current database is a 300 MB TEI document. All the elements are in the `http://www.tei-c.org/ns/1.0http://www.tei-c.org/ns/1.0` namespace.
The following test case will report the correct number, but it will take a couple of seconds to run, instead of a few milliseconds.
declare namespace tei="http://www.tei-c.org/ns/1.0<http://www.tei-c.org/ns/1.0>"; let $n := index:element-names("monier")[. = 're']/@count let $c := count(//tei:re) return <res><in-index>{$n}</in-index><in-doc>{$c}</in-doc></res>
I wonder if the presence of the namespace somehow confuses the optimizer. The same problem can be observed running the same test case with
declare default element namespace "http://www.tei-c.org/ns/1.0<http://www.tei-c.org/ns/1.0>"; [...] let $c := count(//re)
Regards,
-- Gioele Barabucci <gioele.barabucci@uni-koeln.demailto:gioele.barabucci@uni-koeln.de>
Hi Vincent,
Does this mean, for a given set of XML files that have several namespace declarations attached to the root element including a default namespace, if namespaces are removed when these XML files are loaded into BaseX, for example by using the “Strip namespaces” option in the GUI, BaseX may be able to use additional query optimizations?
Definitely yes, but it surely depends on your queries, and if it contains patterns that can be optimized by taking advantage of the database statistics. Feel free to report back to us.
Christian
basex-talk@mailman.uni-konstanz.de