Hi Christian,
Does this mean, for a given set of XML files that have several namespace declarations attached to the root element including a default namespace, if namespaces are removed
when these XML files are loaded into BaseX, for example by using the “Strip namespaces” option in the GUI, BaseX may be able to use additional query optimizations?
If the answer is yes, I may have to try re-importing several databases with “Strip namespaces” turned on to see if there is any difference in query performance.
Thank you,
Vincent
From: basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de]
On Behalf Of Christian Grün
Sent: Wednesday, February 22, 2017 9:54 AM
To: Gioele Barabucci <gioele@svario.it>
Cc: BaseX <basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] count(//elem) not optimized, even though `elem` is in the index
Hi Gioele,
> I wonder if the presence of the namespace somehow confuses the optimizer.
Exactly, that’s the reason. For some historical reason (but not such a
wise one, as most quoted “historical reasons” are), we decided to
index the node names without considering the namespace URI. As a
result, the index:element-names function will yield…
<entry count="2">xml</entry>
…for the following document:
<xml>
<xml xmlns='uri'/>
</xml>
For the same reason, various optimizations that are based on the
database statistics will only get into effect if a document contains
no, or at most one global, namespace declaration. In various cases,
optimizations could still be made possible (e.g. if we know that the
element/attribute names with and without namespace URIs are distinct),
but that hasn’t been implemented so far.
Cheers,
Christian
> I was stressing the BaseX 8.6 planner/optimizer when I noticed that
> expressions like `count(//elem)` are not optimized at all, even though they
> are correctly indexed, as demonstrated by `index:element-names()`.
>
> The current database is a 300 MB TEI document. All the elements are in the
> `http://www.tei-c.org/ns/1.0` namespace.
>
> The following test case will report the correct number, but it will take a
> couple of seconds to run, instead of a few milliseconds.
>
> ```
> declare namespace tei="http://www.tei-c.org/ns/1.0";
>
> let $n := index:element-names("monier")[. = 're']/@count
>
> let $c := count(//tei:re)
>
> return <res><in-index>{$n}</in-index><in-doc>{$c}</in-doc></res>
> ```
>
> I wonder if the presence of the namespace somehow confuses the optimizer.
> The same problem can be observed running the same test case with
>
> ```
> declare default element namespace "http://www.tei-c.org/ns/1.0";
> [...]
> let $c := count(//re)
> ```
>
> Regards,
>
> --
> Gioele Barabucci <gioele.barabucci@uni-koeln.de>
>