count(//elem) not optimized, even though `elem` is in the index

List overview All Threads
Download

newer

older

base-uri() changed answer around...

XSLT caching

Gioele Barabucci

22 Feb 2017 22 Feb '17

6 a.m.

Hello,

I was stressing the BaseX 8.6 planner/optimizer when I noticed that expressions like `count(//elem)` are not optimized at all, even though they are correctly indexed, as demonstrated by `index:element-names()`.

The current database is a 300 MB TEI document. All the elements are in the `http://www.tei-c.org/ns/1.0%60 namespace.

The following test case will report the correct number, but it will take a couple of seconds to run, instead of a few milliseconds.

``` declare namespace tei="http://www.tei-c.org/ns/1.0";

let $n := index:element-names("monier")[. = 're']/@count

let $c := count(//tei:re)

return <res><in-index>{$n}</in-index><in-doc>{$c}</in-doc></res> ```

I wonder if the presence of the namespace somehow confuses the optimizer. The same problem can be observed running the same test case with

``` declare default element namespace "http://www.tei-c.org/ns/1.0"; [...] let $c := count(//re) ```

Regards,

-- Gioele Barabucci gioele.barabucci@uni-koeln.de

Show replies by date

Christian Grün

22 Feb 22 Feb

9:53 a.m.

New subject: count(//elem) not optimized, even though `elem` is in the index

Hi Gioele,

...

I wonder if the presence of the namespace somehow confuses the optimizer.

Exactly, that’s the reason. For some historical reason (but not such a wise one, as most quoted “historical reasons” are), we decided to index the node names without considering the namespace URI. As a result, the index:element-names function will yield…

…for the following document:

For the same reason, various optimizations that are based on the database statistics will only get into effect if a document contains no, or at most one global, namespace declaration. In various cases, optimizations could still be made possible (e.g. if we know that the element/attribute names with and without namespace URIs are distinct), but that hasn’t been implemented so far.

Cheers, Christian

...

I was stressing the BaseX 8.6 planner/optimizer when I noticed that expressions like `count(//elem)` are not optimized at all, even though they are correctly indexed, as demonstrated by `index:element-names()`.

The current database is a 300 MB TEI document. All the elements are in the `http://www.tei-c.org/ns/1.0%60 namespace.

The following test case will report the correct number, but it will take a couple of seconds to run, instead of a few milliseconds.
declare namespace tei="http://www.tei-c.org/ns/1.0";

let $n := index:element-names("monier")[. = 're']/@count

let $c := count(//tei:re)

return <res><in-index>{$n}</in-index><in-doc>{$c}</in-doc></res>
I wonder if the presence of the namespace somehow confuses the optimizer. The same problem can be observed running the same test case with
declare default element namespace "http://www.tei-c.org/ns/1.0";
[...]
let $c := count(//re)
Regards,

-- Gioele Barabucci gioele.barabucci@uni-koeln.de

Gioele Barabucci

10:13 a.m.

Am 22.02.2017 um 15:53 schrieb Christian Grün:

...

...
I wonder if the presence of the namespace somehow confuses the optimizer.

For the same reason, various optimizations that are based on the database statistics will only get into effect if a document contains no, or at most one global, namespace declaration.

Hi Christian, thank you for the explanation.

I have a couple further questions and a request.

The first question: in my case all the elements are in one namespace, declared in the root element. Shouldn't the optimizations kick in?

The other question: what can be done right now to work around this problem? Any workaround aside reimporting the database stripping the namespace information?

And now the request. Would it be possible to document in the wiki which optimizations are not possible when namespaces (or more than one namespace) are used?

Regards,

-- Gioele Barabucci gioele@svario.it

Christian Grün

6:09 p.m.

New subject: count(//elem) not optimized, even though `elem` is in the index

Hi Gioele,

...

The first question: in my case all the elements are in one namespace, declared in the root element. Shouldn't the optimizations kick in?

I must admit it’s been a while ago when we were working on these optimizations. If you are interested to dig deeper, the Path.size function could be a good entry point [1].

...

Any workaround aside reimporting the database stripping the namespace information?

Stripping namespace (via STRIPNS [2]) would surely be an option.

...

And now the request. Would it be possible to document in the wiki which optimizations are not possible when namespaces (or more than one namespace) are used?

It would take me quite some time to get this all documented, but let’s see what I can do.

Christian

[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba... [2] http://docs.basex.org/wiki/Options#STRIPNS

Lizzi, Vincent

5:15 p.m.

New subject: count(//elem) not optimized, even though `elem` is in the index

Hi Christian,

Does this mean, for a given set of XML files that have several namespace declarations attached to the root element including a default namespace, if namespaces are removed when these XML files are loaded into BaseX, for example by using the “Strip namespaces” option in the GUI, BaseX may be able to use additional query optimizations?

If the answer is yes, I may have to try re-importing several databases with “Strip namespaces” turned on to see if there is any difference in query performance.

Thank you, Vincent

From: basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] On Behalf Of Christian Grün Sent: Wednesday, February 22, 2017 9:54 AM To: Gioele Barabucci gioele@svario.it Cc: BaseX basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] count(//elem) not optimized, even though `elem` is in the index

Hi Gioele,

...

I wonder if the presence of the namespace somehow confuses the optimizer.

…for the following document:

Cheers, Christian

...

I was stressing the BaseX 8.6 planner/optimizer when I noticed that expressions like `count(//elem)` are not optimized at all, even though they are correctly indexed, as demonstrated by `index:element-names()`.

The current database is a 300 MB TEI document. All the elements are in the `http://www.tei-c.org/ns/1.0 http://www.tei-c.org/ns/1.0` namespace.

The following test case will report the correct number, but it will take a couple of seconds to run, instead of a few milliseconds.
declare namespace tei="http://www.tei-c.org/ns/1.0<http://www.tei-c.org/ns/1.0>";

let $n := index:element-names("monier")[. = 're']/@count

let $c := count(//tei:re)

return <res><in-index>{$n}</in-index><in-doc>{$c}</in-doc></res>
I wonder if the presence of the namespace somehow confuses the optimizer. The same problem can be observed running the same test case with
declare default element namespace "http://www.tei-c.org/ns/1.0<http://www.tei-c.org/ns/1.0>";
[...]
let $c := count(//re)
Regards,

-- Gioele Barabucci <gioele.barabucci@uni-koeln.demailto:gioele.barabucci@uni-koeln.de>

Christian Grün

6:11 p.m.

New subject: count(//elem) not optimized, even though `elem` is in the index

Hi Vincent,

...

Does this mean, for a given set of XML files that have several namespace declarations attached to the root element including a default namespace, if namespaces are removed when these XML files are loaded into BaseX, for example by using the “Strip namespaces” option in the GUI, BaseX may be able to use additional query optimizations?

Definitely yes, but it surely depends on your queries, and if it contains patterns that can be optimized by taking advantage of the database statistics. Feel free to report back to us.

Christian

3100

Age (days ago)

3100

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

5 comments

3 participants

tags (0)

participants (3)

Christian Grün
Gioele Barabucci
Lizzi, Vincent