Problem with Wikipedia database (or a more general namespace efficiency problem?)

List overview All Threads
Download

newer

older

How to delete a node with PHP?

EXPath ZIP Module

NewIntellectual

27 Feb 2011 27 Feb '11

11:31 a.m.

As both a stress test and to experiment, I created a database using a recent complete (current page) dump of English Wikipedia, a hefty file of 30.5 GB. I don't have enough memory apparently to create a full-text index of all of that text, so I created the DB without one.

My first testing came up empty until I realized that I needed to deal with the namespace (ugh). Then I tried:

declare default element namespace "http://www.mediawiki.org/xml/export-0.4/ "; //siteinfo

This contains a small amount of data and occurs only once in the document (at /mediawiki/siteinfo). However, it's extremely slow (~33 seconds on my system). The query plan is:

Timing: - Parsing: 0.35 ms - Compiling: 0.22 ms - Evaluating: 33316.32 ms - Printing: 0.3 ms - Total Time: 33317.19 ms

My surmise is that millions of node names are being checked rather than a path index being used to rapidly access the appropriate node(s). I don't think such a simple query should fail to be properly optimized. Another surmise is that it's related to namespaces not being indexed (?). While personally I very much dislike namespaces, they are common, and they have to be efficiently handled.

To see if it made a difference, I also tried an explicitly named namespace test:

declare namespace w="http://www.mediawiki.org/xml/export-0.4/"; //w:siteinfo

This results in:

<IterPath> <Root/> <IterStep axis="descendant" test="w:siteinfo"/> </IterPath> Timing: - Parsing: 0.33 ms - Compiling: 0.07 ms - Evaluating: 54288.51 ms - Printing: 0.3 ms - Total Time: 54289.23 ms

So performance is even worse.

Attachments:

attachment.html (text/html — 2.5 KB)

Show replies by date

Christian Grün

27 Feb 27 Feb

5:42 p.m.

Hi Phil,

...

declare default element namespace "http://www.mediawiki.org/xml/export-0.4/"; //siteinfo

If you know that this node will occur only once, the most efficient option will be to use a positional predicate:

( //*:siteinfo ) [1]

But you may be surprised that the following query is evaluated very quickly:

count(//*:siteinfo)

This means that the path index has indeed enough information to allow for a faster evaluation: we're not saving direct references to the target nodes (as such an index would get very large for e.g. the Wikipedia page element), but we're saving the number of distinct node paths. As a result, we could rewrite your query into

( //*:siteinfo ) [position() <= 1]

We haven't included this optimization yet, as the additional predicate may slow down other queries; but in your case, it would clearly speed up the evaluation time to a few milliseconds (if at all). I have added a GitHub issue to remember your thoughts:

https://github.com/BaseXdb/basex/issues#issue/29

...

While personally I very much dislike namespaces, they are common, and they have to be efficiently handled.

Namespaces, a great topic... It's true that name tests with prefixes will be evaluated slower than queries without prefixes (i.e., prefix wildcards). This is something most XQuery implementations suffer from, as the complex nature of namespaces does not enable simple reference checks. Indeed, most members of the W3 XML Query Working Group regret that namespaces have not been specified much simpler; due to all legacy issues, history cannot be reverted in that aspect.

After all, however, I was surprised to see that your query nearly took twice the time as the one without namespaces; I'd have expected a slowdown of maybe 10-15%. To conclude this: if you want faster queries, you should declare global namespaces, or simply use wildcards.

Hope this helps, Christian

Jan Vlčinský (CAD)

28 Feb 28 Feb

1:52 a.m.

Hi, I can confirm, that positional argument can be sometime quite costly. I had xpath in my xquery, which selects elements according to my criteria and pick up any of them, just to make sure I will get only one item in the result and not a sequence.

Original query was something like $ltn//PredefinedLocation[Location/TmcLinear/@locationCode = $primarycode]/Location[1] and replacing it by ($ltn//PredefinedLocation[Location/TmcLinear/@locationCode = $primarycode]/Location)[1]

speeded up my query from 30s to 35ms.

One of reasons could be, my original xpath was simply wrong and was still returning a sequence. But I realized, that using indexes it is probably quite costly to determine node position.

Jan

2011/2/27 Christian Grün christian.gruen@gmail.com

...

Hi Phil,

...
declare default element namespace "http://www.mediawiki.org/xml/export-0.4/"; //siteinfo

If you know that this node will occur only once, the most efficient option will be to use a positional predicate:

( //*:siteinfo ) [1]

But you may be surprised that the following query is evaluated very quickly:

count(//*:siteinfo)

This means that the path index has indeed enough information to allow for a faster evaluation: we're not saving direct references to the target nodes (as such an index would get very large for e.g. the Wikipedia page element), but we're saving the number of distinct node paths. As a result, we could rewrite your query into

( //*:siteinfo ) [position() <= 1]

We haven't included this optimization yet, as the additional predicate may slow down other queries; but in your case, it would clearly speed up the evaluation time to a few milliseconds (if at all). I have added a GitHub issue to remember your thoughts:

https://github.com/BaseXdb/basex/issues#issue/29

...
While personally I very much dislike namespaces, they are common, and they have to be efficiently handled.

Namespaces, a great topic... It's true that name tests with prefixes will be evaluated slower than queries without prefixes (i.e., prefix wildcards). This is something most XQuery implementations suffer from, as the complex nature of namespaces does not enable simple reference checks. Indeed, most members of the W3 XML Query Working Group regret that namespaces have not been specified much simpler; due to all legacy issues, history cannot be reverted in that aspect.

After all, however, I was surprised to see that your query nearly took twice the time as the one without namespaces; I'd have expected a slowdown of maybe 10-15%. To conclude this: if you want faster queries, you should declare global namespaces, or simply use wildcards.

Hope this helps, Christian _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

-- *Ing. Jan Vlčinský* CAD programy Slunečnicová 338/3, 734 01 Karviná Ráj, Czech Republic tel: +420-597 602 024; mob: +420-608 979 040 skype: janvlcinsky; GoogleTalk: jan.vlcinsky@gmail.com http://cz.linkedin.com/in/vlcinsky

NewIntellectual

6:50 p.m.

For testing, I edited out the namespace decl. in the root element to remove that issue, and performed another test:

//siteinfo

I find it very disturbing that it still took 47 seconds to retrieve the single element, and that this:

/mediawiki/siteinfo

still took about 8 seconds.

Granted, most documents are *much* smaller than the entirety of current English Wikipedia pages, but it's a very valid use case and I think BaseX should dominate the world of XML databases.

I have several thoughts, but I think this may be the key issue:

On Sun, Feb 27, 2011 at 5:42 PM, Christian Grün christian.gruen@gmail.comwrote:

...

we're not saving direct references to the target nodes (as such an index would get very large for e.g. the Wikipedia page element),

I believe this premise should be challenged. Surely the performance issue above would be solved if indeed there were more specific indices to particular elements.

I don't think it's unreasonable to at least permit such an index to be created optionally. After all, consider text indexing. For most documents, the text itself is comprised of many more words than XML elements, yet that is considered a common and valid index to form. Why would 'page' as a textual word be more "indexable" than the <page> element?

Christian Grün

7:23 p.m.

...

I find it very disturbing that it still took 47 seconds to retrieve the single element, and that this: /mediawiki/siteinfo still took about 8 seconds.

True, that sounds slow.. Although I guess it's still faster than most other implementations available out there. Of course that's not reason enough to keep it like that. Currently, however, some other todos have more priority..

...

I believe this premise should be challenged. Surely the performance issue above would be solved if indeed there were more specific indices to particular elements. I don't think it's unreasonable to at least permit such an index to be created optionally.

Again, I agree that an optional index might be a nice addition, but it takes some time to implement it (and, by experience, using positional predicates will often do the trick as well). Btw, everyone is free to contribute to our project by either branching and extending our code, or speeding up the development of certain features by supporting us financially (which some people have done already)..

Thanks, Christian

NewIntellectual

11:41 p.m.

I was thinking about the element indexing issue and a possible quick solution comes to mind. It would be something of a hideous hack, but I suspect that it would be much easier to implement than a new kind of index.

The idea would be (if this indexing option were selected) to invisibly add a new attribute to each element; perhaps named "_elname_", something which is unlikely to clash with a normally used name (of course the name should be documented and known to avoid). The attribute would then be defined as:

where ELEMENT was any particular string. e.g. every <page ...> would become <page _elname_="page" ...>

If my thought is correct, then every reference to an element could then be invisibly replaced with:

ELEMENT[@_elname_="ELEMENT"]

which is effectively an identity assuming that the attribute exists for each element. This would seem to work whether the element is referenced alone or with additional predicates. The goal here would be to use the existing attribute indices to speed up finding those elements.

I haven't tried it yet with the Wikipedia example but I have noticed that searches which do use indices execute very rapidly, as would be expected.

There may be some lethal flaw, but I'd like the opinions of the BaseX developers on this approach.

NewIntellectual

11:46 p.m.

One more thought on the whole issue of database size, particularly indices: I would say that time is a far greater consideration than disk space now. A terabyte is cheap; but there is no way to add more seconds per second.

Jan Vlčinský (CAD)

1 Mar 1 Mar

1:11 a.m.

Hi I also agree, that speed is always strongly demanded commodity. However, learning from XML database implementations, I would be careful with taking enough space as granted. My original preferred XML database implementation was Sedna XML database, but space requirements were sometime too high like 100 times the space of original docs. They were giving some hints, how to make it smaller by restructuring used XML structures, but this completely destroys the beauty of XML databases offering solution for just existing documents one has at the moment without any additional constrains.

Jan

2011/3/1 NewIntellectual newintellectual@gmail.com

...

One more thought on the whole issue of database size, particularly indices: I would say that time is a far greater consideration than disk space now. A terabyte is cheap; but there is no way to add more seconds per second. _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Christian Grün

6:13 a.m.

...this is what I can confirm. An optional inclusion of direct element nodes references might indeed be the best choice to keep BaseX database instances light-weight (exactly for this reason, the full-text index is optional as well). But we'll at least have another look at the addition of pos. preds in short term.

Christian __________________________

On Tue, Mar 1, 2011 at 7:11 AM, Jan Vlčinský (CAD) jan.vlcinsky@cad-programs.com wrote:

...

Hi I also agree, that speed is always strongly demanded commodity. However, learning from XML database implementations, I would be careful with taking enough space as granted. My original preferred XML database implementation was Sedna XML database, but space requirements were sometime too high like 100 times the space of original docs. They were giving some hints, how to make it smaller by restructuring used XML structures, but this completely destroys the beauty of XML databases offering solution for just existing documents one has at the moment without any additional constrains. Jan

2011/3/1 NewIntellectual newintellectual@gmail.com

...
One more thought on the whole issue of database size, particularly indices: I would say that time is a far greater consideration than disk space now. A terabyte is cheap; but there is no way to add more seconds per second. _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

-- Ing. Jan Vlčinský CAD programy Slunečnicová 338/3, 734 01 Karviná Ráj, Czech Republic tel: +420-597 602 024; mob: +420-608 979 040 skype: janvlcinsky; GoogleTalk: jan.vlcinsky@gmail.com http://cz.linkedin.com/in/vlcinsky

Jan Vlčinský (CAD)

7:21 a.m.

There are few things about speed

- simplicity default setup shall satisfy initial use of the database - need for production optimization if is the query serving every day, then there is the rule, optimization step will come later or sooner. - required additional index resouces like time and CPU to generate or update them, space requirements...

I like BaseX for the simplicity to start and use it. Optimization is often not so much difficult or needed, but wiki entry about optimization would be good tool. Extra indexing options too.

Jan

2011/3/1 Christian Grün christian.gruen@gmail.com

...

...this is what I can confirm. An optional inclusion of direct element nodes references might indeed be the best choice to keep BaseX database instances light-weight (exactly for this reason, the full-text index is optional as well). But we'll at least have another look at the addition of pos. preds in short term.

Christian __________________________

On Tue, Mar 1, 2011 at 7:11 AM, Jan Vlčinský (CAD) jan.vlcinsky@cad-programs.com wrote:

...
Hi I also agree, that speed is always strongly demanded commodity. However, learning from XML database implementations, I would be careful

with

...
taking enough space as granted. My original preferred XML database implementation was Sedna XML database, but space requirements were sometime too high like 100 times the space of original docs. They were giving some hints, how to make it smaller by restructuring used XML structures, but this completely destroys the

beauty

...
of XML databases offering solution for just existing documents one has at the moment without any additional constrains. Jan

2011/3/1 NewIntellectual newintellectual@gmail.com

...
One more thought on the whole issue of database size, particularly indices: I would say that time is a far greater consideration than disk space now. A terabyte is cheap; but there is no way to add more seconds

per

...
...
second. _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

-- Ing. Jan Vlčinský CAD programy Slunečnicová 338/3, 734 01 Karviná Ráj, Czech Republic tel: +420-597 602 024; mob: +420-608 979 040 skype: janvlcinsky; GoogleTalk: jan.vlcinsky@gmail.com http://cz.linkedin.com/in/vlcinsky

Godmar Back

7:49 a.m.

Let me ask a naive question.

Christian Grün

9:30 a.m.

...

[...] BaseX is not required to output XML that preserves prefixes and default namespaces when a query is made; any equivalent XML that expresses the query result is acceptable.

5252

Age (days ago)

5254

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

11 comments

4 participants

tags (0)

participants (4)

Christian Grün
Godmar Back
Jan Vlčinský (CAD)
NewIntellectual