As both a stress test and to experiment, I created a database using a recent complete (current page) dump of English Wikipedia, a hefty file of 30.5 GB. I don't have enough memory apparently to create a full-text index of all of that text, so I created the DB without one.

My first testing came up empty until I realized that I needed to deal with the namespace (ugh). Then I tried:

declare default element namespace "http://www.mediawiki.org/xml/export-0.4/";

//siteinfo

This contains a small amount of data and occurs only once in the document (at /mediawiki/siteinfo). However, it's extremely slow (~33 seconds on my system). The query plan is:

<Root/>

</IterPath>

Timing:

- Parsing: 0.35 ms

- Compiling: 0.22 ms

- Evaluating: 33316.32 ms

- Printing: 0.3 ms

- Total Time: 33317.19 ms

My surmise is that millions of node names are being checked rather than a path index being used to rapidly access the appropriate node(s). I don't think such a simple query should fail to be properly optimized. Another surmise is that it's related to namespaces not being indexed (?). While personally I very much dislike namespaces, they are common, and they have to be efficiently handled.

To see if it made a difference, I also tried an explicitly named namespace test:

declare namespace w="http://www.mediawiki.org/xml/export-0.4/";

//w:siteinfo

This results in:

<Root/>

</IterPath>

Timing:

- Parsing: 0.33 ms

- Compiling: 0.07 ms

- Evaluating: 54288.51 ms

- Printing: 0.3 ms

- Total Time: 54289.23 ms

So performance is even worse.