Hi all -
An interesting XQuery question came up on reddit[1] over the weekend, and I'm curious about a couple of aspects of the problem.
To recreate, the sample data is at timecenters[2] in the employeeTemporalDataset.zip download (specifically, the `departments.xml` and the three compressed directories of XML in `employees.tar.gz`). I created a database from employees.tar.gz (thanks for letting us parse XML in archives!), added the `departments.xml`, and then `Optimized All`.
The initial query was ``` for $emp in /employees/employee[@tend = '9999-01-01'] let $curdept := $emp/deptno[@tend = '9999-01-01'] return $emp/lastname || " " || $curdept || " " || /departments/department[deptno = $curdept]/deptname ``` and it is slow (~100 minutes, 5999201.28 ms). The original poster came back later with a modified query that is significantly faster[3], but I was mostly wondering about the whys of the slowness. I think (but am not sure) that this is a join, and since the initial `for` binding ($emp) is pretty big (~240K), the processor has to parse through the $emp result sequence and match to values in the /departments part of the database; is that a correct assumption?
Again, I'm assuming that in the second, faster, query, the processor has two sequences ($d and $e) and is able to pull the joined data together much more quickly?
Thanks in advance for any light you can shed on these questions. Best, Bridger
[1] https://www.reddit.com/r/xml/comments/c3mb86/simple_xquery_optimization/ [2] http://timecenter.cs.aau.dk/software.htm [3] improved query: ``` for $d in /departments/department for $e in /employees/employee[deptno[@tend='9999-01-01'] = $d/deptno] return $e/lastname || " " || $d/deptno || " " || $d/deptname ```
Hi Bridger,
Thanks for the pointer to the challenging query.
[…] I was mostly wondering about the whys of the slowness. I think (but am not sure) that this is a join, and since the initial `for` binding ($emp) is pretty big (~240K), the processor has to parse through the $emp result sequence and match to values in the /departments part of the database; is that a correct assumption?
Exactly! The query will be evaluated much faster if the final departments lookup will be removed.
The main reason for the slowness is that the departments document is just one of the 15.000 documents in the database, and each document will be checked for a “departments” root element (and this will happen 240.124 times). Well, to be more precise, the last path expression will actually be rewritten to an index lookup (see the Info View for more details):
db:text("employees", $curdept_1)/parent::*:deptno/parent::*:department/*:deptname)
The problem is a similar one: The full database index returns a lot of text nodes for the looked up string, which will all need to be matched against the specified path (just the inverse way round).
The proposed solution is a clever one indeed, because it chooses to minimize the total number of comparisons in the join. And there are several other ways to speed up the query (I’m just listing three):
1. The departments document can be stored in a separate database:
let $tend := '9999-01-01' for $emp in db:open('employees')/employees/employee[@tend = $tend] let $curdept := $emp/deptno[@tend = $tend] let $deptname := db:open('departments')/departments/department[deptno = $curdept]/deptname return $emp/lastname || " " || $curdept || " " || $deptname
If the performance is not good enough, the indexes can be further tuned for each single database (e. g. by only indexing @tend attributes in the employees database).
2. If you want to stick to a single database, you can explicitly address the single departments.xml document:
let $tend := '9999-01-01' for $emp in db:open('employees')/employees/employee[@tend = $tend] let $curdept := $emp/deptno[@tend = $tend] let $deptname := db:open('employees', 'departments.xml')/departments/department[deptno eq $curdept]/deptname return $emp/lastname || " " || $curdept || " " || $deptname
There is one more hidden optimization here. I used "eq" instead of "=":
• In the given comparison, we compare the results of a path expression (deptno) and a variable reference ($curdept). • Value comparisons (eq, ne, etc.) result in an error if one of the operands yields more than one item. • At compile time, however, we cannot tell if both operands will yield at most a single item. • An index rewriting would possibly suppress errors caused by the original expression. • Thus, we do not rewrite the expression to an index lookup.
I’m suppressing the index lookup as the sequential way of processing the path will be faster (it would take much more time to evaluate the inversed index path). – There is currently no way to disable index rewritings explicitly; maybe I’ll introduce this in a future version.
3. Another solution is to use a map:
let $deptnames := map:merge( /departments/department/map:entry(deptno, data(deptname)) ) let $tend := '9999-01-01' for $emp in /employees/employee[@tend = $tend] let $curdept := $emp/deptno[@tend = $tend] return $emp/lastname || " " || $curdept || " " || $deptnames($curdept)
This query will be evaluated faster than the others: All department names are cached in advance and won’t lead to additional database traversals. I assume it will be tough to get this one faster… but maybe I’m wrong?
Hope this helps, Christian
PS: I think it would be pretty hard to rewrite this query automatically, as the rewriting would be based on various fuzzy heuristics. But who knows…
basex-talk@mailman.uni-konstanz.de