Hello everyone
I’ve recently started to use and stumbled upon a problem with the indices in basex that I do not fully understand and was hoping for some clarification.
So basically I’ve got a bunch of xml files in my database with medical ICD-10 codes that all look like this:
<classification name=„some-name“ year=„2017“ > <code> <value>03R80JZ</value> <final>true</final> </code> <code> <value>2W3JX1Z</value> <final>false</final> </code> … ... </classification>
The db can contain multiple of these files, one for each name and year, and each classification can contain up to 100k of those codes (all values are unique). What I’d like to do now is, given a list of codes, check if those codes actually exist in the classification for a given name and source. So I came up with this straight-forward piece of code:
let $proc := ('F0723EZ','0JWS37Z','0W054JZ','0NQ00ZZ','03R80JZ','0PW537Z','0RW940Z','0PPD0KZ','2W3JX1Z','0D9B8ZZ','0SWC30Z','0PRJ4KZ','025R0ZZ','0T1347C','0S950ZX','008J4ZZ','0D110K4','0XUV4JZ','0GTN4ZZ','4A133J1’) for $p in $proc return /classification[@name=‚some-name‘ and @year=2017]/code[value = $p]
This does work fine, however it takes roughly one second to evaluate the query, and in the Query Info Window I can see that the text index of the db is not being used. However I do see that the attribute index is being queried. Now I noticed that, if I drop the attribute index, the query is evaluated basically instantaneous and can see that the query is rewritten to use the text index as well.
So my question is:
Is this normal, i.e. is it not possible to use multiple indices in one query at once? Or is this a problem where the „sub-query“ …./code[value = $p] is only evaluated on the result of the first query, thus the text index is not being considered? If so, is there a way (without rewriting the whole query) to make basex use both indices when compiling the query?
Thanks a lot already and all the best
Clemens Müthing
Dear Clemens,
Welcome to the list, and thanks for the helpful example.
BaseX tries to select the index structure that yields the fastest results. As the input for $p is variable, cost estimation is conservative here, and the attribute index is chosen instead.
However, you can explicitly access the text index (see [1] for more details) and traverse your path in reverse order:
let $proc := ('F0723EZ',...) for $p in $proc return db:text('classification', $p)/ parent::value/ parent::code/ parent::classification[@name='some-name' and @year=2017]
Hope this helps? Christian
[1] http://docs.basex.org/wiki/Database_Module#db:text
On Wed, Jan 25, 2017 at 11:13 AM, Clemens Müthing clemensm@posteo.de wrote:
Hello everyone
I’ve recently started to use and stumbled upon a problem with the indices in basex that I do not fully understand and was hoping for some clarification.
So basically I’ve got a bunch of xml files in my database with medical ICD-10 codes that all look like this:
<classification name=„some-name“ year=„2017“ > <code> <value>03R80JZ</value> <final>true</final> </code> <code> <value>2W3JX1Z</value> <final>false</final> </code> … ... </classification>
The db can contain multiple of these files, one for each name and year, and each classification can contain up to 100k of those codes (all values are unique). What I’d like to do now is, given a list of codes, check if those codes actually exist in the classification for a given name and source. So I came up with this straight-forward piece of code:
let $proc := ('F0723EZ','0JWS37Z','0W054JZ','0NQ00ZZ','03R80JZ','0PW537Z','0RW940Z','0PPD0KZ','2W3JX1Z','0D9B8ZZ','0SWC30Z','0PRJ4KZ','025R0ZZ','0T1347C','0S950ZX','008J4ZZ','0D110K4','0XUV4JZ','0GTN4ZZ','4A133J1’) for $p in $proc return /classification[@name=‚some-name‘ and @year=2017]/code[value = $p]
This does work fine, however it takes roughly one second to evaluate the query, and in the Query Info Window I can see that the text index of the db is not being used. However I do see that the attribute index is being queried. Now I noticed that, if I drop the attribute index, the query is evaluated basically instantaneous and can see that the query is rewritten to use the text index as well.
So my question is:
Is this normal, i.e. is it not possible to use multiple indices in one query at once? Or is this a problem where the „sub-query“ …./code[value = $p] is only evaluated on the result of the first query, thus the text index is not being considered? If so, is there a way (without rewriting the whole query) to make basex use both indices when compiling the query?
Thanks a lot already and all the best
Clemens Müthing
Hello Christian
Thanks, that was helpful. I also had the idea to rewrite the query so that the text index is being selected, which makes the query basically instantaneous.
let $proc := ('F0723EZ‘,’... for $p in $proc let $codes := /classification/code[value = $p] for $code in $codes where $code/..[@year = 2017 and @name = ’some-name'] return $code
Which brings up another question: I guess that in this case the query planer uses the text index (as indicated in the Query Info as well), and then only needs to check on the parents of the few results returned by the first query if they satisfy the where clause. So in this case only the text index is being used, but not the attribute index. Would it theoretically be possible to write the query in a way that both the text index and the attribute index are being used (like a multi-column index in SQL), or would I have to create a custom index for this?
Thanks
Clemens
Am 25.01.2017 um 11:51 schrieb Christian Grün christian.gruen@gmail.com:
Dear Clemens,
Welcome to the list, and thanks for the helpful example.
BaseX tries to select the index structure that yields the fastest results. As the input for $p is variable, cost estimation is conservative here, and the attribute index is chosen instead.
However, you can explicitly access the text index (see [1] for more details) and traverse your path in reverse order:
let $proc := ('F0723EZ',...) for $p in $proc return db:text('classification', $p)/ parent::value/ parent::code/ parent::classification[@name='some-name' and @year=2017]
Hope this helps? Christian
[1] http://docs.basex.org/wiki/Database_Module#db:text
On Wed, Jan 25, 2017 at 11:13 AM, Clemens Müthing clemensm@posteo.de wrote:
Hello everyone
I’ve recently started to use and stumbled upon a problem with the indices in basex that I do not fully understand and was hoping for some clarification.
So basically I’ve got a bunch of xml files in my database with medical ICD-10 codes that all look like this:
<classification name=„some-name“ year=„2017“ > <code> <value>03R80JZ</value> <final>true</final> </code> <code> <value>2W3JX1Z</value> <final>false</final> </code> … ... </classification>
The db can contain multiple of these files, one for each name and year, and each classification can contain up to 100k of those codes (all values are unique). What I’d like to do now is, given a list of codes, check if those codes actually exist in the classification for a given name and source. So I came up with this straight-forward piece of code:
let $proc := ('F0723EZ','0JWS37Z','0W054JZ','0NQ00ZZ','03R80JZ','0PW537Z','0RW940Z','0PPD0KZ','2W3JX1Z','0D9B8ZZ','0SWC30Z','0PRJ4KZ','025R0ZZ','0T1347C','0S950ZX','008J4ZZ','0D110K4','0XUV4JZ','0GTN4ZZ','4A133J1’) for $p in $proc return /classification[@name=‚some-name‘ and @year=2017]/code[value = $p]
This does work fine, however it takes roughly one second to evaluate the query, and in the Query Info Window I can see that the text index of the db is not being used. However I do see that the attribute index is being queried. Now I noticed that, if I drop the attribute index, the query is evaluated basically instantaneous and can see that the query is rewritten to use the text index as well.
So my question is:
Is this normal, i.e. is it not possible to use multiple indices in one query at once? Or is this a problem where the „sub-query“ …./code[value = $p] is only evaluated on the result of the first query, thus the text index is not being considered? If so, is there a way (without rewriting the whole query) to make basex use both indices when compiling the query?
Thanks a lot already and all the best
Clemens Müthing
Hi Clemens,
I guess that in this case the query planer uses the text index (as indicated in the Query Info as well), and then only needs to check on the parents of the few results returned by the first query if they satisfy the where clause.
Exactly.
Would it theoretically be possible to write the query in a way that both the text index and the attribute index are being used
Absolutely, it could e.g. look like this:
let $texts := db:text('classification', ('F0723EZ', '...'))/ parent::value/parent::code/parent::classification let $attributes := db:attribute('classification', 'some-name')/ parent::classification[@year = '2017'] return $texts intersect $attributes
Cheers, Christian
Hi Christian
Thanks, didn’t think of that option :-)
Greetings
Clemens
Am 27.01.2017 um 19:27 schrieb Christian Grün christian.gruen@gmail.com:
Hi Clemens,
I guess that in this case the query planer uses the text index (as indicated in the Query Info as well), and then only needs to check on the parents of the few results returned by the first query if they satisfy the where clause.
Exactly.
Would it theoretically be possible to write the query in a way that both the text index and the attribute index are being used
Absolutely, it could e.g. look like this:
let $texts := db:text('classification', ('F0723EZ', '...'))/ parent::value/parent::code/parent::classification let $attributes := db:attribute('classification', 'some-name')/ parent::classification[@year = '2017'] return $texts intersect $attributes
Cheers, Christian
basex-talk@mailman.uni-konstanz.de