Join

List overview All Threads
Download

newer

older

Reg : Collection Performence

BaseX 9.3: The Winter Edition

Giuseppe G. A. Celano

28 Nov 2019 28 Nov '19

1:45 a.m.

Hi,

I have the following query:

count(

for $r in doc("hib_parses.xml")//row

let $i := doc("hib_lemmas.xml")//row[field[@name="lemma_lang_id"][. = "3"]]

where $r/field[@name="lemma_id"] = $i/field[@name="lemma_id"]

return $r

)

I have noticed that the where clause needs to be changed into $r/field[@name="lemma_id"]/text() = $i/field[@name="lemma_id"]/text() in order to get a result (otherwise the query seems to never end). I am wondering whether this is a BaseX issue, in that I would assume that the two kinds of where clause are equivalent (because of atomization). I have also noticed that /data() does not work either. Thanks!

Best, Giuseppe

Attachments:

attachment.html (text/html — 4.1 KB)

Show replies by date

Christian Grün

28 Nov 28 Nov

8:05 a.m.

Could you additionally share some sample data with us, or indicate the skeleton/schema of your database documents?

Thanks in advance Christian

Giuseppe G. A. Celano celano@informatik.uni-leipzig.de schrieb am Do., 28. Nov. 2019, 01:45:

...

Hi,

I have the following query:

count(

for $r in doc("hib_parses.xml")//row

let $i := doc("hib_lemmas.xml")//row[field[@name="lemma_lang_id"][. = "3"]]

where $r/field[@name="lemma_id"] = $i/field[@name="lemma_id"]

return $r

)

I have noticed that the where clause needs to be changed into $r/field[@name="lemma_id"]/text() = $i/field[@name="lemma_id"]/text() in order to get a result (otherwise the query seems to never end). I am wondering whether this is a BaseX issue, in that I would assume that the two kinds of where clause are equivalent (because of atomization). I have also noticed that /data() does not work either. Thanks!

Best, Giuseppe

Christian Grün

5:54 p.m.

Hi Giuseppe,

Thanks for passing me on your data sets. Some background information:

• If you look at the query info, you’ll see that your query won’t be rewritten for index access.

• Without index access, your query will need to perform the impressive amount of 1440254 * 17573 = 25 billion comparisons.

• The optimized version of the query with text() steps can be evaluated much faster, as it utilizes both the text and the attribute index:

db:text("hib_parses", db:attribute("hib_lemmas", "lemma_id") ..../parent::row)

• A and A/text() cannot be treated identically by the query processor: A text node may have more than one text node (an example: <A>a<_/>b</A>). The atomized result will always be a single value, whereas A/text() will give you two values.

• In some cases, the optimizer will implicitly add text nodes to path expressions if it’s a) possible at compile time to determine that a given step has only single text nodes, and b) the query will not yield different results. In the next step, paths with trailing text() steps may then be rewritten for index access.

• Some optimizations are restricted to documents without namespaces. Adding the text() step is one of them, so this could be the reason why you need to add this step manually.

Hope this helps, Christian

PS: I will see if there’s a chance to enable the discussed optimization for documents with namespaces.

On Thu, Nov 28, 2019 at 1:45 AM Giuseppe G. A. Celano celano@informatik.uni-leipzig.de wrote:

...

Hi,

I have the following query:

count( for $r in doc("hib_parses.xml")//row let $i := doc("hib_lemmas.xml")//row[field[@name="lemma_lang_id"][. = "3"]] where $r/field[@name="lemma_id"] = $i/field[@name="lemma_id"] return $r )

I have noticed that the where clause needs to be changed into $r/field[@name="lemma_id"]/text() = $i/field[@name="lemma_id"]/text() in order to get a result (otherwise the query seems to never end). I am wondering whether this is a BaseX issue, in that I would assume that the two kinds of where clause are equivalent (because of atomization). I have also noticed that /data() does not work either. Thanks!

Best, Giuseppe

Giuseppe G. A. Celano

29 Nov 29 Nov

7:23 p.m.

Hi Christian,

Thank you very much for this detailed explanation! If I understand correctly, the index option, which makes everything faster, is an optimization that is independent from XQuery per se. This explains why it is activated only under certain circumstances, independently from the fact that two XQuery expressions are supposed to return the same result. Thanks.

Best, Giuseppe

...

On Nov 28, 2019, at 5:54 PM, Christian Grün christian.gruen@gmail.com wrote:

Hi Giuseppe,

Thanks for passing me on your data sets. Some background information:

• If you look at the query info, you’ll see that your query won’t be rewritten for index access.

• Without index access, your query will need to perform the impressive amount of 1440254 * 17573 = 25 billion comparisons.

• The optimized version of the query with text() steps can be evaluated much faster, as it utilizes both the text and the attribute index:

db:text("hib_parses", db:attribute("hib_lemmas", "lemma_id") ..../parent::row)

• A and A/text() cannot be treated identically by the query processor: A text node may have more than one text node (an example: <A>a<_/>b</A>). The atomized result will always be a single value, whereas A/text() will give you two values.

• In some cases, the optimizer will implicitly add text nodes to path expressions if it’s a) possible at compile time to determine that a given step has only single text nodes, and b) the query will not yield different results. In the next step, paths with trailing text() steps may then be rewritten for index access.

• Some optimizations are restricted to documents without namespaces. Adding the text() step is one of them, so this could be the reason why you need to add this step manually.

Hope this helps, Christian

PS: I will see if there’s a chance to enable the discussed optimization for documents with namespaces.

On Thu, Nov 28, 2019 at 1:45 AM Giuseppe G. A. Celano celano@informatik.uni-leipzig.de wrote:

...
Hi,

I have the following query:

count( for $r in doc("hib_parses.xml")//row let $i := doc("hib_lemmas.xml")//row[field[@name="lemma_lang_id"][. = "3"]] where $r/field[@name="lemma_id"] = $i/field[@name="lemma_id"] return $r )

I have noticed that the where clause needs to be changed into $r/field[@name="lemma_id"]/text() = $i/field[@name="lemma_id"]/text() in order to get a result (otherwise the query seems to never end). I am wondering whether this is a BaseX issue, in that I would assume that the two kinds of where clause are equivalent (because of atomization). I have also noticed that /data() does not work either. Thanks!

Best, Giuseppe

2085

Age (days ago)

2086

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

3 comments

2 participants

tags (0)

participants (2)

Christian Grün
Giuseppe G. A. Celano