[basex-talk] Understanding the path index

25 Oct 2019


      Hi,
first of all: thank you, the fix for [1] did the trick and in 9.2.4 the
query is working as expected. 
Today, I come back to you with another challenge in performance which
again seems to have something to do with indexing(?). So here's the
situation:
I have two databases "abc" and "def". "abc" contains 1 xml doc with
about 150.000 nodes and "def" contains 1 xml doc with about 400.000
nodes. Both are similarly strutured and have their up-to-date text and
attr indexes. The xml docs look both (simplified) like the following:
rdf:RDF
  <ore:Aggregation rdf:about="123">
     <edm:object rdf:resource="urn1"/>
     <...>
  </ore:Aggregation>
  <edm:WebResource rdf:about="urn1">
     <...>
  </edm:WebResource>
  <ore:Aggregation rdf:about="124">
     <edm:object rdf:resource="urn2"/>
     <...>
  </ore:Aggregation>
  <edm:WebResource rdf:about="urn2">
     <...>
  </edm:WebResource>
  <ore:Aggregation rdf:about="125">
     <edm:object rdf:resource="urn2"/>
     <edm:object rdf:resource="urn3"/>
     <...>
  </ore:Aggregation>
  <edm:WebResource rdf:about="urn3">
     <...>
  </edm:WebResource>
  <...>
</rdf:RDF>
So one aggregation refers to one (or more) web resources. I boiled down
my original query to the following purpose to keep it simple: for each
aggregation give me the corresponding web resource.
for $agg in db:open($db_name)/rdf:RDF/ore:Aggregation
   return for $urn in $agg/edm:object/@rdf:resource
     return (# db:enforceindex #)
   {db:open($db_name)/rdf:RDF/edm:WebResource[@rdf:about=$urn]}
For both databases the query gives me the required result and the query
info tells me that the attribute index for $urn is applied in both
cases (this is also the case if I leave out the pragma). However, oddly
enough, for the "larger" database "def" with a larger attribute index
it takes roughly 1 second while the "smaller" database "abc" with a
smaller attribute index takes 20 seconds. This is not very long but the
original query is more complicated and I have bigger databases with the
same structure where it starts to matter.
The only (and I think important) difference between "abc" and "def" is
that "abc" contains only 3 web resources that all 5.000 aggregations
refer to. While in "def" each aggregation refers to a particular web
resource (== 7.000 aggregations and 7.000 web resources).
With index:facets I had a look at the facet values and learned that
there is a "maximum number of distinct values to store per name". Is
there a difference in performance because of that? Maybe I do not get
the index structures but it feels strange that it takes longer to find
the correct attribute in a range of 3 different values than in a range
of 7.000. Maybe there is also another problem in my query, databases or
my reasoning that I do not see? Either way, I need help in
understanding this phenomenon :-)
I hope you could follow, please don't hesitate to ask if you need
anything to reproduce this situation (I am using BaseX 9.2.4).
Julia
[1] 
https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-July/014511.html

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

[basex-talk] Understanding the path index