Re: [basex-talk] Understanding the path index

25 Oct 2019

      Julia, all,

apologies - I hit 'send' a bit too quickly. I wanted to ask how you were
ingesting RDF, what the size of your input looked like, and how you were
dealing with namespaces. I have some RDF/MADS data from the Library of
Congress and I've never been able to ingest it into BaseX, even stripping
namespaces, etc, things seem to run out of memory or stall.

Would you be willing to share some details about your data?

Thanks very much.
Best,
Bridger

On Fri, Oct 25, 2019 at 3:41 PM Bridger Dyson-Smith <bdysonsmith@gmail.com>
wrote:
...
Hi Julia -
Preface: let me be clear when I say that I've wondered about some of this
myself, so I don't think I have an answer for you. That being said, I
wonder if this is a grouping/data modeling problem: i.e. you have 5,000
aggregations refer to 1 of 3 web resources vs ~7,000 aggregations each with
their own distinct web resource.
If you created 3 databases for "abc" ( hm... "a", "b", and "c"? ☺), one
for each web resource (i.e. where there would be only
aggregations-to-the-specific-web-resource), would that help with query
times at all? It might necessitate a bit of pre-processing in your creation
step though.
In any event, I hope those random thoughts are helpful in some way.
Best,
Bridger
On Fri, Oct 25, 2019 at 10:24 AM Beck, Julia <J.Beck@ub.uni-frankfurt.de>
wrote:
...
Hi,
first of all: thank you, the fix for [1] did the trick and in 9.2.4 the
query is working as expected.
Today, I come back to you with another challenge in performance which
again seems to have something to do with indexing(?). So here's the
situation:
I have two databases "abc" and "def". "abc" contains 1 xml doc with
about 150.000 nodes and "def" contains 1 xml doc with about 400.000
nodes. Both are similarly strutured and have their up-to-date text and
attr indexes. The xml docs look both (simplified) like the following:
<rdf:RDF>
  <ore:Aggregation rdf:about="123">
     <edm:object rdf:resource="urn1"/>
     <...>
  </ore:Aggregation>
  <edm:WebResource rdf:about="urn1">
     <...>
  </edm:WebResource>
  <ore:Aggregation rdf:about="124">
     <edm:object rdf:resource="urn2"/>
     <...>
  </ore:Aggregation>
  <edm:WebResource rdf:about="urn2">
     <...>
  </edm:WebResource>
  <ore:Aggregation rdf:about="125">
     <edm:object rdf:resource="urn2"/>
     <edm:object rdf:resource="urn3"/>
     <...>
  </ore:Aggregation>
  <edm:WebResource rdf:about="urn3">
     <...>
  </edm:WebResource>
  <...>
</rdf:RDF>
So one aggregation refers to one (or more) web resources. I boiled down
my original query to the following purpose to keep it simple: for each
aggregation give me the corresponding web resource.
for $agg in db:open($db_name)/rdf:RDF/ore:Aggregation
   return for $urn in $agg/edm:object/@rdf:resource
     return (# db:enforceindex #)
   {db:open($db_name)/rdf:RDF/edm:WebResource[@rdf:about=$urn]}
For both databases the query gives me the required result and the query
info tells me that the attribute index for $urn is applied in both
cases (this is also the case if I leave out the pragma). However, oddly
enough, for the "larger" database "def" with a larger attribute index
it takes roughly 1 second while the "smaller" database "abc" with a
smaller attribute index takes 20 seconds. This is not very long but the
original query is more complicated and I have bigger databases with the
same structure where it starts to matter.
The only (and I think important) difference between "abc" and "def" is
that "abc" contains only 3 web resources that all 5.000 aggregations
refer to. While in "def" each aggregation refers to a particular web
resource (== 7.000 aggregations and 7.000 web resources).
With index:facets I had a look at the facet values and learned that
there is a "maximum number of distinct values to store per name". Is
there a difference in performance because of that? Maybe I do not get
the index structures but it feels strange that it takes longer to find
the correct attribute in a range of 3 different values than in a range
of 7.000. Maybe there is also another problem in my query, databases or
my reasoning that I do not see? Either way, I need help in
understanding this phenomenon :-)
I hope you could follow, please don't hesitate to ask if you need
anything to reproduce this situation (I am using BaseX 9.2.4).
Julia
[1]
https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-July/014511.html