On Fri, Oct 25, 2019 at 3:41 PM Bridger Dyson-Smith <bdysonsmith@gmail.com> wrote:

Hi Julia -

Preface: let me be clear when I say that I've wondered about some of this myself, so I don't think I have an answer for you. That being said, I wonder if this is a grouping/data modeling problem: i.e. you have 5,000 aggregations refer to 1 of 3 web resources vs ~7,000 aggregations each with their own distinct web resource.

If you created 3 databases for "abc" ( hm... "a", "b", and "c"? ☺), one for each web resource (i.e. where there would be only aggregations-to-the-specific-web-resource), would that help with query times at all? It might necessitate a bit of pre-processing in your creation step though.

In any event, I hope those random thoughts are helpful in some way.
Best,
Bridger

On Fri, Oct 25, 2019 at 10:24 AM Beck, Julia <J.Beck@ub.uni-frankfurt.de> wrote:
Hi,

first of all: thank you, the fix for [1] did the trick and in 9.2.4 the
query is working as expected.
Today, I come back to you with another challenge in performance which
again seems to have something to do with indexing(?). So here's the
situation:
I have two databases "abc" and "def". "abc" contains 1 xml doc with
about 150.000 nodes and "def" contains 1 xml doc with about 400.000
nodes. Both are similarly strutured and have their up-to-date text and
attr indexes. The xml docs look both (simplified) like the following:

<rdf:RDF>
<ore:Aggregation rdf:about="123">
<edm:object rdf:resource="urn1"/>
<...>
</ore:Aggregation>
<edm:WebResource rdf:about="urn1">
<...>
</edm:WebResource>
<ore:Aggregation rdf:about="124">
<edm:object rdf:resource="urn2"/>
<...>
</ore:Aggregation>
<edm:WebResource rdf:about="urn2">
<...>
</edm:WebResource>
<ore:Aggregation rdf:about="125">
<edm:object rdf:resource="urn2"/>
<edm:object rdf:resource="urn3"/>
<...>
</ore:Aggregation>
<edm:WebResource rdf:about="urn3">
<...>
</edm:WebResource>
<...>
</rdf:RDF>

So one aggregation refers to one (or more) web resources. I boiled down
my original query to the following purpose to keep it simple: for each
aggregation give me the corresponding web resource.

for $agg in db:open($db_name)/rdf:RDF/ore:Aggregation
return for $urn in $agg/edm:object/@rdf:resource
return (# db:enforceindex #)
{db:open($db_name)/rdf:RDF/edm:WebResource[@rdf:about=$urn]}

For both databases the query gives me the required result and the query
info tells me that the attribute index for $urn is applied in both
cases (this is also the case if I leave out the pragma). However, oddly
enough, for the "larger" database "def" with a larger attribute index
it takes roughly 1 second while the "smaller" database "abc" with a
smaller attribute index takes 20 seconds. This is not very long but the
original query is more complicated and I have bigger databases with the
same structure where it starts to matter.

The only (and I think important) difference between "abc" and "def" is
that "abc" contains only 3 web resources that all 5.000 aggregations
refer to. While in "def" each aggregation refers to a particular web
resource (== 7.000 aggregations and 7.000 web resources).

With index:facets I had a look at the facet values and learned that
there is a "maximum number of distinct values to store per name". Is
there a difference in performance because of that? Maybe I do not get
the index structures but it feels strange that it takes longer to find
the correct attribute in a range of 3 different values than in a range
of 7.000. Maybe there is also another problem in my query, databases or
my reasoning that I do not see? Either way, I need help in
understanding this phenomenon :-)

I hope you could follow, please don't hesitate to ask if you need
anything to reproduce this situation (I am using BaseX 9.2.4).

Julia

[1]
https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-July/014511.html