Hi,
first of all: thank you, the fix for [1] did the trick and in 9.2.4 the query is working as expected. Today, I come back to you with another challenge in performance which again seems to have something to do with indexing(?). So here's the situation: I have two databases "abc" and "def". "abc" contains 1 xml doc with about 150.000 nodes and "def" contains 1 xml doc with about 400.000 nodes. Both are similarly strutured and have their up-to-date text and attr indexes. The xml docs look both (simplified) like the following:
rdf:RDF <ore:Aggregation rdf:about="123"> <edm:object rdf:resource="urn1"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn1"> <...> </edm:WebResource> <ore:Aggregation rdf:about="124"> <edm:object rdf:resource="urn2"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn2"> <...> </edm:WebResource> <ore:Aggregation rdf:about="125"> <edm:object rdf:resource="urn2"/> <edm:object rdf:resource="urn3"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn3"> <...> </edm:WebResource> <...> </rdf:RDF>
So one aggregation refers to one (or more) web resources. I boiled down my original query to the following purpose to keep it simple: for each aggregation give me the corresponding web resource.
for $agg in db:open($db_name)/rdf:RDF/ore:Aggregation return for $urn in $agg/edm:object/@rdf:resource return (# db:enforceindex #) {db:open($db_name)/rdf:RDF/edm:WebResource[@rdf:about=$urn]}
For both databases the query gives me the required result and the query info tells me that the attribute index for $urn is applied in both cases (this is also the case if I leave out the pragma). However, oddly enough, for the "larger" database "def" with a larger attribute index it takes roughly 1 second while the "smaller" database "abc" with a smaller attribute index takes 20 seconds. This is not very long but the original query is more complicated and I have bigger databases with the same structure where it starts to matter.
The only (and I think important) difference between "abc" and "def" is that "abc" contains only 3 web resources that all 5.000 aggregations refer to. While in "def" each aggregation refers to a particular web resource (== 7.000 aggregations and 7.000 web resources).
With index:facets I had a look at the facet values and learned that there is a "maximum number of distinct values to store per name". Is there a difference in performance because of that? Maybe I do not get the index structures but it feels strange that it takes longer to find the correct attribute in a range of 3 different values than in a range of 7.000. Maybe there is also another problem in my query, databases or my reasoning that I do not see? Either way, I need help in understanding this phenomenon :-)
I hope you could follow, please don't hesitate to ask if you need anything to reproduce this situation (I am using BaseX 9.2.4).
Julia
[1] https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-July/014511.html