Julia, all,
apologies - I hit 'send' a bit too quickly. I wanted to ask how you were ingesting RDF, what the size of your input looked like, and how you were dealing with namespaces. I have some RDF/MADS data from the Library of Congress and I've never been able to ingest it into BaseX, even stripping namespaces, etc, things seem to run out of memory or stall.
Would you be willing to share some details about your data?
Thanks very much. Best, Bridger
On Fri, Oct 25, 2019 at 3:41 PM Bridger Dyson-Smith bdysonsmith@gmail.com wrote:
Hi Julia -
Preface: let me be clear when I say that I've wondered about some of this myself, so I don't think I have an answer for you. That being said, I wonder if this is a grouping/data modeling problem: i.e. you have 5,000 aggregations refer to 1 of 3 web resources vs ~7,000 aggregations each with their own distinct web resource.
If you created 3 databases for "abc" ( hm... "a", "b", and "c"? ☺), one for each web resource (i.e. where there would be only aggregations-to-the-specific-web-resource), would that help with query times at all? It might necessitate a bit of pre-processing in your creation step though.
In any event, I hope those random thoughts are helpful in some way. Best, Bridger
On Fri, Oct 25, 2019 at 10:24 AM Beck, Julia J.Beck@ub.uni-frankfurt.de wrote:
Hi,
first of all: thank you, the fix for [1] did the trick and in 9.2.4 the query is working as expected. Today, I come back to you with another challenge in performance which again seems to have something to do with indexing(?). So here's the situation: I have two databases "abc" and "def". "abc" contains 1 xml doc with about 150.000 nodes and "def" contains 1 xml doc with about 400.000 nodes. Both are similarly strutured and have their up-to-date text and attr indexes. The xml docs look both (simplified) like the following:
rdf:RDF <ore:Aggregation rdf:about="123"> <edm:object rdf:resource="urn1"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn1"> <...> </edm:WebResource> <ore:Aggregation rdf:about="124"> <edm:object rdf:resource="urn2"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn2"> <...> </edm:WebResource> <ore:Aggregation rdf:about="125"> <edm:object rdf:resource="urn2"/> <edm:object rdf:resource="urn3"/> <...> </ore:Aggregation> <edm:WebResource rdf:about="urn3"> <...> </edm:WebResource> <...> </rdf:RDF>
So one aggregation refers to one (or more) web resources. I boiled down my original query to the following purpose to keep it simple: for each aggregation give me the corresponding web resource.
for $agg in db:open($db_name)/rdf:RDF/ore:Aggregation return for $urn in $agg/edm:object/@rdf:resource return (# db:enforceindex #) {db:open($db_name)/rdf:RDF/edm:WebResource[@rdf:about=$urn]}
For both databases the query gives me the required result and the query info tells me that the attribute index for $urn is applied in both cases (this is also the case if I leave out the pragma). However, oddly enough, for the "larger" database "def" with a larger attribute index it takes roughly 1 second while the "smaller" database "abc" with a smaller attribute index takes 20 seconds. This is not very long but the original query is more complicated and I have bigger databases with the same structure where it starts to matter.
The only (and I think important) difference between "abc" and "def" is that "abc" contains only 3 web resources that all 5.000 aggregations refer to. While in "def" each aggregation refers to a particular web resource (== 7.000 aggregations and 7.000 web resources).
With index:facets I had a look at the facet values and learned that there is a "maximum number of distinct values to store per name". Is there a difference in performance because of that? Maybe I do not get the index structures but it feels strange that it takes longer to find the correct attribute in a range of 3 different values than in a range of 7.000. Maybe there is also another problem in my query, databases or my reasoning that I do not see? Either way, I need help in understanding this phenomenon :-)
I hope you could follow, please don't hesitate to ask if you need anything to reproduce this situation (I am using BaseX 9.2.4).
Julia
[1] https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-July/014511.html