Dear Christian,
Thank you, your suggestion is indeed 4s faster on my machine than my code. This is quite impressive. Now I am below 20s. Not ideal but a good start. If I try hard to not do this if not necessary then I am willing to leave it at that.
I also tried some ideas from your code but with the traditional for loop. Looks like that is even faster:
declare namespace _ =
"https://www.oeaw.ac.at/acdh/tools/vle/util";
let $sorted-ascending := for $key in
collection('_qdb-TEI-02__cache')//*[@order="none"]/_:d/@vutlsk
order by data($key) ascending
return $key
let $sorted-ascending-archiv := for $key in
collection('_qdb-TEI-02__cache')//*[@order="none"]/_:d/@vutlsk-archiv
order by data($key) ascending
return $key
return (db:replace("_qdb-TEI-02__cache", 'ascending_cache.xml',
<_:dryed order="ascending"
ids="{string-join(subsequence($sorted-ascending, 1,
15000)/../(@ID, @xml:id), ' ')}"/>),
db:replace("_qdb-TEI-02__cache", 'descending_cache.xml',
<_:dryed order="descending"
ids="{string-join(subsequence(reverse($sorted-ascending), 1,
15000)/../(@ID, @xml:id), ' ')}"/>),
db:replace("_qdb-TEI-02__cache", 'ascending-archiv_cache.xml',
<_:dryed order="ascending" label="archiv"
ids="{string-join(subsequence($sorted-ascending-archiv, 1,
15000)/../(@ID, @xml:id), ' ')}"/>),
db:replace("_qdb-TEI-02__cache", 'descending-archiv_cache.xml',
<_:dryed order="descending" label="archiv"
ids="{string-join(subsequence(reverse($sorted-ascending-archiv),
1, 15000)/../(@ID, @xml:id), ' ')}"/>))
This takes only 11s on my machine.
One thing I think I also saw previously: parent axis is rather
slow. Do you agree with that or am I imagining something?
Some replies to your comments below:
I am not 100% sure what redundant expressions you saw in my code. Is this about using reverse() instead of having two for loops?
Some spontaneous ideas:
• You could try to evaluate redundant expressions once and bind them to a variable instead (see the attached code).
I tried that now and it does not make a difference whether I do a db:replace in _qdb-TEI-02__cache or create separate dbs for each document with db:create. I already adjusted attrinclude so it ignores the ids attribute.• You could save each document to a separate database via db:create (depending on your data, this may be faster than replacements in a single database), or save all new elements in a single document.
I thought about that but could not imagine how to do that. The most probable change that is affecting the sort order is something like removing a space at the start or a ( or changing the first letter. Doing any minimal update here would probably still mean to sort the 2.4 Mio entries.• Instead of creating full index structures with each update operation, you may save a lot of time if you only update parts of the data that have actually changed.
I don't quite get how I would do incremental changes to the entries ordered by a key. I so an incremental update by just getting the updated pre values for the database that was changed. That is reasonably fast even with incremental attribute index update.• If that’s close to impossible (because the types of updates are too manifold), you could work with daily databases that only contain incremental changes, and merge them with the main database every night.
2,4 million tags are a lot, though; and the string length of the created attribute values seem to exceed 100.000 characters, which is a lot, too. What will you do with the resulting documents?
As I mentioned this is a custom index to a set of databases containing 2,4 million TEI entry elements with data. These are more than 700 databases with about 3500 entries each and updates happen to one of them. This is quite fast.
I was not sure what is a lot of data in BaseX. I had a feeling that my dataset is not medium sized anymore but I am not sure what the size of datasets is that should give reasonable performance. I have to say that searching this data in BaseX proved to be a very fast and pleasant experience. Just editing it entry by entry is tricky.
This really big attributes string values are one part of a two step lookup I want to use to get a paging feature (at least for some out of the 2.4 mio entries).
The RESTXQ user can ask for a result with 10, 25, 100 entries per page and specify a page in alphabetical order of one of the sort keys. Worst case is the user deos not specify any other filte criteria. If she does then things are fast enough in all my realistic scenarios invloving only 2 or two databases so aroung 7000 index entries. I implemented getting a page out of all entries with sorting and subsequence. But that means it takes 8s or more for the first page to show. That is to long.
Using this code
declare namespace _ =
"https://www.oeaw.ac.at/acdh/tools/vle/util";
let $all :=
collection("_qdb-TEI-02__cache")//_:dryed[@order='ascending' and
not(@label)]/tokenize(@ids)
return db:attribute("_qdb-TEI-02__cache", subsequence($all, 1000,
25))[. instance of attribute(ID) or . instance of
attribute(xml:id)]/..!db:open-pre(./@db_name, ./@pre)
showing a page takes about 500 ms.
Best regads
Omar Siam