Executing optimized XQuery in RestXQ without having to deal with global lock situations - BaseX-Talk - mailman.uni-konstanz.de

2 May 2022


      What I like about BaseX is that it is very good at optimizing 
self-contained queries about the size a user can read and understand [1] 
[2] and that it has a DB locking system for transaction management [3] 
that is robust and easy to understand.
What I don’t like so much about BaseX is that these two mechanisms don’t 
work very well with complex code that is split into various modules. I 
use modules for code that may be shared among projects or just as a 
means of grouping common concerns in one module.
That I don’t like this behaviour does not mean I know (or have any hope) 
that this can be solved in a better way without at least make unpleasant 
sacrifices elsewhere. It is just the setting I have to deal with.
When BaseX cannot determine anymore which DBs are used in a query and 
which are not, it falls back to assuming there are no indexes, so 
automatic optimization in this regard is stopped, and it assumes that 
just all DBs known to BaseX are used in that query so it acquires a 
global lock. [4]
When doing only reading queries this is not much of a problem. Using 
indexes in queries can be forced with functions or with the 
db:enforceindex pragma [5].
Problems start showing when trying to implement a CRUD RestXQ 
application. Create, update and delete can be implemented using the 
XQuery update standard but of course now this will get slow and 
cumbersome when for many read operations it cannot be determined which 
DBs they use and so a global read lock is held. That of course means 
that no global write lock can be acquired until all read operations are 
finished on all DBs known to a BaseX instance.
This is especially problematic when one instance of BaseX with a RestXQ 
application is used to serve data from independent databases. Say one 
instance of BaseX has a RestXQ API that servers a lot of different 
dictionaries for different natural languages. This is my use case. 
Although the content of dictionary entries is different, the parts in 
the TEI/XML I try to manipulate, that are created, read, updated or 
deleted, are the same. So, a common API should handle many independent 
dictionaries, edited by many users, using one instance of BaseX.
Also, when working with my biggest XML database of several GB I ran into 
problems when reindexing after an update. Reindexing all those GB of 
data takes too long and makes small updates in there impossible.
Why not multiple instances of BaseX? Well because for better or worse 
BaseX runs in a JVM and even after I tried to minimize the memory 
footprint of an idle BaseX it is still a little less than 300 MB and we 
run a lot of services here on shared servers so RAM usage matters. Also, 
RAM usage is a part of the costs when using commercial cloud services. 
But of course, not running BaseX at all if not used is best if you pay 
per minute. And also: as recently discussed on the list: BaseX as any 
Java program gets optimized while running by the JVM and then those 
optimizations as well as caching will benefit all the data hosted in one 
instance but would be less efficient with multiple instances I assume.
So how do I achieve four goals:
*  Keep the XQuery short and concise because that is what the optimizer 
can handle best?
* Keep the code separated into Modules that deal with one particular aspect?
* Use RestXQ and not another technique to actually implement the RESTful 
API?
* All this while being able to split GB of XML data into portions that 
can be reindexed in a reasonable amount of time?
The two thing that help here a lot is eval functions like xquery:eval 
[6] and String Constructors [7].
Say, I want to run a query but on different collections (databases). I 
can do this by having a list of collections and executing the actual 
query in a for loop with the concrete collection as a variable.
If I just write the XQuery code down like this the problem is that the 
optimizer would need to evaluate the query to find out which databases 
to lock and what indexes can be used. BaseX is not built to do this 
(yet). It does not mock run the query. So, it decides that a global lock 
needs to be used. Depending on the use of XQuery Update either a global 
write lock or a global read lock is acquired. Easy to understand but 
does not help with performance here.
If I want to make the situation worse for the optimizer I can use 
xquery:eval. That of course makes the XQuery code totally opaque to the 
optimizer. A global lock is guaranteed.
Still another eval function is a solution here. There is the jobs module 
jobs:eval [8].
If I break up my code into jobs only these jobs hold locks for as long 
as they run. This can be a much shorter period of time than what it 
takes to run a whole RestXQ request. It is also possible to find a place 
that needs to be changed in a number of databases and then only write 
lock one of them to change something.
So, if my data is stored in not one but several database files I can 
make them look like one big XML for API purposes, but still have small 
enough independent parts that can be indexed separately so updates with 
reindexing are relatively fast.
If I have a search I want to perform on parts of databases that are in 
principle independent, like dictionary entries in a large dictionary, I 
can do this in parallel on each database.
I tried to implement this idea with jobs:eval and it actually worked 
very well. Only the interface of the function was cumbersome to use the 
way I wanted to make use of this functionality.
So, I wrote a wrapper around jobs:eval and jobs:wait that makes it easy 
to generate small self-contained XQuery code [9] using String 
Constructors [10] and some other functions used for querying the 
structure of the data stored in BaseX like listing and filtering 
databases by name [11].
Another other goal for this util:eval(s) function was to make it still 
easy to see errors [12].
A typical use is something like: run a filter query in all databases 
[13] that are found using a database name filter in some settings 
database [14] and use a string for comparison from a request URL 
parameter [15].
Find an entry out of a few million and replace it with an updated 
version. Of course, with reindexing [16].
What were some (unexpected) problems?
Because now jobs and especially write jobs lock databases while the 
RestXQ code is running the RestXQ code itself cannot hold any read or 
write lock. That is possible in BaseX but some functions force a global 
lock. For example db:list. I think there are good reasons why you want 
to have a global lock and therefore atomicity during a query when you 
ask for the list of databases. Of course, my code happens to need to 
list databases quite often. And my code should not hold a global lock 
here after getting the list. My list of databases may change during a 
RestXQ call but I don’t care yet about that situation. I think it does 
not matter to me.
There is also a now simple solution: Outsource db:list to its own job [17].
I also remember there was a problem with an automatic conversion of 
RestXQ parameters creating a randomly named lock. But it was no problem 
to do the conversion explicitly in XQuery code and so have the RestXQ 
code not hold any lock again.
Now there already was a question on the mailing list about BaseX 
behaviour in a multithreaded environment. I don’t use that BaseX.jar in 
such a way with my own Java code but jobs are (Java) threads. And the 
interesting thing here is now that with a lot of threads (say 700) that 
don’t lock each other, a bottleneck shows in the way BaseX handles file 
access. At least the Java profiler showed me this as a primary source of 
wasted time [18].
If I get it correctly then file access is as usual done in 4KB portions 
which are read into a buffer and smaller parts are accessed from there. 
This way is by far the most efficient way to do this on any current 
operating system and file system. But now this buffer’s handling needs 
protection from the buffer being manipulated in different threads.
All I found in the JDK for this is a performance nightmare was the Jave 
nio streams systems [19], which tries to guarantee quite a few threads 
related consistencies [20] and seems really slow. This seems to be a 
well-known fact documented numerous times on the internet [21]. I also 
tried with one of the tests BaseX contains [22] and an attempt to use 
FileChannel instead of the current RandomAccessFile base implementation 
and found the documented behaviour: Java nio file classes are no 
replacement for the current implementation when it comes to performance.
Looking at other databases I saw they implement something OS dependent 
but it is hard to compare [23].
[1] https://docs.basex.org/wiki/Indexes
[2] https://docs.basex.org/wiki/XQuery_Optimizations
[3] https://docs.basex.org/wiki/Transaction_Management
[4] https://docs.basex.org/wiki/Transaction_Management#Limitations
[5] https://docs.basex.org/wiki/Indexes#Enforce_Rewritings
[6] https://docs.basex.org/wiki/XQuery_Module#xquery:eval
[7] 
https://www.w3.org/TR/2017/REC-xquery-31-20170321/#id-string-constructors
[8] https://docs.basex.org/wiki/Jobs_Module#jobs:eval
[9] 
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/util.xqm#L5...
[10] 
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/data/access...
[11] 
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/data/profil...
[12] 
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/util.xqm#L9...
[13] 
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/data/access...
[14] 
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/data/access...
[15] 
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/entries.xqm...
[16] 
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/data/access...
[17] 
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/dicts.xqm#L...
[18] 
https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba... 
and other read methods there
[19] 
https://blogs.oracle.com/javamagazine/post/java-nio-nio2-buffers-channels-as...
[20] 
https://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html 
The view of a file provided by an instance of this class is guaranteed 
to be consistent with other views of the same file provided by other 
instances in the same program.
[21] https://www.mathematik.uni-marburg.de/~alexmaurer/files/NioVsIo.pdf 
as an example. May be more recent evaluations of Java 11 or 17 nio or 
nio.2 performance is better?
[22] 
https://github.com/BaseXdb/basex/blob/master/basex-core/src/test/java/org/ba...
[23] 
https://github.com/neo4j/neo4j/blob/4.4/community/native/src/main/java/org/n...
Best regards
-- 
Mag. Ing. Omar Siam
Austrian Center for Digital Humanities and Cultural Heritage
Österreichische Akademie der Wissenschaften | Austrian Academy of Sciences
Stellvertretende Behindertenvertrauensperson | Deputy representative for disabled persons
Wohllebengasse 12-14, 1040 Wien, Österreich | Vienna, Austria
T: +43 1 51581-7295
omar.siam@oeaw.ac.at | www.oeaw.ac.at/acdh