What I like about BaseX is that it is very good at optimizing self-contained queries about the size a user can read and understand [1] [2] and that it has a DB locking system for transaction management [3] that is robust and easy to understand.
What I don’t like so much about BaseX is that these two mechanisms don’t work very well with complex code that is split into various modules. I use modules for code that may be shared among projects or just as a means of grouping common concerns in one module. That I don’t like this behaviour does not mean I know (or have any hope) that this can be solved in a better way without at least make unpleasant sacrifices elsewhere. It is just the setting I have to deal with.
When BaseX cannot determine anymore which DBs are used in a query and which are not, it falls back to assuming there are no indexes, so automatic optimization in this regard is stopped, and it assumes that just all DBs known to BaseX are used in that query so it acquires a global lock. [4]
When doing only reading queries this is not much of a problem. Using indexes in queries can be forced with functions or with the db:enforceindex pragma [5].
Problems start showing when trying to implement a CRUD RestXQ application. Create, update and delete can be implemented using the XQuery update standard but of course now this will get slow and cumbersome when for many read operations it cannot be determined which DBs they use and so a global read lock is held. That of course means that no global write lock can be acquired until all read operations are finished on all DBs known to a BaseX instance.
This is especially problematic when one instance of BaseX with a RestXQ application is used to serve data from independent databases. Say one instance of BaseX has a RestXQ API that servers a lot of different dictionaries for different natural languages. This is my use case. Although the content of dictionary entries is different, the parts in the TEI/XML I try to manipulate, that are created, read, updated or deleted, are the same. So, a common API should handle many independent dictionaries, edited by many users, using one instance of BaseX.
Also, when working with my biggest XML database of several GB I ran into problems when reindexing after an update. Reindexing all those GB of data takes too long and makes small updates in there impossible.
Why not multiple instances of BaseX? Well because for better or worse BaseX runs in a JVM and even after I tried to minimize the memory footprint of an idle BaseX it is still a little less than 300 MB and we run a lot of services here on shared servers so RAM usage matters. Also, RAM usage is a part of the costs when using commercial cloud services. But of course, not running BaseX at all if not used is best if you pay per minute. And also: as recently discussed on the list: BaseX as any Java program gets optimized while running by the JVM and then those optimizations as well as caching will benefit all the data hosted in one instance but would be less efficient with multiple instances I assume.
So how do I achieve four goals: * Keep the XQuery short and concise because that is what the optimizer can handle best? * Keep the code separated into Modules that deal with one particular aspect? * Use RestXQ and not another technique to actually implement the RESTful API? * All this while being able to split GB of XML data into portions that can be reindexed in a reasonable amount of time?
The two thing that help here a lot is eval functions like xquery:eval [6] and String Constructors [7]. Say, I want to run a query but on different collections (databases). I can do this by having a list of collections and executing the actual query in a for loop with the concrete collection as a variable. If I just write the XQuery code down like this the problem is that the optimizer would need to evaluate the query to find out which databases to lock and what indexes can be used. BaseX is not built to do this (yet). It does not mock run the query. So, it decides that a global lock needs to be used. Depending on the use of XQuery Update either a global write lock or a global read lock is acquired. Easy to understand but does not help with performance here. If I want to make the situation worse for the optimizer I can use xquery:eval. That of course makes the XQuery code totally opaque to the optimizer. A global lock is guaranteed.
Still another eval function is a solution here. There is the jobs module jobs:eval [8]. If I break up my code into jobs only these jobs hold locks for as long as they run. This can be a much shorter period of time than what it takes to run a whole RestXQ request. It is also possible to find a place that needs to be changed in a number of databases and then only write lock one of them to change something. So, if my data is stored in not one but several database files I can make them look like one big XML for API purposes, but still have small enough independent parts that can be indexed separately so updates with reindexing are relatively fast.
If I have a search I want to perform on parts of databases that are in principle independent, like dictionary entries in a large dictionary, I can do this in parallel on each database.
I tried to implement this idea with jobs:eval and it actually worked very well. Only the interface of the function was cumbersome to use the way I wanted to make use of this functionality.
So, I wrote a wrapper around jobs:eval and jobs:wait that makes it easy to generate small self-contained XQuery code [9] using String Constructors [10] and some other functions used for querying the structure of the data stored in BaseX like listing and filtering databases by name [11].
Another other goal for this util:eval(s) function was to make it still easy to see errors [12].
A typical use is something like: run a filter query in all databases [13] that are found using a database name filter in some settings database [14] and use a string for comparison from a request URL parameter [15]. Find an entry out of a few million and replace it with an updated version. Of course, with reindexing [16]. What were some (unexpected) problems? Because now jobs and especially write jobs lock databases while the RestXQ code is running the RestXQ code itself cannot hold any read or write lock. That is possible in BaseX but some functions force a global lock. For example db:list. I think there are good reasons why you want to have a global lock and therefore atomicity during a query when you ask for the list of databases. Of course, my code happens to need to list databases quite often. And my code should not hold a global lock here after getting the list. My list of databases may change during a RestXQ call but I don’t care yet about that situation. I think it does not matter to me. There is also a now simple solution: Outsource db:list to its own job [17].
I also remember there was a problem with an automatic conversion of RestXQ parameters creating a randomly named lock. But it was no problem to do the conversion explicitly in XQuery code and so have the RestXQ code not hold any lock again.
Now there already was a question on the mailing list about BaseX behaviour in a multithreaded environment. I don’t use that BaseX.jar in such a way with my own Java code but jobs are (Java) threads. And the interesting thing here is now that with a lot of threads (say 700) that don’t lock each other, a bottleneck shows in the way BaseX handles file access. At least the Java profiler showed me this as a primary source of wasted time [18].
If I get it correctly then file access is as usual done in 4KB portions which are read into a buffer and smaller parts are accessed from there. This way is by far the most efficient way to do this on any current operating system and file system. But now this buffer’s handling needs protection from the buffer being manipulated in different threads.
All I found in the JDK for this is a performance nightmare was the Jave nio streams systems [19], which tries to guarantee quite a few threads related consistencies [20] and seems really slow. This seems to be a well-known fact documented numerous times on the internet [21]. I also tried with one of the tests BaseX contains [22] and an attempt to use FileChannel instead of the current RandomAccessFile base implementation and found the documented behaviour: Java nio file classes are no replacement for the current implementation when it comes to performance.
Looking at other databases I saw they implement something OS dependent but it is hard to compare [23].
[1] https://docs.basex.org/wiki/Indexes [2] https://docs.basex.org/wiki/XQuery_Optimizations [3] https://docs.basex.org/wiki/Transaction_Management [4] https://docs.basex.org/wiki/Transaction_Management#Limitations [5] https://docs.basex.org/wiki/Indexes#Enforce_Rewritings [6] https://docs.basex.org/wiki/XQuery_Module#xquery:eval [7] https://www.w3.org/TR/2017/REC-xquery-31-20170321/#id-string-constructors [8] https://docs.basex.org/wiki/Jobs_Module#jobs:eval [9] https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/util.xqm#L5... [10] https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/data/access... [11] https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/data/profil... [12] https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/util.xqm#L9... [13] https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/data/access... [14] https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/data/access... [15] https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/entries.xqm... [16] https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/data/access... [17] https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/dicts.xqm#L... [18] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba... and other read methods there [19] https://blogs.oracle.com/javamagazine/post/java-nio-nio2-buffers-channels-as... [20] https://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html The view of a file provided by an instance of this class is guaranteed to be consistent with other views of the same file provided by other instances in the same program. [21] https://www.mathematik.uni-marburg.de/~alexmaurer/files/NioVsIo.pdf as an example. May be more recent evaluations of Java 11 or 17 nio or nio.2 performance is better? [22] https://github.com/BaseXdb/basex/blob/master/basex-core/src/test/java/org/ba... [23] https://github.com/neo4j/neo4j/blob/4.4/community/native/src/main/java/org/n...
Best regards
a bottleneck shows in the way BaseX handles file access.
I wonder if this issue is relevant? https://github.com/BaseXdb/basex/issues/1574 /Andy
On Mon, 2 May 2022 at 16:11, Omar Siam Omar.Siam@oeaw.ac.at wrote:
What I like about BaseX is that it is very good at optimizing self-contained queries about the size a user can read and understand [1] [2] and that it has a DB locking system for transaction management [3] that is robust and easy to understand.
What I don’t like so much about BaseX is that these two mechanisms don’t work very well with complex code that is split into various modules. I use modules for code that may be shared among projects or just as a means of grouping common concerns in one module. That I don’t like this behaviour does not mean I know (or have any hope) that this can be solved in a better way without at least make unpleasant sacrifices elsewhere. It is just the setting I have to deal with.
When BaseX cannot determine anymore which DBs are used in a query and which are not, it falls back to assuming there are no indexes, so automatic optimization in this regard is stopped, and it assumes that just all DBs known to BaseX are used in that query so it acquires a global lock. [4]
When doing only reading queries this is not much of a problem. Using indexes in queries can be forced with functions or with the db:enforceindex pragma [5].
Problems start showing when trying to implement a CRUD RestXQ application. Create, update and delete can be implemented using the XQuery update standard but of course now this will get slow and cumbersome when for many read operations it cannot be determined which DBs they use and so a global read lock is held. That of course means that no global write lock can be acquired until all read operations are finished on all DBs known to a BaseX instance.
This is especially problematic when one instance of BaseX with a RestXQ application is used to serve data from independent databases. Say one instance of BaseX has a RestXQ API that servers a lot of different dictionaries for different natural languages. This is my use case. Although the content of dictionary entries is different, the parts in the TEI/XML I try to manipulate, that are created, read, updated or deleted, are the same. So, a common API should handle many independent dictionaries, edited by many users, using one instance of BaseX.
Also, when working with my biggest XML database of several GB I ran into problems when reindexing after an update. Reindexing all those GB of data takes too long and makes small updates in there impossible.
Why not multiple instances of BaseX? Well because for better or worse BaseX runs in a JVM and even after I tried to minimize the memory footprint of an idle BaseX it is still a little less than 300 MB and we run a lot of services here on shared servers so RAM usage matters. Also, RAM usage is a part of the costs when using commercial cloud services. But of course, not running BaseX at all if not used is best if you pay per minute. And also: as recently discussed on the list: BaseX as any Java program gets optimized while running by the JVM and then those optimizations as well as caching will benefit all the data hosted in one instance but would be less efficient with multiple instances I assume.
So how do I achieve four goals:
- Keep the XQuery short and concise because that is what the optimizer
can handle best?
- Keep the code separated into Modules that deal with one particular
aspect?
- Use RestXQ and not another technique to actually implement the RESTful
API?
- All this while being able to split GB of XML data into portions that
can be reindexed in a reasonable amount of time?
The two thing that help here a lot is eval functions like xquery:eval [6] and String Constructors [7]. Say, I want to run a query but on different collections (databases). I can do this by having a list of collections and executing the actual query in a for loop with the concrete collection as a variable. If I just write the XQuery code down like this the problem is that the optimizer would need to evaluate the query to find out which databases to lock and what indexes can be used. BaseX is not built to do this (yet). It does not mock run the query. So, it decides that a global lock needs to be used. Depending on the use of XQuery Update either a global write lock or a global read lock is acquired. Easy to understand but does not help with performance here. If I want to make the situation worse for the optimizer I can use xquery:eval. That of course makes the XQuery code totally opaque to the optimizer. A global lock is guaranteed.
Still another eval function is a solution here. There is the jobs module jobs:eval [8]. If I break up my code into jobs only these jobs hold locks for as long as they run. This can be a much shorter period of time than what it takes to run a whole RestXQ request. It is also possible to find a place that needs to be changed in a number of databases and then only write lock one of them to change something. So, if my data is stored in not one but several database files I can make them look like one big XML for API purposes, but still have small enough independent parts that can be indexed separately so updates with reindexing are relatively fast.
If I have a search I want to perform on parts of databases that are in principle independent, like dictionary entries in a large dictionary, I can do this in parallel on each database.
I tried to implement this idea with jobs:eval and it actually worked very well. Only the interface of the function was cumbersome to use the way I wanted to make use of this functionality.
So, I wrote a wrapper around jobs:eval and jobs:wait that makes it easy to generate small self-contained XQuery code [9] using String Constructors [10] and some other functions used for querying the structure of the data stored in BaseX like listing and filtering databases by name [11].
Another other goal for this util:eval(s) function was to make it still easy to see errors [12].
A typical use is something like: run a filter query in all databases [13] that are found using a database name filter in some settings database [14] and use a string for comparison from a request URL parameter [15]. Find an entry out of a few million and replace it with an updated version. Of course, with reindexing [16]. What were some (unexpected) problems? Because now jobs and especially write jobs lock databases while the RestXQ code is running the RestXQ code itself cannot hold any read or write lock. That is possible in BaseX but some functions force a global lock. For example db:list. I think there are good reasons why you want to have a global lock and therefore atomicity during a query when you ask for the list of databases. Of course, my code happens to need to list databases quite often. And my code should not hold a global lock here after getting the list. My list of databases may change during a RestXQ call but I don’t care yet about that situation. I think it does not matter to me. There is also a now simple solution: Outsource db:list to its own job [17].
I also remember there was a problem with an automatic conversion of RestXQ parameters creating a randomly named lock. But it was no problem to do the conversion explicitly in XQuery code and so have the RestXQ code not hold any lock again.
Now there already was a question on the mailing list about BaseX behaviour in a multithreaded environment. I don’t use that BaseX.jar in such a way with my own Java code but jobs are (Java) threads. And the interesting thing here is now that with a lot of threads (say 700) that don’t lock each other, a bottleneck shows in the way BaseX handles file access. At least the Java profiler showed me this as a primary source of wasted time [18].
If I get it correctly then file access is as usual done in 4KB portions which are read into a buffer and smaller parts are accessed from there. This way is by far the most efficient way to do this on any current operating system and file system. But now this buffer’s handling needs protection from the buffer being manipulated in different threads.
All I found in the JDK for this is a performance nightmare was the Jave nio streams systems [19], which tries to guarantee quite a few threads related consistencies [20] and seems really slow. This seems to be a well-known fact documented numerous times on the internet [21]. I also tried with one of the tests BaseX contains [22] and an attempt to use FileChannel instead of the current RandomAccessFile base implementation and found the documented behaviour: Java nio file classes are no replacement for the current implementation when it comes to performance.
Looking at other databases I saw they implement something OS dependent but it is hard to compare [23].
[1] https://docs.basex.org/wiki/Indexes [2] https://docs.basex.org/wiki/XQuery_Optimizations [3] https://docs.basex.org/wiki/Transaction_Management [4] https://docs.basex.org/wiki/Transaction_Management#Limitations [5] https://docs.basex.org/wiki/Indexes#Enforce_Rewritings [6] https://docs.basex.org/wiki/XQuery_Module#xquery:eval [7] https://www.w3.org/TR/2017/REC-xquery-31-20170321/#id-string-constructors [8] https://docs.basex.org/wiki/Jobs_Module#jobs:eval [9]
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/util.xqm#L5... [10]
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/data/access... [11]
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/data/profil... [12]
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/util.xqm#L9... [13]
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/data/access... [14]
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/data/access... [15]
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/entries.xqm... [16]
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/data/access... [17]
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/dicts.xqm#L... [18]
https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba... and other read methods there [19]
https://blogs.oracle.com/javamagazine/post/java-nio-nio2-buffers-channels-as... [20]
https://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html The view of a file provided by an instance of this class is guaranteed to be consistent with other views of the same file provided by other instances in the same program. [21] https://www.mathematik.uni-marburg.de/~alexmaurer/files/NioVsIo.pdf as an example. May be more recent evaluations of Java 11 or 17 nio or nio.2 performance is better? [22]
https://github.com/BaseXdb/basex/blob/master/basex-core/src/test/java/org/ba... [23]
https://github.com/neo4j/neo4j/blob/4.4/community/native/src/main/java/org/n...
Best regards
-- Mag. Ing. Omar Siam Austrian Center for Digital Humanities and Cultural Heritage Österreichische Akademie der Wissenschaften | Austrian Academy of Sciences Stellvertretende Behindertenvertrauensperson | Deputy representative for disabled persons Wohllebengasse 12-14, 1040 Wien, Österreich | Vienna, Austria T: +43 1 51581-7295 omar.siam@oeaw.ac.at | www.oeaw.ac.at/acdh
Am 25.05.2022 um 10:37 schrieb Andy Bunce:
a bottleneck shows in the way BaseX handles file access.
I wonder if this issue is relevant? https://github.com/BaseXdb/basex/issues/1574 /Andy
Yes that is exactly the part I suspect to slow things down.
basex-talk@mailman.uni-konstanz.de