Hi list,
I am using
BaseX here at my institution a lot as an XML database that
provides data using RESTful
APIs [1] or serves HTML, JS and CSS with RestXQ which in turn
uses HTTP requests to fetch more data [2]. Our
main use of BaseX databases is about TEI/XML encoded
dictionaries of various sizes. That means we mainly want to
search for, get and change small tei:entry parts in larger
TEI/XML documents.
BaseX was amazingly stable (compared to some other open source XML database existing today) over the last few years and has a very solid set of built in functions that almost always are sufficient to get a job done. I also like BaseX because I am pretty sure I always could understand why things did not work and what I can do about it even without doing any Java programming (although I saw some weirdness or the other over the last few years).
I created or ported some XQuery modules [3][4][6][7] and a containerization environment [5] that make my life easier and I would like to share them here and maybe have a discussion about my implementations and if others can make use of them and how (for example can these [6][7] be expath packages, I see some obstacles in the way RestXQ annotations work).
I will try to give an introduction to each of the modules in some separate mails.
Finally, I
now write to the list as I have a performance problem with a
CRUD API I created [1] for the task mentioned above when using
it with a dataset that is about 7GB looking at the BaseX
databases
that make it up.
This API uses
many of the modules and techniques I came up with so I thought
it
might be helpful to first talk about those parts. I hope that
they may be useful to others as well.
I tried to get creative at finding a way to optimize querying
the data without
having many long lasting global lock situations and having BaseX
using indexes
as much as possible (this started before the db:enforceindex
pragma was introduced
and still works for me as expected without it) while still
writing the RESTful API in BaseX' implementation of RestXQ.
That is why I created [7]. It heavily uses (abuses?) BaseX' jobs
module. It allows me to query in smaller BaseX databases in
parallel and present them as if they were one big XML DB, which
vastly improves performance on update and reindex, to a point.
Still there is a file based lock (or is it even class based?)
[8], I think the JVM profiling tells me, that severely limits
the number of (read) operations that can be done over the API
without the user having to wait so long they think the operation
failed. This is a multi threading problem as I see it.
Or maybe I overlooked something that would solve my
problems without all the creative stuff I tried? That probably
will be obvious when I have explained the current implentation
in more detail which I intend to do in the next few days.
[1] https://vle-curation.acdh.oeaw.ac.at/openapi/,
https://github.com/acdh-oeaw/vleserver_basex
[2] https://vicav.acdh.oeaw.ac.at/,
https://github.com/acdh-oeaw/vicav-app
[3] https://github.com/acdh-oeaw/openapi4restxq
[4] https://github.com/acdh-oeaw/api-problem4restxq
[5] https://github.com/simar0at/heroku-buildpack-basex
[6] https://github.com/acdh-oeaw/vicav-app/blob/master/http.xqm,
https://github.com/acdh-oeaw/openapi4restxq/blob/master_basex/swagger-ui.xqm
[7]
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/util.xqm
[8]
https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/io/random/DataAccess.java#L184
and other read mehods there
Best regards
-- Mag. Ing. Omar Siam Austrian Center for Digital Humanities and Cultural Heritage Österreichische Akademie der Wissenschaften | Austrian Academy of Sciences Stellvertretende Behindertenvertrauensperson | Deputy representative for disabled persons Wohllebengasse 12-14, 1040 Wien, Österreich | Vienna, Austria T: +43 1 51581-7295 omar.siam@oeaw.ac.at | www.oeaw.ac.at/acdh