Over the last couple of years I’ve developed the Mirabel system that provides DITA link management and query features over large volumes of content (the ServiceNow product documentation source). In particular it knows what links to what and enables viewing the content with all link information available.
For a given version of the product docs we have about 60K DITA topics and 100 root maps that organize those topics into publications.
The primary job of Mirabel is to capture all the hyperlink details as defined by the DITA source and enable queries about the element-to-element and document-to-document relationships established by those links.
My implementation approach for loading the link knowledge uses a multi-step process:
Convert the XQuery map to a single XML document and store in the link record keeping database. The resulting database takes about 150MB of storage.
This third step can take two-to-three hours: 60K topics times 0.2 seconds for each topic is 3.3 hours. 0.1 seconds is about as fast as the link processing can go based on my testing.
This is all done using temporary databases so as not to disturb the working databases used by the running Mirabel web application. The work is performed by a BaseX “worker” server, not the main server that serves the web site. I essentially have one BaseX http server for each core on my server and allocate work to them based on load, so queries coming from the web app will not be allocated to a worker currently doing a content update process.
Once all the new link data is loaded, the temporary databases are swapped into production by renaming the production databases, renaming the temp databases to their production names, then dropping the old databases. (Saying this just now I’m realizing that I don’t know how to pause or wait for active queries against the in-production databases to finish so I can swap the databases.)
Because all the index entries use node IDs, the content database and record-keeping databases have to be put into production at the same time, otherwise the content node IDs will be out of sync with the indexed record IDs. I’m working on
the assumption that renaming databases is essentially instantaneous and so I can use that to swap the temp databases into production reliably.
I use my job orchestration module (https://github.com/ekimbernow/basex-orchestration) to manage the sequence of operations, where each job calls the next job in the sequence once it has finished.
This process works reliably for smaller volumes of content—for example, a content set with only a couple of thousand topics and four or five root maps.
But at full scale I’m consistently seeing that the link record keeping database, which only has two large XML documents in it, never completes optimization: The database page shows the database with two things in it, but when you open the database’s page, they
do not show up and the job that performs the optimization never completes, leaving the database in a locked state. This means the new where-used index can’t be put into production.
I feel like I’m going about this the wrong way to make best use of BaseX and avoid this problem with very large databases but I don’t see any obvious alternative approaches. But it feels like I’m missing something fundamental or making a silly error that I can’t see.
So my question:
How would you solve this problem?
In particular, how would you go about constructing the where-used index in a way that works best with BaseX?
Or maybe the question is “should I be updating the in-production database with the new data and doing the swapping into the production within the database itself?” (i.e., by renaming the where used index document rather than the database itself.)
I am currently using 11.6 and can move to 12 once it is released.
Thanks,
Eliot
_____________________________________________
Eliot Kimber
Sr. Staff Content Engineer
O: 512 554 9368
servicenow