Over the last couple of years I’ve developed the Mirabel system that provides DITA link management and query features over large volumes of content (the ServiceNow product documentation source). In particular it knows what links to what and enables viewing the content with all link information available.

 

For a given version of the product docs we have about 60K DITA topics and 100 root maps that organize those topics into publications.

 

The primary job of Mirabel is to capture all the hyperlink details as defined by the DITA source and enable queries about the element-to-element and document-to-document relationships established by those links.

 

My implementation approach for loading the link knowledge uses a multi-step process:

 

  1. Load the entire source content into a database
  2. Create a “key space” database that reflects the DITA key-to-resource mappings defined by each root DITA map. The key spaces are XQuery maps that map from key names to resources identified by their database node IDs (essentially, each use of a topic from a map has an associated unique key by which that use of the topic can be referenced). The key spaces are a prerequisite for resolving cross references to keys from one topic to other topics in the context of some root map (the same root map or a different one).
  3. Create a “link record keeping” database that contains the “where used” index for the content.

    The where-used index maps element node IDs to a record of every reference to that node (cross references, content references, topic references from maps). The where-used index is the core data used to know where a given map or topic is used, which is used to answer questions like “what publications use this topic?” or “is this topic used at all?”. The where-used table is constructed as an XQuery map that is then turned into XML for storage (I implemented this before BaseX added direct storage of maps but given the size I think it still makes sense to store it as XML, but I could be wrong).
    1. Process all map-to-map and map-to-topic references and create the initial map entries, one for each map and topic.
    2. For topics referenced from maps, process all topic-to-topic references and update the records for each target topic to reflect the references to it. The map context of a given topic determines the targets of key references from that topic, so it is necessary to process the topics in the context of the root maps that use them (in DITA, root maps determine the key-to-resource bindings to which key references resolve).
    3. For topics not referenced from any maps, add entries for them to the where-used table and process any topic-to-topic references (key references cannot be resolved but direct URI references can be).


Convert the XQuery map to a single XML document and store in the link record keeping database. The resulting database takes about 150MB of storage.

 

This third step can take two-to-three hours: 60K topics times 0.2 seconds for each topic is 3.3 hours. 0.1 seconds is about as fast as the link processing can go based on my testing.

 

This is all done using temporary databases so as not to disturb the working databases used by the running Mirabel web application. The work is performed by a BaseX “worker” server, not the main server that serves the web site. I essentially have one BaseX http server for each core on my server and allocate work to them based on load, so queries coming from the web app will not be allocated to a worker currently doing a content update process.

 

Once all the new link data is loaded, the temporary databases are swapped into production by renaming the production databases, renaming the temp databases to their production names, then dropping the old databases. (Saying this just now I’m realizing that I don’t know how to pause or wait for active queries against the in-production databases to finish so I can swap the databases.)

 

Because all the index entries use node IDs, the content database and record-keeping databases have to be put into production at the same time, otherwise the content node IDs will be out of sync with the indexed record IDs. I’m working on the assumption that renaming databases is essentially instantaneous and so I can use that to swap the temp databases into production reliably.

 

I use my job orchestration module (https://github.com/ekimbernow/basex-orchestration) to manage the sequence of operations, where each job calls the next job in the sequence once it has finished.

 

This process works reliably for smaller volumes of content—for example, a content set with only a couple of thousand topics and four or five root maps.

But at full scale I’m consistently seeing that the link record keeping database, which only has two large XML documents in it, never completes optimization: The database page shows the database with two things in it, but when you open the database’s page, they do not show up and the job that performs the optimization never completes, leaving the database in a locked state. This means the new where-used index can’t be put into production.

 

I feel like I’m going about this the wrong way to make best use of BaseX and avoid this problem with very large databases but I don’t see any obvious alternative approaches. But it feels like I’m missing something fundamental or making a silly error that I can’t see.

 

So my question:

 

How would you solve this problem?

 

In particular, how would you go about constructing the where-used index in a way that works best with BaseX?

 

Or maybe the question is “should I be updating the in-production database with the new data and doing the swapping into the production within the database itself?” (i.e., by renaming the where used index document rather than the database itself.)

 

I am currently using 11.6 and can move to 12 once it is released.

 

Thanks,

 

Eliot

_____________________________________________

Eliot Kimber

Sr. Staff Content Engineer

O: 512 554 9368

 

servicenow

 

servicenow.com

LinkedIn | X | YouTube | Instagram