Hi,
we're using BaseX to store multiple collections of documents (we call them records).
These record are produced programmatically, by parsing an incoming stream on a server application and turning it into a document of the kind
<record id="123" version="1"> ... </record>
So far I took the following approach:
- each collection of records is its own database in BaseX, for easier management
- on insertion - set the session's autoflush to false - iterate over record - add them via add(id, document) - each 10000 records, flush - finally, flush once more - create the attributes index
So for example now we have:
name Resources Size Input Path ------------------------------------------------------------------------------------ col1 14141 19815190 col2 14750 16697081 col3 84450 253593687 col4 1012477 2107593252 col5 126058 186315175 col6 13767 14640701 col7 815991 730536864 col8 31189 39598405 col9 24733 91277637 col10 171906 202392553 ...
and there'll be quite a bit more coming in.
This kind of bulk insertion can also happen concurrently (I've set-up an actor pool at five for the moment).
My questions are:
- is this the most performant approach, or would it make sense to e.g. build one stream on the fly and somehow turn it into an inputstream to be sent via add? - is there a performance cost in adding with an ID? We don't really need them since we retrieve records via a query - and those resources aren't really files on the file-system - is there a performance penalty in doing this kind of parsing concurrently? - are there any JVM parameters that would help speed this up? I haven't quite found how to pass in JVM parameters when starting basexserver via the command line. Looks like BaseX gave itself an Xmx of 1866006528 (but that machine has 8GB so it could in theory get more.
Thanks!
Manuel