BaseX scaling and real-time analytics - BaseX-Talk - mailman.uni-konstanz.de

24 Apr 2017


      I'm doing a BaseX evaluation to characterize its performance with my use
case.
In my use case, I am receiving XML files over the network at a rate of
about 100 / sec. I want to insert them into BaseX and perform real-time
analytics. These XML files ranges from 20 to 50 nodes and in general has a
similar strucuture. I will be periodically running XQuerys on this dataset.
I am using the network API in Python to add documents in a tight loop to
determine throughput. I notice that when the database is empty, adding 1000
documents takes about 1 second. When the database is loaded up with 30,000
documents, adding these 1000 documents takes about 5 seconds. I was able to
get much better scaling with AUTOFLUSH off, and periodically executing a
flush command. So far with a million documents in the database, adding
documents seem to take close to constant time which is great.
I haven't gotten to optimizing queries yet. Doing some simple aggregate
queries like count(//), it looks like the query time scales linearly with
the number of documents in the database.
Are there any general strategies to optimizing real-time analytics use
cases? Any options that can be tuned to increase document insertion and
query speeds and scaling? Or perhaps indexing options that might work
better for a dataset that is always changing and increasing?
Thanks,
- Simon