I'm doing a BaseX evaluation to characterize its performance with my use case.
In my use case, I am receiving XML files over the network at a rate of about 100 / sec. I want to insert them into BaseX and perform real-time analytics. These XML files ranges from 20 to 50 nodes and in general has a similar strucuture. I will be periodically running XQuerys on this dataset.
I am using the network API in Python to add documents in a tight loop to determine throughput. I notice that when the database is empty, adding 1000 documents takes about 1 second. When the database is loaded up with 30,000 documents, adding these 1000 documents takes about 5 seconds. I was able to get much better scaling with AUTOFLUSH off, and periodically executing a flush command. So far with a million documents in the database, adding documents seem to take close to constant time which is great.
I haven't gotten to optimizing queries yet. Doing some simple aggregate queries like count(//), it looks like the query time scales linearly with the number of documents in the database.
Are there any general strategies to optimizing real-time analytics use cases? Any options that can be tuned to increase document insertion and query speeds and scaling? Or perhaps indexing options that might work better for a dataset that is always changing and increasing?
Thanks, - Simon
Hi Simon,
I haven't gotten to optimizing queries yet. Doing some simple aggregate queries like count(//), it looks like the query time scales linearly with the number of documents in the database.
This is true, and it depends a lot on the types of queries you are performing:
• If your database is fully optimized (which can be attained by running the 'optimize' command, or 'db:optimize', on the database), many count() calls can be evaluated in constant time, because the database statistics will be utilized for evaluation.
• If you activate the 'updindex' flag for a database, value indexes can be kept up-to-date, and queries such as //*[text() = 'abc'] will be evaluated by the index [1]. The InfoView panel [2] can be checked to see how queries will be compiled.
Please note that databases in BaseX are pretty light-weight. For example, you can access more than one database in a single XQuery expression. If you have free time slots for maintenance, you could work with one or more static databases, which is completely indexed and optimized, and one incremental database for new documents. During maintenance, you can add new documents to the static database pool and benefit from all query optimizations.
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Indexes [2] http://docs.basex.org/wiki/Graphical_User_Interface#Visualizations
basex-talk@mailman.uni-konstanz.de