Hi Martín,
AUTOFLUSH=false TEXTINDEX=false ATTRINDEX=false
Looks like a sound way to do it. If consistency is critical, you'll need to ensure that your data will be flushed once in a while.
As Fabrice indicated in an earlier answer (..thanks..), you could as well do some testing with the ADD command or db:add. By default, our REST API checks if a newly added document already exists in the database. If you know that your added documents will always be new, then you could get rid of the existence check. This way, you can easily store more than a million of documents in a single database in 1 hour [1]. If you go this way, you should probably start with a new database, because the first call of a replace operation will create an additional document index, which will then be maintained as soon as it's created.
It would obviously be more convenient to use the existing REST API for that. We could possibly introduce a query parameter to the PUT method in order to skip the existence check.
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Twitter
Also, I'm using the C# client instead of the REST one, and also using
a pool of connections so as to avoid issuing an extra Open() call each time a file is sent to the server. Inserting 5000 files to a 1.2G database now takes 50 secs. Still it takes more than inserting on an empty database, but a lot less than the 6 minutes I was getting on a DB half the size.
Now I need to see the drawbacks of this configuration for our purposes, but just wanted to shared this.
Thanks, Martín.
From: ferrari_martin@hotmail.com To: christian.gruen@gmail.com Date: Thu, 30 Jul 2015 00:46:17 +0000 CC: basex-talk@mailman.uni-konstanz.de
Subject: Re: [basex-talk] Performance and heavy load
Hi Christian, I've dug more into this problem. We've installed BaseX 8.2.3 on our Linux box. It looks like insertions get slower as the DB grows. With an empty database, I'm able to insert 5000 10kb files in 104 secs. However, with a DB of around 800MB, the same test takes around six minutes to complete. I've tried with the REST interface and c# client, with similar results. I've also tried using add instead of replace and played setting PARALLEL values to 1, 8 and 16, as this was suggested by Fabrice and Maximilan.
Our volume is really huge, we have several BaseX databases in which we add files all the time. Basically, we're logging requests and responses from different external services into BaseX. Maybe this is not a good use of BaseX? I don't think we can split the DBs, as it would result in too many DBs to manage.
I've also spotted some guys asking about this, but with no resolution the their problems:
https://mailman.uni-konstanz.de/pipermail/basex-talk/2013-December/005990.ht... https://mailman.uni-konstanz.de/pipermail/basex-talk/2013-December/005995.ht... http://stackoverflow.com/questions/25113900/inserting-millions-of-xml-files-...
This is an excerpt from the logs, just to see how the test adds files:
REST interface 01:28:35.662 xx.yy.zz.ww:57162 admin REQUEST [PUT] http://xx.yy.zz.ww:8984/rest/mferrari_test_1/prueba55003.xml 01:28:35.719 xx.yy.zz.ww:57162 admin 201 0 resource(s) replaced in 21.27 ms. 57.9 ms
C# commands 01:48:51.530 xx.yy.zz.ww:62284 admin REQUEST OPEN mferrari_test_1 41.36 ms 01:48:51.531 xx.yy.zz.ww:62282 admin REQUEST ADD TO prueba070006.xml [...] 3.91 ms 01:48:51.568 xx.yy.zz.ww:62278 admin OK Resource(s) added in 123.96 ms. 125.52 ms
Thanks! Martín.
From: christian.gruen@gmail.com Date: Tue, 28 Jul 2015 15:12:48 +0200 Subject: Re: [basex-talk] Performance and heavy load To: ferrari_martin@hotmail.com CC: basex-talk@mailman.uni-konstanz.de
Out of interest: Do you use a recent version of BaseX?
On Tue, Jul 28, 2015 at 3:34 AM, Martín Ferrari ferrari_martin@hotmail.com wrote:
Hi guys, I'm quite new to BaseX. I've read a bit already, but perhaps you can help so I can investigate further. We are having a performance problem with our BaseX server. We're running it on a VM, and hitting it from around 5 web servers.
Under no stress, I get this timing from the log for a 1191 bytes file.
00:01:23.526 ww.aa.yy.xx:56312 admin REQUEST [PUT]
http://basex.xxxxxx:8984/rest/PaymentLogs_1/WRP.BR-4273791-1_PaymentGateway_... 00:01:24.967 ww.aa.yy.xx:56312 admin 201 1 resource(s) replaced in 1401.17 ms. 1441.24 ms
A call to /rest takes about 4-5 ms (it's called around once each 2 seconds, though it's not needed):
00:01:23.520 ww.aa.yy.zz:56312 admin REQUEST [GET] http://basex.xxxxxxxx:8984/rest 00:01:23.524 ww.aa.yy.xx:56312 admin 200 4.67 ms
Is the 1400 ms time normal for storing one xml file less than 2kb (storing a 10kb file took 1200 ms, so I'm not sure size mattered that much)?
And also, when the load starts to get heavier, from 7 to 12 files per second, BaseX server quickly starts to get slower, then taking minutes to respond, until finally it starts giving errors about the database being currently opened by another process, and too many open files. Many connections remain in the CLOSE_WAIT state, and the server is no longer usable.
Is it reasonable to expect to [PUT] more than 10 files per second, some of them taking more than 10kb? We're using it for logging, so that's a lot of xml files. If it's reasonable to use it that way, I'll dig more into optimizing it. Is anyone using it in a similar way?
Thanks, Martín.