Hi Christian,
I've run your attached revised Java example on my machine. When inserting 56.000 times the same 16 KB xml structure, the BaseX server protocols inserting times beginning with ~40 ms and ending with ~150 ms (see attached AddDocs2_using_56000_19KB_from_console.log). The thread dump of the BaseX-JVM is attached in 56000_19KB_xml_java.hprof.txt. When inserting the same 19 KB xml structure 100.000 times, the duration of the persist operations begins with ~40 ms and ends with ~ 250 ms (see AddDocs2_using_100000_19KB_xml_from_console.log). The related JVM thread dump is attached in 100000_19KB_xml_java.hprof.txt. When inserting a 160 KB xml structure 100.000 times, the persist operation duration starts by ~45 ms and reach ~2000 ms after 68.000 persist invocations and 16 hours of run time (!) (see attached AddDocs2_using_100000_160KB_1from2.log and AddDocs2_using_100000_160KB_2from2.log). The related JVM thread dump is attached in 100000_160KB_xml_java.hprof.txt
There are actually two AUTOFLUSH-related issues: when AUTOFLUSH is (explicitly or implicitly) active, there always exist a system-independent, monotonic increase of the insert operation durations. When AUTOFLUSH is explicitly inactivated (in order to achieve a constant insert operation duration), the probability of loosing data increase.
Regards, Lucian ________________________________________ Von: Christian Grün [christian.gruen@gmail.com] Gesendet: Samstag, 14. Januar 2017 12:09 An: Bularca, Lucian Cc: basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.
Hi Lucian,
I have a hard time reproducing the reported behavior. The attached, revised Java example (without AUTOFLUSH) required around 30 ms for the first documents and 120 ms for the last documents, which is still pretty far from what you’ve been encountering:
von Anfang ~ 10 ms auf ~ 2500 ms kommne würde
But obviously something weird has been going on in your setup. Let’s see what alternatives we have…
• Could you possibly try to update my example code such that it shows the reported behavior? Ideally with small input, in order to speed up the process. Maybe the runtime increase can also be demonstrated after 1.000 or 10.000 documents... • You could also send me a list of the files of your test_database directory; maybe the file sizes indicate some unusual patterns. • You could start BaseXServer with the JVM flag -Xrunhprof:cpu=samples (to be inserted in the basexserver script), start the server, run your script, stop the server directly afterwards, and send me the result file, which will be stored in the directory from where you started BaseX (java.hprof.txt).
Best, Christian
On Wed, Jan 11, 2017 at 4:57 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Lucian,
Thanks for your analysis. Indeed I’m wondering about the monotonic delay caused by auto flushing the data; this hasn’t always been the case. I’m wondering even more why no one else noticed this in recent time.. Maybe it’s not too long ago that this was introduced. It may take some time to find the culprit, but I’ll keep you updated.
All the best, Christian
On Wed, Jan 11, 2017 at 2:46 PM, Bularca, Lucian Lucian.Bularca@mueller.de wrote:
Hi Christian,
I've made a comparation of the persistence time series running your example code and mine, in all possible combinations of following scenarios:
- with and without "set intparse on"
- using my prepared test data and your test data
- closing and opening the DB connection each "n"-th insertion operation (where n in {5, 100, 500, 1000})
- with and without "set autoflush on".
I finally found out, that the only relevant variable that influence the insert operation duration is the value of the AUTOFLASH option.
If AUTOFLASH = OFF when opening a database, then the persistence durations remains relative constant (on my machine about 43 ms) during the entire insert operations sequence (50.000 or 100.000 times), for all possible combinations named above.
If AUTOFLASH = ON when opening a database, then the persistence durations increase monotonic, for all possible combinations named above.
The persistence duration, if AUTOFLASH = ON, is directly proportional to the number of DB clients executing these insert operations, respectively to the sequence length of insert operations executed by a DB client.
In my opinion, this behaviour is an issue of BaseX, because AUTOFLASH is implcitly set to ON (see BaseX documentation http://docs.basex.org/wiki/Options#AUTOFLUSH), so DB clients must explicitly set AUTOFLASH = OFF in order to keep the insert operation durations relatively constant over time. Additionally, no explicitly flushing data, increases the risk of data loss (see BaseX documentation http://docs.basex.org/wiki/Options#AUTOFLUSH), but clients how repeatedly execute the FLUSH command increase the durations of the subsequent insert operations.
Regards, Lucian
Von: Christian Grün [christian.gruen@gmail.com] Gesendet: Dienstag, 10. Januar 2017 17:33 An: Bularca, Lucian Cc: Dirk Kirsten; basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.
Hi Lucian,
I couldn’t run your code example out of the box. 24 hours sounds pretty alarming, though, so I have written my own example (attached). It creates 50.000 XML documents, each sized around 160 KB. It’s not as fast as I had expected, but the total runtime is around 13 minutes, and it only slow down a little when adding more documents...
10000: 125279.45 ms 20000: 128244.23 ms 30000: 130499.9 ms 40000: 132286.05 ms 50000: 134814.82 ms
Maybe you could compare the code with yours, and we can find out what causes the delay?
Best, Christian
On Tue, Jan 10, 2017 at 4:44 PM, Bularca, Lucian Lucian.Bularca@mueller.de wrote:
Hi Dirk,
of course, querying millions of data entries on a single database rise problems. This is equally problematic for all databases, not only for the BaseX DB and certain storing strategies will be mandatory at production time.
The actual problem is, that adding 50.000 of 160 KB xml stuctures took 24 hours because that inexplicable monotonic increase of the insert operation durations.
I'll really appreciate if someone can explain this behaviour or a counterexample can demonstrate, that the cause of this behaviour is test case but not DB inherent.
Regards, Lucian