Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

List overview All Threads
Download

newer

older

Access-Control-Allow-Origin

How to optimally and reliably go...

Bularca, Lucian

10 Jan 2017 10 Jan '17

5:44 a.m.

Guten Tag,

im Rahmen einer Performance-Evaluierung der Persistierung von XML Datenstrukturen in einer BaseX Datenbank, haben wir stetig absinkende Persistierungs-Raten umgekehrt proportional zu der Datenbank-Größe festgestellt.

Dieses Verhalten ist erklährbar und wäre auch einnehmbar, wenn nicht die Dauer der Persistierung einer ~ 160 KB großen XML Datenstruktur, von Anfang ~ 10 ms auf ~ 2500 ms kommne würde, nach ~ 50.000 Persistierungs-Vorgänge.

Dabei versuchen wir 100.000 unterschiedliche, ungefähr 160 KB große XML Datenstrukturen mittels der Java API in eine BaseX Datenbank zu speichern um dabei die Gesammt-Dauer bzw. die Dauer der einzelnen Persistierungs-Vorgänge zu messen. Die BaseX Datenbank wurde im HTTP Modus (basexhttp) mit -Xmx 4048m gestartert.

Die oben genannte Messwerte blieben gleich, unabhängig davon, ob alle XM Datenstrukturen in eine einzige Session gespeichert wurden, oder wenn alle 500 Persistierungs-Vorgänge der Socket (DB-Anbindung) geschlossen und erneut geöffnet wurde. Eine Indizierung der Datenbank (mittels der GUI "Optimize All", bzw. "Create Text Index") zwischendurch konnte die Persistierungs-Raten nicht beeinflussen bzw. optimieren.

Ein Beispiel der Test-Klassen (nur exemplarisch, nicht kompilierbar !) die wir dazu benutzt haben, sind im Anhang BaseXClient.java.zip zu dieser E-Mail zu finden.

Sind generell, Persistierungs-Raten von mehr als 160 KB / 2500 ms bei mehr als 30.000 vorhandene Einträge in der BaseX zu erwarten, oder können wir diese Persistierungs-Zeiten drastisch optimieren (und wenn ja, wie)?

Mit freundlichen Grüßen, Lucian Bularca

Attachments:

attachment.html (text/html — 2.0 KB)
BaseXClient.java.zip (application/zip — 3.9 KB)

Show replies by date

Dirk Kirsten

10 Jan 10 Jan

6:52 a.m.

Hello Lucian,

please be aware that this is an English-speaking mailing list as we have many users from all over the world and the mailing list is intended to help everyone. But as most of our team members are German (well, and Bavarians...) we of course understand it. Hence, I answer in English (for all other: Lucian seem to have same performance issues when adding many documents).

First of all, are you sure your tests sufficiently test the add performance. Looking at your file TestBaseXClient.java it seems to not record the runtimes of the individual insertions, but just the overall runtime of in this case 100000 insertions.

Also, at least in the Example you provided you also do some other stuff (especiall printing to sysout), which obviously also has a performance impact.

Optimizing or creating indexes in between a mass update should not increase the speed, as it builds the indexes, which will be invalidated after the next index, so I would not expect any speed up here.

What version of BaseX did you use?

Did you set AUTOFLUSH (see http://docs.basex.org/wiki/Options#AUTOFLUSH) to false? This should benefit performance.

In general it is also a good architectural approach to split up documents into many databases instead of having one large database. Given that you can access as many databases as you want in one query you will not lose any query capabilities and at some point you might encounter certain limits. However, I would expect 100,000 documents added to be much of a problem.

As a side node, as it seems you are evaluating BaseX and I guess you are doing this for a reason, it might be faster/easier when talking to our BaseX members, who of course can help you with evaluating your problem and identifying whether BaseX is the right choice for your given problem. Take a look at http://basexgmbh.de/ for our commercial offerings.

Cheers

Dirk

On 01/10/2017 05:44 PM, Bularca, Lucian wrote:

...

Guten Tag,

im Rahmen einer Performance-Evaluierung der Persistierung von XML Datenstrukturen in einer BaseX Datenbank, haben wir stetig absinkende Persistierungs-Raten umgekehrt proportional zu der Datenbank-Größe festgestellt.

Dieses Verhalten ist erklährbar und wäre auch einnehmbar, wenn nicht die Dauer der Persistierung einer ~ 160 KB großen XML Datenstruktur, von Anfang ~ 10 ms auf ~ 2500 ms kommne würde, nach ~ 50.000 Persistierungs-Vorgänge.

Dabei versuchen wir 100.000 unterschiedliche, ungefähr 160 KB große XML Datenstrukturen mittels der Java API in eine BaseX Datenbank zu speichern um dabei die Gesammt-Dauer bzw. die Dauer der einzelnen Persistierungs-Vorgänge zu messen. Die BaseX Datenbank wurde im HTTP Modus (basexhttp) mit -Xmx 4048m gestartert.

Die oben genannte Messwerte blieben gleich, unabhängig davon, ob alle XM Datenstrukturen in eine einzige Session gespeichert wurden, oder wenn alle 500 Persistierungs-Vorgänge der Socket (DB-Anbindung) geschlossen und erneut geöffnet wurde. Eine Indizierung der Datenbank (mittels der GUI "Optimize All", bzw. "Create Text Index") zwischendurch konnte die Persistierungs-Raten nicht beeinflussen bzw. optimieren.

Ein Beispiel der Test-Klassen (nur exemplarisch, nicht kompilierbar !) die wir dazu benutzt haben, sind im Anhang BaseXClient.java.zip zu dieser E-Mail zu finden.

Sind generell, Persistierungs-Raten von mehr als 160 KB / 2500 ms bei mehr als 30.000 vorhandene Einträge in der BaseX zu erwarten, oder können wir diese Persistierungs-Zeiten drastisch optimieren (und wenn ja, wie)?

Mit freundlichen Grüßen, Lucian Bularca

-- Dirk Kirsten, BaseX GmbH, http://basexgmbh.de |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 91 68 276, Fax: 0049 7531 20 05 22

Bularca, Lucian

8:24 a.m.

Hi Dirk,

thanks for your fast reply :)

Regarding the performance measure, I've forgot to mention, that I've based my affirmations on the protocol entries from the BaseX log file (see attached basex.log). The intention of the System.out made in each iteration, is just to protocol the order number of the added xml structure, not the duration of a persist operation. This System.out indeed does have an impact on the overall performance, but cannot explain the monotonic increase of the insert operations duration (see attached basex.log file). After 24 hours of inserting xml test-structures, only the half of the 100.000 xml test-structures where added in the database, at a rate of at most 1 structure / 2 seconds.

All these tests where made against the 8.5.3 version of the BaseX database.

In production, we expect peaks of 2,7 * 10 ^ 5 xml structures to persist / 24 hours (~ 31 xml structures / 1 second). Do you mean with "However, I would expect 100,000 documents added to be much of a problem.", that persisting 100.000 xml structures in the BaseX database is problematic?

Regards, Lucian ________________________________ Von: basex-talk-bounces@mailman.uni-konstanz.de [basex-talk-bounces@mailman.uni-konstanz.de]" im Auftrag von "Dirk Kirsten [dk@basex.org] Gesendet: Dienstag, 10. Januar 2017 12:52 An: basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

Hello Lucian,

Also, at least in the Example you provided you also do some other stuff (especiall printing to sysout), which obviously also has a performance impact.

What version of BaseX did you use?

Did you set AUTOFLUSH (see http://docs.basex.org/wiki/Options#AUTOFLUSH) to false? This should benefit performance.

Cheers

Dirk

On 01/10/2017 05:44 PM, Bularca, Lucian wrote: Guten Tag,

Ein Beispiel der Test-Klassen (nur exemplarisch, nicht kompilierbar !) die wir dazu benutzt haben, sind im Anhang BaseXClient.java.zip zu dieser E-Mail zu finden.

Mit freundlichen Grüßen, Lucian Bularca

Dirk Kirsten

8:37 a.m.

Hi Lucian,

sorry, I obviously forgot a very important word: "not". I would NOT expect 100,000 documents to be added to be much of a problem. Sorry for the confusion.

The log file is interesting, it certainly looks like the performance is degrading. I can't say much about it, but I am sure Christian (our head architect) will give you some pointers when he has time to answer.

However, given the description of your problem I would advise you to in general rethink your architecture. So many consistent updates on a database seem to me to be not very performant when done on only a single database. So maybe you want to split up your data, e.g. you could put all documents of a certain day into a separate database.

Or you could have one "up-to-date" database, which you always update and transfer the entries within this database into another database during low-performance times. The other database could have proper indexes and whatever you need.

Because otherwise you will run into problems when querying your data. I guess you don't want to just store your data, you want to do something with it, don't you? Because just storing without using data seems a bit useless... And for this you probably want to use some indexes and having an up-to-date-index with constant updates is quite costly.

To sum it up: I think you want to split up your data in some way into several databases.

However, I understand that you will still have something like 100,000 documents in a database (which should be fine), so your current performance issue will still exist. My comment is more towards your general architecture.

Cheers

Dirk

On 01/10/2017 08:24 PM, Bularca, Lucian wrote:

...

Hi Dirk,

thanks for your fast reply :)

Regarding the performance measure, I've forgot to mention, that I've based my affirmations on the protocol entries from the BaseX log file (see attached basex.log). The intention of the System.out made in each iteration, is just to protocol the order number of the added xml structure, not the duration of a persist operation. This System.out indeed does have an impact on the overall performance, but cannot explain the monotonic increase of the insert operations duration (see attached basex.log file). After 24 hours of inserting xml test-structures, only the half of the 100.000 xml test-structures where added in the database, at a rate of at most 1 structure / 2 seconds.

All these tests where made against the 8.5.3 version of the BaseX database.

In production, we expect peaks of 2,7 * 10 ^ 5 xml structures to persist / 24 hours (~ 31 xml structures / 1 second). Do you mean with "However, I would expect 100,000 documents added to be much of a problem.", that persisting 100.000 xml structures in the BaseX database is problematic?

Regards, Lucian

*Von:* basex-talk-bounces@mailman.uni-konstanz.de [basex-talk-bounces@mailman.uni-konstanz.de]" im Auftrag von "Dirk Kirsten [dk@basex.org] *Gesendet:* Dienstag, 10. Januar 2017 12:52 *An:* basex-talk@mailman.uni-konstanz.de *Betreff:* Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

Hello Lucian,

please be aware that this is an English-speaking mailing list as we have many users from all over the world and the mailing list is intended to help everyone. But as most of our team members are German (well, and Bavarians...) we of course understand it. Hence, I answer in English (for all other: Lucian seem to have same performance issues when adding many documents).

First of all, are you sure your tests sufficiently test the add performance. Looking at your file TestBaseXClient.java it seems to not record the runtimes of the individual insertions, but just the overall runtime of in this case 100000 insertions.

Also, at least in the Example you provided you also do some other stuff (especiall printing to sysout), which obviously also has a performance impact.

Optimizing or creating indexes in between a mass update should not increase the speed, as it builds the indexes, which will be invalidated after the next index, so I would not expect any speed up here.

What version of BaseX did you use?

Did you set AUTOFLUSH (see http://docs.basex.org/wiki/Options#AUTOFLUSH) to false? This should benefit performance.

In general it is also a good architectural approach to split up documents into many databases instead of having one large database. Given that you can access as many databases as you want in one query you will not lose any query capabilities and at some point you might encounter certain limits. However, I would expect 100,000 documents added to be much of a problem.

As a side node, as it seems you are evaluating BaseX and I guess you are doing this for a reason, it might be faster/easier when talking to our BaseX members, who of course can help you with evaluating your problem and identifying whether BaseX is the right choice for your given problem. Take a look at http://basexgmbh.de/ for our commercial offerings.

Cheers

Dirk

On 01/10/2017 05:44 PM, Bularca, Lucian wrote:

...
Guten Tag,

im Rahmen einer Performance-Evaluierung der Persistierung von XML Datenstrukturen in einer BaseX Datenbank, haben wir stetig absinkende Persistierungs-Raten umgekehrt proportional zu der Datenbank-Größe festgestellt.

Dieses Verhalten ist erklährbar und wäre auch einnehmbar, wenn nicht die Dauer der Persistierung einer ~ 160 KB großen XML Datenstruktur, von Anfang ~ 10 ms auf ~ 2500 ms kommne würde, nach ~ 50.000 Persistierungs-Vorgänge.

Dabei versuchen wir 100.000 unterschiedliche, ungefähr 160 KB große XML Datenstrukturen mittels der Java API in eine BaseX Datenbank zu speichern um dabei die Gesammt-Dauer bzw. die Dauer der einzelnen Persistierungs-Vorgänge zu messen. Die BaseX Datenbank wurde im HTTP Modus (basexhttp) mit -Xmx 4048m gestartert.

Die oben genannte Messwerte blieben gleich, unabhängig davon, ob alle XM Datenstrukturen in eine einzige Session gespeichert wurden, oder wenn alle 500 Persistierungs-Vorgänge der Socket (DB-Anbindung) geschlossen und erneut geöffnet wurde. Eine Indizierung der Datenbank (mittels der GUI "Optimize All", bzw. "Create Text Index") zwischendurch konnte die Persistierungs-Raten nicht beeinflussen bzw. optimieren.

Ein Beispiel der Test-Klassen (nur exemplarisch, nicht kompilierbar !) die wir dazu benutzt haben, sind im Anhang BaseXClient.java.zip zu dieser E-Mail zu finden.

Sind generell, Persistierungs-Raten von mehr als 160 KB / 2500 ms bei mehr als 30.000 vorhandene Einträge in der BaseX zu erwarten, oder können wir diese Persistierungs-Zeiten drastisch optimieren (und wenn ja, wie)?

Mit freundlichen Grüßen, Lucian Bularca

-- Dirk Kirsten, BaseX GmbH, http://basexgmbh.de |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 91 68 276, Fax: 0049 7531 20 05 22

Bularca, Lucian

10:44 a.m.

Hi Dirk,

of course, querying millions of data entries on a single database rise problems. This is equally problematic for all databases, not only for the BaseX DB and certain storing strategies will be mandatory at production time.

The actual problem is, that adding 50.000 of 160 KB xml stuctures took 24 hours because that inexplicable monotonic increase of the insert operation durations.

I'll really appreciate if someone can explain this behaviour or a counterexample can demonstrate, that the cause of this behaviour is test case but not DB inherent.

Regards, Lucian ________________________________ Von: Dirk Kirsten [dk@basex.org] Gesendet: Dienstag, 10. Januar 2017 14:37 An: Bularca, Lucian; basex-talk@mailman.uni-konstanz.de Betreff: Re: AW: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

Hi Lucian,

sorry, I obviously forgot a very important word: "not". I would NOT expect 100,000 documents to be added to be much of a problem. Sorry for the confusion.

To sum it up: I think you want to split up your data in some way into several databases.

Cheers

Dirk

On 01/10/2017 08:24 PM, Bularca, Lucian wrote:

Hi Dirk,

thanks for your fast reply :)

All these tests where made against the 8.5.3 version of the BaseX database.

Regards, Lucian ________________________________ Von: basex-talk-bounces@mailman.uni-konstanz.demailto:basex-talk-bounces@mailman.uni-konstanz.de [basex-talk-bounces@mailman.uni-konstanz.demailto:basex-talk-bounces@mailman.uni-konstanz.de]" im Auftrag von "Dirk Kirsten [dk@basex.orgmailto:dk@basex.org] Gesendet: Dienstag, 10. Januar 2017 12:52 An: basex-talk@mailman.uni-konstanz.demailto:basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

Hello Lucian,

Also, at least in the Example you provided you also do some other stuff (especiall printing to sysout), which obviously also has a performance impact.

What version of BaseX did you use?

Did you set AUTOFLUSH (see http://docs.basex.org/wiki/Options#AUTOFLUSH) to false? This should benefit performance.

Cheers

Dirk

On 01/10/2017 05:44 PM, Bularca, Lucian wrote: Guten Tag,

Ein Beispiel der Test-Klassen (nur exemplarisch, nicht kompilierbar !) die wir dazu benutzt haben, sind im Anhang BaseXClient.java.zip zu dieser E-Mail zu finden.

Mit freundlichen Grüßen, Lucian Bularca

Christian Grün

11:33 a.m.

New subject: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

Hi Lucian,

I couldn’t run your code example out of the box. 24 hours sounds pretty alarming, though, so I have written my own example (attached). It creates 50.000 XML documents, each sized around 160 KB. It’s not as fast as I had expected, but the total runtime is around 13 minutes, and it only slow down a little when adding more documents...

10000: 125279.45 ms 20000: 128244.23 ms 30000: 130499.9 ms 40000: 132286.05 ms 50000: 134814.82 ms

Maybe you could compare the code with yours, and we can find out what causes the delay?

Best, Christian

On Tue, Jan 10, 2017 at 4:44 PM, Bularca, Lucian Lucian.Bularca@mueller.de wrote:

...

Hi Dirk,

of course, querying millions of data entries on a single database rise problems. This is equally problematic for all databases, not only for the BaseX DB and certain storing strategies will be mandatory at production time.

The actual problem is, that adding 50.000 of 160 KB xml stuctures took 24 hours because that inexplicable monotonic increase of the insert operation durations.

I'll really appreciate if someone can explain this behaviour or a counterexample can demonstrate, that the cause of this behaviour is test case but not DB inherent.

Regards, Lucian

Bularca, Lucian

11 Jan 11 Jan

8:46 a.m.

Hi Christian,

I've made a comparation of the persistence time series running your example code and mine, in all possible combinations of following scenarios: - with and without "set intparse on" - using my prepared test data and your test data - closing and opening the DB connection each "n"-th insertion operation (where n in {5, 100, 500, 1000}) - with and without "set autoflush on".

I finally found out, that the only relevant variable that influence the insert operation duration is the value of the AUTOFLASH option.

If AUTOFLASH = OFF when opening a database, then the persistence durations remains relative constant (on my machine about 43 ms) during the entire insert operations sequence (50.000 or 100.000 times), for all possible combinations named above.

If AUTOFLASH = ON when opening a database, then the persistence durations increase monotonic, for all possible combinations named above.

The persistence duration, if AUTOFLASH = ON, is directly proportional to the number of DB clients executing these insert operations, respectively to the sequence length of insert operations executed by a DB client.

In my opinion, this behaviour is an issue of BaseX, because AUTOFLASH is implcitly set to ON (see BaseX documentation http://docs.basex.org/wiki/Options#AUTOFLUSH), so DB clients must explicitly set AUTOFLASH = OFF in order to keep the insert operation durations relatively constant over time. Additionally, no explicitly flushing data, increases the risk of data loss (see BaseX documentation http://docs.basex.org/wiki/Options#AUTOFLUSH), but clients how repeatedly execute the FLUSH command increase the durations of the subsequent insert operations.

Regards, Lucian

________________________________________ Von: Christian Grün [christian.gruen@gmail.com] Gesendet: Dienstag, 10. Januar 2017 17:33 An: Bularca, Lucian Cc: Dirk Kirsten; basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

Hi Lucian,

10000: 125279.45 ms 20000: 128244.23 ms 30000: 130499.9 ms 40000: 132286.05 ms 50000: 134814.82 ms

Maybe you could compare the code with yours, and we can find out what causes the delay?

Best, Christian

On Tue, Jan 10, 2017 at 4:44 PM, Bularca, Lucian Lucian.Bularca@mueller.de wrote:

...

Hi Dirk,

of course, querying millions of data entries on a single database rise problems. This is equally problematic for all databases, not only for the BaseX DB and certain storing strategies will be mandatory at production time.

The actual problem is, that adding 50.000 of 160 KB xml stuctures took 24 hours because that inexplicable monotonic increase of the insert operation durations.

I'll really appreciate if someone can explain this behaviour or a counterexample can demonstrate, that the cause of this behaviour is test case but not DB inherent.

Regards, Lucian

Christian Grün

10:57 a.m.

New subject: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

Hi Lucian,

Thanks for your analysis. Indeed I’m wondering about the monotonic delay caused by auto flushing the data; this hasn’t always been the case. I’m wondering even more why no one else noticed this in recent time.. Maybe it’s not too long ago that this was introduced. It may take some time to find the culprit, but I’ll keep you updated.

All the best, Christian

On Wed, Jan 11, 2017 at 2:46 PM, Bularca, Lucian Lucian.Bularca@mueller.de wrote:

...

Hi Christian,

I've made a comparation of the persistence time series running your example code and mine, in all possible combinations of following scenarios:

with and without "set intparse on"

using my prepared test data and your test data

closing and opening the DB connection each "n"-th insertion operation (where n in {5, 100, 500, 1000})

with and without "set autoflush on".

I finally found out, that the only relevant variable that influence the insert operation duration is the value of the AUTOFLASH option.

If AUTOFLASH = OFF when opening a database, then the persistence durations remains relative constant (on my machine about 43 ms) during the entire insert operations sequence (50.000 or 100.000 times), for all possible combinations named above.

If AUTOFLASH = ON when opening a database, then the persistence durations increase monotonic, for all possible combinations named above.

The persistence duration, if AUTOFLASH = ON, is directly proportional to the number of DB clients executing these insert operations, respectively to the sequence length of insert operations executed by a DB client.

In my opinion, this behaviour is an issue of BaseX, because AUTOFLASH is implcitly set to ON (see BaseX documentation http://docs.basex.org/wiki/Options#AUTOFLUSH), so DB clients must explicitly set AUTOFLASH = OFF in order to keep the insert operation durations relatively constant over time. Additionally, no explicitly flushing data, increases the risk of data loss (see BaseX documentation http://docs.basex.org/wiki/Options#AUTOFLUSH), but clients how repeatedly execute the FLUSH command increase the durations of the subsequent insert operations.

Regards, Lucian

Von: Christian Grün [christian.gruen@gmail.com] Gesendet: Dienstag, 10. Januar 2017 17:33 An: Bularca, Lucian Cc: Dirk Kirsten; basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

Hi Lucian,

I couldn’t run your code example out of the box. 24 hours sounds pretty alarming, though, so I have written my own example (attached). It creates 50.000 XML documents, each sized around 160 KB. It’s not as fast as I had expected, but the total runtime is around 13 minutes, and it only slow down a little when adding more documents...

10000: 125279.45 ms 20000: 128244.23 ms 30000: 130499.9 ms 40000: 132286.05 ms 50000: 134814.82 ms

Maybe you could compare the code with yours, and we can find out what causes the delay?

Best, Christian

On Tue, Jan 10, 2017 at 4:44 PM, Bularca, Lucian Lucian.Bularca@mueller.de wrote:

...
Hi Dirk,

of course, querying millions of data entries on a single database rise problems. This is equally problematic for all databases, not only for the BaseX DB and certain storing strategies will be mandatory at production time.

The actual problem is, that adding 50.000 of 160 KB xml stuctures took 24 hours because that inexplicable monotonic increase of the insert operation durations.

I'll really appreciate if someone can explain this behaviour or a counterexample can demonstrate, that the cause of this behaviour is test case but not DB inherent.

Regards, Lucian

Christian Grün

14 Jan 14 Jan

6:09 a.m.

New subject: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

Hi Lucian,

I have a hard time reproducing the reported behavior. The attached, revised Java example (without AUTOFLUSH) required around 30 ms for the first documents and 120 ms for the last documents, which is still pretty far from what you’ve been encountering:

...

von Anfang ~ 10 ms auf ~ 2500 ms kommne würde

But obviously something weird has been going on in your setup. Let’s see what alternatives we have…

• Could you possibly try to update my example code such that it shows the reported behavior? Ideally with small input, in order to speed up the process. Maybe the runtime increase can also be demonstrated after 1.000 or 10.000 documents... • You could also send me a list of the files of your test_database directory; maybe the file sizes indicate some unusual patterns. • You could start BaseXServer with the JVM flag -Xrunhprof:cpu=samples (to be inserted in the basexserver script), start the server, run your script, stop the server directly afterwards, and send me the result file, which will be stored in the directory from where you started BaseX (java.hprof.txt).

Best, Christian

On Wed, Jan 11, 2017 at 4:57 PM, Christian Grün christian.gruen@gmail.com wrote:

...

Hi Lucian,

Thanks for your analysis. Indeed I’m wondering about the monotonic delay caused by auto flushing the data; this hasn’t always been the case. I’m wondering even more why no one else noticed this in recent time.. Maybe it’s not too long ago that this was introduced. It may take some time to find the culprit, but I’ll keep you updated.

All the best, Christian

On Wed, Jan 11, 2017 at 2:46 PM, Bularca, Lucian Lucian.Bularca@mueller.de wrote:

...
Hi Christian,

I've made a comparation of the persistence time series running your example code and mine, in all possible combinations of following scenarios:

with and without "set intparse on"

using my prepared test data and your test data

closing and opening the DB connection each "n"-th insertion operation (where n in {5, 100, 500, 1000})

with and without "set autoflush on".

I finally found out, that the only relevant variable that influence the insert operation duration is the value of the AUTOFLASH option.

If AUTOFLASH = OFF when opening a database, then the persistence durations remains relative constant (on my machine about 43 ms) during the entire insert operations sequence (50.000 or 100.000 times), for all possible combinations named above.

If AUTOFLASH = ON when opening a database, then the persistence durations increase monotonic, for all possible combinations named above.

The persistence duration, if AUTOFLASH = ON, is directly proportional to the number of DB clients executing these insert operations, respectively to the sequence length of insert operations executed by a DB client.

In my opinion, this behaviour is an issue of BaseX, because AUTOFLASH is implcitly set to ON (see BaseX documentation http://docs.basex.org/wiki/Options#AUTOFLUSH), so DB clients must explicitly set AUTOFLASH = OFF in order to keep the insert operation durations relatively constant over time. Additionally, no explicitly flushing data, increases the risk of data loss (see BaseX documentation http://docs.basex.org/wiki/Options#AUTOFLUSH), but clients how repeatedly execute the FLUSH command increase the durations of the subsequent insert operations.

Regards, Lucian

Von: Christian Grün [christian.gruen@gmail.com] Gesendet: Dienstag, 10. Januar 2017 17:33 An: Bularca, Lucian Cc: Dirk Kirsten; basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

Hi Lucian,

I couldn’t run your code example out of the box. 24 hours sounds pretty alarming, though, so I have written my own example (attached). It creates 50.000 XML documents, each sized around 160 KB. It’s not as fast as I had expected, but the total runtime is around 13 minutes, and it only slow down a little when adding more documents...

10000: 125279.45 ms 20000: 128244.23 ms 30000: 130499.9 ms 40000: 132286.05 ms 50000: 134814.82 ms

Maybe you could compare the code with yours, and we can find out what causes the delay?

Best, Christian

On Tue, Jan 10, 2017 at 4:44 PM, Bularca, Lucian Lucian.Bularca@mueller.de wrote:

...
Hi Dirk,

of course, querying millions of data entries on a single database rise problems. This is equally problematic for all databases, not only for the BaseX DB and certain storing strategies will be mandatory at production time.

The actual problem is, that adding 50.000 of 160 KB xml stuctures took 24 hours because that inexplicable monotonic increase of the insert operation durations.

I'll really appreciate if someone can explain this behaviour or a counterexample can demonstrate, that the cause of this behaviour is test case but not DB inherent.

Regards, Lucian

Bram Vanroy | KU Leuven

9:48 a.m.

New subject: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

Possibly related, but I'm not sure:

When creating millions of databases in a loop in the same session, I found that after some thousands I'd get an OOM error by BaseX. This seemed odd to me, because after each iteration, the database creation query was closed (and I'd expect GC to run at such a time?). To by-pass this I just closed the session and opened a new one each couple of thousand-th time in the loop.

Maybe there is a (small) memory leak somewhere in BaseX that only becomes noticeable (and annoying) after hundreds of thousands of even millions of queries?

-----Oorspronkelijk bericht----- Van: basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] Namens Christian Grün Verzonden: zaterdag 14 januari 2017 12:09 Aan: Bularca, Lucian Lucian.Bularca@mueller.de CC: basex-talk@mailman.uni-konstanz.de Onderwerp: Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

Hi Lucian,

...

von Anfang ~ 10 ms auf ~ 2500 ms kommne würde

But obviously something weird has been going on in your setup. Let’s see what alternatives we have…

Best, Christian

On Wed, Jan 11, 2017 at 4:57 PM, Christian Grün christian.gruen@gmail.com wrote:

...

Hi Lucian,

Thanks for your analysis. Indeed I’m wondering about the monotonic delay caused by auto flushing the data; this hasn’t always been the case. I’m wondering even more why no one else noticed this in recent time.. Maybe it’s not too long ago that this was introduced. It may take some time to find the culprit, but I’ll keep you updated.

All the best, Christian

On Wed, Jan 11, 2017 at 2:46 PM, Bularca, Lucian Lucian.Bularca@mueller.de wrote:

...
Hi Christian,

I've made a comparation of the persistence time series running your example code and mine, in all possible combinations of following scenarios:

with and without "set intparse on"

using my prepared test data and your test data

closing and opening the DB connection each "n"-th insertion

operation (where n in {5, 100, 500, 1000})

with and without "set autoflush on".

I finally found out, that the only relevant variable that influence the insert operation duration is the value of the AUTOFLASH option.

If AUTOFLASH = OFF when opening a database, then the persistence durations remains relative constant (on my machine about 43 ms) during the entire insert operations sequence (50.000 or 100.000 times), for all possible combinations named above.

If AUTOFLASH = ON when opening a database, then the persistence durations increase monotonic, for all possible combinations named above.

The persistence duration, if AUTOFLASH = ON, is directly proportional to the number of DB clients executing these insert operations, respectively to the sequence length of insert operations executed by a DB client.

In my opinion, this behaviour is an issue of BaseX, because AUTOFLASH is implcitly set to ON (see BaseX documentation http://docs.basex.org/wiki/Options#AUTOFLUSH), so DB clients must explicitly set AUTOFLASH = OFF in order to keep the insert operation durations relatively constant over time. Additionally, no explicitly flushing data, increases the risk of data loss (see BaseX documentation http://docs.basex.org/wiki/Options#AUTOFLUSH), but clients how repeatedly execute the FLUSH command increase the durations of the subsequent insert operations.

Regards, Lucian

Von: Christian Grün [christian.gruen@gmail.com] Gesendet: Dienstag, 10. Januar 2017 17:33 An: Bularca, Lucian Cc: Dirk Kirsten; basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

Hi Lucian,

I couldn’t run your code example out of the box. 24 hours sounds pretty alarming, though, so I have written my own example (attached). It creates 50.000 XML documents, each sized around 160 KB. It’s not as fast as I had expected, but the total runtime is around 13 minutes, and it only slow down a little when adding more documents...

10000: 125279.45 ms 20000: 128244.23 ms 30000: 130499.9 ms 40000: 132286.05 ms 50000: 134814.82 ms

Maybe you could compare the code with yours, and we can find out what causes the delay?

Best, Christian

On Tue, Jan 10, 2017 at 4:44 PM, Bularca, Lucian Lucian.Bularca@mueller.de wrote:

...
Hi Dirk,

of course, querying millions of data entries on a single database rise problems. This is equally problematic for all databases, not only for the BaseX DB and certain storing strategies will be mandatory at production time.

The actual problem is, that adding 50.000 of 160 KB xml stuctures took 24 hours because that inexplicable monotonic increase of the insert operation durations.

I'll really appreciate if someone can explain this behaviour or a counterexample can demonstrate, that the cause of this behaviour is test case but not DB inherent.

Regards, Lucian

Bularca, Lucian

18 Jan 18 Jan

4:33 a.m.

Hi Christian,

I've run your attached revised Java example on my machine. When inserting 56.000 times the same 16 KB xml structure, the BaseX server protocols inserting times beginning with ~40 ms and ending with ~150 ms (see attached AddDocs2_using_56000_19KB_from_console.log). The thread dump of the BaseX-JVM is attached in 56000_19KB_xml_java.hprof.txt. When inserting the same 19 KB xml structure 100.000 times, the duration of the persist operations begins with ~40 ms and ends with ~ 250 ms (see AddDocs2_using_100000_19KB_xml_from_console.log). The related JVM thread dump is attached in 100000_19KB_xml_java.hprof.txt. When inserting a 160 KB xml structure 100.000 times, the persist operation duration starts by ~45 ms and reach ~2000 ms after 68.000 persist invocations and 16 hours of run time (!) (see attached AddDocs2_using_100000_160KB_1from2.log and AddDocs2_using_100000_160KB_2from2.log). The related JVM thread dump is attached in 100000_160KB_xml_java.hprof.txt

There are actually two AUTOFLUSH-related issues: when AUTOFLUSH is (explicitly or implicitly) active, there always exist a system-independent, monotonic increase of the insert operation durations. When AUTOFLUSH is explicitly inactivated (in order to achieve a constant insert operation duration), the probability of loosing data increase.

Regards, Lucian ________________________________________ Von: Christian Grün [christian.gruen@gmail.com] Gesendet: Samstag, 14. Januar 2017 12:09 An: Bularca, Lucian Cc: basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

Hi Lucian,

...

von Anfang ~ 10 ms auf ~ 2500 ms kommne würde

But obviously something weird has been going on in your setup. Let’s see what alternatives we have…

Best, Christian

On Wed, Jan 11, 2017 at 4:57 PM, Christian Grün christian.gruen@gmail.com wrote:

...

Hi Lucian,

Thanks for your analysis. Indeed I’m wondering about the monotonic delay caused by auto flushing the data; this hasn’t always been the case. I’m wondering even more why no one else noticed this in recent time.. Maybe it’s not too long ago that this was introduced. It may take some time to find the culprit, but I’ll keep you updated.

All the best, Christian

On Wed, Jan 11, 2017 at 2:46 PM, Bularca, Lucian Lucian.Bularca@mueller.de wrote:

...
Hi Christian,

I've made a comparation of the persistence time series running your example code and mine, in all possible combinations of following scenarios:

with and without "set intparse on"

using my prepared test data and your test data

closing and opening the DB connection each "n"-th insertion operation (where n in {5, 100, 500, 1000})

with and without "set autoflush on".

I finally found out, that the only relevant variable that influence the insert operation duration is the value of the AUTOFLASH option.

If AUTOFLASH = OFF when opening a database, then the persistence durations remains relative constant (on my machine about 43 ms) during the entire insert operations sequence (50.000 or 100.000 times), for all possible combinations named above.

If AUTOFLASH = ON when opening a database, then the persistence durations increase monotonic, for all possible combinations named above.

The persistence duration, if AUTOFLASH = ON, is directly proportional to the number of DB clients executing these insert operations, respectively to the sequence length of insert operations executed by a DB client.

In my opinion, this behaviour is an issue of BaseX, because AUTOFLASH is implcitly set to ON (see BaseX documentation http://docs.basex.org/wiki/Options#AUTOFLUSH), so DB clients must explicitly set AUTOFLASH = OFF in order to keep the insert operation durations relatively constant over time. Additionally, no explicitly flushing data, increases the risk of data loss (see BaseX documentation http://docs.basex.org/wiki/Options#AUTOFLUSH), but clients how repeatedly execute the FLUSH command increase the durations of the subsequent insert operations.

Regards, Lucian

Von: Christian Grün [christian.gruen@gmail.com] Gesendet: Dienstag, 10. Januar 2017 17:33 An: Bularca, Lucian Cc: Dirk Kirsten; basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

Hi Lucian,

I couldn’t run your code example out of the box. 24 hours sounds pretty alarming, though, so I have written my own example (attached). It creates 50.000 XML documents, each sized around 160 KB. It’s not as fast as I had expected, but the total runtime is around 13 minutes, and it only slow down a little when adding more documents...

10000: 125279.45 ms 20000: 128244.23 ms 30000: 130499.9 ms 40000: 132286.05 ms 50000: 134814.82 ms

Maybe you could compare the code with yours, and we can find out what causes the delay?

Best, Christian

On Tue, Jan 10, 2017 at 4:44 PM, Bularca, Lucian Lucian.Bularca@mueller.de wrote:

...
Hi Dirk,

of course, querying millions of data entries on a single database rise problems. This is equally problematic for all databases, not only for the BaseX DB and certain storing strategies will be mandatory at production time.

The actual problem is, that adding 50.000 of 160 KB xml stuctures took 24 hours because that inexplicable monotonic increase of the insert operation durations.

I'll really appreciate if someone can explain this behaviour or a counterexample can demonstrate, that the cause of this behaviour is test case but not DB inherent.

Regards, Lucian

Christian Grün

5:04 a.m.

New subject: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

Hi Lucian,

Thanks for taking your time, rerunning the tests and do some profiling.

...

When inserting a 160 KB xml structure 100.000 times, the persist operation duration starts by ~45 ms and reach ~2000 ms after 68.000 persist invocations and 16 hours of run time (!)

Indeed this differs quite a lot from the tests I have made so far, and from the patterns I am used to.

It was helpful to have a look into the Java profiling files: A plain FileOutputStream.open call takes most of the time, while it’s hardly measurable in my own tests. Do you work with a local file system? Maybe the file listing of your database directory could shed some more light here.

I would additionally assume that this that you were closing and opening your database after each addition, right? Obviously this makes sense if no bulk operations take place; it’s just different from what I did in my tests.

...

There are actually two AUTOFLUSH-related issues: […]

You are obviously right: AUTOFLUSH should only be used for bulk operations, and avoided if persistency of data is critical. And the addition of documents will always be faster if the database is small (but it should definitely not take more than a second to add a single small document).

Best, Christian

3168

Age (days ago)

3176

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

11 comments

4 participants

tags (0)

participants (4)

Bram Vanroy | KU Leuven
Bularca, Lucian
Christian Grün
Dirk Kirsten