Hi Lucian,
sorry, I obviously forgot a very important word: "not". I would NOT expect 100,000 documents to be added to be much of a problem. Sorry for the confusion.
The log file is interesting, it certainly looks like the
performance is degrading. I can't say much about it, but I am sure
Christian (our head architect) will give you some pointers when he
has time to answer.
However, given the description of your problem I would advise you
to in general rethink your architecture. So many consistent
updates on a database seem to me to be not very performant when
done on only a single database. So maybe you want to split up your
data, e.g. you could put all documents of a certain day into a
separate database.
Or you could have one "up-to-date" database, which you always update and transfer the entries within this database into another database during low-performance times. The other database could have proper indexes and whatever you need.
Because otherwise you will run into problems when querying your
data. I guess you don't want to just store your data, you want to
do something with it, don't you? Because just storing without
using data seems a bit useless... And for this you probably want
to use some indexes and having an up-to-date-index with constant
updates is quite costly.
To sum it up: I think you want to split up your data in some way into several databases.
However, I understand that you will still have something like 100,000 documents in a database (which should be fine), so your current performance issue will still exist. My comment is more towards your general architecture.
Cheers
Dirk
On 01/10/2017 08:24 PM, Bularca, Lucian wrote:
Hi Dirk,
thanks for your fast reply :)
Regarding the performance measure, I've forgot to mention, that I've based my affirmations on the protocol entries from the BaseX log file (see attached basex.log). The intention of the System.out made in each iteration, is just to protocol the order number of the added xml structure, not the duration of a persist operation. This System.out indeed does have an impact on the overall performance, but cannot explain the monotonic increase of the insert operations duration (see attached basex.log file). After 24 hours of inserting xml test-structures, only the half of the 100.000 xml test-structures where added in the database, at a rate of at most 1 structure / 2 seconds.
All these tests where made against the 8.5.3 version of the BaseX database.
In production, we expect peaks of 2,7 * 10 ^ 5 xml structures to persist / 24 hours (~ 31 xml structures / 1 second). Do you mean with "However, I would expect 100,000 documents added to be much of a problem.", that persisting 100.000 xml structures in the BaseX database is problematic?
Regards,
Lucian
Von: basex-talk-bounces@mailman.uni-konstanz.de [basex-talk-bounces@mailman.uni-konstanz.de]" im Auftrag von "Dirk Kirsten [dk@basex.org]
Gesendet: Dienstag, 10. Januar 2017 12:52
An: basex-talk@mailman.uni-konstanz.de
Betreff: Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.
Hello Lucian,
please be aware that this is an English-speaking mailing list as we have many users from all over the world and the mailing list is intended to help everyone. But as most of our team members are German (well, and Bavarians...) we of course understand it. Hence, I answer in English (for all other: Lucian seem to have same performance issues when adding many documents).
First of all, are you sure your tests sufficiently test the add performance. Looking at your file TestBaseXClient.java it seems to not record the runtimes of the individual insertions, but just the overall runtime of in this case 100000 insertions.
Also, at least in the Example you provided you also do some other stuff (especiall printing to sysout), which obviously also has a performance impact.
Optimizing or creating indexes in between a mass update should not increase the speed, as it builds the indexes, which will be invalidated after the next index, so I would not expect any speed up here.
What version of BaseX did you use?
Did you set AUTOFLUSH (see http://docs.basex.org/wiki/Options#AUTOFLUSH) to false? This should benefit performance.
In general it is also a good architectural approach to split up documents into many databases instead of having one large database. Given that you can access as many databases as you want in one query you will not lose any query capabilities and at some point you might encounter certain limits. However, I would expect 100,000 documents added to be much of a problem.
As a side node, as it seems you are evaluating BaseX and I guess you are doing this for a reason, it might be faster/easier when talking to our BaseX members, who of course can help you with evaluating your problem and identifying whether BaseX is the right choice for your given problem. Take a look at http://basexgmbh.de/ for our commercial offerings.
Cheers
Dirk
On 01/10/2017 05:44 PM, Bularca, Lucian wrote:
Guten Tag,
im Rahmen einer Performance-Evaluierung der Persistierung von XML Datenstrukturen in einer BaseX Datenbank, haben wir stetig absinkende Persistierungs-Raten umgekehrt proportional zu der Datenbank-Größe festgestellt.
Dieses Verhalten ist erklährbar und wäre auch einnehmbar, wenn nicht die Dauer der Persistierung einer ~ 160 KB großen XML Datenstruktur, von Anfang ~ 10 ms auf ~ 2500 ms kommne würde, nach ~ 50.000 Persistierungs-Vorgänge.
Dabei versuchen wir 100.000 unterschiedliche, ungefähr 160 KB große XML Datenstrukturen mittels der Java API in eine BaseX Datenbank zu speichern um dabei die Gesammt-Dauer bzw. die Dauer der einzelnen Persistierungs-Vorgänge zu messen. Die BaseX Datenbank wurde im HTTP Modus (basexhttp) mit -Xmx 4048m gestartert.
Die oben genannte Messwerte blieben gleich, unabhängig davon, ob alle XM Datenstrukturen in eine einzige Session gespeichert wurden, oder wenn alle 500 Persistierungs-Vorgänge der Socket (DB-Anbindung) geschlossen und erneut geöffnet wurde. Eine Indizierung der Datenbank (mittels der GUI "Optimize All", bzw. "Create Text Index") zwischendurch konnte die Persistierungs-Raten nicht beeinflussen bzw. optimieren.
Ein Beispiel der Test-Klassen (nur exemplarisch, nicht kompilierbar !) die wir dazu benutzt haben, sind im Anhang BaseXClient.java.zip zu dieser E-Mail zu finden.
Sind generell, Persistierungs-Raten von mehr als 160 KB / 2500 ms bei mehr als 30.000 vorhandene Einträge in der BaseX zu erwarten, oder können wir diese Persistierungs-Zeiten drastisch optimieren (und wenn ja, wie)?
Mit freundlichen Grüßen,
Lucian Bularca
-- Dirk Kirsten, BaseX GmbH, http://basexgmbh.de |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 91 68 276, Fax: 0049 7531 20 05 22
-- Dirk Kirsten, BaseX GmbH, http://basexgmbh.de |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 91 68 276, Fax: 0049 7531 20 05 22