Adding millions of XML files

List overview All Threads
Download

newer

older

Running and configuring update...

timezone formatting

freesoft

15 Apr 2013 15 Apr '13

4:19 a.m.

Hi, I'm new to BaseX and to XQuery. I already knew XPath. I'm evaluating BaseX to store our XML files and make queries on them. We have to store about 1 million of XML files per month. The XML files are little (~1 KB to 5 KB). So our case is: High number of files, little size.

I've read that BaseX is scalable and has high performance, so it is probably a good tool for us. But, in the tests I'm doing, I'm getting an "Out of Main Memory" error when loading high number of XML files.

For exaple, if I create a new database ("testdb"), and add 3 XML files, no problem occurs. Files are stored correctly, and I can make queries on them. Then, if I try to add 18000 XML files to the same database ("testdb") (by using GUI > Database > Properties > Add Resources), then I see how the coloured memory bar grows and grows... until an error appears:

Out of Main Memory. You can try to: - increase Java's heap size with the flag -Xmx<size> - deactivate the text and attribute indexes.

The text and attribute indexes are disabled, so it is not the cause. And I increased the Java size with the flag -Xmx<size> (by editing the basexgui.bat script), and same error happens.

Probaly BaseX loads all files to main memory first, and then dumps them to the database files. That shouldn't be done in that way. For each XML file, it should be loaded into main memory, then procesed and then dumped to the db files. For each file, independently from the rest.

So I have two questions: 1. Do I have to use an special way to add high number of XML files? 2. Is BaseX sufficiently stable to store and manage our data (about 1 million of files will be added per month)?

Thank you for our help and for your great software, and excuse me if I am asking for recurrent questions.

Attachments:

attachment.html (text/html — 2.2 KB)

Show replies by date

Fabrice Etanchaud

15 Apr 15 Apr

4:59 a.m.

Hi « kgfhjjgrn » ;-)

The size of your test should not cause any problem to BaseX (18 000 files from 1 up to 5 KB)

1. Did you try to set the ADDCACHE option ?

2. You should OPTIMIZE your collection after each batch of ADD commands, even if no index is set.

3. Did you try to unset the AUTOFLUSH option, and explicitly FLUSH the updates at batch's end ?

4. The GUI may not be the best place to run updates, did you try the basex command line tools ?

Opening a collection containing a huge number of documents may take a long time from my experience. It seems to be related to the kind of memory data structure used to store the document names. A workaround could be to insert your documents under a common root xml element with XQuery Update.

The excellent BaseX Team may give you better advice.

Best, Fabrice Etanchaud Questel-Orbit

De : basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de freesoft Envoyé : lundi 15 avril 2013 10:19 À : basex-talk@mailman.uni-konstanz.de Objet : [basex-talk] Adding millions of XML files

Out of Main Memory. You can try to: - increase Java's heap size with the flag -Xmx<size> - deactivate the text and attribute indexes.

The text and attribute indexes are disabled, so it is not the cause. And I increased the Java size with the flag -Xmx<size> (by editing the basexgui.bat script), and same error happens.

Thank you for our help and for your great software, and excuse me if I am asking for recurrent questions.

Christian Grün

6:12 a.m.

Hi kgfhjjgrn,

I believe that Fabrice already mentioned all details that should help you to build larger databases. The ADDCACHE option [1] (included in the latest stable snapshot [2]) may already be sufficient to add your documents via the GUI: simply run the "set addcache true" command via the input bar of the main window before opening the Properties dialog.

Note that you can access multiple databases with a single XQuery call, so if you know that you’ll exceed the limits of a single database at some time (see [3]), simply create new databases in certain intervals.

Hope this helps, Christian

[1] http://docs.basex.org/wiki/Options#ADDCACHE [2] http://files.basex.org/releases/latest/ [3] http://docs.basex.org/wiki/Statistics _________________________________________

...

The size of your test should not cause any problem to BaseX (18 000 files from 1 up to 5 KB)
  Did you try to set the ADDCACHE option ?
  You should OPTIMIZE your collection after each batch of ADD
commands, even if no index is set.
  Did you try to unset the AUTOFLUSH option, and explicitly FLUSH the
updates at batch’s end ?
  The GUI may not be the best place to run updates, did you try the
basex command line tools ?

Opening a collection containing a huge number of documents may take a long time from my experience.

It seems to be related to the kind of memory data structure used to store the document names.

A workaround could be to insert your documents under a common root xml element with XQuery Update.

Best,

Fabrice Etanchaud

Questel-Orbit

De : basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de freesoft Envoyé : lundi 15 avril 2013 10:19 À : basex-talk@mailman.uni-konstanz.de Objet : [basex-talk] Adding millions of XML files

Hi, I'm new to BaseX and to XQuery. I already knew XPath. I'm evaluating BaseX to store our XML files and make queries on them. We have to store about 1 million of XML files per month. The XML files are little (~1 KB to 5 KB). So our case is: High number of files, little size.

I've read that BaseX is scalable and has high performance, so it is probably a good tool for us. But, in the tests I'm doing, I'm getting an "Out of Main Memory" error when loading high number of XML files.

For exaple, if I create a new database ("testdb"), and add 3 XML files, no problem occurs. Files are stored correctly, and I can make queries on them. Then, if I try to add 18000 XML files to the same database ("testdb") (by using GUI > Database > Properties > Add Resources), then I see how the coloured memory bar grows and grows... until an error appears:
Out of Main Memory.
You can try to:
- increase Java's heap size with the flag -Xmx<size>
- deactivate the text and attribute indexes.
The text and attribute indexes are disabled, so it is not the cause. And I increased the Java size with the flag -Xmx<size> (by editing the basexgui.bat script), and same error happens.

Probaly BaseX loads all files to main memory first, and then dumps them to the database files. That shouldn't be done in that way. For each XML file, it should be loaded into main memory, then procesed and then dumped to the db files. For each file, independently from the rest.

So I have two questions:

Do I have to use an special way to add high number of XML files?

Is BaseX sufficiently stable to store and manage our data (about 1

million of files will be added per month)?

Thank you for our help and for your great software, and excuse me if I am asking for recurrent questions.

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

freesoft

8:12 a.m.

Worked! :-)

I have uninstalled 7.6 and installed 7.7 beta. Then, created the empty db, added the 3 files, run the "set addcache true" command, added the 17828 files... and no "out of memory" error, just the processing info:

Path "everything" added in 462943.7 ms.

that is ~8 minutes (in my development machine, not in our server).

Now I'm going to do some more tests (both for adding and for quering), and I'm going to try the "basex" command, in order to add XML files automatically to the db.

Anyway I would ask some more questions:

1. Is 7.7 beta sufficiently stable to be used in our production server? Shoud I wait for the final 7.7 release?

2. Is the "addcache" property value permanently saved to the db? Should I run the "set addcache true" command everytime I add files?

3. Should I keep disabled the Text & Attribute indexes? Is the "addcache=on" option sufficient to allow the adition of XML files, so I can enable those indexes? Will my queries be slow with those indexes disabled?

4. Should I run Optimize after every massive insertion (even with "addcache=on")?

Thank you for the information on limits, it is very useful. In particular, the following limits:

FileSize: 512 GiB #Files: 536,870,912

mean a medium value of exactly 1 KB/file. Since my files are bigger than 1 KB (in medium), then the size limit will be reached first (512 GiB). So my Perl scripts will have to detect the size of the db, and if it is bigger than ~500 GB, then they will create a new db and add new XML files to it.

Please show me an easy example of how to use several databases in the same query. Perhaps something like:

for $doc in (collection("db1"), collection("db2")) for $node in $doc/$a_node_path etc...

Well, thank you very much for your help. And excuse me for the huge amount of questions from a newbie like me :-)

freesoft

________________________________ De: Christian Grün christian.gruen@gmail.com Para: Fabrice Etanchaud fetanchaud@questel.com CC: freesoft kqfhjjgrn@yahoo.es; "basex-talk@mailman.uni-konstanz.de" basex-talk@mailman.uni-konstanz.de Enviado: Lunes 15 de abril de 2013 12:12 Asunto: Re: [basex-talk] Adding millions of XML files

Hi kgfhjjgrn,

Hope this helps, Christian

[1] http://docs.basex.org/wiki/Options#ADDCACHE [2] http://files.basex.org/releases/latest/ [3] http://docs.basex.org/wiki/Statistics _________________________________________

...

The size of your test should not cause any problem to BaseX (18 000 files from 1 up to 5 KB)

1. Did you try to set the ADDCACHE option ?

2. You should OPTIMIZE your collection after each batch of ADD commands, even if no index is set.

3. Did you try to unset the AUTOFLUSH option, and explicitly FLUSH the updates at batch’s end ?

4. The GUI may not be the best place to run updates, did you try the basex command line tools ?

Opening a collection containing a huge number of documents may take a long time from my experience.

It seems to be related to the kind of memory data structure used to store the document names.

A workaround could be to insert your documents under a common root xml element with XQuery Update.

Best,

Fabrice Etanchaud

Questel-Orbit

De : basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de freesoft Envoyé : lundi 15 avril 2013 10:19 À : basex-talk@mailman.uni-konstanz.de Objet : [basex-talk] Adding millions of XML files

Hi, I'm new to BaseX and to XQuery. I already knew XPath. I'm evaluating BaseX to store our XML files and make queries on them. We have to store about 1 million of XML files per month. The XML files are little (~1 KB to 5 KB). So our case is: High number of files, little size.

I've read that BaseX is scalable and has high performance, so it is probably a good tool for us. But, in the tests I'm doing, I'm getting an "Out of Main Memory" error when loading high number of XML files.

For exaple, if I create a new database ("testdb"), and add 3 XML files, no problem occurs. Files are stored correctly, and I can make queries on them. Then, if I try to add 18000 XML files to the same database ("testdb") (by using GUI > Database > Properties > Add Resources), then I see how the coloured memory bar grows and grows... until an error appears:

Out of Main Memory. You can try to: - increase Java's heap size with the flag -Xmx<size> - deactivate the text and attribute indexes.

The text and attribute indexes are disabled, so it is not the cause. And I increased the Java size with the flag -Xmx<size> (by editing the basexgui.bat script), and same error happens.

Probaly BaseX loads all files to main memory first, and then dumps them to the database files. That shouldn't be done in that way. For each XML file, it should be loaded into main memory, then procesed and then dumped to the db files. For each file, independently from the rest.

So I have two questions:

Do I have to use an special way to add high number of XML files?

Is BaseX sufficiently stable to store and manage our data (about 1

million of files will be added per month)?

Thank you for our help and for your great software, and excuse me if I am asking for recurrent questions.

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Christian Grün

5:51 p.m.

Hi Freesoft,

...

I have uninstalled 7.6 and installed 7.7 beta. Then, created the empty db, added the 3 files, run the "set addcache true" command, added the 17828 files... and no "out of memory" error, just the processing info:

Good to hear. Please note that it’s always faster to specify initial documents along with the CREATE DB command instead of adding them in a second step (but I’m aware that you’re mainly interested in the time required to incrementally add new documents).

...

Is 7.7 beta sufficiently stable to be used in our production server?

Shoud I wait for the final 7.7 release?

The current snapshot should be a safe bet, as there will be no critical updates until the official release.

...

Is the "addcache" property value permanently saved to the db? Should I

run the "set addcache true" command everytime I add files?

The value of ADDCACHE is bound to the current BaseX instance and won't be stored in the database. This means that you’ll have to set it to true whenever you run a new BaseX instance.

But.. As you stumbled upon an issue that has also been discussed before, I had yet another look at the ADD command, and I added some heuristics for directory inputs. If the documents to be added are expected to blow up main memory, they will be cached even if ADDCACHE is set to false. You are invited to check out the latest version [1] and give us some more feedback.

...

Should I keep disabled the Text & Attribute indexes? Is the "addcache=on"

option sufficient to allow the adition of XML files, so I can enable those indexes? Will my queries be slow with those indexes disabled?

If text and attribute indexes are enabled, they will be invalidated with an update and restored with the next OPTIMIZE call, so it’s a good choice to keep the defaults. Not all queries will get slower without indexes. You can have a look at the query info (shown e.g. in the GUI’s InfoView) to see if the query plans with and without index structures differ.

...

Should I run Optimize after every massive insertion (even with

"addcache=on")?

It’s generally advisable to run OPTIMIZE whenever you want to perform queries on your new data.

...

mean a medium value of exactly 1 KB/file. Since my files are bigger than 1 KB (in medium), then the size limit will be reached first (512 GiB).

My assumption is that you will first hit the node id limit (#Nodes), but simply try and see what happens.

...

Please show me an easy example of how to use several databases in the same query. Perhaps something like:
for $doc in (collection("db1"), collection("db2"))
for $node in $doc/$a_node_path

Looks fine. This is one more alternative:

for $i in 1 to 100 let $db := "db" || $i return db:open($db)/your/path

...

Well, thank you very much for your help. And excuse me for the huge amount of questions from a newbie like me :-)

Your questions are welcome. If you got some free time, you are invited to read out documentation; many of its contents have been inspiried by earlier discussions on this list.

Christian

[1] http://files.basex.org/releases/latest/ [2] http://docs.basex.org/wiki/Main_Page

freesoft

17 Apr 17 Apr

5:54 a.m.

Hi, yesterday I was bussy, so I'm sorry for responding so late.

...

Good to hear. Please note that it’s always faster to specify initial documents along with the CREATE DB command instead of adding them in a second step (but I’m aware that you’re mainly interested in the time required to incrementally add new documents).

Yes, exactly. In fact, I'm seeing that, as db grows, the adition of new XML files is more and more slow. When I add the first 2 GBytes of files, the insertion is fast. But when I add 2 GBytes more, it is terribly slow. In fact, I have launched it about 1h ago, and it is still adding files. What could be the cause? If BaseX is scalable, then the degradation in speed (due to size increment) shouldn't be too high.

...

But.. As you stumbled upon an issue that has also been discussed before, I had yet another look at the ADD command, and I added some heuristics for directory inputs. If the documents to be added are expected to blow up main memory, they will be cached even if ADDCACHE is set to false. You are invited to check out the latest version [1] and give us some more feedback.

Well, I prefer to be sure, and enable the ADDCACHE option (because Murphy is always enabled :). If I get the time, I will try the heuristics.

...

If text and attribute indexes are enabled, they will be invalidated with an update and restored with the next OPTIMIZE call, so it’s a good choice to keep the defaults. Not all queries will get slower without indexes. You can have a look at the query info (shown e.g. in the GUI’s InfoView) to see if the query plans with and without index structures differ.

I tried to run OPTIMIZE after adding files, and the creation of indexes failed (specifically, the creation of Text index, as I can remember): An "out of memory" error. So I'm working without indexes and without OPTIMIZE. Queries are working fine, at least for now.

freesoft

________________________________ De: Christian Grün christian.gruen@gmail.com Para: freesoft kqfhjjgrn@yahoo.es CC: "basex-talk@mailman.uni-konstanz.de" basex-talk@mailman.uni-konstanz.de Enviado: Lunes 15 de abril de 2013 23:51 Asunto: Re: [basex-talk] Adding millions of XML files

Hi Freesoft,

...

I have uninstalled 7.6 and installed 7.7 beta. Then, created the empty db, added the 3 files, run the "set addcache true" command, added the 17828 files... and no "out of memory" error, just the processing info:

...

Is 7.7 beta sufficiently stable to be used in our production server?

Shoud I wait for the final 7.7 release?

The current snapshot should be a safe bet, as there will be no critical updates until the official release.

...

Is the "addcache" property value permanently saved to the db? Should I

run the "set addcache true" command everytime I add files?

The value of ADDCACHE is bound to the current BaseX instance and won't be stored in the database. This means that you’ll have to set it to true whenever you run a new BaseX instance.

...

Should I keep disabled the Text & Attribute indexes? Is the "addcache=on"

option sufficient to allow the adition of XML files, so I can enable those indexes? Will my queries be slow with those indexes disabled?

...

Should I run Optimize after every massive insertion (even with

"addcache=on")?

It’s generally advisable to run OPTIMIZE whenever you want to perform queries on your new data.

...

mean a medium value of exactly 1 KB/file. Since my files are bigger than 1 KB (in medium), then the size limit will be reached first (512 GiB).

My assumption is that you will first hit the node id limit (#Nodes), but simply try and see what happens.

...

Please show me an easy example of how to use several databases in the same query. Perhaps something like:

for $doc in (collection("db1"), collection("db2")) for $node in $doc/$a_node_path

Looks fine. This is one more alternative:

for $i in 1 to 100 let $db := "db" || $i return db:open($db)/your/path

...

Well, thank you very much for your help. And excuse me for the huge amount of questions from a newbie like me :-)

Your questions are welcome. If you got some free time, you are invited to read out documentation; many of its contents have been inspiried by earlier discussions on this list.

Christian

[1] http://files.basex.org/releases/latest/ [2] http://docs.basex.org/wiki/Main_Page

Christian Grün

18 Apr 18 Apr

4:07 p.m.

Hi Freesoft,

...

Yes, exactly. In fact, I'm seeing that, as db grows, the adition of new XML files is more and more slow. When I add the first 2 GBytes of files, the insertion is fast. But when I add 2 GBytes more, it is terribly slow.

As Fabrice mentioned, you could try to set AUTOFLUSH to false; does this improve things?

Christian

freesoft

15 Apr 15 Apr

6:13 a.m.

Hi Fabrice, thank you for your fast response.

I'm sorry, the info on file size I wrote was incorrect, the correct info of the test is:

number of files: 17828 files total size: 955 MB medium size: ~ 55 KB/file

The exact steps I did were:

1. Open BaseX GUI. 2. Create an empty database: Database > new, then I remove the "Input file or directory" content (so it gets emty), then I enter a "testdb" in "Name of database", and then I click on "Ok". 3. Then I open the created database: Database > Open and manage > double click on "testdb". 4. Then I add 3 XML files: Database > Properties, then I write into "Input file or directory" a directoy path that contains 3 XML files, then in "Target Path" I write the name of the directory (that will be a prefix path for the document nodes), and then I click on "Add...". Files are added correctly, and I can make queries on them. 5. Then I add the 17828 XML files: Database > Properties, then I write into "Input file or directory" a directoy path that contains the 17828 XML files, then in "Target Path" I write the name of the directory (that will be a prefix path for the document nodes), and then I click on "Add...". Files are not added correctly: I get an out of memory message:

Out of Main Memory. You can try to: - increase Java's heap size with the flag -Xmx<size> - deactivate the text and attribute indexes.

Now I have tried to follow your suggestions:

* The ADDCACHE option seems to be available only in BaseX 7.7, as described in http://docs.basex.org/wiki/Options#ADDCACHE . I'm using 7.6, that is the version downloaded from the home page of BaseX. In fact, I don't see any ADDCACHE option in the GUI (perhaps it is available only in the commandline). I would like to make the tests only with a totally stable version. Do you recommend me to use 7.7?

* I have tried to optimize: After step 4, I have clicked on the "Optimize" button. But I still get an "Out of Main Memory" error when running step 5.

* I don't see in the GUI any option to enable/disable the AUTOFLUSH option. Perhaps it is only available in the command line?

* I'm going to try the command line ("basex" command). I will post the results.

* I still haven't learned XQuery Update. I think I should focus on the problem of adding our documents to the database, for now, and then (when the problem is solved) I will try to optimize things (such as the time taken for opening the collection).

Thank you for your suggestions. I'm going to try the "basex" command now ...

freesoft

________________________________ De: Fabrice Etanchaud fetanchaud@questel.com Para: freesoft kqfhjjgrn@yahoo.es; "basex-talk@mailman.uni-konstanz.de" basex-talk@mailman.uni-konstanz.de Enviado: Lunes 15 de abril de 2013 10:59 Asunto: RE: [basex-talk] Adding millions of XML files

Hi « kgfhjjgrn » ;-) The size of your test should not cause any problem to BaseX (18 000 files from 1 up to 5 KB) 1. Did you try to set the ADDCACHE option ? 2. You should OPTIMIZE your collection after each batch of ADD commands, even if no index is set. 3. Did you try to unset the AUTOFLUSH option, and explicitly FLUSH the updates at batch’s end ? 4. The GUI may not be the best place to run updates, did you try the basex command line tools ? Opening a collection containing a huge number of documents may take a long time from my experience. It seems to be related to the kind of memory data structure used to store the document names. A workaround could be to insert your documents under a common root xml element with XQuery Update. The excellent BaseX Team may give you better advice. Best, Fabrice Etanchaud Questel-Orbit De :basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de freesoft Envoyé : lundi 15 avril 2013 10:19 À : basex-talk@mailman.uni-konstanz.de Objet : [basex-talk] Adding millions of XML files Hi, I'm new to BaseX and to XQuery. I already knew XPath. I'm evaluating BaseX to store our XML files and make queries on them. We have to store about 1 million of XML files per month. The XML files are little (~1 KB to 5 KB). So our case is: High number of files, little size.

...

Add Resources), then I see how the coloured memory bar grows and grows... until an error appears:

Out of Main Memory. You can try to: - increase Java's heap size with the flag -Xmx<size> - deactivate the text and attribute indexes.

The text and attribute indexes are disabled, so it is not the cause. And I increased the Java size with the flag -Xmx<size> (by editing the basexgui.bat script), and same error happens.

Thank you for our help and for your great software, and excuse me if I am asking for recurrent questions.

4474

Age (days ago)

4477

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

7 comments

3 participants

tags (0)

participants (3)

Christian Grün
Fabrice Etanchaud
freesoft