Hi,
I'm doing some testing before migration one of our customers to a new version of our platform that uses BaseX in order to store documents. They have approx. 4M documents, and I'm running an import operation on a 1 M document collection on my laptop.
The way I'm inserting documents is by firing off one Add command per document, based on a stream of the document, at a different (unique) path for each document, and flushing every at 10K Adds.
Since most CPU usage (for one of the cores, the other ones being untouched) is taken by the BaseX server, I fired up YourKit out of curiosity to see where the CPU time was spent. My machine is a 2*4 core MacBook Pro with 8GB of RAM and SSD, so I think hardware-wise it should do pretty fine.
YourKit shows that what seems to use up most time is the Namespaces.update method:
Thread-12 [RUNNABLE] CPU time: 2h 7m 9s org.basex.data.Namespaces.update(NSNode, int, int, boolean, Set) org.basex.data.Namespaces.update(int, int, boolean, Set) org.basex.data.Data.insert(int, int, Data) org.basex.core.cmd.Add.run() org.basex.core.Command.run(Context, OutputStream) org.basex.core.Command.exec(Context, OutputStream) org.basex.core.Command.execute(Context, OutputStream) org.basex.core.Command.execute(Context) org.basex.server.ClientListener.execute(Command) org.basex.server.ClientListener.add() org.basex.server.ClientListener.run()
I'm not really sure what that method does - it's a recursive function and seems to be triggered by Data.insert:
// NSNodes have to be checked for pre value shifts after insert nspaces.update(ipre, dsize, true, newNodes);
The whole set of records should have no more than 5 different namespaces in total. Thus I'm wondering if there would perhaps be some potential for optimization here? Note that I'm completely ignorant as to what the method does and what its exact purpose is.
Thanks,
Manuel
PS: the import is now finished: Storing 1001712 records into BaseX took 9285008 ms
Hi Manuel,
sorry for the delayed feedback, and thanks for pointing to the Namespaces.update() method, which in fact updates the hierarchical namespaces structures in a database (well, you guessed that already…). As we first need to do some more research on potential optimizations, I have created a new GitHub issue to keep track of this bottleneck [1].
Thanks, Christian
[1] https://github.com/BaseXdb/basex/issues/523 ___________________________
On Sat, Jun 30, 2012 at 7:01 PM, Manuel Bernhardt bernhardt.manuel@gmail.com wrote:
Hi,
I'm doing some testing before migration one of our customers to a new version of our platform that uses BaseX in order to store documents. They have approx. 4M documents, and I'm running an import operation on a 1 M document collection on my laptop.
The way I'm inserting documents is by firing off one Add command per document, based on a stream of the document, at a different (unique) path for each document, and flushing every at 10K Adds.
Since most CPU usage (for one of the cores, the other ones being untouched) is taken by the BaseX server, I fired up YourKit out of curiosity to see where the CPU time was spent. My machine is a 2*4 core MacBook Pro with 8GB of RAM and SSD, so I think hardware-wise it should do pretty fine.
YourKit shows that what seems to use up most time is the Namespaces.update method:
Thread-12 [RUNNABLE] CPU time: 2h 7m 9s org.basex.data.Namespaces.update(NSNode, int, int, boolean, Set) org.basex.data.Namespaces.update(int, int, boolean, Set) org.basex.data.Data.insert(int, int, Data) org.basex.core.cmd.Add.run() org.basex.core.Command.run(Context, OutputStream) org.basex.core.Command.exec(Context, OutputStream) org.basex.core.Command.execute(Context, OutputStream) org.basex.core.Command.execute(Context) org.basex.server.ClientListener.execute(Command) org.basex.server.ClientListener.add() org.basex.server.ClientListener.run()
I'm not really sure what that method does - it's a recursive function and seems to be triggered by Data.insert:
// NSNodes have to be checked for pre value shifts after insert nspaces.update(ipre, dsize, true, newNodes);
The whole set of records should have no more than 5 different namespaces in total. Thus I'm wondering if there would perhaps be some potential for optimization here? Note that I'm completely ignorant as to what the method does and what its exact purpose is.
Thanks,
Manuel
PS: the import is now finished: Storing 1001712 records into BaseX took 9285008 ms _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Hi,
great, thanks! If there's anything I can do to help, let me know. Right now I think I'm going to abort the import because it probably will take somewhat longer.
Manuel
On Mon, Jul 2, 2012 at 3:11 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Manuel,
sorry for the delayed feedback, and thanks for pointing to the Namespaces.update() method, which in fact updates the hierarchical namespaces structures in a database (well, you guessed that already…). As we first need to do some more research on potential optimizations, I have created a new GitHub issue to keep track of this bottleneck [1].
Thanks, Christian
[1] https://github.com/BaseXdb/basex/issues/523 ___________________________
On Sat, Jun 30, 2012 at 7:01 PM, Manuel Bernhardt bernhardt.manuel@gmail.com wrote:
Hi,
I'm doing some testing before migration one of our customers to a new version of our platform that uses BaseX in order to store documents. They have approx. 4M documents, and I'm running an import operation on a 1 M document collection on my laptop.
The way I'm inserting documents is by firing off one Add command per document, based on a stream of the document, at a different (unique) path for each document, and flushing every at 10K Adds.
Since most CPU usage (for one of the cores, the other ones being untouched) is taken by the BaseX server, I fired up YourKit out of curiosity to see where the CPU time was spent. My machine is a 2*4 core MacBook Pro with 8GB of RAM and SSD, so I think hardware-wise it should do pretty fine.
YourKit shows that what seems to use up most time is the Namespaces.update method:
Thread-12 [RUNNABLE] CPU time: 2h 7m 9s org.basex.data.Namespaces.update(NSNode, int, int, boolean, Set) org.basex.data.Namespaces.update(int, int, boolean, Set) org.basex.data.Data.insert(int, int, Data) org.basex.core.cmd.Add.run() org.basex.core.Command.run(Context, OutputStream) org.basex.core.Command.exec(Context, OutputStream) org.basex.core.Command.execute(Context, OutputStream) org.basex.core.Command.execute(Context) org.basex.server.ClientListener.execute(Command) org.basex.server.ClientListener.add() org.basex.server.ClientListener.run()
I'm not really sure what that method does - it's a recursive function and seems to be triggered by Data.insert:
// NSNodes have to be checked for pre value shifts after insert nspaces.update(ipre, dsize, true, newNodes);
The whole set of records should have no more than 5 different namespaces in total. Thus I'm wondering if there would perhaps be some potential for optimization here? Note that I'm completely ignorant as to what the method does and what its exact purpose is.
Thanks,
Manuel
PS: the import is now finished: Storing 1001712 records into BaseX took 9285008 ms _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Another note: if your initial database is empty, and if your documents to be added are stored on disk, the operation will be much faster if you specify this directory along with the create command.
great, thanks! If there's anything I can do to help, let me know. Right now I think I'm going to abort the import because it probably will take somewhat longer.
Manuel
On Mon, Jul 2, 2012 at 3:11 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Manuel,
sorry for the delayed feedback, and thanks for pointing to the Namespaces.update() method, which in fact updates the hierarchical namespaces structures in a database (well, you guessed that already…). As we first need to do some more research on potential optimizations, I have created a new GitHub issue to keep track of this bottleneck [1].
Thanks, Christian
[1] https://github.com/BaseXdb/basex/issues/523 ___________________________
On Sat, Jun 30, 2012 at 7:01 PM, Manuel Bernhardt bernhardt.manuel@gmail.com wrote:
Hi,
I'm doing some testing before migration one of our customers to a new version of our platform that uses BaseX in order to store documents. They have approx. 4M documents, and I'm running an import operation on a 1 M document collection on my laptop.
The way I'm inserting documents is by firing off one Add command per document, based on a stream of the document, at a different (unique) path for each document, and flushing every at 10K Adds.
Since most CPU usage (for one of the cores, the other ones being untouched) is taken by the BaseX server, I fired up YourKit out of curiosity to see where the CPU time was spent. My machine is a 2*4 core MacBook Pro with 8GB of RAM and SSD, so I think hardware-wise it should do pretty fine.
YourKit shows that what seems to use up most time is the Namespaces.update method:
Thread-12 [RUNNABLE] CPU time: 2h 7m 9s org.basex.data.Namespaces.update(NSNode, int, int, boolean, Set) org.basex.data.Namespaces.update(int, int, boolean, Set) org.basex.data.Data.insert(int, int, Data) org.basex.core.cmd.Add.run() org.basex.core.Command.run(Context, OutputStream) org.basex.core.Command.exec(Context, OutputStream) org.basex.core.Command.execute(Context, OutputStream) org.basex.core.Command.execute(Context) org.basex.server.ClientListener.execute(Command) org.basex.server.ClientListener.add() org.basex.server.ClientListener.run()
I'm not really sure what that method does - it's a recursive function and seems to be triggered by Data.insert:
// NSNodes have to be checked for pre value shifts after insert nspaces.update(ipre, dsize, true, newNodes);
The whole set of records should have no more than 5 different namespaces in total. Thus I'm wondering if there would perhaps be some potential for optimization here? Note that I'm completely ignorant as to what the method does and what its exact purpose is.
Thanks,
Manuel
PS: the import is now finished: Storing 1001712 records into BaseX took 9285008 ms _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Hi,
On Mon, Jul 2, 2012 at 10:42 AM, Christian Grün christian.gruen@gmail.com wrote:
Another note: if your initial database is empty, and if your documents to be added are stored on disk, the operation will be much faster if you specify this directory along with the create command.
I had considered looking at this, but in our situation the source is a stream that gets converted on the fly and then sent to the server (which is on a different server than the one doing the inserts). Btw, is there a reason why inserting from a file is faster than from a stream? I'd expect both to use the same insertion mechanism.
Thanks,
Manuel
great, thanks! If there's anything I can do to help, let me know. Right now I think I'm going to abort the import because it probably will take somewhat longer.
Manuel
On Mon, Jul 2, 2012 at 3:11 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Manuel,
sorry for the delayed feedback, and thanks for pointing to the Namespaces.update() method, which in fact updates the hierarchical namespaces structures in a database (well, you guessed that already…). As we first need to do some more research on potential optimizations, I have created a new GitHub issue to keep track of this bottleneck [1].
Thanks, Christian
[1] https://github.com/BaseXdb/basex/issues/523 ___________________________
On Sat, Jun 30, 2012 at 7:01 PM, Manuel Bernhardt bernhardt.manuel@gmail.com wrote:
Hi,
I'm doing some testing before migration one of our customers to a new version of our platform that uses BaseX in order to store documents. They have approx. 4M documents, and I'm running an import operation on a 1 M document collection on my laptop.
The way I'm inserting documents is by firing off one Add command per document, based on a stream of the document, at a different (unique) path for each document, and flushing every at 10K Adds.
Since most CPU usage (for one of the cores, the other ones being untouched) is taken by the BaseX server, I fired up YourKit out of curiosity to see where the CPU time was spent. My machine is a 2*4 core MacBook Pro with 8GB of RAM and SSD, so I think hardware-wise it should do pretty fine.
YourKit shows that what seems to use up most time is the Namespaces.update method:
Thread-12 [RUNNABLE] CPU time: 2h 7m 9s org.basex.data.Namespaces.update(NSNode, int, int, boolean, Set) org.basex.data.Namespaces.update(int, int, boolean, Set) org.basex.data.Data.insert(int, int, Data) org.basex.core.cmd.Add.run() org.basex.core.Command.run(Context, OutputStream) org.basex.core.Command.exec(Context, OutputStream) org.basex.core.Command.execute(Context, OutputStream) org.basex.core.Command.execute(Context) org.basex.server.ClientListener.execute(Command) org.basex.server.ClientListener.add() org.basex.server.ClientListener.run()
I'm not really sure what that method does - it's a recursive function and seems to be triggered by Data.insert:
// NSNodes have to be checked for pre value shifts after insert nspaces.update(ipre, dsize, true, newNodes);
The whole set of records should have no more than 5 different namespaces in total. Thus I'm wondering if there would perhaps be some potential for optimization here? Note that I'm completely ignorant as to what the method does and what its exact purpose is.
Thanks,
Manuel
PS: the import is now finished: Storing 1001712 records into BaseX took 9285008 ms _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Hi Manuel,
is there a reason why inserting from a file is faster than from a stream? I'd expect both to use the same insertion mechanism.
There are several reasons for that, e.g.:
– as each of the ADD operations is atomic, it must be guaranteed that a command will not lead to a corrupt database. In contrast, CREATE will either succeed or fail as a whole. – if data is streamed, we first need to cache the result because of the same reason (if the received data is invalid, the insert operation will fail)
Apart from that, your specific bottleneck seems to be related to the namespace method. Without that, the add operation should be very fast, too. As malamut2 suggested…
https://github.com/BaseXdb/basex/issues/523
an additional option, which strips all namespaces in a document, could be another solution (provided that you don't really need the namespaces). Anyway, we'll give you an update as soon as someone has time to look at this.
Christian ___________________________
Thanks,
Manuel
great, thanks! If there's anything I can do to help, let me know. Right now I think I'm going to abort the import because it probably will take somewhat longer.
Manuel
On Mon, Jul 2, 2012 at 3:11 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Manuel,
sorry for the delayed feedback, and thanks for pointing to the Namespaces.update() method, which in fact updates the hierarchical namespaces structures in a database (well, you guessed that already…). As we first need to do some more research on potential optimizations, I have created a new GitHub issue to keep track of this bottleneck [1].
Thanks, Christian
[1] https://github.com/BaseXdb/basex/issues/523 ___________________________
On Sat, Jun 30, 2012 at 7:01 PM, Manuel Bernhardt bernhardt.manuel@gmail.com wrote:
Hi,
I'm doing some testing before migration one of our customers to a new version of our platform that uses BaseX in order to store documents. They have approx. 4M documents, and I'm running an import operation on a 1 M document collection on my laptop.
The way I'm inserting documents is by firing off one Add command per document, based on a stream of the document, at a different (unique) path for each document, and flushing every at 10K Adds.
Since most CPU usage (for one of the cores, the other ones being untouched) is taken by the BaseX server, I fired up YourKit out of curiosity to see where the CPU time was spent. My machine is a 2*4 core MacBook Pro with 8GB of RAM and SSD, so I think hardware-wise it should do pretty fine.
YourKit shows that what seems to use up most time is the Namespaces.update method:
Thread-12 [RUNNABLE] CPU time: 2h 7m 9s org.basex.data.Namespaces.update(NSNode, int, int, boolean, Set) org.basex.data.Namespaces.update(int, int, boolean, Set) org.basex.data.Data.insert(int, int, Data) org.basex.core.cmd.Add.run() org.basex.core.Command.run(Context, OutputStream) org.basex.core.Command.exec(Context, OutputStream) org.basex.core.Command.execute(Context, OutputStream) org.basex.core.Command.execute(Context) org.basex.server.ClientListener.execute(Command) org.basex.server.ClientListener.add() org.basex.server.ClientListener.run()
I'm not really sure what that method does - it's a recursive function and seems to be triggered by Data.insert:
// NSNodes have to be checked for pre value shifts after insert nspaces.update(ipre, dsize, true, newNodes);
The whole set of records should have no more than 5 different namespaces in total. Thus I'm wondering if there would perhaps be some potential for optimization here? Note that I'm completely ignorant as to what the method does and what its exact purpose is.
Thanks,
Manuel
PS: the import is now finished: Storing 1001712 records into BaseX took 9285008 ms _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Hi Manuel,
while many XML purists will hate this feature, I bet that many users will love it: I have added a new STRIPNS option to remove namespaces from imported XML documents [1-3]. This option is also available via the GUI. After all, it depends on the particular use case if stripping namespaces makes things easier or is pretty much nuts.
I remember that you haven't actually asked for this feature, and it may well be that you absolutely want to retain namespaces in your database. The discussed performance bottleneck with namespaced documents is still on the list.
Christian
[1] http://docs.basex.org/wiki/Options#STRIPNS [2] http://files.basex.org/releases/latest/ [3] https://github.com/BaseXdb/basex/issues/537
___________________________
is there a reason why inserting from a file is faster than from a stream? I'd expect both to use the same insertion mechanism.
There are several reasons for that, e.g.:
– as each of the ADD operations is atomic, it must be guaranteed that a command will not lead to a corrupt database. In contrast, CREATE will either succeed or fail as a whole. – if data is streamed, we first need to cache the result because of the same reason (if the received data is invalid, the insert operation will fail)
Apart from that, your specific bottleneck seems to be related to the namespace method. Without that, the add operation should be very fast, too. As malamut2 suggested…
https://github.com/BaseXdb/basex/issues/523
an additional option, which strips all namespaces in a document, could be another solution (provided that you don't really need the namespaces). Anyway, we'll give you an update as soon as someone has time to look at this.
Christian ___________________________
Thanks,
Manuel
great, thanks! If there's anything I can do to help, let me know. Right now I think I'm going to abort the import because it probably will take somewhat longer.
Manuel
On Mon, Jul 2, 2012 at 3:11 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Manuel,
sorry for the delayed feedback, and thanks for pointing to the Namespaces.update() method, which in fact updates the hierarchical namespaces structures in a database (well, you guessed that already…). As we first need to do some more research on potential optimizations, I have created a new GitHub issue to keep track of this bottleneck [1].
Thanks, Christian
[1] https://github.com/BaseXdb/basex/issues/523 ___________________________
On Sat, Jun 30, 2012 at 7:01 PM, Manuel Bernhardt bernhardt.manuel@gmail.com wrote:
Hi,
I'm doing some testing before migration one of our customers to a new version of our platform that uses BaseX in order to store documents. They have approx. 4M documents, and I'm running an import operation on a 1 M document collection on my laptop.
The way I'm inserting documents is by firing off one Add command per document, based on a stream of the document, at a different (unique) path for each document, and flushing every at 10K Adds.
Since most CPU usage (for one of the cores, the other ones being untouched) is taken by the BaseX server, I fired up YourKit out of curiosity to see where the CPU time was spent. My machine is a 2*4 core MacBook Pro with 8GB of RAM and SSD, so I think hardware-wise it should do pretty fine.
YourKit shows that what seems to use up most time is the Namespaces.update method:
Thread-12 [RUNNABLE] CPU time: 2h 7m 9s org.basex.data.Namespaces.update(NSNode, int, int, boolean, Set) org.basex.data.Namespaces.update(int, int, boolean, Set) org.basex.data.Data.insert(int, int, Data) org.basex.core.cmd.Add.run() org.basex.core.Command.run(Context, OutputStream) org.basex.core.Command.exec(Context, OutputStream) org.basex.core.Command.execute(Context, OutputStream) org.basex.core.Command.execute(Context) org.basex.server.ClientListener.execute(Command) org.basex.server.ClientListener.add() org.basex.server.ClientListener.run()
I'm not really sure what that method does - it's a recursive function and seems to be triggered by Data.insert:
// NSNodes have to be checked for pre value shifts after insert nspaces.update(ipre, dsize, true, newNodes);
The whole set of records should have no more than 5 different namespaces in total. Thus I'm wondering if there would perhaps be some potential for optimization here? Note that I'm completely ignorant as to what the method does and what its exact purpose is.
Thanks,
Manuel
PS: the import is now finished: Storing 1001712 records into BaseX took 9285008 ms _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Hi,
a little update on this: I started the import of 3M documents last evening using this method, and after 9h it's not yet finished (at 2,29M documents atm.). So this operation looks a lot like it is in o(n^2) (the insertion of 1M record took somewhat above 2h)
Manuel
On Sat, Jun 30, 2012 at 7:01 PM, Manuel Bernhardt bernhardt.manuel@gmail.com wrote:
Hi,
I'm doing some testing before migration one of our customers to a new version of our platform that uses BaseX in order to store documents. They have approx. 4M documents, and I'm running an import operation on a 1 M document collection on my laptop.
The way I'm inserting documents is by firing off one Add command per document, based on a stream of the document, at a different (unique) path for each document, and flushing every at 10K Adds.
Since most CPU usage (for one of the cores, the other ones being untouched) is taken by the BaseX server, I fired up YourKit out of curiosity to see where the CPU time was spent. My machine is a 2*4 core MacBook Pro with 8GB of RAM and SSD, so I think hardware-wise it should do pretty fine.
YourKit shows that what seems to use up most time is the Namespaces.update method:
Thread-12 [RUNNABLE] CPU time: 2h 7m 9s org.basex.data.Namespaces.update(NSNode, int, int, boolean, Set) org.basex.data.Namespaces.update(int, int, boolean, Set) org.basex.data.Data.insert(int, int, Data) org.basex.core.cmd.Add.run() org.basex.core.Command.run(Context, OutputStream) org.basex.core.Command.exec(Context, OutputStream) org.basex.core.Command.execute(Context, OutputStream) org.basex.core.Command.execute(Context) org.basex.server.ClientListener.execute(Command) org.basex.server.ClientListener.add() org.basex.server.ClientListener.run()
I'm not really sure what that method does - it's a recursive function and seems to be triggered by Data.insert:
// NSNodes have to be checked for pre value shifts after insert nspaces.update(ipre, dsize, true, newNodes);
The whole set of records should have no more than 5 different namespaces in total. Thus I'm wondering if there would perhaps be some potential for optimization here? Note that I'm completely ignorant as to what the method does and what its exact purpose is.
Thanks,
Manuel
PS: the import is now finished: Storing 1001712 records into BaseX took 9285008 ms
…the problem should now be fixed. I'd be glad if you could once more test the import you've been discussing in your report with the latest code base/snapshot.
Thanks in advance, Christian ___________________________
On Sat, Jun 30, 2012 at 7:01 PM, Manuel Bernhardt bernhardt.manuel@gmail.com wrote:
Hi,
I'm doing some testing before migration one of our customers to a new version of our platform that uses BaseX in order to store documents. They have approx. 4M documents, and I'm running an import operation on a 1 M document collection on my laptop.
The way I'm inserting documents is by firing off one Add command per document, based on a stream of the document, at a different (unique) path for each document, and flushing every at 10K Adds.
Since most CPU usage (for one of the cores, the other ones being untouched) is taken by the BaseX server, I fired up YourKit out of curiosity to see where the CPU time was spent. My machine is a 2*4 core MacBook Pro with 8GB of RAM and SSD, so I think hardware-wise it should do pretty fine.
YourKit shows that what seems to use up most time is the Namespaces.update method:
Thread-12 [RUNNABLE] CPU time: 2h 7m 9s org.basex.data.Namespaces.update(NSNode, int, int, boolean, Set) org.basex.data.Namespaces.update(int, int, boolean, Set) org.basex.data.Data.insert(int, int, Data) org.basex.core.cmd.Add.run() org.basex.core.Command.run(Context, OutputStream) org.basex.core.Command.exec(Context, OutputStream) org.basex.core.Command.execute(Context, OutputStream) org.basex.core.Command.execute(Context) org.basex.server.ClientListener.execute(Command) org.basex.server.ClientListener.add() org.basex.server.ClientListener.run()
I'm not really sure what that method does - it's a recursive function and seems to be triggered by Data.insert:
// NSNodes have to be checked for pre value shifts after insert nspaces.update(ipre, dsize, true, newNodes);
The whole set of records should have no more than 5 different namespaces in total. Thus I'm wondering if there would perhaps be some potential for optimization here? Note that I'm completely ignorant as to what the method does and what its exact purpose is.
Thanks,
Manuel
PS: the import is now finished: Storing 1001712 records into BaseX took 9285008 ms _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Hi Christian,
thanks for the fix! I'll test it right away on a big import.
We don't have that many namespaces in those documents but the general idea is to keep them, so we won't be using the STRIPNS feature for the time being (though we might in the future, depending on the use-case)
Thanks,
Manuel
On Sat, Jul 7, 2012 at 4:45 PM, Christian Grün christian.gruen@gmail.com wrote:
…the problem should now be fixed. I'd be glad if you could once more test the import you've been discussing in your report with the latest code base/snapshot.
Thanks in advance, Christian ___________________________
On Sat, Jun 30, 2012 at 7:01 PM, Manuel Bernhardt bernhardt.manuel@gmail.com wrote:
Hi,
I'm doing some testing before migration one of our customers to a new version of our platform that uses BaseX in order to store documents. They have approx. 4M documents, and I'm running an import operation on a 1 M document collection on my laptop.
The way I'm inserting documents is by firing off one Add command per document, based on a stream of the document, at a different (unique) path for each document, and flushing every at 10K Adds.
Since most CPU usage (for one of the cores, the other ones being untouched) is taken by the BaseX server, I fired up YourKit out of curiosity to see where the CPU time was spent. My machine is a 2*4 core MacBook Pro with 8GB of RAM and SSD, so I think hardware-wise it should do pretty fine.
YourKit shows that what seems to use up most time is the Namespaces.update method:
Thread-12 [RUNNABLE] CPU time: 2h 7m 9s org.basex.data.Namespaces.update(NSNode, int, int, boolean, Set) org.basex.data.Namespaces.update(int, int, boolean, Set) org.basex.data.Data.insert(int, int, Data) org.basex.core.cmd.Add.run() org.basex.core.Command.run(Context, OutputStream) org.basex.core.Command.exec(Context, OutputStream) org.basex.core.Command.execute(Context, OutputStream) org.basex.core.Command.execute(Context) org.basex.server.ClientListener.execute(Command) org.basex.server.ClientListener.add() org.basex.server.ClientListener.run()
I'm not really sure what that method does - it's a recursive function and seems to be triggered by Data.insert:
// NSNodes have to be checked for pre value shifts after insert nspaces.update(ipre, dsize, true, newNodes);
The whole set of records should have no more than 5 different namespaces in total. Thus I'm wondering if there would perhaps be some potential for optimization here? Note that I'm completely ignorant as to what the method does and what its exact purpose is.
Thanks,
Manuel
PS: the import is now finished: Storing 1001712 records into BaseX took 9285008 ms _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Hi again,
inserting 3M records now seems to take a lot less time - I'm running an insertion for the past 40 minutes now and it's close to finishing (2.8M records so far). I have the impression that it gets slower with the amount of size still, but much less so - but I couldn't put a finger on any particular method call with YourKit (it started with a whopping 15K documents / second, and now is at 300 documents / second)
I'll leave the computer running and see tomorrow how much time it took in total (and give a detail on what calls took how long), but in any case this is a huge improvement over how it used to be, thanks a lot!
Manuel
On Mon, Jul 9, 2012 at 2:04 PM, Manuel Bernhardt bernhardt.manuel@gmail.com wrote:
Hi Christian,
thanks for the fix! I'll test it right away on a big import.
We don't have that many namespaces in those documents but the general idea is to keep them, so we won't be using the STRIPNS feature for the time being (though we might in the future, depending on the use-case)
Thanks,
Manuel
On Sat, Jul 7, 2012 at 4:45 PM, Christian Grün christian.gruen@gmail.com wrote:
…the problem should now be fixed. I'd be glad if you could once more test the import you've been discussing in your report with the latest code base/snapshot.
Thanks in advance, Christian ___________________________
On Sat, Jun 30, 2012 at 7:01 PM, Manuel Bernhardt bernhardt.manuel@gmail.com wrote:
Hi,
I'm doing some testing before migration one of our customers to a new version of our platform that uses BaseX in order to store documents. They have approx. 4M documents, and I'm running an import operation on a 1 M document collection on my laptop.
The way I'm inserting documents is by firing off one Add command per document, based on a stream of the document, at a different (unique) path for each document, and flushing every at 10K Adds.
Since most CPU usage (for one of the cores, the other ones being untouched) is taken by the BaseX server, I fired up YourKit out of curiosity to see where the CPU time was spent. My machine is a 2*4 core MacBook Pro with 8GB of RAM and SSD, so I think hardware-wise it should do pretty fine.
YourKit shows that what seems to use up most time is the Namespaces.update method:
Thread-12 [RUNNABLE] CPU time: 2h 7m 9s org.basex.data.Namespaces.update(NSNode, int, int, boolean, Set) org.basex.data.Namespaces.update(int, int, boolean, Set) org.basex.data.Data.insert(int, int, Data) org.basex.core.cmd.Add.run() org.basex.core.Command.run(Context, OutputStream) org.basex.core.Command.exec(Context, OutputStream) org.basex.core.Command.execute(Context, OutputStream) org.basex.core.Command.execute(Context) org.basex.server.ClientListener.execute(Command) org.basex.server.ClientListener.add() org.basex.server.ClientListener.run()
I'm not really sure what that method does - it's a recursive function and seems to be triggered by Data.insert:
// NSNodes have to be checked for pre value shifts after insert nspaces.update(ipre, dsize, true, newNodes);
The whole set of records should have no more than 5 different namespaces in total. Thus I'm wondering if there would perhaps be some potential for optimization here? Note that I'm completely ignorant as to what the method does and what its exact purpose is.
Thanks,
Manuel
PS: the import is now finished: Storing 1001712 records into BaseX took 9285008 ms _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Hi again,
the parsing isn't finished yet, and apparently there's something odd going on:
00:44:36.429 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201590 [...] 00:44:38.681 [127.0.0.1:61979]: kulturnett____arkivportalen OK 2252.36 ms 00:44:38.682 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201591 [...] 00:44:38.682 [127.0.0.1:61979]: kulturnett____arkivportalen OK 0.64 ms 00:44:38.683 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201592 [...] 00:44:41.049 [127.0.0.1:61979]: kulturnett____arkivportalen OK 2366.34 ms 00:44:41.050 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201593 [...] 00:44:43.410 [127.0.0.1:61979]: kulturnett____arkivportalen OK 2360.21 ms 00:44:43.410 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201594 [...] 00:44:43.411 [127.0.0.1:61979]: kulturnett____arkivportalen OK 0.75 ms 00:44:43.412 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201595 [...] 00:44:45.730 [127.0.0.1:61979]: kulturnett____arkivportalen OK 2318.13 ms 00:44:45.730 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201596 [...] 00:44:48.068 [127.0.0.1:61979]: kulturnett____arkivportalen OK 2337.68 ms 00:44:48.068 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201597 [...] 00:44:48.069 [127.0.0.1:61979]: kulturnett____arkivportalen OK 0.66 ms 00:44:48.070 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201598 [...] 00:44:50.401 [127.0.0.1:61979]: kulturnett____arkivportalen OK 2331.87 ms 00:44:50.402 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201599 [...] 00:44:50.403 [127.0.0.1:61979]: kulturnett____arkivportalen OK 0.7 ms 00:44:50.403 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201600 [...] 00:44:52.763 [127.0.0.1:61979]: kulturnett____arkivportalen OK 2360.16 ms 00:44:52.764 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201601 [...] 00:44:55.155 [127.0.0.1:61979]: kulturnett____arkivportalen OK 2391.05 ms 00:44:55.155 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201602 [...] 00:44:55.156 [127.0.0.1:61979]: kulturnett____arkivportalen OK 0.96 ms 00:44:55.157 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201603 [...] 00:44:57.504 [127.0.0.1:61979]: kulturnett____arkivportalen OK 2347.02 ms 00:44:57.504 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201604 [...] 00:44:59.846 [127.0.0.1:61979]: kulturnett____arkivportalen OK 2341.76 ms 00:44:59.846 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201605 [...] 00:44:59.848 [127.0.0.1:61979]: kulturnett____arkivportalen OK 1.33 ms 00:44:59.848 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201606 [...] 00:45:02.180 [127.0.0.1:61979]: kulturnett____arkivportalen OK 2332.35 ms 00:45:02.181 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201607 [...]
Also, YourKit seems to have found potential deadlocks, or sluggish calls:
Thread-30 <--- Frozen for at least 2m 20s org.basex.index.value.MemValues.index(byte[], int) org.basex.data.MemData.index(int, int, byte[], int) org.basex.data.Data.text(int, int, byte[], int) org.basex.build.MemBuilder.addText(byte[], int, byte) org.basex.build.Builder.addText(byte[], byte) org.basex.build.Builder.text(byte[]) org.basex.build.xml.XMLParser.parse() org.basex.build.SingleParser.parse(Builder) org.basex.build.DirParser.parseResource(Builder) org.basex.build.DirParser.parse(Builder, IO) org.basex.build.DirParser.parse(Builder) org.basex.build.Builder.parse() org.basex.build.MemBuilder.build()<2 recursive calls> org.basex.core.cmd.Add.run() org.basex.core.Command.run(Context, OutputStream) org.basex.core.Command.exec(Context, OutputStream) org.basex.core.Command.execute(Context, OutputStream) org.basex.core.Command.execute(Context) org.basex.server.ClientListener.execute(Command) org.basex.server.ClientListener.add() org.basex.server.ClientListener.run()
Thread-30 <--- Frozen for at least 44s org.basex.index.name.Names.index(byte[], byte[], boolean) org.basex.build.Builder.addElem(byte[], Atts) org.basex.build.Builder.startElem(byte[], Atts) org.basex.build.xml.XMLParser.parseTag() org.basex.build.xml.XMLParser.parse() org.basex.build.SingleParser.parse(Builder) org.basex.build.DirParser.parseResource(Builder) org.basex.build.DirParser.parse(Builder, IO) org.basex.build.DirParser.parse(Builder) org.basex.build.Builder.parse() org.basex.build.MemBuilder.build()<2 recursive calls> org.basex.core.cmd.Add.run() org.basex.core.Command.run(Context, OutputStream) org.basex.core.Command.exec(Context, OutputStream) org.basex.core.Command.execute(Context, OutputStream) org.basex.core.Command.execute(Context) org.basex.server.ClientListener.execute(Command) org.basex.server.ClientListener.add() org.basex.server.ClientListener.run()
Thread-30 <--- Frozen for at least 51s org.basex.index.resource.Docs.insert(int, Data) org.basex.index.resource.Resources.insert(int, Data) org.basex.data.Data.insert(int, int, Data) org.basex.core.cmd.Add.run() org.basex.core.Command.run(Context, OutputStream) org.basex.core.Command.exec(Context, OutputStream) org.basex.core.Command.execute(Context, OutputStream) org.basex.core.Command.execute(Context) org.basex.server.ClientListener.execute(Command) org.basex.server.ClientListener.add() org.basex.server.ClientListener.run()
I'm not exactly sure which of the above are relevant, but I thought I'd share them anyway. I'll try to get some better measurements tomorrow.
Manuel
On Mon, Jul 9, 2012 at 11:25 PM, Manuel Bernhardt bernhardt.manuel@gmail.com wrote:
Hi again,
inserting 3M records now seems to take a lot less time - I'm running an insertion for the past 40 minutes now and it's close to finishing (2.8M records so far). I have the impression that it gets slower with the amount of size still, but much less so - but I couldn't put a finger on any particular method call with YourKit (it started with a whopping 15K documents / second, and now is at 300 documents / second)
I'll leave the computer running and see tomorrow how much time it took in total (and give a detail on what calls took how long), but in any case this is a huge improvement over how it used to be, thanks a lot!
Manuel
On Mon, Jul 9, 2012 at 2:04 PM, Manuel Bernhardt bernhardt.manuel@gmail.com wrote:
Hi Christian,
thanks for the fix! I'll test it right away on a big import.
We don't have that many namespaces in those documents but the general idea is to keep them, so we won't be using the STRIPNS feature for the time being (though we might in the future, depending on the use-case)
Thanks,
Manuel
On Sat, Jul 7, 2012 at 4:45 PM, Christian Grün christian.gruen@gmail.com wrote:
…the problem should now be fixed. I'd be glad if you could once more test the import you've been discussing in your report with the latest code base/snapshot.
Thanks in advance, Christian ___________________________
On Sat, Jun 30, 2012 at 7:01 PM, Manuel Bernhardt bernhardt.manuel@gmail.com wrote:
Hi,
I'm doing some testing before migration one of our customers to a new version of our platform that uses BaseX in order to store documents. They have approx. 4M documents, and I'm running an import operation on a 1 M document collection on my laptop.
The way I'm inserting documents is by firing off one Add command per document, based on a stream of the document, at a different (unique) path for each document, and flushing every at 10K Adds.
Since most CPU usage (for one of the cores, the other ones being untouched) is taken by the BaseX server, I fired up YourKit out of curiosity to see where the CPU time was spent. My machine is a 2*4 core MacBook Pro with 8GB of RAM and SSD, so I think hardware-wise it should do pretty fine.
YourKit shows that what seems to use up most time is the Namespaces.update method:
Thread-12 [RUNNABLE] CPU time: 2h 7m 9s org.basex.data.Namespaces.update(NSNode, int, int, boolean, Set) org.basex.data.Namespaces.update(int, int, boolean, Set) org.basex.data.Data.insert(int, int, Data) org.basex.core.cmd.Add.run() org.basex.core.Command.run(Context, OutputStream) org.basex.core.Command.exec(Context, OutputStream) org.basex.core.Command.execute(Context, OutputStream) org.basex.core.Command.execute(Context) org.basex.server.ClientListener.execute(Command) org.basex.server.ClientListener.add() org.basex.server.ClientListener.run()
I'm not really sure what that method does - it's a recursive function and seems to be triggered by Data.insert:
// NSNodes have to be checked for pre value shifts after insert nspaces.update(ipre, dsize, true, newNodes);
The whole set of records should have no more than 5 different namespaces in total. Thus I'm wondering if there would perhaps be some potential for optimization here? Note that I'm completely ignorant as to what the method does and what its exact purpose is.
Thanks,
Manuel
PS: the import is now finished: Storing 1001712 records into BaseX took 9285008 ms _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
basex-talk@mailman.uni-konstanz.de