So far I did not find any information on how BaseX can be advised how to use computing resources. The use case here is as follows: I get several megabytes of xml files each day, usually between 50 and 100 MB. These are organized in one database per day. Since most queries run on a daily base this works perfectly fine. However, there are situations when I need to run a query over a larger time span, say three or six months. (Note that I'm speaking of read-only queries here, not of transactions.) Of course I can do this in a loop (for $db in $db-list) but since the data in each database is completely independent from that in the other databases it would make perfect sense to parallelize the query. Is there a way to achieve this using xQuery? I'm aware of the possibility to split the sequence into several ones and run them in different threads on different connections using Java, for instance. But even then I still don't know what the server does (my queries run in a client-server configuration): will it occupy just one processor, or will it distribute the workload?
Hi Götz,
it would make perfect sense to parallelize the query. Is there a way to achieve this using xQuery?
Our initial attempts to integrate low-level support for parallelization in XQuery turned out not to be as successful as we hoped they would be. One reason for that is that you can basically do everything with XQuery, and it's pretty hard to detect patterns in the code that are simple enough to be parallelized. Next to that, Java does not give us enough facilities to control CPU caching behavior.
As you already indicated, you can simply run multiple queries in parallel by e.g. using Java threads or the BaseX client/server architecture (which by default allows 8 transactions in parallel [1]). If your queries do a lot of I/O, you will often get better performance by only allowing one transaction at a time, though. This is due to the random access patterns on your external drives (and in my experience, it also applies to SSDs). However, if you work with main-memory instances of databases, parallelization might give you some performance gains (albeit not as big as you might expect).
Hope this helps, Christian
Christian,
I think we should be able to attach BaseX to Apache spark. But integration code need to be written. Everybody is able to read from Hadoop,SOLR, ElasticSearch etc. to Spark and process there. Why not for BaseX?
Erol Akarsu
On Wed, Apr 22, 2015 at 4:28 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Götz,
it would make perfect sense to parallelize the query. Is there a way to achieve
this
using xQuery?
Our initial attempts to integrate low-level support for parallelization in XQuery turned out not to be as successful as we hoped they would be. One reason for that is that you can basically do everything with XQuery, and it's pretty hard to detect patterns in the code that are simple enough to be parallelized. Next to that, Java does not give us enough facilities to control CPU caching behavior.
As you already indicated, you can simply run multiple queries in parallel by e.g. using Java threads or the BaseX client/server architecture (which by default allows 8 transactions in parallel [1]). If your queries do a lot of I/O, you will often get better performance by only allowing one transaction at a time, though. This is due to the random access patterns on your external drives (and in my experience, it also applies to SSDs). However, if you work with main-memory instances of databases, parallelization might give you some performance gains (albeit not as big as you might expect).
Hope this helps, Christian
Any volunteers out there? ;)
On Wed, Apr 22, 2015 at 11:05 AM, Erol Akarsu eakarsu@gmail.com wrote:
Christian,
I think we should be able to attach BaseX to Apache spark. But integration code need to be written. Everybody is able to read from Hadoop,SOLR, ElasticSearch etc. to Spark and process there. Why not for BaseX?
Erol Akarsu
On Wed, Apr 22, 2015 at 4:28 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Götz,
it would make perfect sense to parallelize the query. Is there a way to achieve this using xQuery?
Our initial attempts to integrate low-level support for parallelization in XQuery turned out not to be as successful as we hoped they would be. One reason for that is that you can basically do everything with XQuery, and it's pretty hard to detect patterns in the code that are simple enough to be parallelized. Next to that, Java does not give us enough facilities to control CPU caching behavior.
As you already indicated, you can simply run multiple queries in parallel by e.g. using Java threads or the BaseX client/server architecture (which by default allows 8 transactions in parallel [1]). If your queries do a lot of I/O, you will often get better performance by only allowing one transaction at a time, though. This is due to the random access patterns on your external drives (and in my experience, it also applies to SSDs). However, if you work with main-memory instances of databases, parallelization might give you some performance gains (albeit not as big as you might expect).
Hope this helps, Christian
Hi Erol,
I am not volunteering :-) but if somebody wants to take this route this code might give some pointers [1]. It uses Apache Spark to run Saxon-HE, an XQuery example [2], and more info [3].
/Andy
[1] https://github.com/elsevierlabs/spark-xml-utils [2] https://github.com/elsevierlabs/spark-xml-utils/wiki/xquery [3] http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3C140793661...
On 22 April 2015 at 10:05, Erol Akarsu eakarsu@gmail.com wrote:
Christian,
I think we should be able to attach BaseX to Apache spark. But integration code need to be written. Everybody is able to read from Hadoop,SOLR, ElasticSearch etc. to Spark and process there. Why not for BaseX?
Erol Akarsu
On Wed, Apr 22, 2015 at 4:28 AM, Christian Grün <christian.gruen@gmail.com
wrote:
Hi Götz,
it would make perfect sense to parallelize the query. Is there a way to achieve
this
using xQuery?
Our initial attempts to integrate low-level support for parallelization in XQuery turned out not to be as successful as we hoped they would be. One reason for that is that you can basically do everything with XQuery, and it's pretty hard to detect patterns in the code that are simple enough to be parallelized. Next to that, Java does not give us enough facilities to control CPU caching behavior.
As you already indicated, you can simply run multiple queries in parallel by e.g. using Java threads or the BaseX client/server architecture (which by default allows 8 transactions in parallel [1]). If your queries do a lot of I/O, you will often get better performance by only allowing one transaction at a time, though. This is due to the random access patterns on your external drives (and in my experience, it also applies to SSDs). However, if you work with main-memory instances of databases, parallelization might give you some performance gains (albeit not as big as you might expect).
Hope this helps, Christian
basex-talk@mailman.uni-konstanz.de