Christian,
I think we should be able to attach BaseX to Apache spark. But integration code need to be written. Everybody is able to read from Hadoop,SOLR, ElasticSearch etc. to Spark and process there. Why not for BaseX?
Erol Akarsu
On Wed, Apr 22, 2015 at 4:28 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Götz,
it would make perfect sense to parallelize the query. Is there a way to achieve
this
using xQuery?
Our initial attempts to integrate low-level support for parallelization in XQuery turned out not to be as successful as we hoped they would be. One reason for that is that you can basically do everything with XQuery, and it's pretty hard to detect patterns in the code that are simple enough to be parallelized. Next to that, Java does not give us enough facilities to control CPU caching behavior.
As you already indicated, you can simply run multiple queries in parallel by e.g. using Java threads or the BaseX client/server architecture (which by default allows 8 transactions in parallel [1]). If your queries do a lot of I/O, you will often get better performance by only allowing one transaction at a time, though. This is due to the random access patterns on your external drives (and in my experience, it also applies to SSDs). However, if you work with main-memory instances of databases, parallelization might give you some performance gains (albeit not as big as you might expect).
Hope this helps, Christian