Running BaseX on a parallel cluster

List overview All Threads
Download

newer

older

Evaluating XPath within custom...

Profiling queries

Kristian Kankainen

22 May 2017 22 May '17

11:26 a.m.

Hi all!

Is there any way to make BaseX run in parallel on a cluster? Through school I have access to several clusters and I got interested in trying out if BaseX can take advantage of parallel computing.

I know my question is vague. To try to be more specific - can I compile BaseX with some extra parameters? Is there anything written about this topic?

For the background information I got interested in trying out running parallel XQuery because from somewhere (that I can't find in the mailing list archive) I read that XQuery falls under the "dataflow" programming paradigm [1] and as such should be parallelizable out-of-the-box. Also being a functional language with no side-effects makes automatic parallelization easier. I am not a computer scientist and/but I am aware it isn't that easy as I might make it sound.

Cheers Kristian K

[1] https://en.wikipedia.org/wiki/Dataflow_programming

Show replies by date

Christian Grün

22 May 22 May

7:59 p.m.

Hi Kristian,

If you have a single machine, and…

• if you access databases, the random access patterns of parallel queries slow down evaluation time a lot, so will usually get best performance with single-threaded queries.

• if you don’t access any databases, the xquery:fork-join function will allow you to run several threads in parallel within a single XQuery expression. Even then, performance gain is not that thrilling as you might expect, but it works fine for operations that lead to long delays caused by external services (such can be the case with http:send-request).

If you work with multiple machines, you can use the BaseX Client Module [2] to request data from several BaseX instances, or simulate map/reduce patterns.

If you newer versions of Java, more and more code will be rewritten for multi-core processing, but this is nothing you can really control.

As you see, my replies do not completely match your question, but feel free to give us more input on which types of queries and expressions you would like to parallelize.

Cheers, Christian

[1] http://docs.basex.org/wiki/XQuery_Module#xquery:fork-join [2] http://docs.basex.org/wiki/Client_Module

On Mon, May 22, 2017 at 11:26 AM, Kristian Kankainen kristian@keeleleek.ee wrote:

...

Hi all!

Is there any way to make BaseX run in parallel on a cluster? Through school I have access to several clusters and I got interested in trying out if BaseX can take advantage of parallel computing.

I know my question is vague. To try to be more specific - can I compile BaseX with some extra parameters? Is there anything written about this topic?

For the background information I got interested in trying out running parallel XQuery because from somewhere (that I can't find in the mailing list archive) I read that XQuery falls under the "dataflow" programming paradigm [1] and as such should be parallelizable out-of-the-box. Also being a functional language with no side-effects makes automatic parallelization easier. I am not a computer scientist and/but I am aware it isn't that easy as I might make it sound.

Cheers Kristian K

[1] https://en.wikipedia.org/wiki/Dataflow_programming

Kristian Kankainen

23 May 23 May

4:59 p.m.

Thank you for the enlightening answer. I have two further questions:

Question 1) So, if I want to treat BaseX as a magical genie that can make wonderfully parallelized execution plans, then I would need to provide:

* each used data set in my query as a separate BaseX database instance

so that the computations would thus run separately (in parallel) in each of the instances? (and each instance would have its own isolated access to its data, not creating the problem of the mentioned random access patterns)

Question 2) What exactly is referred to by you saying "Java"? Is it Java proper or is it something else that is run on the java virtual machine, like Scala?

22.05.2017 20:59 Christian Grün kirjutas:

...

Hi Kristian,

If you have a single machine, and…

• if you access databases, the random access patterns of parallel queries slow down evaluation time a lot, so will usually get best performance with single-threaded queries.

• if you don’t access any databases, the xquery:fork-join function will allow you to run several threads in parallel within a single XQuery expression. Even then, performance gain is not that thrilling as you might expect, but it works fine for operations that lead to long delays caused by external services (such can be the case with http:send-request).

If you work with multiple machines, you can use the BaseX Client Module [2] to request data from several BaseX instances, or simulate map/reduce patterns.

If you newer versions of Java, more and more code will be rewritten for multi-core processing, but this is nothing you can really control.

As you see, my replies do not completely match your question, but feel free to give us more input on which types of queries and expressions you would like to parallelize.

Cheers, Christian

[1] http://docs.basex.org/wiki/XQuery_Module#xquery:fork-join [2] http://docs.basex.org/wiki/Client_Module

On Mon, May 22, 2017 at 11:26 AM, Kristian Kankainen kristian@keeleleek.ee wrote:

...
Hi all!

Is there any way to make BaseX run in parallel on a cluster? Through school I have access to several clusters and I got interested in trying out if BaseX can take advantage of parallel computing.

I know my question is vague. To try to be more specific - can I compile BaseX with some extra parameters? Is there anything written about this topic?

For the background information I got interested in trying out running parallel XQuery because from somewhere (that I can't find in the mailing list archive) I read that XQuery falls under the "dataflow" programming paradigm [1] and as such should be parallelizable out-of-the-box. Also being a functional language with no side-effects makes automatic parallelization easier. I am not a computer scientist and/but I am aware it isn't that easy as I might make it sound.

Cheers Kristian K

[1] https://en.wikipedia.org/wiki/Dataflow_programming

Christian Grün

25 May 25 May

4:48 p.m.

Hi Kristian,

...

each used data set in my query as a separate BaseX database instance

…as long as the databases are on different disks/drives.

...

so that the computations would thus run separately (in parallel) in each of the instances?

What kind of computations do you want to perform?

...

What exactly is referred to by you saying "Java"? Is it Java proper or is it something else that is run on the java virtual machine, like Scala?

I must admit I don’t know very much about the internals of the JVM. I just noticed that the evaluation of some XQuery expressions leads to the utilization of multiple CPU cores with Java 7 or 8, and I didn’t encounter this behaviour with Java 6. It may be that this is due to the JIT compiler, but it could also be that some (very safe) computations are done in parallel.

Cheers, Christian

Kristian Kankainen

5:09 p.m.

Hi Christian,

...

Hi Kristian,

...

each used data set in my query as a separate BaseX database instance

…as long as the databases are on different disks/drives.

...
so that the computations would thus run separately (in parallel) in each of the instances?

What kind of computations do you want to perform?

Mainly finding subsets of the XML data stored: retrieve a list of "something" that matches some criteria; then use the items in this list for retrieving further subsets of related data from other data sets. In my case finding the related data for each of the item in the first list could be done in parallel.

...

...
What exactly is referred to by you saying "Java"? Is it Java proper or is it something else that is run on the java virtual machine, like Scala?

I must admit I don’t know very much about the internals of the JVM. I just noticed that the evaluation of some XQuery expressions leads to the utilization of multiple CPU cores with Java 7 or 8, and I didn’t encounter this behaviour with Java 6. It may be that this is due to the JIT compiler, but it could also be that some (very safe) computations are done in parallel.

This is interesting but I don't know too much about it either. To be clear, BaseX is not doing any parallelization effort of the XQuery execution plans? And the main reason is because of the random access patterns of (single) disk access?

Best regards Kristian

Christian Grün

5:28 p.m.

...

In my case finding the related data for each of the item in the first

list could be done in parallel.

So it's mostly sequential disk-based scans of your data that you would like to speed up?

...

To be clear, BaseX is not doing any parallelization effort of the XQuery

execution plans?

Right. We did various experiments over the time, and it turned out it was difficult to find query patterns that could be accelerated at all by parallelization. Things may look different today if we had designed BaseX from the very beginning to be parallelizable.

However, it could make sense to invest more energy on our disk buffer management. It’s pretty basic and straightforward at the moment (just a few lines of code), as it mostly relies on the operating system.

Interested volunteers are welcome, I’ll be happy to give more details..

Cheers, Christian

Best regards Kristian

3008

Age (days ago)

3011

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

5 comments

2 participants

tags (0)

participants (2)

Christian Grün
Kristian Kankainen