I am working on crawling data. I can see it is taking a lot of time because xquery is pulling web pages one by one. But popular crawlers like apache Nutch has multithreading support so it can open multiple socket connections to multiple sites so that it can crawl very fast. Can we make add multithreading support basex? So it can run multiple xquery functions in parallel and then merge results? Initially, merging results can be done manually. At least, if we can distribute loads to different xquery functions, that would be great.
Erol Akarsu
Hi Erol,
I can also imagine various use cases in which multithreading would be beneficial. Have you already thought about ways how multithreaded execution could work in XQuery?
Christian ____________________________________
2013/10/24 Erol Akarsu eakarsu@gmail.com:
I am working on crawling data. I can see it is taking a lot of time because xquery is pulling web pages one by one. But popular crawlers like apache Nutch has multithreading support so it can open multiple socket connections to multiple sites so that it can crawl very fast. Can we make add multithreading support basex? So it can run multiple xquery functions in parallel and then merge results? Initially, merging results can be done manually. At least, if we can distribute loads to different xquery functions, that would be great.
Erol Akarsu
Christian,
For example, I have big load of urls to crawl. Urls can be split to multiple chunks. Basex can start multiple function calls in parallel that deals one chunk. After functions finished, we can merge results. This is basic way.
I think that we can benefit excellent concurrency feature of clojure language. basex can call clojure functions but I don't know how
On Thu, Oct 24, 2013 at 6:33 PM, Christian Grün cg@basex.org wrote:
Hi Erol,
I can also imagine various use cases in which multithreading would be beneficial. Have you already thought about ways how multithreaded execution could work in XQuery?
Christian ____________________________________
2013/10/24 Erol Akarsu eakarsu@gmail.com:
I am working on crawling data. I can see it is taking a lot of time
because
xquery is pulling web pages one by one. But popular crawlers like apache Nutch has multithreading support so it can open multiple socket connections to multiple sites so that it can crawl very fast. Can we make add multithreading support basex? So it can run multiple
xquery
functions in parallel and then merge results? Initially, merging results can be done manually. At least, if we can distribute loads to different xquery functions, that would be great.
Erol Akarsu
For example, I have big load of urls to crawl. Urls can be split to multiple chunks. Basex can start multiple function calls in parallel that deals one chunk. After functions finished, we can merge results. This is basic way.
Haskell provides the Strategies library for that:
http://hackage.haskell.org/package/parallel-3.1.0.1/docs/Control-Parallel-St...
It would be rather easy to implement some first quick hacks that provide simple multi-threading, but it gets a real challenge if you want to make the feature production safe. This is different for languages like Clojure, because parallel programming is at the very core of the language.
Christian,
One option. Suppose we have an xml doc that contains same elements inside like <a> <b> ..</b> <b>.. </b> ... </a>
We can partition all b'b blocks inside clojure and call clojure functions that will work parallel seamless way. Each clojure function will call xquery through basex' java bridge to process its xml block. Clojure will collect results of all parallel functions and write into disk or call other xquery functions.
Or Clojure call basex to partition xml file and return back parts. Clojure will then call parallel functions with parts
In this way, whenever we need xml processing, clojure will call basex. All threading will be supported by clojure.
Erol Akarsu
On Thu, Oct 24, 2013 at 6:49 PM, Christian Grün cg@basex.org wrote:
For example, I have big load of urls to crawl. Urls can be split to
multiple
chunks. Basex can start multiple function calls in parallel that deals
one
chunk. After functions finished, we can merge results. This is basic way.
Haskell provides the Strategies library for that:
http://hackage.haskell.org/package/parallel-3.1.0.1/docs/Control-Parallel-St...
It would be rather easy to implement some first quick hacks that provide simple multi-threading, but it gets a real challenge if you want to make the feature production safe. This is different for languages like Clojure, because parallel programming is at the very core of the language.
basex-talk@mailman.uni-konstanz.de