Hi Peter,
In this case there are about 23k documents. My approach is given in the attached file. I have a basex client and server running and also an apache tika server. it's all on a 32 bit machine so heap size is restricted. I initially get a JSON file with the URLs to a set of Excel or PDF docs, PUT those to Tika and then PUT those to basex.
I have had to find a way of chunking (using subsequence) otherwise I get Out of Memory errors. However, doing a small subsequence seems to take ages. I have had this programme running over 3 days and it didn't complete.
Hm, 23k documents does not sound that much. I know too less about Tika. What part of the code is responsible for the error? Could you provide me with the stack trace of the OOM error (you may need to use -d to get it on screen)?
Christian
basex-talk@mailman.uni-konstanz.de