Hello everyone
Many thanks to Alexander, Bridger, Fabrice, Michael for getting back to me with very detailed responses, these have been really helpful.
A few notes:
1) The name is Athanasios :D. Sorry, just couldn’t help it, it seemed incredibly formal to be addressed via the surname in our communications. Our mail server advertises the “Surname. Initial” pattern, so I can see where the confusion came from.
2) I think that there is scope for adding some sort of “logging” to all actions of the server in general because I think I may have hit a bug but I cannot provide any more illuminating comments. Here is what is happening:
a. During import, I get an error that file somethingsomething140.xml has an incredibly long element that cannot be imported at line (blahblah). The whole process just dies there.
b. This is a bug, because if I simply imported JUST the offending file itself, a single file database is created without any problems and I can query it and all. So, maybe, the error is caused because of the previous file OR because of the way the files are loaded. But I have absolutely no way of knowing the “load history” of the files or the exception that was caught or anything else. In fact, once you press “OK” in the error dialog box, any database files that have been created are lost. In addition to this, the XML files to import are enumerated in a random order. So, I had to run the import again and stay there looking at each one of the files loading, to witness that the system “breaks” after 254 files (which is suspiciously close to 256). None of the files around the vicinity of the offending file caused any problems, so this may be a more difficult to catch bug (but it is thrown with both the internal and external parsers). Following this, I created smaller databases with 250 XML files and then got “predictable” errors on running out of memory and not creating indexes which I can solve more easily.
3) It’s good to know that I don’t need the original files because that’s a lot of space I can get rid of. Thank you.
4) Seems like the ADDCACHE would have saved me some trouble here, many thanks for that, but of course, if you don’t know the file enumeration order, you are still stuck in not knowing which files have already been imported.
5) Michael, logging won’t help with the internal import procedure, except of course if you were implying writing a quick script to do the import “manually”?
6) Michael, the fork-join and “client connect” are really interesting and worth a try before I start connecting things together via Hadoop. Are these modules already available to BaseX? Do I simply import their namespace or is it not even needed?
Many thanks again.
All the best
From: Bridger Dyson-Smith [mailto:bdysonsmith@gmail.com] Sent: 12 September 2017 16:53 To: Anastasiou A. Cc: basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] A few general questions about BaseX
Hi Anastasiou, Hopefully some of these answers are somewhat helpful.
On Tue, Sep 12, 2017 at 4:54 AM, Anastasiou A. <a.anastasiou@swansea.ac.ukmailto:a.anastasiou@swansea.ac.uk> wrote: Hello everyone
I am trying to load BaseX with a large number of XML files (~500), each one a few hundreds of MBs big. BaseX fails with a message along the lines “This is too big for one database”.
Can I please ask:
1) Are there any logs, beyond the DB logs? If yes, where can I find them?
a. The reason I am asking is because once basexgui gives the message, there is no indication about the error. Ideally, I would like to know if this is a limitation on memory amount or number of items (?). I'm not sure how to enable more verbose logging with the GUI -- hopefully one of the devs or power users can weigh in on this.
2) The parser options include reading XML files from archives, which is very convenient, but once the file has been parsed, does BaseX require the “originals” for queries / returning results? AFAIK, no it does not. BaseX will query and return results from the internal database(s).
3) Is it possible to do federation with BaseX? In other words, let’s say I split a database in two large parts (as per #1), is it possible to launch two baseX servers and then have them talk to each other so that ultimately I just query one of them and get back unified results? AFAIK, the preferred method is to split your files across many databases, then query multiple databases from a single expression[1]. Others will be able to speak to this better, but I don't think there's a straightforward way to run multiple BaseX servers in a single JVM.
All the best
Best, Bridger