Hello everyone!

My name is Mathias. I'm using BaseX for an university project where we are creating a publication database. Right now we have 25 GB xml data spread over 180k documents. Ultimately I want to be able to perform Xquery searches on this data, possibly even full text.

I'd like to know whether you think that BaseX is at all suitable for this amount of data. If yes, how would I add these files to the database optimally? If I use the BaseX GUI to add the folder an OutOfMemoryException is produced shortly after starting the process. Even providing more RAM (~7 GB via -Xmx7000M) only delays this. I haven't looked at the code but it appears as though all file contents are stored in RAM and are only written to hard disk at the end, which would at least explain the huge amounts of memory BaseX consumes. 

Since the GUI can't handle the files I wrote an importer myself which consecutively adds single files via the "Add" command. This seems to work without memory excess. However, it is taking ages to add all 180.000 files this way (several hours, haven't completed it yet). Maybe it's just further delaying the overflow since it's so slow. Also, this might just be my subjective feeling, but adding files seems to get slower as the database grows. Is there some kind of duplicate check going an that could be in the way? If yes, is there a way just to bulk insert all the data I got without checks?


I'd be grateful for any thoughts on this!


Thanks in advance,

 Mathias