Hi Christian,
About option 4:
I agree with the options you laid out. I am currently diving deeper into option 4 in the list you wrote.
Regarding the partitioning strategy, I agree. I did manage however to partition the files to be imported, into separate sets, with a constraint on max partition size (on disk) and max partition file count (the number of XML documents in each partition).
The tool called fpart [5] made this possible (I can imagine more sophisticated bin-packing methods, involving pre-computed node count values, and other variables, can be achieved via glpk [6] but that might be too much work).
So, currently I am experimenting with a max partition size of 2.4GB and a max file count of 85k files, and fpart seems to have split the file list into 11 partitions of 33k files each and the size of a partition being ~ 2.4GB.
So, I wrote a script for this, it's called sharded-import.sh and attached here. I'm also noticing that the /dba/ BaseX web interface is not blocked anymore if I run this script, as opposed to running the previous import where I run
CREATE DB db_name /directory/
which allows me to see the progress or allows me to run queries before the big import finishes.
Maybe the downside is that it's more verbose, and prints out a ton of lines like
along the way, and maybe that's slower than before.
About option 1:
Re: increase memory, I am running these experiments on a low-memory, old, network-attached storage, model QNAP TS-451+ [7] [8], which I had to take apart with a screwdriver to add 2GB of RAM (now it has 4GB of memory), and I can't seem to find around the house any additional memory sticks to take it up to 8GB (which is also the maximum memory it supports). And if I want to find like 2 x 4GB sticks of RAM, the frequency of the memory has to match what it supports, I'm having trouble finding the exact one, Corsair says it has memory sticks that would work, but I'd have to wait weeks for them to ship to Bucharest which is where I live.
It seems like buying an Intel NUC that goes up to 64GB of memory would be a bit too expensive at $1639 [9] but .. people on reddit [10] were discussing some years back about this supermicro server [11] which is only $668 and would allow to add up to 64GB of memory.
Basically I would buy something cheap that I can jampack with a lot of RAM, but a hands-off approach would be best here, so if it comes pre-equipped with all the memory and everything, would be nice (would spare the trouble of having to buy the memory separate, making sure it matches the motherboard specs etc).
About option 2:
In fact, that's a great idea. But it would require me to write something that would figure out the XPath patterns where the actual content sits. I actually wanted to look for some algorithm that's designed to do that, and try to implement it, but I had no time.
It would either have to detect the repetitive bloated nodes, and build XPaths for the rest of the nodes, where the actual content sits. I think this would be equivalent to computing the "web template" of a website, given all its pages.
It would definitely decrease the size of the content that would have to be indexed.
By the way, here I'm writing about a more general procedure, because it's not just this dataset that I want to import.. I want to import heavy, large amounts of data :)
These are my thoughts for now