Greetings,
It's a good idea... perhaps expand the concept to recommendations on how to deal with large dataset generally.
I just had to write this fun script to unzip sets of xml files and split them into directories that were set of 100 files ( could have been more) because the insertion would otherwise run of memory on my cheap 4GB digital ocean server using -Xmx3512m. By splitting up the data I could run with -Xmx1512m. Seems quite fast split up as well.
Could include tips like this...
echo -e "\n\n 1. Insert 13f /mnt/DATASTORE/SEC/13F/ZIPXML/XMLMETA"
cd /mnt/DATASTORE/SEC/13F/ZIPXML/
for j in *.zip; do
echo -e "\n\n Process ${j}"
rm -r /mnt/DATASTORE/SEC/13F/ZIPXML/XMLMETA/*
unzip -q ${j} -d /mnt/DATASTORE/SEC/13F/ZIPXML/
cd /mnt/DATASTORE/SEC/13F/ZIPXML/XMLMETA/
ls | parallel -n100 mkdir {#}\;mv {} {#}
cd /mnt/appuappu-mexxon/COMPANY_XML_DB/13F/SCRIPTS/
for D in `find /mnt/DATASTORE/SEC/13F/ZIPXML/XMLMETA/*/ -type d`
do
echo -e "\n\n Process ${D}"
java -Xmx1512m -Xss8096k -cp ../../../SEC_SERVER/LIB/saxon9ee.jar:../../../SEC_SERVER/LIB/BaseX85.jar org.basex.BaseX -bpath=$D ../XQUERY/add13FDirectory.xq
done
cd /mnt/DATASTORE/SEC/13F/ZIPXML/
done
./optimizedb.sh
hmm I suppose I shouldn't be using ../../../SEC_SERVER/LIB/saxon9ee.jar in this case. Just copied that from my other scripts..