Greetings,

It's a good idea... perhaps expand the concept to recommendations on how to deal with large dataset generally.

I just had to write this fun script to unzip sets of xml files and split them into directories that were set of 100 files ( could have been more)  because the insertion would otherwise run of memory on my cheap 4GB digital ocean server using -Xmx3512m. By splitting up the data I could run with -Xmx1512m. Seems quite fast split up as well.

Could include tips like this...


echo -e "\n\n  1. Insert 13f /mnt/DATASTORE/SEC/13F/ZIPXML/XMLMETA"
cd  /mnt/DATASTORE/SEC/13F/ZIPXML/
for j in *.zip; do
    echo -e "\n\n Process  ${j}"
    rm -r /mnt/DATASTORE/SEC/13F/ZIPXML/XMLMETA/*
    unzip -q ${j} -d /mnt/DATASTORE/SEC/13F/ZIPXML/
    cd  /mnt/DATASTORE/SEC/13F/ZIPXML/XMLMETA/
    ls | parallel -n100 mkdir {#}\;mv {} {#}   
    cd /mnt/appuappu-mexxon/COMPANY_XML_DB/13F/SCRIPTS/
    for D in `find /mnt/DATASTORE/SEC/13F/ZIPXML/XMLMETA/*/  -type d`
    do
         echo -e "\n\n  Process ${D}"
         java -Xmx1512m -Xss8096k  -cp ../../../SEC_SERVER/LIB/saxon9ee.jar:../../../SEC_SERVER/LIB/BaseX85.jar org.basex.BaseX -bpath=$D ../XQUERY/add13FDirectory.xq  
    done
    cd  /mnt/DATASTORE/SEC/13F/ZIPXML/
 done
./optimizedb.sh

hmm I suppose I shouldn't be using ../../../SEC_SERVER/LIB/saxon9ee.jar in this case. Just copied that from my other scripts..


On Tue, Jul 12, 2016 at 2:42 PM, Dirk Kirsten <dk@basex.org> wrote:
Hi Max,

I totally agree. By chance I also yesterday run into some issues with
Indexes and found the current documentation especially about index
configuration not very exhaustive. Especially from the point of view of
inexperienced BaseX users I find it rather inconvenient trying to figure
out how to properly create and maintain indexes.

Best regards from the other side of the lake, Dirk

On 07/12/2016 04:33 PM, Maximilian Gärber wrote:
> Hi,
>
> I will  try to add the infos I found most helpful but I am sure it
> will not be exhaustive...
>
> Regards,
>
> Max
>
> 2016-07-12 9:36 GMT+02:00 Christian Grün <christian.gruen@gmail.com>:
>> Hi Max,
>>
>> Good Idea. I think it would fit into the "Advanced User's Guide".
>> Personally, I would keep XQuery Update in the XQuery section, but
>> "Indexes" (c|sh)ould surely be moved. And "Index Configuration" still
>> needs to be written. Does it mean that you’d be interested in writing
>> such an article? :)
>>
>> Cheers
>> Christian
>>
>>
>>> I was just thinking there could be other users that stumble over the
>>> details of index updates and optimization.
>>>
>>> Maybe this deserves a top-level page in the wiki? Then we would have:
>>>
>>> * Indexes: Explains about what is there
>>> * XQuery Update: How to use
>>> * Index Configuration: How to configure and good practice
>>>
>>> Regards,
>>>
>>> Max

--
Dirk Kirsten, BaseX GmbH, http://basexgmbh.de
|-- Firmensitz: Blarerstrasse 56, 78462 Konstanz
|-- Registergericht Freiburg, HRB: 708285, Geschäftsführer:
|   Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle
`-- Phone: 0049 7531 91 68 276, Fax: 0049 7531 20 05 22