If your archives contain a mix of raw and xml files,

Have a look at the old zip module, that may avoid reading the entire archive.

 

Best regards,

Fabrice

 

De : basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Fabrice Etanchaud
Envoyé : lundi 4 mai 2015 14:12
À : Hondros, Constantine (ELS-AMS); basex-talk@mailman.uni-konstanz.de
Objet : Re: [basex-talk] Pulling files from multiple zips into one DB

 

Dear Constantine,

 

In my experience, commands are always faster than db:* calls.

Maybe someone @basex could confirm that, and that commands do not use the Pending Update List ?

 

Are you sure you disabled ADDRAW ?

If there are many raw files along the xml files, you may have better results extracting and rearchiving only xml before.

I have the same problem with patent archives, where each xml file may come with many pdf and gif.

 

Best regards,

Fabrice

 

De : Hondros, Constantine (ELS-AMS) [mailto:C.Hondros@elsevier.com]
Envoyé : lundi 4 mai 2015 14:01
À : Fabrice Etanchaud
Objet : RE: Pulling files from multiple zips into one DB

 

Is that going to be any faster do you think? I tried it and it took a looooong time to read through the zips, so I am hoping there might be a faster more direct way of doing it.

 

From: Fabrice Etanchaud [mailto:fetanchaud@questel.com]
Sent: 04 May 2015 13:56
To: Hondros, Constantine (ELS-AMS)
Subject: RE: Pulling files from multiple zips into one DB

 

Hello Constantine,

 

Why don’t you simply create a new collection with ADDARCHIVES=true ?

 

Best regards,

Fabrice

 

De : basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Hondros, Constantine (ELS-AMS)
Envoyé : lundi 4 mai 2015 13:50
À : basex-talk@mailman.uni-konstanz.de
Objet : [basex-talk] Pulling files from multiple zips into one DB

 

Hello all,

I need to merge any XML files located in 500 GB of zips into a single DB for further analysis. Is there any faster or more efficient way to do it in BaseX than this? TIA.

 

for $zip in file:list($src, false(), '*.zip')

  let $arch := file:read-binary(concat($src, '\', $zip))

  for $a in archive:entries($arch)[ends-with(., 'xml')]

  return db:add('my_db', archive:extract-text($arch, $a), $a)

 

 

TIA,

Constantine

 

 


Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The Netherlands, Registration No. 33156677, Registered in The Netherlands.

 


Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The Netherlands, Registration No. 33156677, Registered in The Netherlands.