Re: [basex-talk] Call for assistance : BaseX as a preprocessor ?

24 Feb 2020


      Hi Maxime,
BaseX provides no streaming facilities for large XML instances.
However, if you have enough disk space left, you can create a database
instance from your XML dump. We have already done this for Wiki dumps
up to 420 GB [1]. You should disable the text and attribute index;
database creation will then consume constant memory.
In the next step, you can write a query that writes out CSV entries
for all page elements; the File Module and file:append can be helpful
for that [2]. If this approach turns out not be fast enough, you can
use the FLWOR window clause for writing out chunks of CSV entries [3].
If your output is projected to be much smaller than your input, you
don’t need any window clause, and you could use our CSV Module and the
'xquery' format for serialising your CSV result in one go [4].
Hope this helps,
Christian
[1] http://docs.basex.org/wiki/Statistics
[2] http://docs.basex.org/wiki/File_Module
[3] http://docs.basex.org/wiki/XQuery_3.0#window
[4] http://docs.basex.org/wiki/CSV_Module
On Mon, Feb 24, 2020 at 12:54 AM maxzor maxzor@maxzor.eu wrote:
...
...
Do you mean stream a single large XML file ? A series of XML files, or stream a file thru a series of XQuery|XSLT|XPath transforms.
Possibly poor wording, I meant read a large XML file and produce i.e. a
csv file.
...
I don’t believe BaseX uses a streaming XML parser, so probably can’t handle streaming a single large XML file and produce output before it’s parsed the complete file.
Do you know of a streaming xml lib? other than StAX (no Java here :<)?
...
But it looks like, from the link in your stackoverflow post that the data is already sharded into a collection of separate XML files that each contain multiple <page> elements.
This is the alternative, instead of processing the monolithic
multistream file, I could crawl over the ~150MB bz2-compressed chunks.
Regards, Maxime

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] Call for assistance : BaseX as a preprocessor ?