Hi Maxime,
BaseX provides no streaming facilities for large XML instances.
However, if you have enough disk space left, you can create a database instance from your XML dump. We have already done this for Wiki dumps up to 420 GB [1]. You should disable the text and attribute index; database creation will then consume constant memory.
In the next step, you can write a query that writes out CSV entries for all page elements; the File Module and file:append can be helpful for that [2]. If this approach turns out not be fast enough, you can use the FLWOR window clause for writing out chunks of CSV entries [3]. If your output is projected to be much smaller than your input, you don’t need any window clause, and you could use our CSV Module and the 'xquery' format for serialising your CSV result in one go [4].
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Statistics [2] http://docs.basex.org/wiki/File_Module [3] http://docs.basex.org/wiki/XQuery_3.0#window [4] http://docs.basex.org/wiki/CSV_Module
On Mon, Feb 24, 2020 at 12:54 AM maxzor maxzor@maxzor.eu wrote:
Do you mean stream a single large XML file ? A series of XML files, or stream a file thru a series of XQuery|XSLT|XPath transforms.
Possibly poor wording, I meant read a large XML file and produce i.e. a csv file.
I don’t believe BaseX uses a streaming XML parser, so probably can’t handle streaming a single large XML file and produce output before it’s parsed the complete file.
Do you know of a streaming xml lib? other than StAX (no Java here :<)?
But it looks like, from the link in your stackoverflow post that the data is already sharded into a collection of separate XML files that each contain multiple <page> elements.
This is the alternative, instead of processing the monolithic multistream file, I could crawl over the ~150MB bz2-compressed chunks.
Regards, Maxime