As it so happens, I just received a 20.5 Mb Excel file which I am loading into BaseX as CSV. To prepare the file, I opened it in Excel and saved as CSV format.
The CSV file is 70 Mb. Here is what I observe loading this CSV file to BaseX a few different ways.
1.
BaseX GUI – Using “Create Database” with input format CSV, the CSV was loaded and converted to XML in a few seconds.
2.
Command script – The CSV was loaded and converted to XML in about 10 seconds.
SET PARSER csv
SET CSVPARSER encoding=windows-1252, header=true, separator=comma
SET CREATEFILTER *.csv
create database csvtest1 "path\to\file.csv"
3.
XQuery – The CSV was loaded and converted to XML in about 20 seconds.
db:create('csvtest2', csv:parse(file:read-text(' path\to\file.csv'), map{'encoding': 'windows-1252', 'header': true()}), 'file.csv' )
4.
XQuery (parsing only) – CSV file was parsed in about 4 seconds.
csv:parse(file:read-text(' path\to\file.csv'), map{'encoding': 'windows-1252', 'header': true()})
5.
XQuery (parsing only) using map – The CSV file was parsed in about 6 seconds.
csv:parse(file:read-text(' path\to\file.csv'), map{'encoding': 'windows-1252', 'header': true(), 'format': 'map'})
These alternate methods are, from what I can see, pretty equivalent except for the last one which produces a map instead of XML. At what point, i.e. how much
data in CSV format, would using map start to offer benefits beyond mere convenience?
SET CSVPARSER encoding=utf-8, lines=true, header=false, separator=space
When trying this in BaseX 8.2.3 I get an error message:
Error: PARSER: csv Unknown option 'lines'.
I didn’t want to correct the example in the documentation without checking whether it is actually incorrect. Does this example need to be updated?
Vincent
From: basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de]
On Behalf Of Hans-Juergen Rennau
Sent: Thursday, September 08, 2016 10:02 AM
To: Marc van Grootel <marc.van.grootel@gmail.com>
Cc: BaseX <basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] csv:parse in the age of XQuery 3.1
What concerns me, I definitely want the CSV as XML. But the performance problems have certainly nothing to do with XML versus CSV (I often deal with > 300 MB
XML, which is parsed very fast!) - it is the parsing operation itself which, if I'm not mistaken, is handled by XQuery code and which must be shifted into the Java implementation.
I'm currently dealing with CSV a lot as well. I tend to use the
format=map approach but not nearly as large as 22 MB CSV yet. I'm
wondering if, or how much more efficient it is to deal with this type
of data as arrays and map data structures versus XML. For most
processing I can leave serializing to XML to the very end. And if too
large I would probably also chunk it before storing the end result.
Intuitively I would think that dealing with CSV as maps/arrays should
be much faster and less memory intensive.
--Marc