As it so happens, I just received a 20.5 Mb Excel file which I am loading into BaseX as CSV. To prepare the file, I opened it in Excel and saved as CSV format. The CSV file is 70 Mb. Here is what I observe loading this CSV file to BaseX a few different ways.
1. BaseX GUI – Using “Create Database” with input format CSV, the CSV was loaded and converted to XML in a few seconds.
2. Command script – The CSV was loaded and converted to XML in about 10 seconds.
SET PARSER csv SET CSVPARSER encoding=windows-1252, header=true, separator=comma SET CREATEFILTER *.csv create database csvtest1 "path\to\file.csv"
3. XQuery – The CSV was loaded and converted to XML in about 20 seconds.
db:create('csvtest2', csv:parse(file:read-text(' path\to\file.csv'), map{'encoding': 'windows-1252', 'header': true()}), 'file.csv' )
4. XQuery (parsing only) – CSV file was parsed in about 4 seconds.
csv:parse(file:read-text(' path\to\file.csv'), map{'encoding': 'windows-1252', 'header': true()})
5. XQuery (parsing only) using map – The CSV file was parsed in about 6 seconds.
csv:parse(file:read-text(' path\to\file.csv'), map{'encoding': 'windows-1252', 'header': true(), 'format': 'map'})
These alternate methods are, from what I can see, pretty equivalent except for the last one which produces a map instead of XML. At what point, i.e. how much data in CSV format, would using map start to offer benefits beyond mere convenience?
I came across an example in the documentation that gave me an error message. The Command Line example at http://docs.basex.org/wiki/Parsers#CSV_Parser has
SET CSVPARSERhttp://docs.basex.org/wiki/Options#CSVPARSER encoding=utf-8, lines=true, header=false, separator=space
When trying this in BaseX 8.2.3 I get an error message:
Error: PARSER: csv Unknown option 'lines'.
The “lines” option is not listed in the CSV Module parser documentation at http://docs.basex.org/wiki/CSV_Module#Options.
I didn’t want to correct the example in the documentation without checking whether it is actually incorrect. Does this example need to be updated?
Vincent
From: basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] On Behalf Of Hans-Juergen Rennau Sent: Thursday, September 08, 2016 10:02 AM To: Marc van Grootel marc.van.grootel@gmail.com Cc: BaseX basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] csv:parse in the age of XQuery 3.1
What concerns me, I definitely want the CSV as XML. But the performance problems have certainly nothing to do with XML versus CSV (I often deal with > 300 MB XML, which is parsed very fast!) - it is the parsing operation itself which, if I'm not mistaken, is handled by XQuery code and which must be shifted into the Java implementation.
Kind regards, Hans-Jürgen
Marc van Grootel <marc.van.grootel@gmail.commailto:marc.van.grootel@gmail.com> schrieb am 15:55 Donnerstag, 8.September 2016:
I'm currently dealing with CSV a lot as well. I tend to use the format=map approach but not nearly as large as 22 MB CSV yet. I'm wondering if, or how much more efficient it is to deal with this type of data as arrays and map data structures versus XML. For most processing I can leave serializing to XML to the very end. And if too large I would probably also chunk it before storing the end result.
Intuitively I would think that dealing with CSV as maps/arrays should be much faster and less memory intensive.
--Marc