Vincent, thank you for these measurements, which induced me to repeat my attempt to parse that 23 MB file. To my great surprise I got results similar to yours - parsing 23 MB took only six seconds!

My former experience (when I had to give up after 20 minutes or so) was gathered 16 months ago - so it seems that the BaseX team has done great work in the meantime - hurray!

Now I am very glad to know that BaseX masters CSV without constraints, which further enhances its value as data integration engine.

Hans-Jürgen

"Lizzi, Vincent" <Vincent.Lizzi@taylorandfrancis.com> schrieb am 18:53 Donnerstag, 8.September 2016:

As it so happens, I just received a 20.5 Mb Excel file which I am loading into BaseX as CSV. To prepare the file, I opened it in Excel and saved as CSV format. The CSV file is 70 Mb. Here is what I observe loading this CSV file to BaseX a few different ways.

1. BaseX GUI – Using “Create Database” with input format CSV, the CSV was loaded and converted to XML in a few seconds.

2. Command script – The CSV was loaded and converted to XML in about 10 seconds.

SET PARSER csv

SET CSVPARSER encoding=windows-1252, header=true, separator=comma

SET CREATEFILTER *.csv

create database csvtest1 "path\to\file.csv"

3. XQuery – The CSV was loaded and converted to XML in about 20 seconds.

db:create('csvtest2', csv:parse(file:read-text(' path\to\file.csv'), map{'encoding': 'windows-1252', 'header': true()}), 'file.csv' )

4. XQuery (parsing only) – CSV file was parsed in about 4 seconds.

csv:parse(file:read-text(' path\to\file.csv'), map{'encoding': 'windows-1252', 'header': true()})

5. XQuery (parsing only) using map – The CSV file was parsed in about 6 seconds.

csv:parse(file:read-text(' path\to\file.csv'), map{'encoding': 'windows-1252', 'header': true(), 'format': 'map'})

These alternate methods are, from what I can see, pretty equivalent except for the last one which produces a map instead of XML. At what point, i.e. how much data in CSV format, would using map start to offer benefits beyond mere convenience?

I came across an example in the documentation that gave me an error message. The Command Line example at http://docs.basex.org/wiki/Parsers#CSV_Parser has

SET CSVPARSER encoding=utf-8, lines=true, header=false, separator=space

When trying this in BaseX 8.2.3 I get an error message:

Error: PARSER: csv Unknown option 'lines'.

The “lines” option is not listed in the CSV Module parser documentation at http://docs.basex.org/wiki/CSV_Module#Options.

I didn’t want to correct the example in the documentation without checking whether it is actually incorrect. Does this example need to be updated?

Vincent

From: basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] On Behalf Of Hans-Juergen Rennau
Sent: Thursday, September 08, 2016 10:02 AM
To: Marc van Grootel <marc.van.grootel@gmail.com>
Cc: BaseX <basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] csv:parse in the age of XQuery 3.1

What concerns me, I definitely want the CSV as XML. But the performance problems have certainly nothing to do with XML versus CSV (I often deal with > 300 MB XML, which is parsed very fast!) - it is the parsing operation itself which, if I'm not mistaken, is handled by XQuery code and which must be shifted into the Java implementation.

Kind regards,

Hans-Jürgen

Marc van Grootel <marc.van.grootel@gmail.com> schrieb am 15:55 Donnerstag, 8.September 2016:

I'm currently dealing with CSV a lot as well. I tend to use the
format=map approach but not nearly as large as 22 MB CSV yet. I'm
wondering if, or how much more efficient it is to deal with this type
of data as arrays and map data structures versus XML. For most
processing I can leave serializing to XML to the very end. And if too
large I would probably also chunk it before storing the end result.

Intuitively I would think that dealing with CSV as maps/arrays should
be much faster and less memory intensive.

--Marc