Re: [basex-talk] csv:parse in the age of XQuery 3.1

8 Sep 2016


      Vincent, thank you for these measurements, which induced me to repeat my attempt to parse that 23 MB file. To my great surprise I got results similar to yours - parsing 23 MB took only six seconds!
My former experience (when I had to give up after 20 minutes or so) was gathered 16 months ago - so it seems that the BaseX team has done great work in the meantime - hurray!
Now I am very glad to know that BaseX masters CSV without constraints, which further enhances its value as data integration engine.
Hans-Jürgen
"Lizzi, Vincent" Vincent.Lizzi@taylorandfrancis.com schrieb am 18:53 Donnerstag, 8.September 2016:
#yiv8945062534 #yiv8945062534 -- _filtered #yiv8945062534 {panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv8945062534 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv8945062534 {}#yiv8945062534 #yiv8945062534 p.yiv8945062534MsoNormal, #yiv8945062534 li.yiv8945062534MsoNormal, #yiv8945062534 div.yiv8945062534MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv8945062534 a:link, #yiv8945062534 span.yiv8945062534MsoHyperlink {color:#0563C1;text-decoration:underline;}#yiv8945062534 a:visited, #yiv8945062534 span.yiv8945062534MsoHyperlinkFollowed {color:#954F72;text-decoration:underline;}#yiv8945062534 pre {margin:0in;margin-bottom:.0001pt;font-size:10.0pt;}#yiv8945062534 p.yiv8945062534MsoListParagraph, #yiv8945062534 li.yiv8945062534MsoListParagraph, #yiv8945062534 div.yiv8945062534MsoListParagraph {margin-top:0in;margin-right:0in;margin-bottom:0in;margin-left:.5in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv8945062534 span.yiv8945062534EmailStyle17 {color:#1F497D;}#yiv8945062534 span.yiv8945062534HTMLPreformattedChar {}#yiv8945062534 .yiv8945062534MsoChpDefault {font-size:10.0pt;} _filtered #yiv8945062534 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv8945062534 div.yiv8945062534WordSection1 {}#yiv8945062534 _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {}#yiv8945062534 ol {margin-bottom:0in;}#yiv8945062534 ul {margin-bottom:0in;}#yiv8945062534 As it so happens, I just received a 20.5 Mb Excel file which I am loading into BaseX as CSV. To prepare the file, I opened it in Excel and saved as CSV format. The CSV file is 70 Mb. Here is what I observe loading this CSV file to BaseX a few different ways.    1.      BaseX GUI – Using “Create Database” with input format CSV, the CSV was loaded and converted to XML in a few seconds.    2.      Command script – The CSV was loaded and converted to XML in about 10 seconds.    SET PARSER csv SET CSVPARSER encoding=windows-1252, header=true, separator=comma SET CREATEFILTER *.csv create database csvtest1 "path\to\file.csv"    3.      XQuery – The CSV was loaded and converted to XML in about 20 seconds.    db:create('csvtest2', csv:parse(file:read-text(' path\to\file.csv'), map{'encoding': 'windows-1252', 'header': true()}), 'file.csv' )    4.      XQuery (parsing only) – CSV file was parsed in about 4 seconds.    csv:parse(file:read-text(' path\to\file.csv'), map{'encoding': 'windows-1252', 'header': true()})    5.      XQuery (parsing only) using map – The CSV file was parsed in about 6 seconds.    csv:parse(file:read-text(' path\to\file.csv'), map{'encoding': 'windows-1252', 'header': true(), 'format': 'map'})    These alternate methods are, from what I can see, pretty equivalent except for the last one which produces a map instead of XML. At what point, i.e. how much data in CSV format, would using map start to offer benefits beyond mere convenience?       I came across an example in the documentation that gave me an error message. The Command Line example athttp://docs.basex.org/wiki/Parsers#CSV_Parser has    SET CSVPARSER encoding=utf-8, lines=true, header=false, separator=space    When trying this in BaseX 8.2.3 I get an error message:    Error: PARSER: csv Unknown option 'lines'.    The “lines” option is not listed in the CSV Module parser documentation athttp://docs.basex.org/wiki/CSV_Module#Options.    I didn’t want to correct the example in the documentation without checking whether it is actually incorrect. Does this example need to be updated?    Vincent          From: basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de]On Behalf Of Hans-Juergen Rennau
Sent: Thursday, September 08, 2016 10:02 AM
To: Marc van Grootel marc.van.grootel@gmail.com
Cc: BaseX basex-talk@mailman.uni-konstanz.de
Subject: Re: [basex-talk] csv:parse in the age of XQuery 3.1    What concerns me, I definitely want the CSV as XML. But the performance problems have certainly nothing to do with XML versus CSV (I often deal with > 300 MB XML, which is parsed very fast!) - it is the parsing operation itself which, if I'm not mistaken, is handled by XQuery code and which must be shifted into the Java implementation.    Kind regards, Hans-Jürgen    Marc van Grootel marc.van.grootel@gmail.com schrieb am 15:55 Donnerstag, 8.September 2016:    I'm currently dealing with CSV a lot as well. I tend to use the
format=map approach but not nearly as large as 22 MB CSV yet. I'm
wondering if, or how much more efficient it is to deal with this type
of data as arrays and map data structures versus XML. For most
processing I can leave serializing to XML to the very end. And if too
large I would probably also chunk it before storing the end result.
Intuitively I would think that dealing with CSV as maps/arrays should
be much faster and less memory intensive.
--Marc

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] csv:parse in the age of XQuery 3.1