Dear BaseX developers,
I noticed in example 3 under http://docs.basex.org/wiki/CSV_Module#Examples that csv:parse() with option { 'format': 'map' } returns a map of maps, with hardcoded row numbers:
map { 1: map { "City": "Newton", "Name": "John" }, 2: map { "City": "Oldtown", "Name": "Jack" } }
Using maps, which are unordered, to represent something ordered like rows in a CSV, hardcoded row numbers are necessary for reassembling the map in document order. I assume this was a necessary approach when the module was developed in the map-only world of XQuery 3.0. Now that 3.1 supports arrays, might an array of maps be a closer fit for CSV parsing?
array { map { "City": "Newton", "Name": "John" }, map { "City": "Oldtown", "Name": "Jack" } }
I'm also curious, do you know of any efforts to create an EXPath spec for CSV? Putting spec and CSV in the same sentence is dangerous, since CSV is a notoriously under-specified format: "The CSV file format is not standardized" (see https://en.wikipedia.org/wiki/Comma-separated_values). But perhaps there is a common enough need for CSV parsing that such a spec would benefit the community? I thought I'd start by asking here, since BaseX's seems to be the most developed (or only?) CSV module in XQuery.
Then there's the question of how to approach implementations of such a spec. While XQuery is probably capable of parsing and serializing small enough CSV, CSVs do get large and naive processing with XQuery would tend to run into memory issues (as I found with xqjson). This means implementations would tend to write in a lower-level language. eXist, for example, uses Jackson for fn:parse-json(). I see Jackson has a CSV extension too: https://github.com/FasterXML/jackson-dataformat-csv. Any thoughts on the suitability of XQuery for the task?
Joe
Joe, just to back you: I believe that an EXPath spec for CSV processing would be *extremely* useful! (There is hardly a format as ubiquitous as CSV.) And I had similar experience concerning the performance - concretely, a 22 MB file proved to be simply unprocessable! Which means that BaseX support for CSV is only partial.
So I ardently hope for the BaseX team to enable the parsing of large CSV, and I hope for an initiative pulling CSV into EXPath! Kind regards,Hans-Jürgen
Joe Wicentowski joewiz@gmail.com schrieb am 6:14 Donnerstag, 8.September 2016:
Dear BaseX developers,
I noticed in example 3 under http://docs.basex.org/wiki/CSV_Module#Examples that csv:parse() with option { 'format': 'map' } returns a map of maps, with hardcoded row numbers:
map { 1: map { "City": "Newton", "Name": "John" }, 2: map { "City": "Oldtown", "Name": "Jack" } }
Using maps, which are unordered, to represent something ordered like rows in a CSV, hardcoded row numbers are necessary for reassembling the map in document order. I assume this was a necessary approach when the module was developed in the map-only world of XQuery 3.0. Now that 3.1 supports arrays, might an array of maps be a closer fit for CSV parsing?
array { map { "City": "Newton", "Name": "John" }, map { "City": "Oldtown", "Name": "Jack" } }
I'm also curious, do you know of any efforts to create an EXPath spec for CSV? Putting spec and CSV in the same sentence is dangerous, since CSV is a notoriously under-specified format: "The CSV file format is not standardized" (see https://en.wikipedia.org/wiki/Comma-separated_values).%C2%A0 But perhaps there is a common enough need for CSV parsing that such a spec would benefit the community? I thought I'd start by asking here, since BaseX's seems to be the most developed (or only?) CSV module in XQuery.
Then there's the question of how to approach implementations of such a spec. While XQuery is probably capable of parsing and serializing small enough CSV, CSVs do get large and naive processing with XQuery would tend to run into memory issues (as I found with xqjson). This means implementations would tend to write in a lower-level language. eXist, for example, uses Jackson for fn:parse-json(). I see Jackson has a CSV extension too: https://github.com/FasterXML/jackson-dataformat-csv. Any thoughts on the suitability of XQuery for the task?
Joe
I'm currently dealing with CSV a lot as well. I tend to use the format=map approach but not nearly as large as 22 MB CSV yet. I'm wondering if, or how much more efficient it is to deal with this type of data as arrays and map data structures versus XML. For most processing I can leave serializing to XML to the very end. And if too large I would probably also chunk it before storing the end result.
Intuitively I would think that dealing with CSV as maps/arrays should be much faster and less memory intensive.
--Marc
What concerns me, I definitely want the CSV as XML. But the performance problems have certainly nothing to do with XML versus CSV (I often deal with > 300 MB XML, which is parsed very fast!) - it is the parsing operation itself which, if I'm not mistaken, is handled by XQuery code and which must be shifted into the Java implementation. Kind regards,Hans-Jürgen
Marc van Grootel marc.van.grootel@gmail.com schrieb am 15:55 Donnerstag, 8.September 2016:
I'm currently dealing with CSV a lot as well. I tend to use the format=map approach but not nearly as large as 22 MB CSV yet. I'm wondering if, or how much more efficient it is to deal with this type of data as arrays and map data structures versus XML. For most processing I can leave serializing to XML to the very end. And if too large I would probably also chunk it before storing the end result.
Intuitively I would think that dealing with CSV as maps/arrays should be much faster and less memory intensive.
--Marc
As it so happens, I just received a 20.5 Mb Excel file which I am loading into BaseX as CSV. To prepare the file, I opened it in Excel and saved as CSV format. The CSV file is 70 Mb. Here is what I observe loading this CSV file to BaseX a few different ways.
1. BaseX GUI – Using “Create Database” with input format CSV, the CSV was loaded and converted to XML in a few seconds.
2. Command script – The CSV was loaded and converted to XML in about 10 seconds.
SET PARSER csv SET CSVPARSER encoding=windows-1252, header=true, separator=comma SET CREATEFILTER *.csv create database csvtest1 "path\to\file.csv"
3. XQuery – The CSV was loaded and converted to XML in about 20 seconds.
db:create('csvtest2', csv:parse(file:read-text(' path\to\file.csv'), map{'encoding': 'windows-1252', 'header': true()}), 'file.csv' )
4. XQuery (parsing only) – CSV file was parsed in about 4 seconds.
csv:parse(file:read-text(' path\to\file.csv'), map{'encoding': 'windows-1252', 'header': true()})
5. XQuery (parsing only) using map – The CSV file was parsed in about 6 seconds.
csv:parse(file:read-text(' path\to\file.csv'), map{'encoding': 'windows-1252', 'header': true(), 'format': 'map'})
These alternate methods are, from what I can see, pretty equivalent except for the last one which produces a map instead of XML. At what point, i.e. how much data in CSV format, would using map start to offer benefits beyond mere convenience?
I came across an example in the documentation that gave me an error message. The Command Line example at http://docs.basex.org/wiki/Parsers#CSV_Parser has
SET CSVPARSERhttp://docs.basex.org/wiki/Options#CSVPARSER encoding=utf-8, lines=true, header=false, separator=space
When trying this in BaseX 8.2.3 I get an error message:
Error: PARSER: csv Unknown option 'lines'.
The “lines” option is not listed in the CSV Module parser documentation at http://docs.basex.org/wiki/CSV_Module#Options.
I didn’t want to correct the example in the documentation without checking whether it is actually incorrect. Does this example need to be updated?
Vincent
From: basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] On Behalf Of Hans-Juergen Rennau Sent: Thursday, September 08, 2016 10:02 AM To: Marc van Grootel marc.van.grootel@gmail.com Cc: BaseX basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] csv:parse in the age of XQuery 3.1
What concerns me, I definitely want the CSV as XML. But the performance problems have certainly nothing to do with XML versus CSV (I often deal with > 300 MB XML, which is parsed very fast!) - it is the parsing operation itself which, if I'm not mistaken, is handled by XQuery code and which must be shifted into the Java implementation.
Kind regards, Hans-Jürgen
Marc van Grootel <marc.van.grootel@gmail.commailto:marc.van.grootel@gmail.com> schrieb am 15:55 Donnerstag, 8.September 2016:
I'm currently dealing with CSV a lot as well. I tend to use the format=map approach but not nearly as large as 22 MB CSV yet. I'm wondering if, or how much more efficient it is to deal with this type of data as arrays and map data structures versus XML. For most processing I can leave serializing to XML to the very end. And if too large I would probably also chunk it before storing the end result.
Intuitively I would think that dealing with CSV as maps/arrays should be much faster and less memory intensive.
--Marc
Vincent, thank you for these measurements, which induced me to repeat my attempt to parse that 23 MB file. To my great surprise I got results similar to yours - parsing 23 MB took only six seconds!
My former experience (when I had to give up after 20 minutes or so) was gathered 16 months ago - so it seems that the BaseX team has done great work in the meantime - hurray!
Now I am very glad to know that BaseX masters CSV without constraints, which further enhances its value as data integration engine.
Hans-Jürgen
"Lizzi, Vincent" Vincent.Lizzi@taylorandfrancis.com schrieb am 18:53 Donnerstag, 8.September 2016:
#yiv8945062534 #yiv8945062534 -- _filtered #yiv8945062534 {panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv8945062534 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv8945062534 {}#yiv8945062534 #yiv8945062534 p.yiv8945062534MsoNormal, #yiv8945062534 li.yiv8945062534MsoNormal, #yiv8945062534 div.yiv8945062534MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv8945062534 a:link, #yiv8945062534 span.yiv8945062534MsoHyperlink {color:#0563C1;text-decoration:underline;}#yiv8945062534 a:visited, #yiv8945062534 span.yiv8945062534MsoHyperlinkFollowed {color:#954F72;text-decoration:underline;}#yiv8945062534 pre {margin:0in;margin-bottom:.0001pt;font-size:10.0pt;}#yiv8945062534 p.yiv8945062534MsoListParagraph, #yiv8945062534 li.yiv8945062534MsoListParagraph, #yiv8945062534 div.yiv8945062534MsoListParagraph {margin-top:0in;margin-right:0in;margin-bottom:0in;margin-left:.5in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv8945062534 span.yiv8945062534EmailStyle17 {color:#1F497D;}#yiv8945062534 span.yiv8945062534HTMLPreformattedChar {}#yiv8945062534 .yiv8945062534MsoChpDefault {font-size:10.0pt;} _filtered #yiv8945062534 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv8945062534 div.yiv8945062534WordSection1 {}#yiv8945062534 _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {} _filtered #yiv8945062534 {}#yiv8945062534 ol {margin-bottom:0in;}#yiv8945062534 ul {margin-bottom:0in;}#yiv8945062534 As it so happens, I just received a 20.5 Mb Excel file which I am loading into BaseX as CSV. To prepare the file, I opened it in Excel and saved as CSV format. The CSV file is 70 Mb. Here is what I observe loading this CSV file to BaseX a few different ways. 1. BaseX GUI – Using “Create Database” with input format CSV, the CSV was loaded and converted to XML in a few seconds. 2. Command script – The CSV was loaded and converted to XML in about 10 seconds. SET PARSER csv SET CSVPARSER encoding=windows-1252, header=true, separator=comma SET CREATEFILTER *.csv create database csvtest1 "path\to\file.csv" 3. XQuery – The CSV was loaded and converted to XML in about 20 seconds. db:create('csvtest2', csv:parse(file:read-text(' path\to\file.csv'), map{'encoding': 'windows-1252', 'header': true()}), 'file.csv' ) 4. XQuery (parsing only) – CSV file was parsed in about 4 seconds. csv:parse(file:read-text(' path\to\file.csv'), map{'encoding': 'windows-1252', 'header': true()}) 5. XQuery (parsing only) using map – The CSV file was parsed in about 6 seconds. csv:parse(file:read-text(' path\to\file.csv'), map{'encoding': 'windows-1252', 'header': true(), 'format': 'map'}) These alternate methods are, from what I can see, pretty equivalent except for the last one which produces a map instead of XML. At what point, i.e. how much data in CSV format, would using map start to offer benefits beyond mere convenience? I came across an example in the documentation that gave me an error message. The Command Line example athttp://docs.basex.org/wiki/Parsers#CSV_Parser has SET CSVPARSER encoding=utf-8, lines=true, header=false, separator=space When trying this in BaseX 8.2.3 I get an error message: Error: PARSER: csv Unknown option 'lines'. The “lines” option is not listed in the CSV Module parser documentation athttp://docs.basex.org/wiki/CSV_Module#Options. I didn’t want to correct the example in the documentation without checking whether it is actually incorrect. Does this example need to be updated? Vincent From: basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de]On Behalf Of Hans-Juergen Rennau Sent: Thursday, September 08, 2016 10:02 AM To: Marc van Grootel marc.van.grootel@gmail.com Cc: BaseX basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] csv:parse in the age of XQuery 3.1 What concerns me, I definitely want the CSV as XML. But the performance problems have certainly nothing to do with XML versus CSV (I often deal with > 300 MB XML, which is parsed very fast!) - it is the parsing operation itself which, if I'm not mistaken, is handled by XQuery code and which must be shifted into the Java implementation. Kind regards, Hans-Jürgen Marc van Grootel marc.van.grootel@gmail.com schrieb am 15:55 Donnerstag, 8.September 2016: I'm currently dealing with CSV a lot as well. I tend to use the format=map approach but not nearly as large as 22 MB CSV yet. I'm wondering if, or how much more efficient it is to deal with this type of data as arrays and map data structures versus XML. For most processing I can leave serializing to XML to the very end. And if too large I would probably also chunk it before storing the end result.
Intuitively I would think that dealing with CSV as maps/arrays should be much faster and less memory intensive.
--Marc
Hi Joe,
Thanks for your mail. You are completely right, using an array would be the natural choice with csv:parse. It’s mostly due to backward compatibility that we didn’t update the function.
@All: I’m pretty sure that all of us would like having an EXPath spec for parsing CSV data. We still need one volunteer to make it happen ;) Anyone out there?
Cheers Christian
On Thu, Sep 8, 2016 at 6:13 AM, Joe Wicentowski joewiz@gmail.com wrote:
Dear BaseX developers,
I noticed in example 3 under http://docs.basex.org/wiki/CSV_Module#Examples that csv:parse() with option { 'format': 'map' } returns a map of maps, with hardcoded row numbers:
map { 1: map { "City": "Newton", "Name": "John" }, 2: map { "City": "Oldtown", "Name": "Jack" } }
Using maps, which are unordered, to represent something ordered like rows in a CSV, hardcoded row numbers are necessary for reassembling the map in document order. I assume this was a necessary approach when the module was developed in the map-only world of XQuery 3.0. Now that 3.1 supports arrays, might an array of maps be a closer fit for CSV parsing?
array { map { "City": "Newton", "Name": "John" }, map { "City": "Oldtown", "Name": "Jack" } }
I'm also curious, do you know of any efforts to create an EXPath spec for CSV? Putting spec and CSV in the same sentence is dangerous, since CSV is a notoriously under-specified format: "The CSV file format is not standardized" (see https://en.wikipedia.org/wiki/Comma-separated_values). But perhaps there is a common enough need for CSV parsing that such a spec would benefit the community? I thought I'd start by asking here, since BaseX's seems to be the most developed (or only?) CSV module in XQuery.
Then there's the question of how to approach implementations of such a spec. While XQuery is probably capable of parsing and serializing small enough CSV, CSVs do get large and naive processing with XQuery would tend to run into memory issues (as I found with xqjson). This means implementations would tend to write in a lower-level language. eXist, for example, uses Jackson for fn:parse-json(). I see Jackson has a CSV extension too: https://github.com/FasterXML/jackson-dataformat-csv. Any thoughts on the suitability of XQuery for the task?
Joe
Thanks for your replies and interest, Hans-Jürgen, Marc, Vincent, and Christian.
The other day, short of a comprehensive solution, I went in search of a regex that would handle quoted values that contain commas that shouldn't serve as delimiters. I found one that worked in eXist but not in BaseX.
Source for the regex: http://stackoverflow.com/a/13259681/659732
The query:
``` xquery version "3.1";
let $csv := 'Author,Title,ISBN,Binding,Year Published Jeannette Walls,The Glass Castle,074324754X,Paperback,2006 James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005 Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002 "Larry Bossidy, Ram Charan, Charles Burck",Execution,9780609610572,Hardcover,2002 Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999' let $lines := tokenize($csv, '\n') let $header-row := fn:head($lines) let $body-rows := fn:tail($lines) let $headers := fn:tokenize($header-row, ",") ! fn:replace(., " ", "") for $row in $body-rows let $cells := fn:analyze-string($row, '(?:\s*(?:"([^"]*)"|([^,]+))\s*,?|(?<=,)(),?)+?')//fn:group return element Book { for $cell at $count in $cells return element {$headers[$count]} {$cell/string()} } It produces the desired results:
<Book> <Author>Jeannette Walls</Author> <Title>The Glass Castle</Title> <ISBN>074324754X</ISBN> <Binding>Paperback</Binding> <YearPublished>2006</YearPublished> </Book> <Book> <Author>James Surowiecki</Author> <Title>The Wisdom of Crowds</Title> <ISBN>9780385503860</ISBN> <Binding>Paperback</Binding> <YearPublished>2005</YearPublished> </Book> <Book> <Author>Lawrence Lessig</Author> <Title>The Future of Ideas</Title> <ISBN>9780375505782</ISBN> <Binding>Paperback</Binding> <YearPublished>2002</YearPublished> </Book> <Book> <Author>Larry Bossidy, Ram Charan, Charles Burck</Author> <Title>Execution</Title> <ISBN>9780609610572</ISBN> <Binding>Hardcover</Binding> <YearPublished>2002</YearPublished> </Book> <Book> <Author>Kurt Vonnegut</Author> <Title>Slaughterhouse-Five</Title> <ISBN>9780791059258</ISBN> <Binding>Paperback</Binding> <YearPublished>1999</YearPublished> </Book>
Unfortunately BaseX complains about the regex, with the following error:
Stopped at /Users/joe/file, 9/32: [FORX0002] Invalid regular expression: (?:\s(?:"([^"])"|([^,]+))\s*,?|(?<=,)(),?)+?.
Without a column location, I'm unable to tell where the problem is. Is there something used in this expression that BaseX doesn't support?
On the topic of the potential memory pitfalls of a pure XQuery solution for our hypothetical EXPath library, I think the primary problem is that the entire CSV has to be loaded into memory. I wonder if implementations could use the new `fn:unparsed-text-lines()` function from XQuery 3.0 to stream the CSV through XQuery without requiring the entire thing to be in memory? Or are we basically setting ourselves up for the EXPath solution being a wrapper around an external library written in a lower level language?
Joe
On Sun, Sep 11, 2016 at 4:53 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Joe,
Thanks for your mail. You are completely right, using an array would be the natural choice with csv:parse. It’s mostly due to backward compatibility that we didn’t update the function.
@All: I’m pretty sure that all of us would like having an EXPath spec for parsing CSV data. We still need one volunteer to make it happen ;) Anyone out there?
Cheers Christian
On Thu, Sep 8, 2016 at 6:13 AM, Joe Wicentowski joewiz@gmail.com wrote:
Dear BaseX developers,
I noticed in example 3 under http://docs.basex.org/wiki/CSV_Module#Examples that csv:parse() with option { 'format': 'map' } returns a map of maps, with hardcoded row numbers:
map { 1: map { "City": "Newton", "Name": "John" }, 2: map { "City": "Oldtown", "Name": "Jack" } }
Using maps, which are unordered, to represent something ordered like rows in a CSV, hardcoded row numbers are necessary for reassembling the map in document order. I assume this was a necessary approach when the module was developed in the map-only world of XQuery 3.0. Now that 3.1 supports arrays, might an array of maps be a closer fit for CSV parsing?
array { map { "City": "Newton", "Name": "John" }, map { "City": "Oldtown", "Name": "Jack" } }
I'm also curious, do you know of any efforts to create an EXPath spec for CSV? Putting spec and CSV in the same sentence is dangerous, since CSV is a notoriously under-specified format: "The CSV file format is not standardized" (see https://en.wikipedia.org/wiki/Comma-separated_values). But perhaps there is a common enough need for CSV parsing that such a spec would benefit the community? I thought I'd start by asking here, since BaseX's seems to be the most developed (or only?) CSV module in XQuery.
Then there's the question of how to approach implementations of such a spec. While XQuery is probably capable of parsing and serializing small enough CSV, CSVs do get large and naive processing with XQuery would tend to run into memory issues (as I found with xqjson). This means implementations would tend to write in a lower-level language. eXist, for example, uses Jackson for fn:parse-json(). I see Jackson has a CSV extension too: https://github.com/FasterXML/jackson-dataformat-csv. Any thoughts on the suitability of XQuery for the task?
Joe
Joe, concerning your regex, I would complain, too! Already the first two characters (?render the expression invalid:(1) An unescaped ? is an occurrence indicator, making the preceding entity optional(2) An unescaped ( is used for grouping, it does not repesent anything => there is no entity preceding the ? which the ? could make optional => error
Please keep in mind that the regex flavor supported by XPath is the regex flavor defined by the XSD spec. There are a few constructs used in Perl & Co which are not defined in XPath regex.
What concerns the CSV implementation, I came to realize my error: the BaseX implementation *is* Java code, not XQuery code - the xqm module just contains the function signature, marked "external". Cheers,Hans
Joe Wicentowski joewiz@gmail.com schrieb am 21:27 Sonntag, 11.September 2016:
Thanks for your replies and interest, Hans-Jürgen, Marc, Vincent, and Christian.
The other day, short of a comprehensive solution, I went in search of a regex that would handle quoted values that contain commas that shouldn't serve as delimiters. I found one that worked in eXist but not in BaseX.
Source for the regex: http://stackoverflow.com/a/13259681/659732
The query:
``` xquery version "3.1";
let $csv := 'Author,Title,ISBN,Binding,Year Published Jeannette Walls,The Glass Castle,074324754X,Paperback,2006 James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005 Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002 "Larry Bossidy, Ram Charan, Charles Burck",Execution,9780609610572,Hardcover,2002 Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999' let $lines := tokenize($csv, '\n') let $header-row := fn:head($lines) let $body-rows := fn:tail($lines) let $headers := fn:tokenize($header-row, ",") ! fn:replace(., " ", "") for $row in $body-rows let $cells := fn:analyze-string($row, '(?:\s*(?:"([^"]*)"|([^,]+))\s*,?|(?<=,)(),?)+?')//fn:group return element Book { for $cell at $count in $cells return element {$headers[$count]} {$cell/string()} } It produces the desired results:
<Book> <Author>Jeannette Walls</Author> <Title>The Glass Castle</Title> <ISBN>074324754X</ISBN> <Binding>Paperback</Binding> <YearPublished>2006</YearPublished> </Book> <Book> <Author>James Surowiecki</Author> <Title>The Wisdom of Crowds</Title> <ISBN>9780385503860</ISBN> <Binding>Paperback</Binding> <YearPublished>2005</YearPublished> </Book> <Book> <Author>Lawrence Lessig</Author> <Title>The Future of Ideas</Title> <ISBN>9780375505782</ISBN> <Binding>Paperback</Binding> <YearPublished>2002</YearPublished> </Book> <Book> <Author>Larry Bossidy, Ram Charan, Charles Burck</Author> <Title>Execution</Title> <ISBN>9780609610572</ISBN> <Binding>Hardcover</Binding> <YearPublished>2002</YearPublished> </Book> <Book> <Author>Kurt Vonnegut</Author> <Title>Slaughterhouse-Five</Title> <ISBN>9780791059258</ISBN> <Binding>Paperback</Binding> <YearPublished>1999</YearPublished> </Book>
Unfortunately BaseX complains about the regex, with the following error:
Stopped at /Users/joe/file, 9/32: [FORX0002] Invalid regular expression: (?:\s(?:"([^"])"|([^,]+))\s*,?|(?<=,)(),?)+?.
Without a column location, I'm unable to tell where the problem is. Is there something used in this expression that BaseX doesn't support?
On the topic of the potential memory pitfalls of a pure XQuery solution for our hypothetical EXPath library, I think the primary problem is that the entire CSV has to be loaded into memory. I wonder if implementations could use the new `fn:unparsed-text-lines()` function from XQuery 3.0 to stream the CSV through XQuery without requiring the entire thing to be in memory? Or are we basically setting ourselves up for the EXPath solution being a wrapper around an external library written in a lower level language?
Joe
On Sun, Sep 11, 2016 at 4:53 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Joe,
Thanks for your mail. You are completely right, using an array would be the natural choice with csv:parse. It’s mostly due to backward compatibility that we didn’t update the function.
@All: I’m pretty sure that all of us would like having an EXPath spec for parsing CSV data. We still need one volunteer to make it happen ;) Anyone out there?
Cheers Christian
On Thu, Sep 8, 2016 at 6:13 AM, Joe Wicentowski joewiz@gmail.com wrote:
Dear BaseX developers,
I noticed in example 3 under http://docs.basex.org/wiki/CSV_Module#Examples that csv:parse() with option { 'format': 'map' } returns a map of maps, with hardcoded row numbers:
map { 1: map { "City": "Newton", "Name": "John" }, 2: map { "City": "Oldtown", "Name": "Jack" } }
Using maps, which are unordered, to represent something ordered like rows in a CSV, hardcoded row numbers are necessary for reassembling the map in document order. I assume this was a necessary approach when the module was developed in the map-only world of XQuery 3.0. Now that 3.1 supports arrays, might an array of maps be a closer fit for CSV parsing?
array { map { "City": "Newton", "Name": "John" }, map { "City": "Oldtown", "Name": "Jack" } }
I'm also curious, do you know of any efforts to create an EXPath spec for CSV? Putting spec and CSV in the same sentence is dangerous, since CSV is a notoriously under-specified format: "The CSV file format is not standardized" (see https://en.wikipedia.org/wiki/Comma-separated_values).%C2%A0 But perhaps there is a common enough need for CSV parsing that such a spec would benefit the community? I thought I'd start by asking here, since BaseX's seems to be the most developed (or only?) CSV module in XQuery.
Then there's the question of how to approach implementations of such a spec. While XQuery is probably capable of parsing and serializing small enough CSV, CSVs do get large and naive processing with XQuery would tend to run into memory issues (as I found with xqjson). This means implementations would tend to write in a lower-level language. eXist, for example, uses Jackson for fn:parse-json(). I see Jackson has a CSV extension too: https://github.com/FasterXML/jackson-dataformat-csv. Any thoughts on the suitability of XQuery for the task?
Joe
Hans-Jürgen, I figured as much. I wonder if we can come up with an xsd-compliant regex for this purpose? It may not give us a full-featured CSV parser, but would handle reasonably uniform cases. Joe
Sent from my iPhone
On Sun, Sep 11, 2016 at 3:39 PM -0400, "Hans-Juergen Rennau" hrennau@yahoo.de wrote:
Joe, concerning your regex, I would complain, too! Already the first two characters (?render the expression invalid:(1) An unescaped ? is an occurrence indicator, making the preceding entity optional(2) An unescaped ( is used for grouping, it does not repesent anything => there is no entity preceding the ? which the ? could make optional => error
Please keep in mind that the regex flavor supported by XPath is the regex flavor defined by the XSD spec. There are a few constructs used in Perl & Co which are not defined in XPath regex.
What concerns the CSV implementation, I came to realize my error: the BaseX implementation *is* Java code, not XQuery code - the xqm module just contains the function signature, marked "external". Cheers,Hans
Joe Wicentowski joewiz@gmail.com schrieb am 21:27 Sonntag, 11.September 2016:
Thanks for your replies and interest, Hans-Jürgen, Marc, Vincent, and Christian.
The other day, short of a comprehensive solution, I went in search of a regex that would handle quoted values that contain commas that shouldn't serve as delimiters. I found one that worked in eXist but not in BaseX.
Source for the regex: http://stackoverflow.com/a/13259681/659732
The query:
``` xquery version "3.1";
let $csv := 'Author,Title,ISBN,Binding,Year Published Jeannette Walls,The Glass Castle,074324754X,Paperback,2006 James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005 Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002 "Larry Bossidy, Ram Charan, Charles Burck",Execution,9780609610572,Hardcover,2002 Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999' let $lines := tokenize($csv, ' ') let $header-row := fn:head($lines) let $body-rows := fn:tail($lines) let $headers := fn:tokenize($header-row, ",") ! fn:replace(., " ", "") for $row in $body-rows let $cells := fn:analyze-string($row, '(?:\s*(?:"([^"]*)"|([^,]+))\s*,?|(?<=,)(),?)+?')//fn:group return element Book { for $cell at $count in $cells return element {$headers[$count]} {$cell/string()} } It produces the desired results:
<Book> <Author>Jeannette Walls</Author> <Title>The Glass Castle</Title> <ISBN>074324754X</ISBN> <Binding>Paperback</Binding> <YearPublished>2006</YearPublished> </Book> <Book> <Author>James Surowiecki</Author> <Title>The Wisdom of Crowds</Title> <ISBN>9780385503860</ISBN> <Binding>Paperback</Binding> <YearPublished>2005</YearPublished> </Book> <Book> <Author>Lawrence Lessig</Author> <Title>The Future of Ideas</Title> <ISBN>9780375505782</ISBN> <Binding>Paperback</Binding> <YearPublished>2002</YearPublished> </Book> <Book> <Author>Larry Bossidy, Ram Charan, Charles Burck</Author> <Title>Execution</Title> <ISBN>9780609610572</ISBN> <Binding>Hardcover</Binding> <YearPublished>2002</YearPublished> </Book> <Book> <Author>Kurt Vonnegut</Author> <Title>Slaughterhouse-Five</Title> <ISBN>9780791059258</ISBN> <Binding>Paperback</Binding> <YearPublished>1999</YearPublished> </Book>
Unfortunately BaseX complains about the regex, with the following error:
Stopped at /Users/joe/file, 9/32: [FORX0002] Invalid regular expression: (?:\s(?:"([^"])"|([^,]+))\s*,?|(?<=,)(),?)+?.
Without a column location, I'm unable to tell where the problem is. Is there something used in this expression that BaseX doesn't support?
On the topic of the potential memory pitfalls of a pure XQuery solution for our hypothetical EXPath library, I think the primary problem is that the entire CSV has to be loaded into memory. I wonder if implementations could use the new `fn:unparsed-text-lines()` function from XQuery 3.0 to stream the CSV through XQuery without requiring the entire thing to be in memory? Or are we basically setting ourselves up for the EXPath solution being a wrapper around an external library written in a lower level language?
Joe
On Sun, Sep 11, 2016 at 4:53 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Joe,
Thanks for your mail. You are completely right, using an array would be the natural choice with csv:parse. It’s mostly due to backward compatibility that we didn’t update the function.
@All: I’m pretty sure that all of us would like having an EXPath spec for parsing CSV data. We still need one volunteer to make it happen ;) Anyone out there?
Cheers Christian
On Thu, Sep 8, 2016 at 6:13 AM, Joe Wicentowski joewiz@gmail.com wrote:
Dear BaseX developers,
I noticed in example 3 under http://docs.basex.org/wiki/CSV_Module#Examples that csv:parse() with option { 'format': 'map' } returns a map of maps, with hardcoded row numbers:
map { 1: map { "City": "Newton", "Name": "John" }, 2: map { "City": "Oldtown", "Name": "Jack" } }
Using maps, which are unordered, to represent something ordered like rows in a CSV, hardcoded row numbers are necessary for reassembling the map in document order. I assume this was a necessary approach when the module was developed in the map-only world of XQuery 3.0. Now that 3.1 supports arrays, might an array of maps be a closer fit for CSV parsing?
array { map { "City": "Newton", "Name": "John" }, map { "City": "Oldtown", "Name": "Jack" } }
I'm also curious, do you know of any efforts to create an EXPath spec for CSV? Putting spec and CSV in the same sentence is dangerous, since CSV is a notoriously under-specified format: "The CSV file format is not standardized" (see https://en.wikipedia.org/wiki/Comma-separated_values).%C2%A0 But perhaps there is a common enough need for CSV parsing that such a spec would benefit the community? I thought I'd start by asking here, since BaseX's seems to be the most developed (or only?) CSV module in XQuery.
Then there's the question of how to approach implementations of such a spec. While XQuery is probably capable of parsing and serializing small enough CSV, CSVs do get large and naive processing with XQuery would tend to run into memory issues (as I found with xqjson). This means implementations would tend to write in a lower-level language. eXist, for example, uses Jackson for fn:parse-json(). I see Jackson has a CSV extension too: https://github.com/FasterXML/jackson-dataformat-csv. Any thoughts on the suitability of XQuery for the task?
Joe
Joe, just in case it is of interest to you: the TopicTools framework, downloadable at https://github.com/hrennau/topictools contains an XQuery-implemented, full-featured csv parser (module _csvParser.xqm, 212 lines). Writing XQuery tools using the framework, the parser is automatically added to your application code and you can declare command-line parameters of your tool to have (among many others) a csv-based data type. Having declared the parameter to have such a type (e.g. csvURI or csvFOX), you can forget csv, as your application code sees nothing else but XML. More offline, if this is interesting to you. Cheers,Hans
PS - illustrating ...
Tool invocations: basex -b "request=csve?cox=joe.csv<header=true" /mytool/mytool.xq basex -b "request=csve?cox=/foo//(*foobar*.csv except betternot*.csv)<header=true" /mytool/mytool.xq Parameter declaration (within a module of the mytool app): <param name="cox" type="csvFOX?" sep="WS" pgroup="input"/> Application code (within a module of the mytool app): let $docs := tt:getParams($request, 'cox') (: here you go - $docs is bound to the XML documents obtained by parsing all files matching the FOXpath expression supplied by the caller into an XML representation; ...)
Joe Wicentowski joewiz@gmail.com schrieb am 21:44 Sonntag, 11.September 2016:
Hans-Jürgen, I figured as much. I wonder if we can come up with an xsd-compliant regex for this purpose? It may not give us a full-featured CSV parser, but would handle reasonably uniform cases. Joe
Sent from my iPhone
On Sun, Sep 11, 2016 at 3:39 PM -0400, "Hans-Juergen Rennau" hrennau@yahoo.de wrote:
Joe, concerning your regex, I would complain, too! Already the first two characters (?render the expression invalid:(1) An unescaped ? is an occurrence indicator, making the preceding entity optional(2) An unescaped ( is used for grouping, it does not repesent anything => there is no entity preceding the ? which the ? could make optional => error
Please keep in mind that the regex flavor supported by XPath is the regex flavor defined by the XSD spec. There are a few constructs used in Perl & Co which are not defined in XPath regex.
What concerns the CSV implementation, I came to realize my error: the BaseX implementation *is* Java code, not XQuery code - the xqm module just contains the function signature, marked "external". Cheers,Hans
Joe Wicentowski joewiz@gmail.com schrieb am 21:27 Sonntag, 11.September 2016:
Thanks for your replies and interest, Hans-Jürgen, Marc, Vincent, and Christian.
The other day, short of a comprehensive solution, I went in search of a regex that would handle quoted values that contain commas that shouldn't serve as delimiters. I found one that worked in eXist but not in BaseX.
Source for the regex: http://stackoverflow.com/a/13259681/659732
The query:
``` xquery version "3.1";
let $csv := 'Author,Title,ISBN,Binding,Year Published Jeannette Walls,The Glass Castle,074324754X,Paperback,2006 James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005 Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002 "Larry Bossidy, Ram Charan, Charles Burck",Execution,9780609610572,Hardcover,2002 Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999' let $lines := tokenize($csv, '\n') let $header-row := fn:head($lines) let $body-rows := fn:tail($lines) let $headers := fn:tokenize($header-row, ",") ! fn:replace(., " ", "") for $row in $body-rows let $cells := fn:analyze-string($row, '(?:\s*(?:"([^"]*)"|([^,]+))\s*,?|(?<=,)(),?)+?')//fn:group return element Book { for $cell at $count in $cells return element {$headers[$count]} {$cell/string()} } It produces the desired results:
<Book> <Author>Jeannette Walls</Author> <Title>The Glass Castle</Title> <ISBN>074324754X</ISBN> <Binding>Paperback</Binding> <YearPublished>2006</YearPublished> </Book> <Book> <Author>James Surowiecki</Author> <Title>The Wisdom of Crowds</Title> <ISBN>9780385503860</ISBN> <Binding>Paperback</Binding> <YearPublished>2005</YearPublished> </Book> <Book> <Author>Lawrence Lessig</Author> <Title>The Future of Ideas</Title> <ISBN>9780375505782</ISBN> <Binding>Paperback</Binding> <YearPublished>2002</YearPublished> </Book> <Book> <Author>Larry Bossidy, Ram Charan, Charles Burck</Author> <Title>Execution</Title> <ISBN>9780609610572</ISBN> <Binding>Hardcover</Binding> <YearPublished>2002</YearPublished> </Book> <Book> <Author>Kurt Vonnegut</Author> <Title>Slaughterhouse-Five</Title> <ISBN>9780791059258</ISBN> <Binding>Paperback</Binding> <YearPublished>1999</YearPublished> </Book>
Unfortunately BaseX complains about the regex, with the following error:
Stopped at /Users/joe/file, 9/32: [FORX0002] Invalid regular expression: (?:\s(?:"([^"])"|([^,]+))\s*,?|(?<=,)(),?)+?.
Without a column location, I'm unable to tell where the problem is. Is there something used in this expression that BaseX doesn't support?
On the topic of the potential memory pitfalls of a pure XQuery solution for our hypothetical EXPath library, I think the primary problem is that the entire CSV has to be loaded into memory. I wonder if implementations could use the new `fn:unparsed-text-lines()` function from XQuery 3.0 to stream the CSV through XQuery without requiring the entire thing to be in memory? Or are we basically setting ourselves up for the EXPath solution being a wrapper around an external library written in a lower level language?
Joe
On Sun, Sep 11, 2016 at 4:53 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Joe,
Thanks for your mail. You are completely right, using an array would be the natural choice with csv:parse. It’s mostly due to backward compatibility that we didn’t update the function.
@All: I’m pretty sure that all of us would like having an EXPath spec for parsing CSV data. We still need one volunteer to make it happen ;) Anyone out there?
Cheers Christian
On Thu, Sep 8, 2016 at 6:13 AM, Joe Wicentowski joewiz@gmail.com wrote:
Dear BaseX developers,
I noticed in example 3 under http://docs.basex.org/wiki/CSV_Module#Examples that csv:parse() with option { 'format': 'map' } returns a map of maps, with hardcoded row numbers:
map { 1: map { "City": "Newton", "Name": "John" }, 2: map { "City": "Oldtown", "Name": "Jack" } }
Using maps, which are unordered, to represent something ordered like rows in a CSV, hardcoded row numbers are necessary for reassembling the map in document order. I assume this was a necessary approach when the module was developed in the map-only world of XQuery 3.0. Now that 3.1 supports arrays, might an array of maps be a closer fit for CSV parsing?
array { map { "City": "Newton", "Name": "John" }, map { "City": "Oldtown", "Name": "Jack" } }
I'm also curious, do you know of any efforts to create an EXPath spec for CSV? Putting spec and CSV in the same sentence is dangerous, since CSV is a notoriously under-specified format: "The CSV file format is not standardized" (see https://en.wikipedia.org/wiki/Comma-separated_values).%C2%A0 But perhaps there is a common enough need for CSV parsing that such a spec would benefit the community? I thought I'd start by asking here, since BaseX's seems to be the most developed (or only?) CSV module in XQuery.
Then there's the question of how to approach implementations of such a spec. While XQuery is probably capable of parsing and serializing small enough CSV, CSVs do get large and naive processing with XQuery would tend to run into memory issues (as I found with xqjson). This means implementations would tend to write in a lower-level language. eXist, for example, uses Jackson for fn:parse-json(). I see Jackson has a CSV extension too: https://github.com/FasterXML/jackson-dataformat-csv. Any thoughts on the suitability of XQuery for the task?
Joe
@Hans-Jürgen… Nice work, thanks for the hint!
On Sun, Sep 11, 2016 at 10:23 PM, Hans-Juergen Rennau hrennau@yahoo.de wrote:
Joe, just in case it is of interest to you: the TopicTools framework, downloadable at
https://github.com/hrennau/topictools
contains an XQuery-implemented, full-featured csv parser (module _csvParser.xqm, 212 lines). Writing XQuery tools using the framework, the parser is automatically added to your application code and you can declare command-line parameters of your tool to have (among many others) a csv-based data type. Having declared the parameter to have such a type (e.g. csvURI or csvFOX), you can forget csv, as your application code sees nothing else but XML. More offline, if this is interesting to you.
Cheers, Hans
PS - illustrating ...
Tool invocations: basex -b "request=csve?cox=joe.csv<header=true" /mytool/mytool.xq basex -b "request=csve?cox=/foo//(*foobar*.csv except betternot*.csv)<header=true" /mytool/mytool.xq
Parameter declaration (within a module of the mytool app): <param name="cox" type="csvFOX?" sep="WS" pgroup="input"/>
Application code (within a module of the mytool app): let $docs := tt:getParams($request, 'cox') (: here you go - $docs is bound to the XML documents obtained by parsing all files matching the FOXpath expression supplied by the caller into an XML representation; ...)
Joe Wicentowski joewiz@gmail.com schrieb am 21:44 Sonntag, 11.September 2016:
Hans-Jürgen,
I figured as much. I wonder if we can come up with an xsd-compliant regex for this purpose? It may not give us a full-featured CSV parser, but would handle reasonably uniform cases.
Joe
Sent from my iPhone
On Sun, Sep 11, 2016 at 3:39 PM -0400, "Hans-Juergen Rennau" hrennau@yahoo.de wrote:
Joe, concerning your regex, I would complain, too! Already the first two characters (? render the expression invalid: (1) An unescaped ? is an occurrence indicator, making the preceding entity optional (2) An unescaped ( is used for grouping, it does not repesent anything => there is no entity preceding the ? which the ? could make optional => error
Please keep in mind that the regex flavor supported by XPath is the regex flavor defined by the XSD spec. There are a few constructs used in Perl & Co which are not defined in XPath regex.
What concerns the CSV implementation, I came to realize my error: the BaseX implementation *is* Java code, not XQuery code - the xqm module just contains the function signature, marked "external".
Cheers, Hans
Joe Wicentowski joewiz@gmail.com schrieb am 21:27 Sonntag, 11.September 2016:
Thanks for your replies and interest, Hans-Jürgen, Marc, Vincent, and Christian.
The other day, short of a comprehensive solution, I went in search of a regex that would handle quoted values that contain commas that shouldn't serve as delimiters. I found one that worked in eXist but not in BaseX.
Source for the regex: http://stackoverflow.com/a/13259681/659732
The query:
xquery version "3.1"; let $csv := 'Author,Title,ISBN,Binding,Year Published Jeannette Walls,The Glass Castle,074324754X,Paperback,2006 James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005 Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002 "Larry Bossidy, Ram Charan, Charles Burck",Execution,9780609610572,Hardcover,2002 Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999' let $lines := tokenize($csv, '\n') let $header-row := fn:head($lines) let $body-rows := fn:tail($lines) let $headers := fn:tokenize($header-row, ",") ! fn:replace(., " ", "") for $row in $body-rows let $cells := fn:analyze-string($row, '(?:\s*(?:\"([^\"]*)\"|([^,]+))\s*,?|(?<=,)(),?)+?')//fn:group return element Book { for $cell at $count in $cells return element {$headers[$count]} {$cell/string()} } It produces the desired results: <Book> <Author>Jeannette Walls</Author> <Title>The Glass Castle</Title> <ISBN>074324754X</ISBN> <Binding>Paperback</Binding> <YearPublished>2006</YearPublished> </Book> <Book> <Author>James Surowiecki</Author> <Title>The Wisdom of Crowds</Title> <ISBN>9780385503860</ISBN> <Binding>Paperback</Binding> <YearPublished>2005</YearPublished> </Book> <Book> <Author>Lawrence Lessig</Author> <Title>The Future of Ideas</Title> <ISBN>9780375505782</ISBN> <Binding>Paperback</Binding> <YearPublished>2002</YearPublished> </Book> <Book> <Author>Larry Bossidy, Ram Charan, Charles Burck</Author> <Title>Execution</Title> <ISBN>9780609610572</ISBN> <Binding>Hardcover</Binding> <YearPublished>2002</YearPublished> </Book> <Book> <Author>Kurt Vonnegut</Author> <Title>Slaughterhouse-Five</Title> <ISBN>9780791059258</ISBN> <Binding>Paperback</Binding> <YearPublished>1999</YearPublished> </Book> Unfortunately BaseX complains about the regex, with the following error: Stopped at /Users/joe/file, 9/32: [FORX0002] Invalid regular expression: (?:\s(?:\"([^\"])\"|([^,]+))\s*,?|(?<=,)(),?)+?. Without a column location, I'm unable to tell where the problem is. Is there something used in this expression that BaseX doesn't support? On the topic of the potential memory pitfalls of a pure XQuery solution for our hypothetical EXPath library, I think the primary problem is that the entire CSV has to be loaded into memory. I wonder if implementations could use the new `fn:unparsed-text-lines()` function from XQuery 3.0 to stream the CSV through XQuery without requiring the entire thing to be in memory? Or are we basically setting ourselves up for the EXPath solution being a wrapper around an external library written in a lower level language? Joe On Sun, Sep 11, 2016 at 4:53 AM, Christian Grün <christian.gruen@gmail.com> wrote: > Hi Joe, > > Thanks for your mail. You are completely right, using an array would > be the natural choice with csv:parse. It’s mostly due to backward > compatibility that we didn’t update the function. > > @All: I’m pretty sure that all of us would like having an EXPath spec > for parsing CSV data. We still need one volunteer to make it happen ;) > Anyone out there? > > Cheers > Christian > > > On Thu, Sep 8, 2016 at 6:13 AM, Joe Wicentowski <joewiz@gmail.com> wrote: >> Dear BaseX developers, >> >> I noticed in example 3 under >> http://docs.basex.org/wiki/CSV_Module#Examples that csv:parse() with >> option { 'format': 'map' } returns a map of maps, with hardcoded row >> numbers: >> >> map { >> 1: map { >> "City": "Newton", >> "Name": "John" >> }, >> 2: map { >> "City": "Oldtown", >> "Name": "Jack" >> } >> } >> >> Using maps, which are unordered, to represent something ordered like >> rows in a CSV, hardcoded row numbers are necessary for reassembling >> the map in document order. I assume this was a necessary approach >> when the module was developed in the map-only world of XQuery 3.0. >> Now that 3.1 supports arrays, might an array of maps be a closer fit >> for CSV parsing? >> >> array { >> map { >> "City": "Newton", >> "Name": "John" >> }, >> map { >> "City": "Oldtown", >> "Name": "Jack" >> } >> } >> >> I'm also curious, do you know of any efforts to create an EXPath spec >> for CSV? Putting spec and CSV in the same sentence is dangerous, >> since CSV is a notoriously under-specified format: "The CSV file >> format is not standardized" (see >> https://en.wikipedia.org/wiki/Comma-separated_values). But perhaps >> there is a common enough need for CSV parsing that such a spec would >> benefit the community? I thought I'd start by asking here, since >> BaseX's seems to be the most developed (or only?) CSV module in >> XQuery. >> >> Then there's the question of how to approach implementations of such a >> spec. While XQuery is probably capable of parsing and serializing >> small enough CSV, CSVs do get large and naive processing with XQuery >> would tend to run into memory issues (as I found with xqjson). This >> means implementations would tend to write in a lower-level language. >> eXist, for example, uses Jackson for fn:parse-json(). I see Jackson >> has a CSV extension too: >> https://github.com/FasterXML/jackson-dataformat-csv. Any thoughts on >> the suitability of XQuery for the task? >> >> Joe
Hi Joe,
My concern is that a single regex, no matter how complex, won’t do justice to parse arbitary CSV data. The CSV input we got so far for testing was simply too diverse (I spent 10% of my time into implementing a basic CSV parser in BaseX, and 90% into examining these special cases, and implementing custom extensions). As example, some comments regarding quotes:
* Quotes can serve as delimiters for a single cell, but they can also be escaped: "Hi "John""
* Quotes can also occur inside a string. In Excel, those quotes will be escaped via double quotes… Hi ""John"". Other parsers prefer backslashes.
* Real-life CSV data is regularly corrupt, so we’d also need to tolerate missing trailing quotes and other special cases.
On the other hand, parsing tabular texts is basically no big deal, and our current Java implementation [1] could surely be ported to XQuery.
Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba...
On Sun, Sep 11, 2016 at 9:44 PM, Joe Wicentowski joewiz@gmail.com wrote:
Hans-Jürgen,
I figured as much. I wonder if we can come up with an xsd-compliant regex for this purpose? It may not give us a full-featured CSV parser, but would handle reasonably uniform cases.
Joe
Sent from my iPhone
On Sun, Sep 11, 2016 at 3:39 PM -0400, "Hans-Juergen Rennau" hrennau@yahoo.de wrote:
Joe, concerning your regex, I would complain, too! Already the first two characters (? render the expression invalid: (1) An unescaped ? is an occurrence indicator, making the preceding entity optional (2) An unescaped ( is used for grouping, it does not repesent anything => there is no entity preceding the ? which the ? could make optional => error
Please keep in mind that the regex flavor supported by XPath is the regex flavor defined by the XSD spec. There are a few constructs used in Perl & Co which are not defined in XPath regex.
What concerns the CSV implementation, I came to realize my error: the BaseX implementation *is* Java code, not XQuery code - the xqm module just contains the function signature, marked "external".
Cheers, Hans
Joe Wicentowski joewiz@gmail.com schrieb am 21:27 Sonntag, 11.September 2016:
Thanks for your replies and interest, Hans-Jürgen, Marc, Vincent, and Christian.
The other day, short of a comprehensive solution, I went in search of a regex that would handle quoted values that contain commas that shouldn't serve as delimiters. I found one that worked in eXist but not in BaseX.
Source for the regex: http://stackoverflow.com/a/13259681/659732
The query:
xquery version "3.1"; let $csv := 'Author,Title,ISBN,Binding,Year Published Jeannette Walls,The Glass Castle,074324754X,Paperback,2006 James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005 Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002 "Larry Bossidy, Ram Charan, Charles Burck",Execution,9780609610572,Hardcover,2002 Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999' let $lines := tokenize($csv, '\n') let $header-row := fn:head($lines) let $body-rows := fn:tail($lines) let $headers := fn:tokenize($header-row, ",") ! fn:replace(., " ", "") for $row in $body-rows let $cells := fn:analyze-string($row, '(?:\s*(?:\"([^\"]*)\"|([^,]+))\s*,?|(?<=,)(),?)+?')//fn:group return element Book { for $cell at $count in $cells return element {$headers[$count]} {$cell/string()} } It produces the desired results: <Book> <Author>Jeannette Walls</Author> <Title>The Glass Castle</Title> <ISBN>074324754X</ISBN> <Binding>Paperback</Binding> <YearPublished>2006</YearPublished> </Book> <Book> <Author>James Surowiecki</Author> <Title>The Wisdom of Crowds</Title> <ISBN>9780385503860</ISBN> <Binding>Paperback</Binding> <YearPublished>2005</YearPublished> </Book> <Book> <Author>Lawrence Lessig</Author> <Title>The Future of Ideas</Title> <ISBN>9780375505782</ISBN> <Binding>Paperback</Binding> <YearPublished>2002</YearPublished> </Book> <Book> <Author>Larry Bossidy, Ram Charan, Charles Burck</Author> <Title>Execution</Title> <ISBN>9780609610572</ISBN> <Binding>Hardcover</Binding> <YearPublished>2002</YearPublished> </Book> <Book> <Author>Kurt Vonnegut</Author> <Title>Slaughterhouse-Five</Title> <ISBN>9780791059258</ISBN> <Binding>Paperback</Binding> <YearPublished>1999</YearPublished> </Book> Unfortunately BaseX complains about the regex, with the following error: Stopped at /Users/joe/file, 9/32: [FORX0002] Invalid regular expression: (?:\s(?:\"([^\"])\"|([^,]+))\s*,?|(?<=,)(),?)+?. Without a column location, I'm unable to tell where the problem is. Is there something used in this expression that BaseX doesn't support? On the topic of the potential memory pitfalls of a pure XQuery solution for our hypothetical EXPath library, I think the primary problem is that the entire CSV has to be loaded into memory. I wonder if implementations could use the new `fn:unparsed-text-lines()` function from XQuery 3.0 to stream the CSV through XQuery without requiring the entire thing to be in memory? Or are we basically setting ourselves up for the EXPath solution being a wrapper around an external library written in a lower level language? Joe On Sun, Sep 11, 2016 at 4:53 AM, Christian Grün <christian.gruen@gmail.com> wrote: > Hi Joe, > > Thanks for your mail. You are completely right, using an array would > be the natural choice with csv:parse. It’s mostly due to backward > compatibility that we didn’t update the function. > > @All: I’m pretty sure that all of us would like having an EXPath spec > for parsing CSV data. We still need one volunteer to make it happen ;) > Anyone out there? > > Cheers > Christian > > > On Thu, Sep 8, 2016 at 6:13 AM, Joe Wicentowski <joewiz@gmail.com> > wrote: >> Dear BaseX developers, >> >> I noticed in example 3 under >> http://docs.basex.org/wiki/CSV_Module#Examples that csv:parse() with >> option { 'format': 'map' } returns a map of maps, with hardcoded row >> numbers: >> >> map { >> 1: map { >> "City": "Newton", >> "Name": "John" >> }, >> 2: map { >> "City": "Oldtown", >> "Name": "Jack" >> } >> } >> >> Using maps, which are unordered, to represent something ordered like >> rows in a CSV, hardcoded row numbers are necessary for reassembling >> the map in document order. I assume this was a necessary approach >> when the module was developed in the map-only world of XQuery 3.0. >> Now that 3.1 supports arrays, might an array of maps be a closer fit >> for CSV parsing? >> >> array { >> map { >> "City": "Newton", >> "Name": "John" >> }, >> map { >> "City": "Oldtown", >> "Name": "Jack" >> } >> } >> >> I'm also curious, do you know of any efforts to create an EXPath spec >> for CSV? Putting spec and CSV in the same sentence is dangerous, >> since CSV is a notoriously under-specified format: "The CSV file >> format is not standardized" (see >> https://en.wikipedia.org/wiki/Comma-separated_values). But perhaps >> there is a common enough need for CSV parsing that such a spec would >> benefit the community? I thought I'd start by asking here, since >> BaseX's seems to be the most developed (or only?) CSV module in >> XQuery. >> >> Then there's the question of how to approach implementations of such a >> spec. While XQuery is probably capable of parsing and serializing >> small enough CSV, CSVs do get large and naive processing with XQuery >> would tend to run into memory issues (as I found with xqjson). This >> means implementations would tend to write in a lower-level language. >> eXist, for example, uses Jackson for fn:parse-json(). I see Jackson >> has a CSV extension too: >> https://github.com/FasterXML/jackson-dataformat-csv. Any thoughts on >> the suitability of XQuery for the task? >> >> Joe
Hans-Jürgen, wrote: ! Already the first
two characters (?render the expression invalid:(1) An unescaped ? is an occurrence indicator, making the preceding entity optional(2) An unescaped ( is used for grouping, it does not repesent anything => there is no entity preceding the ? which the ? could make optional => error
Actually (?: .... ) is a non-capturing group, defined in XPath 3.0 and XQuery 3.0, based on the same syntax in other languages.
This extension, like a number of others, is useful because the expression syntax defined by XSD doesn't make use of capturing groups (there's no \1 or $1 or whatever), and so it doesn't need non-capturing groups, but in XPath and XQuery they are used.
See e.g. https://www.w3.org/TR/xpath-functions-30/#regex-syntax
Liam
Cordial thanks, Liam - I was not aware of that! @Joe: Rule of life: when one is especially sure to be right, one is surely wrong, and so was I, and right were you(r first two characters).
Liam R. E. Quin liam@w3.org schrieb am 5:54 Montag, 12.September 2016:
Hans-Jürgen, wrote: ! Already the first
two characters (?render the expression invalid:(1) An unescaped ? is an occurrence indicator, making the preceding entity optional(2) An unescaped ( is used for grouping, it does not repesent anything => there is no entity preceding the ? which the ? could make optional => error
Actually (?: .... ) is a non-capturing group, defined in XPath 3.0 and XQuery 3.0, based on the same syntax in other languages.
This extension, like a number of others, is useful because the expression syntax defined by XSD doesn't make use of capturing groups (there's no \1 or $1 or whatever), and so it doesn't need non-capturing groups, but in XPath and XQuery they are used.
See e.g. https://www.w3.org/TR/xpath-functions-30/#regex-syntax
Liam
I didn’t check the regex in general, but one reason I think why it fails is the escaped quote. For example, the following query is illegal in XQuery 3.1…
matches('a"b', 'a"b')
…where as the following one is ok:
matches('a"b', 'a"b')
On Mon, Sep 12, 2016 at 1:15 PM, Hans-Juergen Rennau hrennau@yahoo.de wrote:
Cordial thanks, Liam - I was not aware of that!
@Joe: Rule of life: when one is especially sure to be right, one is surely wrong, and so was I, and right were you(r first two characters).
Liam R. E. Quin liam@w3.org schrieb am 5:54 Montag, 12.September 2016:
Hans-Jürgen, wrote:
! Already the first
two characters (?render the expression invalid:(1) An unescaped ? is an occurrence indicator, making the preceding entity optional(2) An unescaped ( is used for grouping, it does not repesent anything => there is no entity preceding the ? which the ? could make optional => error
Actually (?: .... ) is a non-capturing group, defined in XPath 3.0 and XQuery 3.0, based on the same syntax in other languages.
This extension, like a number of others, is useful because the expression syntax defined by XSD doesn't make use of capturing groups (there's no \1 or $1 or whatever), and so it doesn't need non-capturing groups, but in XPath and XQuery they are used.
See e.g. https://www.w3.org/TR/xpath-functions-30/#regex-syntax
Liam
-- Liam R. E. Quin liam@w3.org The World Wide Web Consortium (W3C)
Hi all,
Christian: I completely agree, CSV is a nightmare. One way to reduce the headaches (in, say, developing an EXPath CSV library) might be to require that CSV pass validation by a tool such as http://digital-preservation.github.io/csv-validator/. Adam Retter presented his work on CSV Schema and CSV Validator at http://slides.com/adamretter/csv-validation. This might require the user to fix issues in the CSV first, but would reduce the scope of variation considerably. I notice that the Jackson CSV parser leverages the notion of a schema in its imports: https://github.com/FasterXML/jackson-dataformat-csv.
Hans-Jürgen: Thanks for the pointer to your library - it looks fantastic. I look forward to trying it out.
Liam: Thanks for the info about XQuery's additional regex handling beyond XSD.
And, lastly, to keep this post still basex related...
Christian: I tried removing the quote escaping but still get an error. Here's a small test to reproduce:
xquery version "3.1";
let $row := '"Larry Bossidy, Ram Charan, Charles Burck",Execution,9780609610572,Hardcover,2002' return fn:analyze-string($row, '(?:\s*(?:"([^"]*)"|([^,]+))\s*,?|(?<=,)(),?)+?')
Joe
On Mon, Sep 12, 2016 at 7:29 AM, Christian Grün christian.gruen@gmail.com wrote:
I didn’t check the regex in general, but one reason I think why it fails is the escaped quote. For example, the following query is illegal in XQuery 3.1…
matches('a"b', 'a"b')
…where as the following one is ok:
matches('a"b', 'a"b')
On Mon, Sep 12, 2016 at 1:15 PM, Hans-Juergen Rennau hrennau@yahoo.de wrote:
Cordial thanks, Liam - I was not aware of that!
@Joe: Rule of life: when one is especially sure to be right, one is surely wrong, and so was I, and right were you(r first two characters).
Liam R. E. Quin liam@w3.org schrieb am 5:54 Montag, 12.September 2016:
Hans-Jürgen, wrote:
! Already the first
two characters (?render the expression invalid:(1) An unescaped ? is an occurrence indicator, making the preceding entity optional(2) An unescaped ( is used for grouping, it does not repesent anything => there is no entity preceding the ? which the ? could make optional => error
Actually (?: .... ) is a non-capturing group, defined in XPath 3.0 and XQuery 3.0, based on the same syntax in other languages.
This extension, like a number of others, is useful because the expression syntax defined by XSD doesn't make use of capturing groups (there's no \1 or $1 or whatever), and so it doesn't need non-capturing groups, but in XPath and XQuery they are used.
See e.g. https://www.w3.org/TR/xpath-functions-30/#regex-syntax
Liam
-- Liam R. E. Quin liam@w3.org The World Wide Web Consortium (W3C)
Christian: I tried removing the quote escaping but still get an error. Here's a small test to reproduce:
fn:analyze-string($row, '(?:\s*(?:"([^"]*)"|([^,]+))\s*,?|(?<=,)(),?)+?')
I assume it’s the lookbehind assertion that is not allowed in XQuery (but I should definitely spend more time on it to give you a better answer..).
Hi Christian,
Yes, that sounds like the culprit. Searching back through my files, Adam Retter responded on exist-open (at http://markmail.org/message/3bxz55du3hl6arpr) to a call for help with the lack of lookahead support in XPath, by pointing to an XSLT he adapted for CSV parsing, https://github.com/digital-preservation/csv-tools/blob/master/csv-to-xml_v3..... I adapted this technique to XQuery, and it works on the sample case in my earlier email.
Joe
```xquery xquery version "3.1";
declare function local:get-cells($row as xs:string) as xs:string { (: workaround lack of lookahead support in XPath: end row with comma :) let $string-to-analyze := $row || "," let $analyze := fn:analyze-string($row, '(("[^"]*")+|[^,]*),') for $group in $analyze//fn:group[@nr="1"] return if (matches($group, '^".+"$')) then replace($group, '^"([^"]+)"$', '$1') else $group/string() };
let $csv := 'Author,Title,ISBN,Binding,Year Published Jeannette Walls,The Glass Castle,074324754X,Paperback,2006 James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005 Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002 "Larry Bossidy, Ram Charan, Charles Burck",Execution,9780609610572,Hardcover,2002 Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999' let $lines := tokenize($csv, '\n') let $header-row := fn:head($lines) let $body-rows := fn:tail($lines) let $headers := local:get-cells($header-row) for $row in $body-rows let $cells := local:get-cells($row) return element row { for $cell at $count in $cells return element {$headers[$count]} {$cell} } ```
On Mon, Sep 12, 2016 at 10:11 AM, Christian Grün christian.gruen@gmail.com wrote:
Christian: I tried removing the quote escaping but still get an error. Here's a small test to reproduce:
fn:analyze-string($row, '(?:\s*(?:"([^"]*)"|([^,]+))\s*,?|(?<=,)(),?)+?')
I assume it’s the lookbehind assertion that is not allowed in XQuery (but I should definitely spend more time on it to give you a better answer..).
Sorry, a typo crept in. Here's the corrected function:
declare function local:get-cells($row as xs:string) as xs:string { (: workaround lack of lookahead support in XPath: end row with comma :) let $string-to-analyze := $row || "," let $analyze := fn:analyze-string($string-to-analyze, '(("[^"]*")+|[^,]*),') for $group in $analyze//fn:group[@nr="1"] return if (matches($group, '^".+"$')) then replace($group, '^"([^"]+)"$', '$1') else $group/string() };
And corrected query body:
let $csv := 'Author,Title,ISBN,Binding,Year Published Jeannette Walls,The Glass Castle,074324754X,Paperback,2006 James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005 Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002 "Larry Bossidy, Ram Charan, Charles Burck",Execution,9780609610572,Hardcover,2002 Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999' let $lines := tokenize($csv, '\n') let $header-row := fn:head($lines) let $body-rows := fn:tail($lines) let $headers := local:get-tokens($header-row) ! replace(., '\s+', '_') for $row in $body-rows let $cells := local:get-tokens($row) return element row { for $cell at $count in $cells return element {$headers[$count]} {$cell} }
Hi all,
Forgive me. Rather than post more code in this thread, I've created a gist with revised code that resolves some inconsistencies in what I posted here earlier.
https://gist.github.com/joewiz/7581205ab5be46eaa25fe223acda42c3
Again, this isn't a full-featured CSV parser by any means; it assumes fairly uniform CSV. Its contribution is that it is a fairly concise XQuery implementation that works around the absence of lookahead/lookbehind regex support in XPath.
Joe
basex-talk@mailman.uni-konstanz.de