csv:parse in the age of XQuery 3.1 - BaseX-Talk - mailman.uni-konstanz.de

8 Sep 2016

xquery version "3.1";

let $csv := 'Author,Title,ISBN,Binding,Year Published
Jeannette Walls,The Glass Castle,074324754X,Paperback,2006
James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005
Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002
"Larry Bossidy, Ram Charan, Charles
Burck",Execution,9780609610572,Hardcover,2002
Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999'
let $lines := tokenize($csv, '\n')
let $header-row := fn:head($lines)
let $body-rows := fn:tail($lines)
let $headers := fn:tokenize($header-row, ",") ! fn:replace(., " ", "")
for $row in $body-rows
let $cells := fn:analyze-string($row,
'(?:\s*(?:\"([^\"]*)\"|([^,]+))\s*,?|(?<=,)(),?)+?')//fn:group
return
    element Book {
      for $cell at $count in $cells
      return element {$headers[$count]} {$cell/string()}
    }
It produces the desired results:

<Book>
    <Author>Jeannette Walls</Author>
    <Title>The Glass Castle</Title>
    <ISBN>074324754X</ISBN>
    <Binding>Paperback</Binding>
    <YearPublished>2006</YearPublished>
</Book>
<Book>
    <Author>James Surowiecki</Author>
    <Title>The Wisdom of Crowds</Title>
    <ISBN>9780385503860</ISBN>
    <Binding>Paperback</Binding>
    <YearPublished>2005</YearPublished>
</Book>
<Book>
    <Author>Lawrence Lessig</Author>
    <Title>The Future of Ideas</Title>
    <ISBN>9780375505782</ISBN>
    <Binding>Paperback</Binding>
    <YearPublished>2002</YearPublished>
</Book>
<Book>
    <Author>Larry Bossidy, Ram Charan, Charles Burck</Author>
    <Title>Execution</Title>
    <ISBN>9780609610572</ISBN>
    <Binding>Hardcover</Binding>
    <YearPublished>2002</YearPublished>
</Book>
<Book>
    <Author>Kurt Vonnegut</Author>
    <Title>Slaughterhouse-Five</Title>
    <ISBN>9780791059258</ISBN>
    <Binding>Paperback</Binding>
    <YearPublished>1999</YearPublished>
</Book>

Unfortunately BaseX complains about the regex, with the following error:

Stopped at /Users/joe/file, 9/32: [FORX0002] Invalid regular
expression: (?:\s(?:\"([^\"])\"|([^,]+))\s*,?|(?<=,)(),?)+?.

Without a column location, I'm unable to tell where the problem is.
Is there something used in this expression that BaseX doesn't support?

On the topic of the potential memory pitfalls of a pure XQuery
solution for our hypothetical EXPath library, I think the primary
problem is that the entire CSV has to be loaded into memory.  I wonder
if implementations could use the new `fn:unparsed-text-lines()`
function from XQuery 3.0 to stream the CSV through XQuery without
requiring the entire thing to be in memory?  Or are we basically
setting ourselves up for the EXPath solution being a wrapper around an
external library written in a lower level language?

Joe

On Sun, Sep 11, 2016 at 4:53 AM, Christian Grün
<christian.gruen@gmail.com> wrote:
> Hi Joe,
>
> Thanks for your mail. You are completely right, using an array would
> be the natural choice with csv:parse. It’s mostly due to backward
> compatibility that we didn’t update the function.
>
> @All: I’m pretty sure that all of us would like having an EXPath spec
> for parsing CSV data. We still need one volunteer to make it happen ;)
> Anyone out there?
>
> Cheers
> Christian
>
>
> On Thu, Sep 8, 2016 at 6:13 AM, Joe Wicentowski <joewiz@gmail.com> wrote:
>> Dear BaseX developers,
>>
>> I noticed in example 3 under
>> http://docs.basex.org/wiki/CSV_Module#Examples that csv:parse() with
>> option { 'format': 'map' } returns a map of maps, with hardcoded row
>> numbers:
>>
>> map {
>>    1: map {
>>        "City": "Newton",
>>        "Name": "John"
>>    },
>>    2: map {
>>        "City": "Oldtown",
>>        "Name": "Jack"
>>    }
>> }
>>
>> Using maps, which are unordered, to represent something ordered like
>> rows in a CSV, hardcoded row numbers are necessary for reassembling
>> the map in document order.  I assume this was a necessary approach
>> when the module was developed in the map-only world of XQuery 3.0.
>> Now that 3.1 supports arrays, might an array of maps be a closer fit
>> for CSV parsing?
>>
>> array {
>>    map {
>>        "City": "Newton",
>>        "Name": "John"
>>    },
>>    map {
>>        "City": "Oldtown",
>>        "Name": "Jack"
>>    }
>> }
>>
>> I'm also curious, do you know of any efforts to create an EXPath spec
>> for CSV?  Putting spec and CSV in the same sentence is dangerous,
>> since CSV is a notoriously under-specified format: "The CSV file
>> format is not standardized" (see
>> https://en.wikipedia.org/wiki/Comma-separated_values).  But perhaps
>> there is a common enough need for CSV parsing that such a spec would
>> benefit the community?  I thought I'd start by asking here, since
>> BaseX's seems to be the most developed (or only?) CSV module in
>> XQuery.
>>
>> Then there's the question of how to approach implementations of such a
>> spec.  While XQuery is probably capable of parsing and serializing
>> small enough CSV, CSVs do get large and naive processing with XQuery
>> would tend to run into memory issues (as I found with xqjson).  This
>> means implementations would tend to write in a lower-level language.
>> eXist, for example, uses Jackson for fn:parse-json().  I see Jackson
>> has a CSV extension too:
>> https://github.com/FasterXML/jackson-dataformat-csv. Any thoughts on
>> the suitability of XQuery for the task?
>>
>> Joe