Joe, just to back you: I believe that an EXPath spec for CSV processing would be *extremely* useful! (There is hardly a format as ubiquitous as CSV.)

And I had similar experience concerning the performance - concretely, a 22 MB file proved to be simply unprocessable! Which means that BaseX support for CSV is only partial.

So I ardently hope for the BaseX team to enable the parsing of large CSV, and I hope for an initiative pulling CSV into EXPath!

Kind regards,
Hans-Jürgen


Joe Wicentowski <joewiz@gmail.com> schrieb am 6:14 Donnerstag, 8.September 2016:


Dear BaseX developers,

I noticed in example 3 under
http://docs.basex.org/wiki/CSV_Module#Examples that csv:parse() with
option { 'format': 'map' } returns a map of maps, with hardcoded row
numbers:

map {
    1: map {
        "City": "Newton",
        "Name": "John"
    },
    2: map {
        "City": "Oldtown",
        "Name": "Jack"
    }
}

Using maps, which are unordered, to represent something ordered like
rows in a CSV, hardcoded row numbers are necessary for reassembling
the map in document order.  I assume this was a necessary approach
when the module was developed in the map-only world of XQuery 3.0.
Now that 3.1 supports arrays, might an array of maps be a closer fit
for CSV parsing?

array {
    map {
        "City": "Newton",
        "Name": "John"
    },
    map {
        "City": "Oldtown",
        "Name": "Jack"
    }
}

I'm also curious, do you know of any efforts to create an EXPath spec
for CSV?  Putting spec and CSV in the same sentence is dangerous,
since CSV is a notoriously under-specified format: "The CSV file
format is not standardized" (see
https://en.wikipedia.org/wiki/Comma-separated_values).  But perhaps
there is a common enough need for CSV parsing that such a spec would
benefit the community?  I thought I'd start by asking here, since
BaseX's seems to be the most developed (or only?) CSV module in
XQuery.

Then there's the question of how to approach implementations of such a
spec.  While XQuery is probably capable of parsing and serializing
small enough CSV, CSVs do get large and naive processing with XQuery
would tend to run into memory issues (as I found with xqjson).  This
means implementations would tend to write in a lower-level language.
eXist, for example, uses Jackson for fn:parse-json().  I see Jackson
has a CSV extension too:
https://github.com/FasterXML/jackson-dataformat-csv. Any thoughts on
the suitability of XQuery for the task?

Joe