Hi all,
I'm investigating a way of analysing a massive set of > 900.000 CSV files, for which the CSV parsing in BaseX seems very useful, producing a db nicely filled with documents such as:
<csv>
<record>
<ResourceID>00003a92-d10e-585e-84a7-29ad17c5799f</ResourceID>
<source.id>bbcy:vev:6860</source.id>
<card>AA</card>
<order>0</order>
<source_field/>
<source_code/>
<Annotation>some remarks</Annotation>
<Annotation_Language>en</Annotation_Language>
<Annotation_Type/>
<resource_model/>
<!-- ... -->
</record>
<record>
<ResourceID>00003a92-d10e-585e-84a7-29ad17c5799f</ResourceID>
<source.id>bbcy:vev:6860</source.id>
<card>BE</card>
<order>0</order>
<source_field/>
<source_code>concept</source_code>
<Annotation/>
<Annotation_Language/>
<Annotation_Type/>
<resource_model/>
<!-- ... -->
</record>
<!-- ... -->
</csv>
Yet, when querying those documents, I'm noticing how just selecting non-empty elements is very slow. For example:
//source_code[normalize-space()]
...can take over 40 seconds.
Since I don't have control over the source data, it would be
really great if empty cells could be skipped when parsing CSV
files. Of course this could be a trivial post-processing step via
XSLT / XQuery, but that's unfeasible for that mass of data.
Does BaseX provide a way of telling the CSV parser to skip empty
cells?
Best,
Ron