Hello,
I constructed a small XQuery script which contains the following specifications.
declare option output:method "csv"; declare option output:csv "header=yes, separator=|, quotes=yes"; …
The query execution succeeded (in principle) with the software “BaseX 9.7”.
I observed data processing results where displayed columns contained space characters. I would expect that such data should be handled as text strings then. Thus I would expect also that these data should be enclosed by double quotes (according to the safe handling of CSV files). But this did not happen for my software test.
Will any additional settings become relevant for this use case?
Regards, Markus
Hi Markus,
I observed data processing results where displayed columns contained space characters. I would expect that such data should be handled as text strings then.
Double quotes in CSV files do not give any information on the data type of a value. Instead, they merely ensure that the data can be correctly parsed. If the pipe symbol is used as separator, there is no need to enclosed whitespaced string by quotes.
Do you possibly have a processor that interprets the syntax of the incoming CSV data to detect and assign data types?
Best, Christian
Double quotes in CSV files do not give any information on the data type of a value.
I came along software applications which would need the usage of double quotes for the distinction that provided data should be handled as text strings.
Instead, they merely ensure that the data can be correctly parsed.
I suggest to reconsider this view a bit more for some data imports.
If the pipe symbol is used as separator, there is no need to enclosed whitespaced string by quotes.
How do you think about to avoid the interpretation of a column delimiter character within strings as a CSV separator?
Do you possibly have a processor that interprets the syntax of the incoming CSV data to detect and assign data types?
Occasionally, yes.
Regards, Markus
I came along software applications which would need the usage of double quotes for the distinction that provided data should be handled as text strings.
So you would probably require all non-numeric values to be enclosed by quotes, no matter if it contains spaces? What about other data types (booleans, etc.)?
How do you think about to avoid the interpretation of a column delimiter character within strings as a CSV separator?
Wrapping will take place if the column delimiter occurs in the value.
I came along software applications which would need the usage of double quotes for the distinction that provided data should be handled as text strings.
So you would probably require all non-numeric values to be enclosed by quotes, no matter if it contains spaces?
The answer depends on the configuration for data import tools.
Example: spreadsheet applications
What about other data types (booleans, etc.)?
Further data type indications can eventually be omitted, can't they?
Regards, Markus
What about other data types (booleans, etc.)?
Further data type indications can eventually be omitted, can't they?
A custom XQuery function may be the best option if you require a data-type specific output of CSV data.
For further general information on CSV serialization and the handling of double quotes, feel free to check out the RFC:
Further data type indications can eventually be omitted, can't they?
A custom XQuery function may be the best option if you require a data-type specific output of CSV data.
I hope that such code can be avoided if the BaseX CSV module could be adjusted accordingly. https://docs.basex.org/wiki/CSV_Module#Options
Regards, Markus
I hope that such code can be avoided if the BaseX CSV module could be adjusted accordingly. https://docs.basex.org/wiki/CSV_Module#Options
I wouldn’t completely factor out this option, but we definitely need to define more rules to nail this down. Just some examples:
1. Is there any specification (RFC, ISO, something else) that gives us more details on your exact requirements? 2. What’s the name of the software applications that you use that requires double quotes? 3. What would be the determining factor for using double quotes? Is it the existence of whitespaces (spaces, tabs, nl, cr, …?), or would all non-numeric values need to be quoted?
…
- What’s the name of the software applications that you use that requires double quotes?
Text qualifiers can accordingly be configured for data exports and imports.
Examples: * LibreOffice Calc * Microsoft Excel
- What would be the determining factor for using double quotes?
A specific data type indication should occasionally be applied for texts.
Is it the existence of whitespaces (spaces, tabs, nl, cr, …?),
This can happen.
I observed that content from different XML elements can be joined by using space characters.
or would all non-numeric values need to be quoted?
Probably, not.
I am unsure under which circumstances additional delimiters would be requested for more fine-grained data type distinctions.
Regards, Markus
Examples:
- LibreOffice Calc
- Microsoft Excel
Double quotes are only required for parsing, they are not used to judge if the input is a string or number. A little example:
1;"1"
If you save this one-liner as CSV file and open it with Excel, but values will be interpreted and formatted as numbers.
- What would be the determining factor for using double quotes?
A specific data type indication should occasionally be applied for texts.
I can’t follow, sorry. Examples might help.
Maybe the issue you want to solve is a non-issue? How can we reproduce the problem you encountered with Excel?
Double quotes are only required for parsing,
I would like to know a bit more about corresponding requirements.
they are not used to judge if the input is a string or number.
I got other impressions (or expectations).
A little example: 1;"1"
If you save this one-liner as CSV file and open it with Excel, but values will be interpreted and formatted as numbers.
I suggest to take another look at available information sources.
* Text Import Wizard by Microsoft Excel https://support.microsoft.com/en-gb/office/text-import-wizard-c5b02af6-fda1-... Step 2 of 3 (Delimited data) … Text qualifier …
* Text Import by LibreOffice https://help.libreoffice.org/7.3/en-GB/text/shared/00/00000208.html#hd_id314...
Regards, Markus
I would like to know a bit more about corresponding requirements.
Sorry again, maybe someone else can you help here. I simply don’t understand what you want to express.
- Text Import Wizard by Microsoft Excel https://support.microsoft.com/en-gb/office/text-import-wizard-c5b02af6-fda1-... Step 2 of 3 (Delimited data) … Text qualifier …
I’ll quote the text of your quoted reference:
Text qualifier: Select the character that encloses values in your text file. When Excel encounters the text qualifier character, all of the text that follows that character and precedes the next occurrence of that character is imported as one value, even if the text contains a delimiter character. For example, if the delimiter is a comma (,) and the text qualifier is a quotation mark ("), "Dallas, Texas" is imported into one cell as Dallas, Texas. If no character or the apostrophe (') is specified as the text qualifier, "Dallas, Texas" is imported into two adjacent cells as "Dallas and Texas".
That’s what already happens. The double quote is the text qualifier that’s used for serializing fields with spaces. There’s currently no feature to choose an alternative character (such as the apostrophe), but I think that’s not what you are looking for anyway.
Please help us if you want us to help you. Be more specific and spend more time in explaining your specific challenge. Give us a step-by-step description (e.g., a simple reproducible example) on
a) what you do b) what you get and c) what you expect to get.
That’s what already happens. The double quote is the text qualifier that’s used for serializing fields with spaces.
I did not observe text quotation during my software tests.
There’s currently no feature to choose an alternative character
This can be fine.
(such as the apostrophe),
Applications can occasionally need additional data type indicators, can't they?
but I think that’s not what you are looking for anyway.
I would like to enclose field data which correspond to text string data types by double quotes.
Regards, Markus
I did not observe text quotation during my software tests.
See [1] for some instruction on how to create a reproducible example. Thanks in advance.
I did not observe text quotation during my software tests.
See [1] for some instruction on how to create a reproducible example.
I created a BaseX database from a single XML file which contains the following data for another test approach.
<test_data> <row> <ID>01</ID> <T1>X Y</T1> <T2>Y, X</T2> </row> </test_data>
I tried the following XQuery script out.
declare option output:method "csv"; declare option output:csv "header=yes, quotes=yes, separator=|"; for $x in //test_data/row return <csv> <record> <ID>{$x/ID/data()}</ID> <T1>{$x/T1/data()}</T1> <T2>{$x/T2/data()}</T2> </record> </csv>
Will the corresponding test result need any further clarification?
ID|T1|T2 01|X Y|Y, X
How good does such a data display fit to expectations on desirable software behaviour? https://docs.basex.org/wiki/CSV_Module#Options
Regards, Markus
Thanks a lot for the example, Markus.
No quotes will be added, as your delimiter does not occur in the text value. The result can successfully be imported in spreadsheet applications without quotes.
The quotes will be added if the delimiter occurs in the texts:
declare option output:method 'csv'; declare option output:csv 'header=yes, quotes=yes, separator=|'; <csv> <record> <one>a b</one> <two>c|d</two> </record> </csv>
… yields …
one|two a b|"c|d"
Hope this helps, Christian
No quotes will be added, as your delimiter does not occur in the text value.
How does this feedback fit to previously provided information?
A) https://docs.basex.org/wiki/CSV_Module#Options “… Serialization: If the option is enabled, the value will be wrapped with quotes if it contains characters that might be treated as control characters. …”
B) https://mailman.uni-konstanz.de/pipermail/basex-talk/2022-April/017038.html “… The double quote is the text qualifier that’s used for serializing fields with spaces. …”
Another XQuery script example: declare option output:method "csv"; declare option output:csv "header=yes, quotes=yes, separator=|"; <csv> <record> <T3>line 1 line 2?</T3> <T4>line 3\nline 4?</T4> </record> </csv>
Test result: T3|T4 "line 1 line 2?"|line 3\nline 4?
Will any more clarification help here?
Regards, Markus
Hi Markus -
On Wed, Apr 6, 2022 at 9:25 AM Markus Elfring Markus.Elfring@web.de wrote:
No quotes will be added, as your delimiter does not occur in the text
value.
How does this feedback fit to previously provided information?
A) https://docs.basex.org/wiki/CSV_Module#Options “… Serialization: If the option is enabled, the value will be wrapped with quotes if it contains characters that might be treated as control characters. …”
B) https://mailman.uni-konstanz.de/pipermail/basex-talk/2022-April/017038.html “… The double quote is the text qualifier that’s used for serializing fields with spaces. …”
Another XQuery script example: declare option output:method "csv"; declare option output:csv "header=yes, quotes=yes, separator=|";
<csv> <record> <T3>line 1 line 2?</T3> <T4>line 3\nline 4?</T4> </record> </csv>
Test result: T3|T4 "line 1 line 2?"|line 3\nline 4?
Will any more clarification help here?
I confess that I'm not sure, but if you're asking "Why are there double
quotes around my `T3` field?", then the answer appears to be in the CSV specification[1]. So, the behavior you're seeing in your sample script matches the CSV RFC.
If that wasn't your question, it might help to restate, or reframe, your questions and concerns.
Regards, Markus
Best wishes, Bridger
[1] https://datatracker.ietf.org/doc/html/rfc4180#section-2, specifically part 6:
Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes. For example: "aaa","b CRLF bb","ccc" CRLF zzz,yyy,xxx
Test result: T3|T4 "line 1 line 2?"|line 3\nline 4?
Will any more clarification help here?
…, but if you're asking "Why are there double quotes around my `T3` field?",
Partly, yes. (But not really.)
then the answer appears to be in the CSV specification[1].
Which impressions did you get from the other BaseX information sources which I pointed out?
So, the behavior you're seeing in your sample script matches the CSV RFC.
What is your opinion for the data processing of the fields “T1”, “T2” and “T4”?
Regards, Markus
Markus -
Another confession: I've never been very good at puzzles and riddles :( so...
On Wed, Apr 6, 2022 at 10:34 AM Markus Elfring Markus.Elfring@web.de wrote:
Test result: T3|T4 "line 1 line 2?"|line 3\nline 4?
Will any more clarification help here?
…, but if you're asking "Why are there double quotes around my `T3`
field?",
Partly, yes. (But not really.)
That's certainly a cryptic response! :)
then the answer appears to be in the CSV specification[1].
Which impressions did you get from the other BaseX information sources which I pointed out?
My impressions with the BaseX documentation, generally speaking, are very
positive, and when I have questions (or need clarification, or help, etc), my impressions of the community here are even more positive: what a welcoming and generous group of people. I'm lucky to participate.
So, the behavior you're seeing in your sample script matches the CSV RFC.
What is your opinion for the data processing of the fields “T1”, “T2” and “T4”?
Wrapping up the other examples you've shared in this mail thread, along with some changes: e.g. ``` xquery version "3.1"; declare option output:method "csv"; declare option output:csv "header=yes,quotes=yes,separator=|"; declare variable $input := <test_data> <row> <ID>01</ID> <T1>X Y</T1> <T2>Y, X</T2> <T3>line 4 or a line break in the T3 field.</T3> <T4>line 6\nor something - maybe this is line 7? It might not be: what interprets the '\n' character? BaseX? Something else? Maybe we want the { string-join(('&','#','10',';','')) } character (linefeed literal)?</T4> </row> <row> <ID>02</ID> <T1>A 2nd row</T1> </row> </test_data>;
<csv> { for $row in $input//row return ( <record> { for $field in $row/child::* return ( $field ) } </record> ) } </csv> ```
Provides the results that I would expect: ``` ID|T1|T2|T3|T4 01|X Y|Y, X|"line 4 or a line break in the T3 field."|line 6\nor something - maybe this is line 7? It might not be: what interprets the '\n' character? BaseX? Something else? Maybe we want the character (linefeed literal)? 02|A 2nd row||| ``` My opinion of these results is that they are correct - they meet my expectations. Do you feel like these results are missing something? If so, what specifically? Is there something in the documentation that could change to make things more explicit or provide better clarity?
Regards, Markus
Best, Bridger
My opinion of these results is that they are correct - they meet my expectations.
I find this view interesting.
Do you feel like these results are missing something?
Yes.
If so, what specifically?
I am missing text quoting for the fields “ID”, “T1”, “T2” and “T4”.
Is there something in the documentation that could change to make things more explicit or provide better clarity?
I got such an impression.
Regards, Markus
If any CSV parser complains about the perfectly valid data in this example, I would definitely say it is poorly implemented, because it does not comply to the RFC. I took a glance at the mail thread and I could not find a hint where exactly you are having trouble with the quotes missing apart from saying that they are missing - is this a piece of software that cannot handle CSV properly? If so, I would definitely switch my parser.
Daniel
-----Ursprüngliche Nachricht----- Von: BaseX-Talk basex-talk-bounces@mailman.uni-konstanz.de Im Auftrag von Markus Elfring Gesendet: Mittwoch, 6. April 2022 17:17 An: Bridger Dyson-Smith bdysonsmith@gmail.com Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] Enclosing strings by double quotes for CSV data output
My opinion of these results is that they are correct - they meet my expectations.
I find this view interesting.
Do you feel like these results are missing something?
Yes.
If so, what specifically?
I am missing text quoting for the fields “ID”, “T1”, “T2” and “T4”.
Is there something in the documentation that could change to make things more explicit or provide better clarity?
I got such an impression.
Regards, Markus
Hello Markus,
In the Microsoft documentation you linked, the name "text qualifier" is misleading. It doesn't mean the value will always be formatted as text by the program. For example, I tried making this CSV:
ID|T1|T2 "01"|"April 6, 2022"|"01234567890123456789"
In a new spreadsheet in Excel, I imported the CSV with the custom delimiter pipe (|), and all three of the values were interpreted as formats other than text. "01" became the integer 1, "April 6, 2022" the date 4/6/2022, and the long number became 1.234567E+18. So adding double quotes around all values to designate a "text" datatype will not work for Excel; users still need to manually change the formats of the cells.
The "Format quoted field as text" is a LibreOffice-specific option that does format all of the values as text, not numbers or dates. To accomplish the same in Excel requires several complicated steps. Examples: https://www.winhelponline.com/blog/stop-excel-convert-text-to-number-date-fo...
Also, a user might want "April 6, 2022" to be interpreted as a date for calculations, so whitespace in a value doesn't necessarily mean it's supposed to be treated as text.
The current behavior of BaseX will output a CSV that can be correctly interpreted by both Calc and Excel with the correct separators/delimiters set, so I don't believe it should be changed, and even if it were it wouldn't accomplish what you want.
-Tamara
On Wed, Apr 6, 2022 at 8:17 AM Markus Elfring Markus.Elfring@web.de wrote:
My opinion of these results is that they are correct - they meet my
expectations.
I find this view interesting.
Do you feel like these results are missing something?
Yes.
If so, what specifically?
I am missing text quoting for the fields “ID”, “T1”, “T2” and “T4”.
Is there something in the documentation that could change to make things
more explicit or provide better clarity?
I got such an impression.
Regards, Markus
Dear Tamara,
Thanks for your detailed feedback.
In the Microsoft documentation you linked, the name "text qualifier" is misleading.
Can this view trigger any further consequences?
It doesn't mean the value will always be formatted as text by the program.
This aspect is interesting, isn't it?
"01" became the integer 1, …
Which data types can handle “numbers” so that leading zeros would be preserved?
So adding double quotes around all values to designate a "text" datatype will not work for Excel;
This can be fine in some use cases.
users still need to manually change the formats of the cells.
This can happen also.
The "Format quoted field as text" is a LibreOffice-specific option
Would you like to take related program variants better into account?
that does format all of the values as text,
Do you like such a software functionality?
not numbers or dates.
I find such a setting occasionally useful.
To accomplish the same in Excel requires several complicated steps.
Do any advanced users require special CSV data import configurations?
Examples: https://www.winhelponline.com/blog/stop-excel-convert-text-to-number-date-fo...
Will the attention grow for further clarification also according to information from the article “Stop Excel from Converting Text to Number or Date format when Opening a CSV file”?
…, and even if it were it wouldn't accomplish what you want.
I suggest to reconsider this view a bit more.
Do I eventually need to reimplement CSV export functionality by XQuery means of customised string concatenations?
Regards, Markus
basex-talk@mailman.uni-konstanz.de