Hi folks,
I am aware that with the HTML module you can let it guess a file's encoding by itself by providing it in binary format:
If the input encoding is unknown, the data to be processed can be passed on in its binary representation. The HTML parser will automatically try to detect the correct encoding:
Query html:parse(fetch:binary("https://en.wikipedia.org"))
But is there a way to guess encoding of CSV files? So far I have tried with the fetch and CSV module with no results. I have a huge bunch of CSV files and they are all in different encodings. Maybe it is possible to pipe the content of the fetch:binary to a system command for guessing the encoding, and use this to read in the csv?
Best regards, Kristian Kankainen
Am 24.05.2021 um 09:22 schrieb Kristian Kankainen:
Hi folks,
I am aware that with the HTML module you can let it guess a file's encoding by itself by providing it in binary format:
If the input encoding is unknown, the data to be processed can be passed on in its binary representation. The HTML parser will automatically try to detect the correct encoding:
Query
html:parse(fetch:binary("https://en.wikipedia.org https://en.wikipedia.org"))
But is there a way to guess encoding of CSV files? So far I have tried with the fetch and CSV module with no results. I have a huge bunch of CSV files and they are all in different encodings. Maybe it is possible to pipe the content of the fetch:binary to a system command for guessing the encoding, and use this to read in the csv?
I think both HTML parsers and XML parsers rely on the presence of some encoding declaration (e.g. a meta charset in HTML or the XML declaration in XML) to "detect" an encoding; I am not sure CSV has anything like that.
But that is just my understanding of the parser world in general, I don't know the exact way things work in BaseX.
Hi Kristian,
With HTML, there are various ways to specify the document encoding (e.g. the byte order mark, via XML declaration or the Content-Type meta element). With text files, if fetch:text or file:read-text is used, only the byte order mark (e.g., EF BB BF for UTF-8) will be considered, as it’s the only indicator that allows for a unique identification of the file encoding.
As you may know, it’s often impossible to guess the exact encoding of a text file. But you can always use external tools for that, such as chardetect, which performs statistical analysis on the input (it’s based on Mozilla’s charset detector [1]). The guessed encoding can then be passed on to fetch:text:
(: sample code, needs to be revised :) let $file := '/path/to/file.csv' let $encoding := proc:system('chardetect', $file) let $string := fetch:text($file, $encoding) return csv:parse($string)
Hope this helps, Christian
[1] https://www-archive.mozilla.org/projects/intl/chardet.html
On Mon, May 24, 2021 at 9:23 AM Kristian Kankainen kristian@keeleleek.ee wrote:
Hi folks,
I am aware that with the HTML module you can let it guess a file's encoding by itself by providing it in binary format:
If the input encoding is unknown, the data to be processed can be passed on in its binary representation. The HTML parser will automatically try to detect the correct encoding:
Query
html:parse(fetch:binary("https://en.wikipedia.org"))
But is there a way to guess encoding of CSV files? So far I have tried with the fetch and CSV module with no results. I have a huge bunch of CSV files and they are all in different encodings. Maybe it is possible to pipe the content of the fetch:binary to a system command for guessing the encoding, and use this to read in the csv?
Best regards, Kristian Kankainen
Thank you all. This is what I came up with:
file:write-binary($temp-file, fetch:binary($csv-link)), let $encoding := proc:system("file", ("-Ik", $temp-file)) => substring-after("charset=") => normalize-space() => replace("unknown-8bit", "ISO-8859-15") => replace("binary", "ISO-8859-15") return csv:parse( $temp-file, map { 'header': true(), 'lax': 'no', 'separator': 'semicolon', 'format': 'attributes', 'encoding': $encoding })
On 24. May 2021, at 13:41, Christian Grün christian.gruen@gmail.com wrote:
Hi Kristian,
With HTML, there are various ways to specify the document encoding (e.g. the byte order mark, via XML declaration or the Content-Type meta element). With text files, if fetch:text or file:read-text is used, only the byte order mark (e.g., EF BB BF for UTF-8) will be considered, as it’s the only indicator that allows for a unique identification of the file encoding.
As you may know, it’s often impossible to guess the exact encoding of a text file. But you can always use external tools for that, such as chardetect, which performs statistical analysis on the input (it’s based on Mozilla’s charset detector [1]). The guessed encoding can then be passed on to fetch:text:
(: sample code, needs to be revised :) let $file := '/path/to/file.csv' let $encoding := proc:system('chardetect', $file) let $string := fetch:text($file, $encoding) return csv:parse($string)
Hope this helps, Christian
[1] https://www-archive.mozilla.org/projects/intl/chardet.html
On Mon, May 24, 2021 at 9:23 AM Kristian Kankainen kristian@keeleleek.ee wrote:
Hi folks,
I am aware that with the HTML module you can let it guess a file's encoding by itself by providing it in binary format:
If the input encoding is unknown, the data to be processed can be passed on in its binary representation. The HTML parser will automatically try to detect the correct encoding:
Query
html:parse(fetch:binary("https://en.wikipedia.org"))
But is there a way to guess encoding of CSV files? So far I have tried with the fetch and CSV module with no results. I have a huge bunch of CSV files and they are all in different encodings. Maybe it is possible to pipe the content of the fetch:binary to a system command for guessing the encoding, and use this to read in the csv?
Best regards, Kristian Kankainen
basex-talk@mailman.uni-konstanz.de