That mangled string is the result of reading UTF-8 byte sequences as single-byte characters, e.g. ASCII or some Windows code page.
How are you loading it into BaseX? It seems unlikely that BaseX-provided code would make this kind of basic mistake in reading text but it’s possible it applied the incorrect encoding for some reason.
Cheers,
Eliot
--
Eliot Kimber
From: basex-talk-bounces@mailman.uni-konstanz.de on behalf of BitRider001 bit.rider.001@pm.me Reply-To: BitRider001 bit.rider.001@pm.me Date: Thursday, May 17, 2018 at 8:34 PM To: Bridger Dyson-Smith bdysonsmith@gmail.com Cc: "basex-talk@mailman.uni-konstanz.de" basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] about special characters
Bridger,
Indeed the file was exported from Excel in UTF-8 encoding. I've tried opening the CSV file using Notepad/Wordpad and in Linux with vi in a terminal and in both situations it displays the correct special character.
Its only when I load it into a BaseX db and query it does it show itself, as you said, as "mangled". Saving the results into a text file also contains the "mangled" string.
Strange.
Bit
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On May 18, 2018 9:21 AM, Bridger Dyson-Smith bdysonsmith@gmail.com wrote:
Bit -
that's odd; it looks like the characters are being decomposed (or whatever the term is) and mangled but I'm not sure, unfortunately. Was the CSV an export from Excel? If so, I suppose this could be a Windows character set problem (cp-1252 or iso-8859-1 or something?).
Bridger
On Thu, May 17, 2018 at 9:11 PM BitRider001 bit.rider.001@pm.me wrote:
Hi Bridger,
Yes that is right. I'm on the latest (9.0.1). Attaching a screenshot here for anyone to take a look.
Bit
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On May 18, 2018 8:41 AM, Bridger Dyson-Smith bdysonsmith@gmail.com wrote:
Hi Bit - are you using the latest version? There was a problem with 9.0 and some Unicode characters. Christian and co. have a fix in v9.0.1.
HTH,
Bridger
On Thu, May 17, 2018, 7:54 PM BitRider001 bit.rider.001@pm.me wrote:
Hi,
I just joined the mailing list due to a problem I'm having displaying and storing special characters.
I started with a CSV and created a database from it and the CSV is in UTF-8. However, when I query the special characters become garbled. I'm using the GUI in Windows 10.
It starts with this in the CSV:
<name>Cañelas</name>
Then ends up with this when I export the query result into a text file:
<name>Ca�las</name>
Help please.
Bit