Our mission today is to use Basex to remove tags injected right between the bytes of multibyte UTF-8 characters.
http://www.couchsurfing.org/group_read.html?gid=430&post=13986932
"CG" == Christian Grün christian.gruen@gmail.com writes:
CG> Have you tried method=raw, as mentioned in our documentation CG> (http://docs.basex.org/wiki/Serialization)?
Sorry. Try it yourself: echo '<A>你好</A>'|perl -pwle 's![^[:ascii:]]!$&<wbr/>!'|basex -q ' declare option db:parser "html"; declare option output:method "raw"; doc("/dev/stdin")//*:wbr/..'
There is no way to cleanly restore the shattered UTF-8.
I would also like to try
declare option output:encoding "RAW"; or "BYTES" or "NONE"
but on http://docs.basex.org/wiki/Serialization it just says "all encodings supported by Java" So one is supposed to look at http://www.google.com/search?q=all+encodings+supported+by+Java etc. etc.
Why doesn't basex have a command that would output the current "all encodings supported by Java" that it is using.
Jidanni,
echo '<A>你好</A>'|perl -pwle 's![^[:ascii:]]!$&<wbr/>!'|basex -q ' declare option db:parser "html"; declare option output:method "raw"; doc("/dev/stdin")//*:wbr/..'
If you want help, please try to help, too. Your example is not what I would call very helpful; give us at least:
a) a minimized example, b) the returned output, and c) the expected result
declare option output:encoding "RAW"; or "BYTES" or "NONE"
I’m not sure if you will need any output declaration for your query at all; but we first need more details.
http://docs.basex.org/wiki/Serialization it just says "all encodings supported by Java" So one is supposed to look at http://www.google.com/search?q=all+encodings+supported+by+Java
I've added a link. Note, however, that the list is also dependent on the Java VM you are using.
Why doesn't basex have a command that would output the current "all encodings supported by Java" that it is using.
Try this:
basex "Q{java.nio.charset.Charset}availableCharsets()"
"CG" == Christian Grün christian.gruen@gmail.com writes:
CG> Jidanni,
echo '<A>你好</A>'|perl -pwle 's![^[:ascii:]]!$&<wbr/>!'|basex -q ' declare option db:parser "html"; declare option output:method "raw"; doc("/dev/stdin")//*:wbr/..'
CG> If you want help, please try to help, too. Your example is not what I CG> would call very helpful; give us at least:
CG> a) a minimized example,
That's what it is, totally contained. Just run it on your Linux etc. shell command line.
CG> b) the returned output, and
OK, here it is QP encoded: =EF=BF=BD=EF=BF=BD=EF=BF=BD=E5=A5=BD=
CG> c) the expected result
I'm just trying to find a way to remove the <wbr/> injected here, $ echo '<A>你好</A>'|perl -pwle 's![^[:ascii:]]!$&<wbr/>!'|qprint -e <A>=E4<wbr/>=BD=A0=E5=A5=BD</A>
So I can get <A>=E4=BD=A0=E5=A5=BD</A>
I am guessing that is not possible with Basex, and one needs byte level tools like perl.
declare option output:encoding "RAW"; or "BYTES" or "NONE"
CG> I’m not sure if you will need any output declaration for your query at CG> all; but we first need more details.
http://docs.basex.org/wiki/Serialization it just says "all encodings supported by Java" So one is supposed to look at http://www.google.com/search?q=all+encodings+supported+by+Java
CG> I've added a link. Note, however, that the list is also dependent on CG> the Java VM you are using.
OK, also do make a note of that fact there...
Why doesn't basex have a command that would output the current "all encodings supported by Java" that it is using.
CG> Try this:
CG> basex "Q{java.nio.charset.Charset}availableCharsets()"
Gawd! $ basex "Q{java.nio.charset.Charset}availableCharsets()"|wc 0 167 3593 One big line and everything is repeated twice!
$ basex "Q{java.nio.charset.Charset}availableCharsets()"| perl -nwle 'print for /([^\s{]+)=/g'|wc 167 167 1713 looks much nicer and has half the bytes.
Do make a note of it on the wiki there. Thanks.
On Tue, 2013-01-01 at 10:52 +0800, jidanni@jidanni.org wrote:
I'm just trying to find a way to remove the <wbr/> injected here, $ echo '<A>你好</A>'|perl -pwle 's![^[:ascii:]]!$&<wbr/>!'|qprint -e <A>=E4<wbr/>=BD=A0=E5=A5=BD</A>
I don't have a qprint command on my system, so I'm not sure what's going on for you here. Your perl substitution is putting <wbr/> after the first non-ascii character on the line, and 你 is for sure not an ascii character, so you get <wbr/> after it.
Are you trying to do MIME octet-level encoding of UTF-8 here?
Liam
LREQ> Your perl substitution is putting <wbr/> after the first non-ascii LREQ> character on the line, and 你 is for sure not an ascii character, LREQ> so you get <wbr/> after it.
Not exactly after it. 1/3 of the way through it. I.e., shattered UTF-8. I was just curious if there was a way in basex if I could do s!<wbr/>!!g like I can do in perl, to restore the damaged UTF-8 characters.
http://www.couchsurfing.org/group_read.html?gid=430&post=13998575
On Tue, 2013-01-01 at 11:47 +0800, jidanni@jidanni.org wrote:
Not exactly after it. 1/3 of the way through it. I.e., shattered UTF-8.
Treating the individual UTF-8 octets individually?
Not in standard XQuery, but that doesn't preclude a BaseX extension...
I was just curious if there was a way in basex if I could do s!<wbr/>!!g like I can do in perl, to restore the damaged UTF-8 characters.
Note that "damaged UTF-8 characters", if by that you mean not well-formed UTF-8, aren't going to come through email reliably, so I might not be seeing what you wrote - s!<wbr/>!!g can be done with replace() but getting at UTF-8-encoded characters one octet at a time is another matter. But, my goal in replying was to tease out enough information from you that someone else could answer :-)
It's probably best not to assume that people on an XQuery-list would be familiar with Unicode handling in other languages, such as Perl, by the way, although some of us are :-)
http://www.couchsurfing.org/group_read.html?gid=430&post=13998575
This says, "this thread has been deleted" at me.
Best,
Liam
"LREQ" == Liam R E Quin liam@w3.org writes:
LREQ> Treating the individual UTF-8 octets individually? Yes. LREQ> Not in standard XQuery, but that doesn't preclude a BaseX extension... Well no big deal, I was just curious.
I was just curious if there was a way in basex if I could do s!<wbr/>!!g like I can do in perl, to restore the damaged UTF-8 characters.
LREQ> Note that "damaged UTF-8 characters", if by that you mean not LREQ> well-formed UTF-8, aren't going to come through email reliably, so I LREQ> might not be seeing what you wrote - s!<wbr/>!!g can be done with
Don't worry. I wouldn't put any illegal chars into mail.
LREQ> replace() but getting at UTF-8-encoded characters one octet at a time is LREQ> another matter. But, my goal in replying was to tease out enough LREQ> information from you that someone else could answer :-)
http://www.couchsurfing.org/group_read.html?gid=430&post=13998575
LREQ> This says, "this thread has been deleted" at me. In fact they deleted the entire group it turns out.
Anyway here's what I posted there #!/usr/bin/perl # Shows line where we remove couchsurfing.org's UTF-8 shattering effects. # Must run this before the browser gets its hands on it and turns the # shattered UTF-8 into U+FFFD REPLACEMENT CHARACTER. # So that seems to count out greasemonkey, etc. solutions. # I used wwwoffle -o URL|./this_program after first browsing the page logged in # in a browser that used wwwoffle as a proxy # Copyright : http://www.fsf.org/copyleft/gpl.html # Author : Dan Jacobson -- http://jidanni.org/ # Created On : 12/31/2012 # Last Modified On: Mon Dec 31 13:12:57 2012 # Update Count : 27 use strict; use warnings FATAL => 'all'; my $N = qr/[^[:ascii:]]/; while (<>) { my $original_line = $_; ## needed on e.g., http://www.couchsurfing.org/couchmanager?read=18541584 s!<wbr/>!!g; ## needed on e.g., ## http://www.couchsurfing.org/couchrequest/show_couchoffer_form?city_couchrequ... s!($N) ($N)!$1$2!g; s!\t<span class="show_more_control">\s+<br />!! && chomp; m!^\s+...<a class="show_more_link" href="#"> (more) </a><br />! && next; s!\s*</span><span class="show_more_text" style="display: none;"> !!; print "$.: $_" if $_ ne $original_line; }
As Liam indicated (thanks!), XQuery may not be the best choice to process data on byte level: XQuery was built to work with Unicode characters as basic unit, which means that it will never be possible with pure XQuery to create illegal UTF8 sequences. This also means that the language provides no support to „repair” invalid input.
I wonder if you have enough control over your input to avoid UTF8 shattering? If there’s no choice, and if you still want to try XQuery/BaseX for byte processing, you can play around with the functions of the Conversion Module:
http://docs.basex.org/wiki/Conversion_Module ___________________________
On Tue, Jan 1, 2013 at 5:50 AM, jidanni@jidanni.org wrote:
"LREQ" == Liam R E Quin liam@w3.org writes:
LREQ> Treating the individual UTF-8 octets individually? Yes. LREQ> Not in standard XQuery, but that doesn't preclude a BaseX extension... Well no big deal, I was just curious.
I was just curious if there was a way in basex if I could do s!<wbr/>!!g like I can do in perl, to restore the damaged UTF-8 characters.
LREQ> Note that "damaged UTF-8 characters", if by that you mean not LREQ> well-formed UTF-8, aren't going to come through email reliably, so I LREQ> might not be seeing what you wrote - s!<wbr/>!!g can be done with
Don't worry. I wouldn't put any illegal chars into mail.
LREQ> replace() but getting at UTF-8-encoded characters one octet at a time is LREQ> another matter. But, my goal in replying was to tease out enough LREQ> information from you that someone else could answer :-)
http://www.couchsurfing.org/group_read.html?gid=430&post=13998575
LREQ> This says, "this thread has been deleted" at me. In fact they deleted the entire group it turns out.
Anyway here's what I posted there #!/usr/bin/perl # Shows line where we remove couchsurfing.org's UTF-8 shattering effects. # Must run this before the browser gets its hands on it and turns the # shattered UTF-8 into U+FFFD REPLACEMENT CHARACTER. # So that seems to count out greasemonkey, etc. solutions. # I used wwwoffle -o URL|./this_program after first browsing the page logged in # in a browser that used wwwoffle as a proxy # Copyright : http://www.fsf.org/copyleft/gpl.html # Author : Dan Jacobson -- http://jidanni.org/ # Created On : 12/31/2012 # Last Modified On: Mon Dec 31 13:12:57 2012 # Update Count : 27 use strict; use warnings FATAL => 'all'; my $N = qr/[^[:ascii:]]/; while (<>) { my $original_line = $_; ## needed on e.g., http://www.couchsurfing.org/couchmanager?read=18541584 s!<wbr/>!!g; ## needed on e.g., ## http://www.couchsurfing.org/couchrequest/show_couchoffer_form?city_couchrequ... s!($N) ($N)!$1$2!g; s!\t<span class="show_more_control">\s+<br />!! && chomp; m!^\s+...<a class="show_more_link" href="#"> (more) </a><br />! && next; s!\s*</span><span class="show_more_text" style="display: none;"> !!; print "$.: $_" if $_ ne $original_line; }
basex-talk@mailman.uni-konstanz.de