Our mission today is to use Basex to remove tags injected right between the bytes of multibyte UTF-8 characters. http://www.couchsurfing.org/group_read.html?gid=430&post=13986932
"CG" == Christian Grün <christian.gruen@gmail.com> writes: CG> Have you tried method=raw, as mentioned in our documentation CG> (http://docs.basex.org/wiki/Serialization)?
Sorry. Try it yourself: echo '<A>你好</A>'|perl -pwle 's![^[:ascii:]]!$&<wbr/>!'|basex -q ' declare option db:parser "html"; declare option output:method "raw"; doc("/dev/stdin")//*:wbr/..' There is no way to cleanly restore the shattered UTF-8. I would also like to try declare option output:encoding "RAW"; or "BYTES" or "NONE" but on http://docs.basex.org/wiki/Serialization it just says "all encodings supported by Java" So one is supposed to look at http://www.google.com/search?q=all+encodings+supported+by+Java etc. etc. Why doesn't basex have a command that would output the current "all encodings supported by Java" that it is using.