When converting files from xml to html, there appeared a serialization error saying something to the effect that x84 was an illegal html character. The files were written using file:write with parameter $params defined as:
let $params := output:serialization-parametersoutput:method value="html"/</ output:serialization-parameters>
When sending the html directly to the browser (not writing to file and using the above declaration), the browser (chrome) appeared to be ok, and displayed the full html.
Processing the text through normalize-unicode() didn't help. The error persisted. Is there a way to fix the text before submitting it to file:write()?
All the best Lars G Johnsen National Library of Norway
Hi Lars,
When converting files from xml to html, there appeared a serialization error saying something to the effect that x84 was an illegal html character. The files were written using file:write with parameter $params defined as:
Do you have some idea how the x84 byte was stored into the database?
Is there a way to fix the text before submitting it to file:write()?
There are probably several ways to do this, but one standard XQuery solution I just got in mind looks as follows:
let $invalid := 132 let $valid := string-to-codepoints("?") for $text in db:open('db')//text() let $cps := string-to-codepoints($string) ! (if (. eq $invalid) then $valid else .) let $new := codepoints-to-string($cps) return $text
All strings are converted to their codepoints, and the invalid codes are replaced with an alternative (here: ?). The text are then returned as result.
The following query will replace all texts in the database..
let $invalid := 132 let $valid := string-to-codepoints("?") for $text in db:open('db')//text() let $cps := string-to-codepoints($string) ! (if (. eq $invalid) then $valid else .) let $new := codepoints-to-string($cps) return replace value of node $text with $new
..and the last one replaces the texts in the main memory representation of the document:
copy $db := db:open('db') modify ( let $invalid := 132 let $valid := string-to-codepoints("?") for $text in $db//text() let $cps := string-to-codepoints($string) ! (if (. eq $invalid) then $valid else .) let $new := codepoints-to-string($cps) return replace value of node $text with $new ) return $db
Hope this helps, Christian
Hi Christian and thanks for the quick solution!
Do you have some idea how the x84 byte was stored into the database?
Each file is a digitized book, where the bytes comes stem the OCR-process. The words themselves are stored as values of attributes, for example <STRING CONTENT="word">.
All the best Lars
basex-talk@mailman.uni-konstanz.de