X-debbugs-Cc: basex-talk@mailman.uni-konstanz.de Package: basex Version: 7.1.1-2 Severity: wishlist
We read
basex (7.1.1-2) unstable; urgency=low
* Allow non well-formed HTML to be parsed if libtagsoup-java is installed. * Updated man page with an example on how to parse HTML.
But we find no such example on the man page.
Also please add something to http://docs.basex.org/wiki/Parsers ... OK, I added a minimal http://docs.basex.org/wiki/Parsers#HTML_Parsers
By the way http://home.ccil.org/~cowan/XML/tagsoup/ says
--files Output into individual files, with html extensions changed to xhtml. Otherwise, all output is sent to the standard output. --html Output is in clean HTML: the XML declaration is suppressed, as are end-tags for the known empty elements. --omit-xml-declaration The XML declaration is suppressed.
etc. Please mention how we can manipulate these via "declare option...".
Also mention how to manipulate the 'SAX features and properties' mentioned.
Allow us to attempt a round trip,
declare option db:parser "html"; declare option output:method "html"; declare option output:version "4.01"; declare option output:doctype-public "-//W3C//DTD HTML 4.01//EN"; declare option output:doctype-system "http://www.w3.org/TR/html4/strict.dtd"; doc("http://jidanni.org/index.html")
Alas, I need to somehow use --html, and also who is putting those shape="rect" into my <a> links?? Ah, maybe --html will fix that too, http://www.xmlplease.com/shaperect .
I can accept the fact that comments are stripped, but there should be a way to adjust things so one can get a closer HTML round trip.
I did successfully manage checking my website for deficient IMG links, $ tail -n 2 Makefile xxqq:l.xq basex -bM="$$(find ~/jidanni.org -name *.html ! -name *_en.html)" $? $ cat l.xq declare option db:parser "html"; declare variable $M external; (: haven't learned collections / weeding around within :) let $k :=fn:tokenize($M, "\s+") for $i in $k (: Find img's on my website with no height and width etc. :) return doc($i)//*:img[not(@width) or not (@height) (: or not (@title) :)] (: To Do: spit out the filename and line number, so emacs can go to it. :)
basex-talk@mailman.uni-konstanz.de