Is it true that basex can output HTML, http://docs.basex.org/wiki/Serialization but not read it back in? http://docs.basex.org/wiki/Parsers
Hi,
if Tagsoup [1] is present in the classpath (it comes with our Zip packages e.g.), BaseX will allow (the "poor, nasty and brutish" [1]) HTML input.
Hope this helps. Kind regards from Lake Constance
Michael
[1] http://ccil.org/~cowan/XML/tagsoup/ Am 22.02.2012 um 04:27 schrieb jidanni@jidanni.org:
Is it true that basex can output HTML, http://docs.basex.org/wiki/Serialization but not read it back in? http://docs.basex.org/wiki/Parsers _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
"MS" == Michael Seiferle michael.seiferle@uni-konstanz.de writes:
MS> Hi,
MS> if Tagsoup [1] is present in the classpath (it comes with our Zip MS> packages e.g.), BaseX will allow (the "poor, nasty and brutish" [1]) MS> HTML input.
Well all I know is that http://docs.basex.org/wiki/Parsers should mention what to do to read HTML, and on my machine there is $ apt-cache search tagsoup-java libtagsoup-java - SAX-compliant parser for real-life HTML libtagsoup-java-doc - API Documentation for TagSoup
Mainly it is tags like <img ...> without /> that throw basex off track.
Tagsoup needs to be embedded in your classpath -- which is the case if BaseX is downloaded from our homepage). If you have installed BaseX via the Debian package manager, you'll have to manually embed the tagsoup.jar in the BaseX start scripts.
Hope this helps, Christian
Well all I know is that http://docs.basex.org/wiki/Parsers should mention what to do to read HTML, and on my machine there is $ apt-cache search tagsoup-java libtagsoup-java - SAX-compliant parser for real-life HTML libtagsoup-java-doc - API Documentation for TagSoup
Mainly it is tags like <img ...> without /> that throw basex off track. _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
I'll have a look at this. If tagsoup is present on a Debian system it should be detected automatically. If not, its a fault of the package and i'll fix it.
On 22.02.2012, at 12:24, Christian Grün wrote:
Tagsoup needs to be embedded in your classpath -- which is the case if BaseX is downloaded from our homepage). If you have installed BaseX via the Debian package manager, you'll have to manually embed the tagsoup.jar in the BaseX start scripts.
Hope this helps, Christian
Well all I know is that http://docs.basex.org/wiki/Parsers should mention what to do to read HTML, and on my machine there is $ apt-cache search tagsoup-java libtagsoup-java - SAX-compliant parser for real-life HTML libtagsoup-java-doc - API Documentation for TagSoup
Mainly it is tags like <img ...> without /> that throw basex off track. _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Alexander Holupirek |-- http://www.informatik.uni-konstanz.de/~holupire |-- Database & Information Systems Group, U Konstanz `-- Room E 221, 0049 7531 88 2188 (phone) 3577 (fax)
I just prepared basex_7.1.1-2_all.deb which is available from:
deb http://files.basex.org/debian unstable/ deb-src http://files.basex.org/debian unstable/
* Problem: Want to parse non well-formed HTML
$ cat bad.html <html> <ul> <li>A <li>B </ul> </html>
$ basex -c 'create db html bad.html' "/home/holu/bad.html" (Line 5): </ul> found, </li> expected. The input may be correctly parsed after switching off the internal XML parser.
* Solution: Have tagsoup installed and set it as parser
$ sudo aptitude install libtagsoup-java The following NEW packages will be installed: libtagsoup-java 0 packages upgraded, 1 newly installed, 0 to remove and 88 not upgraded. Need to get 99.0 kB of archives. After unpacking 138 kB will be used. Get: 1 ftp://ftp.debian.org/debian/ unstable/main libtagsoup-java all 1.2.1-1 [99.0 kB] Fetched 99.0 kB in 0s (305 kB/s) Selecting previously unselected package libtagsoup-java. (Reading database ... 89487 files and directories currently installed.) Unpacking libtagsoup-java (from .../libtagsoup-java_1.2.1-1_all.deb) ... Processing triggers for man-db ... Setting up libtagsoup-java (1.2.1-1) ...
$ basex -c 'set parser html; create db html bad.html' $ basex -q "doc('html')" <html xmlns="http://www.w3.org/1999/xhtml"> <body> <ul> <li>A</li> <li>B</li> </ul> </body> </html>
Available in Debian package version 7.1.1-2
Cheers, Alex
On 22.02.2012, at 12:26, Alexander Holupirek wrote:
I'll have a look at this. If tagsoup is present on a Debian system it should be detected automatically. If not, its a fault of the package and i'll fix it.
On 22.02.2012, at 12:24, Christian Grün wrote:
Tagsoup needs to be embedded in your classpath -- which is the case if BaseX is downloaded from our homepage). If you have installed BaseX via the Debian package manager, you'll have to manually embed the tagsoup.jar in the BaseX start scripts.
Hope this helps, Christian
Well all I know is that http://docs.basex.org/wiki/Parsers should mention what to do to read HTML, and on my machine there is $ apt-cache search tagsoup-java libtagsoup-java - SAX-compliant parser for real-life HTML libtagsoup-java-doc - API Documentation for TagSoup
Mainly it is tags like <img ...> without /> that throw basex off track.
Don't you want to mention SET PARSER HTML on http://docs.basex.org/wiki/Parsers ?
Also there is nothing 'bad' about the HTML... It is valid form of one of the html versions mentioned on http://docs.basex.org/wiki/Serialization .
Indeed, the versions that SET PARSER HTML will support should please be documented there on http://docs.basex.org/wiki/Parsers .
basex-talk@mailman.uni-konstanz.de