can output HTML but not read it back in again? - BaseX-Talk - mailman.uni-konstanz.de

List overview All Threads
Download

can output HTML but not read it back in again?

Load files to database

Re: [basex-talk] GUI stops...

jidanni＠jidanni.org

22 Feb 2012 22 Feb '12

4:27 a.m.

Is it true that basex can output HTML, http://docs.basex.org/wiki/Serialization but not read it back in? http://docs.basex.org/wiki/Parsers

Reply

Show replies by date

Michael Seiferle

22 Feb 22 Feb

8:48 a.m.

Hi,

if Tagsoup [1] is present in the classpath (it comes with our Zip packages e.g.), BaseX will allow (the "poor, nasty and brutish" [1]) HTML input.

Hope this helps. Kind regards from Lake Constance

Michael

[1] http://ccil.org/~cowan/XML/tagsoup/ Am 22.02.2012 um 04:27 schrieb jidanni@jidanni.org:

Is it true that basex can output HTML, http://docs.basex.org/wiki/Serialization but not read it back in? http://docs.basex.org/wiki/Parsers _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Reply

jidanni＠jidanni.org

10:35 a.m.

...
...
...
...
"MS" == Michael Seiferle michael.seiferle@uni-konstanz.de writes:

MS> Hi,

MS> if Tagsoup [1] is present in the classpath (it comes with our Zip MS> packages e.g.), BaseX will allow (the "poor, nasty and brutish" [1]) MS> HTML input.

Well all I know is that http://docs.basex.org/wiki/Parsers should mention what to do to read HTML, and on my machine there is $ apt-cache search tagsoup-java libtagsoup-java - SAX-compliant parser for real-life HTML libtagsoup-java-doc - API Documentation for TagSoup

Mainly it is tags like <img ...> without /> that throw basex off track.

Reply

Christian Grün

12:24 p.m.

Tagsoup needs to be embedded in your classpath -- which is the case if BaseX is downloaded from our homepage). If you have installed BaseX via the Debian package manager, you'll have to manually embed the tagsoup.jar in the BaseX start scripts.

Hope this helps, Christian

Well all I know is that http://docs.basex.org/wiki/Parsers should mention what to do to read HTML, and on my machine there is $ apt-cache search tagsoup-java libtagsoup-java - SAX-compliant parser for real-life HTML libtagsoup-java-doc - API Documentation for TagSoup

Mainly it is tags like <img ...> without /> that throw basex off track. _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Reply

Alexander Holupirek

12:26 p.m.

I'll have a look at this. If tagsoup is present on a Debian system it should be detected automatically. If not, its a fault of the package and i'll fix it.

On 22.02.2012, at 12:24, Christian Grün wrote:

Tagsoup needs to be embedded in your classpath -- which is the case if BaseX is downloaded from our homepage). If you have installed BaseX via the Debian package manager, you'll have to manually embed the tagsoup.jar in the BaseX start scripts.

Hope this helps, Christian

...
Well all I know is that http://docs.basex.org/wiki/Parsers should mention what to do to read HTML, and on my machine there is $ apt-cache search tagsoup-java libtagsoup-java - SAX-compliant parser for real-life HTML libtagsoup-java-doc - API Documentation for TagSoup

Mainly it is tags like <img ...> without /> that throw basex off track. _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Alexander Holupirek |-- http://www.informatik.uni-konstanz.de/~holupire |-- Database & Information Systems Group, U Konstanz `-- Room E 221, 0049 7531 88 2188 (phone) 3577 (fax)

Reply

Alexander Holupirek

1:24 p.m.

I just prepared basex_7.1.1-2_all.deb which is available from:

deb http://files.basex.org/debian unstable/ deb-src http://files.basex.org/debian unstable/

* Problem: Want to parse non well-formed HTML

$ cat bad.html <html> <ul> <li>A <li>B </ul> </html>

$ basex -c 'create db html bad.html' "/home/holu/bad.html" (Line 5): </ul> found, </li> expected. The input may be correctly parsed after switching off the internal XML parser.

* Solution: Have tagsoup installed and set it as parser

$ sudo aptitude install libtagsoup-java The following NEW packages will be installed: libtagsoup-java 0 packages upgraded, 1 newly installed, 0 to remove and 88 not upgraded. Need to get 99.0 kB of archives. After unpacking 138 kB will be used. Get: 1 ftp://ftp.debian.org/debian/ unstable/main libtagsoup-java all 1.2.1-1 [99.0 kB] Fetched 99.0 kB in 0s (305 kB/s) Selecting previously unselected package libtagsoup-java. (Reading database ... 89487 files and directories currently installed.) Unpacking libtagsoup-java (from .../libtagsoup-java_1.2.1-1_all.deb) ... Processing triggers for man-db ... Setting up libtagsoup-java (1.2.1-1) ...

$ basex -c 'set parser html; create db html bad.html' $ basex -q "doc('html')" <html xmlns="http://www.w3.org/1999/xhtml"> <body> <ul> <li>A</li> <li>B</li> </ul> </body> </html>

Available in Debian package version 7.1.1-2

Cheers, Alex

On 22.02.2012, at 12:26, Alexander Holupirek wrote:

I'll have a look at this. If tagsoup is present on a Debian system it should be detected automatically. If not, its a fault of the package and i'll fix it.

On 22.02.2012, at 12:24, Christian Grün wrote:

...
Tagsoup needs to be embedded in your classpath -- which is the case if BaseX is downloaded from our homepage). If you have installed BaseX via the Debian package manager, you'll have to manually embed the tagsoup.jar in the BaseX start scripts.

Hope this helps, Christian

...
Well all I know is that http://docs.basex.org/wiki/Parsers should mention what to do to read HTML, and on my machine there is $ apt-cache search tagsoup-java libtagsoup-java - SAX-compliant parser for real-life HTML libtagsoup-java-doc - API Documentation for TagSoup

Mainly it is tags like <img ...> without /> that throw basex off track.

Reply

jidanni＠jidanni.org

1:43 p.m.

Don't you want to mention SET PARSER HTML on http://docs.basex.org/wiki/Parsers ?

Also there is nothing 'bad' about the HTML... It is valid form of one of the html versions mentioned on http://docs.basex.org/wiki/Serialization .

Indeed, the versions that SET PARSER HTML will support should please be documented there on http://docs.basex.org/wiki/Parsers .

Reply

5047

Age (days ago)

5047

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

6 comments

4 participants

tags (0)

participants (4)

Alexander Holupirek
Christian Grün
jidanni＠jidanni.org
Michael Seiferle