Dear BaseX team, I'm using BaseX for my studies with Xavier-Laurent Salvador from Paris13. I've got a puzzling issue with html:parse. I'm trying the request below using html:parse in order to get a list of the urls from a webpage and I'm getting this message error: "Ligne 19: Invalid character found: '"' " for $x in (html:parse(http:send-request( <http:request method='get' override-media-type=' application/octet-stream' href= 'http://www.crealscience.fr/'/
) [2])//@href[matches(.,"http")]) return $x
The same request gives this kind of output errors with any url. Html:parse stops at any error in the page's HTML code. A header with "declare option output:method "text";" was added to the request but it didn't solve the problem. If I insert the same request in a RestXQ file, it works perfectly. Do you have any suggestions to solve that problem? Best, Sophie Petit (basex 7.8 on debian)
Dear Sophie, I assume that TagSoup is missing in your BaseX classpath. TagSoup is responsible for converting HTML pages to XML (see [1] for more details). By calling html:parser(), you can find out if HTML can be correctly converted [2]. By the way, the following query is an alternative solution for parsing HTML to XML. It gives you more control on the specific steps (but, once again, TagSoup must be in the classpath to successfully import HTML): let $url := 'http://www.crealscience.fr/' let $text := fetch:text($url) let $xml := html:parse($text) return $xml Hope this helps, Christian [1] http://docs.basex.org/wiki/Parsers#HTML_Parser [2] http://docs.basex.org/wiki/HTML_Module#html:parser On Tue, Feb 18, 2014 at 6:36 PM, Sophie Petit <sophiepetit@gmail.com> wrote:
Dear BaseX team,
I'm using BaseX for my studies with Xavier-Laurent Salvador from Paris13. I've got a puzzling issue with html:parse. I'm trying the request below using html:parse in order to get a list of the urls from a webpage and I'm getting this message error: "Ligne 19: Invalid character found: '"' "
for $x in (html:parse(http:send-request( <http:request method='get' override-media-type=' application/octet-stream' href= 'http://www.crealscience.fr/'/
) [2])//@href[matches(.,"http")]) return $x
The same request gives this kind of output errors with any url. Html:parse stops at any error in the page's HTML code. A header with "declare option output:method "text";" was added to the request but it didn't solve the problem. If I insert the same request in a RestXQ file, it works perfectly.
Do you have any suggestions to solve that problem?
Best, Sophie Petit (basex 7.8 on debian)
_______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
participants (2)
-
Christian Grün -
Sophie Petit