Hi, If I use `fetch:xml($url, map{'parser':'html'})` all is fine! The next one gives a correct result (although, in contrary to the browser, without namespaces and doctype): let $url := "https://bartoszmilewski.com/2014/10/28/category-theory-for-programmers-the-p..." let $response := http:send-request(<http:request method='get'/>, $url) return $response[2] This creates a mess: let $url := "https://bartoszmilewski.com/2014/10/28/category-theory-for-programmers-the-p..." let $response := http:send-request(<http:request method='get'/>, $url) return html:parse($response[2]) gives (partial): <html> <body>Category Theory for Programmers: The Preface | Bartosz Milewski's Programming Cafe/* */ if ( 'function' === typeof WPRemoteLogin ) { document.cookie = "wordpress_test_cookie=test; path=/"; if ( document.cookie.match( /(;|^)\s*wordpress_test_cookie\=/ ) ) { WPRemoteLogin(); } etc. I also tried: let $url := "https://bartoszmilewski.com/2014/10/28/category-theory-for-programmers-the-p..." let $response := http:send-request(<http:request method='get'/>, $url) return html:parse($response[2], map{ 'html': false(), 'lexical': true(), 'nocdata': true(), 'nodefaults': true(), 'nons': false() }) This adds only the xhtml namespace, but the rest is the same. html:parser() -> "TagSoup" I don't know enough about Java, to test which TagSoup version is in use via the GUI. I am getting this result, when using the GUI on Windows 10 with BaseX 9.0.1 As I have seen, there was a thread on this list in 2016 about eventual replacement of TagSoup: https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg08928.htm... What about https://about.validator.nu/htmlparser/ ? Thank you. -- Goody Bye, Minden jót, Mit freundlichen Grüßen, Andreas Mixich
Hi Andreas,
What about https://about.validator.nu/htmlparser/ ?
Thanks for the pointer; I will have a look at this parser.
This creates a mess:
let $url := "https://bartoszmilewski.com/2014/10/28/category-theory-for-programmers-the-p..." let $response := http:send-request(<http:request method='get'/>, $url) return html:parse($response[2])
The reason is that the HTTP response is of type node(). html:parse takes strings as arguments, and by calling html:parse, your node will be implicitly converted to an atomized string. The following query should do the job: let $url := "https://bartoszmilewski.com/2014/10/28/category-theory-for-programmers-the-p..." let $response := http:send-request( <http:request method='get' override-media-type='text/plain' href='{ $url }'/> )[2] return html:parse( $response, map { 'nons': false() } ) By the way, while running your queries, I noticed that html:parse didn’t accept binary input anymore. This has been fixed in the latest snapshot [1]. Apart from that, we currently work on an enhanced version of our HTTP Client Module (see [2]). Maybe we’ll drop the implicit response conversion in the new functions. Cheers, Christian [1] http://files.basex.org/releases/latest/ [2] https://github.com/BaseXdb/basex/issues/914
PS: You can also supply HTML parsing options via fetch:xml: fetch:xml( 'http://basex.org/', map { 'parser': 'html', 'htmlparser': map { 'nons': false() } } ) In future, if you want to use the HTTP Module, your request could look as simple as this: html:parse( http:get($url)?body, map { 'nons': false() } ) We are still working out if response parsing will be integrated in the http:get call: let $serializer := map { 'parser': 'html', 'htmlparser': map { 'nons': false() } } return http:get( $url, map { 'serializer': $serializer } )('body')
Hi Christian, thank you very much for your help. All is fine now :-) You wrote:
We are still working out if response parsing will be integrated in the http:get call:
In this very case (I am crawling parts of a website recursively, pulling the document content and the binary data (images) and converting it into ePub2, completely recomposing the HTML), I am very happy about the response header available, since I can read out the media-type and interpret the response code. I believe, certain REST APIs also communicate additional information in the response header ('X-something: ' tags). But as long there is one function, that comes with the full featured response, I think that is enough. The user could write wrappers around that, easily. I have checked your reference to https://github.com/BaseXdb/basex/issues/914. I am pretty content with the response being presented as XML! In my opinion it keeps the spirit of the XQuery process alive: a query against an XML backend. Though, I understand, from a pure programming language point of view, that having the result as a map() may be more appealing. In the request, however, I prefer a map with options. Just my ¢2. -- Goody Bye, Minden jót, Mit freundlichen Grüßen, Andreas Mixich
participants (2)
-
Andreas Mixich -
Christian Grün