Hi,
If I use `fetch:xml($url, map{'parser':'html'})` all is fine!
The next one gives a correct result (although, in contrary to the browser, without namespaces and doctype):
let $url := "https://bartoszmilewski.com/2014/10/28/category-theory-for-programmers-the-p..." let $response := http:send-request(<http:request method='get'/>, $url) return $response[2]
This creates a mess:
let $url := "https://bartoszmilewski.com/2014/10/28/category-theory-for-programmers-the-p..." let $response := http:send-request(<http:request method='get'/>, $url) return html:parse($response[2])
gives (partial):
<html> <body>Category Theory for Programmers: The Preface | Bartosz Milewski's Programming Cafe/* */ if ( 'function' === typeof WPRemoteLogin ) { document.cookie = "wordpress_test_cookie=test; path=/"; if ( document.cookie.match( /(;|^)\s*wordpress_test_cookie=/ ) ) { WPRemoteLogin(); }
etc.
I also tried:
let $url := "https://bartoszmilewski.com/2014/10/28/category-theory-for-programmers-the-p..." let $response := http:send-request(<http:request method='get'/>, $url) return html:parse($response[2], map{ 'html': false(), 'lexical': true(), 'nocdata': true(), 'nodefaults': true(), 'nons': false() })
This adds only the xhtml namespace, but the rest is the same.
html:parser() -> "TagSoup"
I don't know enough about Java, to test which TagSoup version is in use via the GUI. I am getting this result, when using the GUI on Windows 10 with BaseX 9.0.1
As I have seen, there was a thread on this list in 2016 about eventual replacement of TagSoup: https://www.mail-archive.com/basex-talk@mailman.uni-konstanz.de/msg08928.htm...
What about https://about.validator.nu/htmlparser/ ?
Thank you.