Hi Christian -
Alas, the data is a client's and confidential.
for $remote in $paths let $name as xs:string := file:name($remote) let $target as xs:string := file:resolve-path($name,$targetBase) let $fetched as item() := http:send-request(<http:request method='get' username='{$id}' password='{$pass}' />, $remote)[2] return if ($fetched instance of document-node()) then file:write($target,$fetched) else if ($fetched instance of xs:base64Binary) then file:write-binary($target,$fetched) else file:write-text($target,$fetched)
works -- the query completes and files are written to disk. I suspect that server ignores override-content-type.
If I don't check the returned type and try to write everything returned out with file:write-binary() to have something where I could pick the html files back off the disk, I got an error that the content wasn't binary, it was xs:untypedAtomic. (Which might imply that file:write-binary complains that way when fed a document node.)
Which leads to "is there a way to get the type of an item?" I don't think there is, but it seems like it would be extremely helpful for stuff like this where "figure out what the web server feels like doing" is a concern.
Thank you! It was a helpful hint.
-- Graydon
On Fri, Apr 8, 2022 at 9:44 AM Christian Grün christian.gruen@gmail.com wrote:
Hi Graydon,
Maybe it’s TagSoup that has problems to convert some specific HTML files to XML. Did you try to write the responses to disk and parse them in a second step?
If your input data is not confidential, could you possibly provide us with an example that runs out of the box?
Best, Christian
I'm using the basexgui to run (minus some identifying actual values
defined previously in the query)
(: for each path, retrieve the document :) for $remote in $paths let $name as xs:string := file:name($remote) let $target as xs:string := file:resolve-path($name,$targetBase) let $fetched := http:send-request(<http:request method='get'
override-media-type='application/octet-stream' username='{$id}' password='{$pass}' />,
$remote)[2]
let $use as item() := try { html:parse($fetched) } catch * { $fetched } return if ($use instance of document-node()) then file:write($target,$use) else file:write-binary($target,$use)
It works, in that I get exactly 100 documents retrieved. (There are
unfortunately 140+ documents in the list.)
However, the query fails with an "out of main memory" error when using a
recent 10.0 beta or 9.7 with Xmx set to 2g. Setting Xmx to 16g with 9.7 produces the same "out of memory" error in the same length of time (about 5 minutes).
java -version says 20:27 test % java -version openjdk version "11.0.14.1" 2022-02-08 OpenJDK Runtime Environment 18.9 (build 11.0.14.1+1) OpenJDK 64-Bit Server VM 18.9 (build 11.0.14.1+1, mixed mode, sharing)
It's entirely possible I'm going about fetching files off a web server
the wrong way; it's possible there's something there that's rather large, but I doubt it's that large.
What should I be doing instead?
Thanks! Graydon