Hi all - I have the sneaking suspicion that the answer to this plea for help will be something like, "Use RESTXQ!!" or something similar, but let me describe the problem: I'm pulling data back from an OAI-PMH endpoint that is slow; i.e. response times are ~1/minute. Embarrassingly, I think I've spent several hours trying to figure out if my requests were buggy before I realized that the API endpoint was just *slow* (at least compared to others that I use regularly).
I typically use the IntelliJ plugin or the BaseX GUI for the vast majority of my XQuery work, but these requests effectively build up a sequence of responses (as files) and then serialize them all to disk when the requests have finished. When the requests are answered quickly, there's prompt feedback (my script finishes quickly, I can see serialized files, etc), but when the requests are answered slowly, I'm left waiting (and then thinking, "Oh no - I've mistyped the endpoint URL." or something similar).
Initially I think I have two requests for guidance: 1. Is there a better way, using the BaseX GUI (or the command line), to get feedback on a querying process like this? Something... asynchronous, or something clever with builtin functions in the `jobs` or `xquery` modules? 2. If this can be addressed relatively directly with RESTXQ, I'm game to (finally) get more comfortable with it, but does anyone have any examples, applications, scripts, etc they would be willing to share? I want to say that several examples and ideas have been shared before here on the list, but I'm having a terrible time finding them.
I've attached a simple SSCCE, where the basic idea is: query an API for some data, and get a response like so: <example> <stuff>...</stuff> <resumptionToken>abc123</resumptionToken> </example> then take the text() of the resumptionToken, and resubmit a request to the API, which would return: <example> <stuff>...</stuff> <resumptionToken>def456</resumptionToken> </example> The full responses are written to the temp directory on your system (file:temp-dir()), with a date-stamp name.
The endpoint in my attached example has a fairly small response from what I've selected, but the idea would be: how let me (or another user) know that the query is active and running, not hung up or failing? I understand that given the functional nature of XQuery these sorts of things can be a bit more complicated, so I would appreciate any thoughts, opinions, links, etc.
Many thanks for your time and trouble. Best, Bridger
PS the endpoint in the example doesn't necessarily exhibit the same slow behavior that is the basis for my woes -- for some list readers it may be very fast indeed -- but I felt like it might be apropos to use this particular endpoint for an illustration.
Hi Bridger,
I'm pulling data back from an OAI-PMH endpoint that is slow; i.e. response times are ~1/minute.
I’ve tried the example you have attached (thanks). It’s seems to be much faster. Do you think that’s just my geographic proximity to the Konstanz-based server, or did you use a different setting for your slow tests?
- Is there a better way, using the BaseX GUI (or the command line), to get feedback on a querying process like this?
If you use the BaseX GUI and if you restart a query or run a second one, the first one will be interrupted, so I guess you’ll have similar experiences with IntelliJ. But…
Something... asynchronous, or something clever with builtin functions in the `jobs` or `xquery` modules?
You could create multiple query jobs, which run in parallel, with the jobs:eval function. They will only be interrupted if the IDE is stopped, but your IDE won’t notify you when the queries terminate normally or expectedly.
A promising alternative for you could be xquery:fork-join [1]. In fact we mostly use it for running multiple slow HTTP requests in parallel:
xquery:fork-join( for $segment in 1 to 4 let $url := 'http://url.com/path/' || $segment return function() { http:send-request((), $url) } )
The function will terminate once all parallel requests have returned a response (and the results will be returned in the expected order).
Next, you could run a script multiple times on command line and e.g. assign different arguments:
basex -bvar=1 query.xq basex -bvar=2 query.xq ...
query.xq: declare variable $var external; file:write($var || '.xml', ...)
- If this can be addressed relatively directly with RESTXQ
RESTXQ can be helpful if you write web applications, or if you want to define custom REST endpoints. It’s true that such endpoints can then be called multiple times as well, and will run in parallel as long as the queries don’t write to the same databases [2]. Maybe it’s overkill if you only want to run scripts in parallel, though. The more basic client/server architecture could be an alternative [3]; it can be used similar to the command line solution.
I've attached a simple SSCCE, where the basic idea is: query an API for some data, and get a response like so:
You indicated that you are sending two requests. Is it the first one that’s slow? Does the first response create all input elements for the second requests, or do you have twice the number of requests in total?
Hope this helps, Christian
[1] https://docs.basex.org/wiki/XQuery_Module#xquery:fork-join [2] https://docs.basex.org/wiki/Transaction_Management [3] https://docs.basex.org/wiki/Database_Server
Hi Christian,
As always, thanks for your time and help.
On Wed, Nov 24, 2021 at 12:18 PM Christian Grün christian.gruen@gmail.com wrote:
Hi Bridger,
I'm pulling data back from an OAI-PMH endpoint that is slow; i.e.
response times are ~1/minute.
I’ve tried the example you have attached (thanks). It’s seems to be much faster. Do you think that’s just my geographic proximity to the Konstanz-based server, or did you use a different setting for your slow tests?
I think that it is partially geographic proximity and partly that the
system that has given me trouble is just incredibly slow; I'm hesitant to share the particular URL.
- Is there a better way, using the BaseX GUI (or the command line), to
get feedback on a querying process like this?
If you use the BaseX GUI and if you restart a query or run a second one, the first one will be interrupted, so I guess you’ll have similar experiences with IntelliJ. But…
Something... asynchronous, or something clever with builtin functions in
the `jobs` or `xquery` modules?
You could create multiple query jobs, which run in parallel, with the jobs:eval function. They will only be interrupted if the IDE is stopped, but your IDE won’t notify you when the queries terminate normally or expectedly.
A promising alternative for you could be xquery:fork-join [1]. In fact we mostly use it for running multiple slow HTTP requests in parallel:
xquery:fork-join( for $segment in 1 to 4 let $url := 'http://url.com/path/' || $segment return function() { http:send-request((), $url) } )
The function will terminate once all parallel requests have returned a response (and the results will be returned in the expected order).
I've used `xquery:fork-join()` for something else in the past, and it is
truly fantastic; as you mention here and in the documentation, it makes slower HTTP requests much easier. Maybe I'm not thinking carefully about my particular issue, but I don't know if using fork-join will help in this case. The initial query to the API returns some data; e.g.
<example> <things>...</things> <token>abc123</token> </example>
and the following queries rely on the existence (or lack) of the value in example/token/text(). Those values are, AFAIK, possibly randomized, or even just structured differently between the various endpoints that I use, so I wouldn't be able to know the full URLs to structure a fork-join.
I'm not sure I'm capturing my problem, but thanks for letting me talk it through here.
Next, you could run a script multiple times on command line and e.g. assign different arguments:
basex -bvar=1 query.xq basex -bvar=2 query.xq ...
query.xq: declare variable $var external; file:write($var || '.xml', ...)
- If this can be addressed relatively directly with RESTXQ
RESTXQ can be helpful if you write web applications, or if you want to define custom REST endpoints. It’s true that such endpoints can then be called multiple times as well, and will run in parallel as long as the queries don’t write to the same databases [2]. Maybe it’s overkill if you only want to run scripts in parallel, though. The more basic client/server architecture could be an alternative [3]; it can be used similar to the command line solution.
I guess my thinking in regards to RESTXQ was that maybe, assuming I have
the proper functions in place, I could return a new webpage while the following function calls were happening in the background; e.g.
step 1: start query to a given endpoint step 2: when the first result is returned, redirect the user (me) to a new webpage with a message (and the token; e.g. 'abc123'), and step 3: using the token, launch the following query (which relies on said token) step 4: when the result is returned, redirect the user to a new webpage with an updated message (and both tokens (first, and second); e.g. 'abc123' and 'def456'), step 5: etc until the process finishes.
Again, that's the RESTXQ that was happening in my imagination, but I'm definitely just at the imagining phase with this, so please excuse me if I'm misconstruing or just thinking about things poorly! :)
I've attached a simple SSCCE, where the basic idea is: query an API for some data, and get a response like so:
You indicated that you are sending two requests. Is it the first one that’s slow? Does the first response create all input elements for the second requests, or do you have twice the number of requests in total?
In my real world case, which again I hesitate to share, *all* requests are slow. In the meantime, maybe this new URL/endpoint might help illustrate. Using the following for $url and $verb (and apologies, my shell seems to mislike "&", hence the "&" - that may need to change depending on your environment), the initial response (ending with 500:7603::) is very quick to the terminal, but the subsequent responses are built up and returned all at the same time. $ basex -burl="http://dpla.lib.utk.edu/repox/OAIHandler" -bverb="?verb=ListRecords&metadataPrefix=MODS&set=utk_roth" quick-example.xq 1637777229247:utk_roth:MODS:500:7603:: (this is returned very quickly) 1637777232215:utk_roth:MODS:1000:7603:: 1637777235461:utk_roth:MODS:1500:7603:: 1637777238529:utk_roth:MODS:2000:7603:: 1637777241271:utk_roth:MODS:2500:7603:: 1637777243814:utk_roth:MODS:3000:7603:: 1637777246607:utk_roth:MODS:3500:7603:: 1637777249193:utk_roth:MODS:4000:7603:: 1637777251921:utk_roth:MODS:4500:7603:: 1637777254893:utk_roth:MODS:5000:7603:: 1637777257666:utk_roth:MODS:5500:7603:: 1637777260401:utk_roth:MODS:6000:7603:: 1637777263461:utk_roth:MODS:6500:7603:: 1637777266368:utk_roth:MODS:7000:7603:: 1637777268823:utk_roth:MODS:7500:7603::
This is also shown by the serialization times, maybe: $ ls -l /tmp/*.xml -rw-r--r-- 1 bridger bridger 1809111 Nov 24 13:09 /tmp/2021-11-24T18:07:08Z.xml -rw-r--r-- 1 bridger bridger 1797940 Nov 24 13:10 /tmp/2021-11-24T18:07:11Z.xml -rw-r--r-- 1 bridger bridger 1800314 Nov 24 13:10 /tmp/2021-11-24T18:07:14Z.xml -rw-r--r-- 1 bridger bridger 1808724 Nov 24 13:10 /tmp/2021-11-24T18:07:17Z.xml -rw-r--r-- 1 bridger bridger 1813505 Nov 24 13:10 /tmp/2021-11-24T18:07:20Z.xml -rw-r--r-- 1 bridger bridger 1804882 Nov 24 13:10 /tmp/2021-11-24T18:07:22Z.xml -rw-r--r-- 1 bridger bridger 1808811 Nov 24 13:10 /tmp/2021-11-24T18:07:25Z.xml -rw-r--r-- 1 bridger bridger 1814575 Nov 24 13:10 /tmp/2021-11-24T18:07:28Z.xml -rw-r--r-- 1 bridger bridger 1807538 Nov 24 13:10 /tmp/2021-11-24T18:07:31Z.xml -rw-r--r-- 1 bridger bridger 1802458 Nov 24 13:10 /tmp/2021-11-24T18:07:34Z.xml -rw-r--r-- 1 bridger bridger 1801862 Nov 24 13:10 /tmp/2021-11-24T18:07:36Z.xml -rw-r--r-- 1 bridger bridger 1817766 Nov 24 13:10 /tmp/2021-11-24T18:07:39Z.xml -rw-r--r-- 1 bridger bridger 1803580 Nov 24 13:10 /tmp/2021-11-24T18:07:42Z.xml -rw-r--r-- 1 bridger bridger 1808175 Nov 24 13:10 /tmp/2021-11-24T18:07:45Z.xml -rw-r--r-- 1 bridger bridger 1798792 Nov 24 13:10 /tmp/2021-11-24T18:07:47Z.xml -rw-r--r-- 1 bridger bridger 371814 Nov 24 13:10 /tmp/2021-11-24T18:07:50Z.xml
It very well may be that I'm simply asking if there's a way to pull some "procedural-ness" out of a functional paradigm and the answer is naturally, "no, sorry."
Hope this helps, Christian
Always, yes. Thanks so much for the response and giving me a space to talk
through this issue.
Best, Bridger
[1] https://docs.basex.org/wiki/XQuery_Module#xquery:fork-join [2] https://docs.basex.org/wiki/Transaction_Management [3] https://docs.basex.org/wiki/Database_Server
Hi Bridger,
As always, thanks for your time and help.
You are welcome!
In my real world case, which again I hesitate to share, *all* requests are slow. In the meantime, maybe this new URL/endpoint might help illustrate.
It does indeed; it takes around a minute for this query to be fully evaluated.
As you’ve already indicated, we cannot run the queries in parallel, as each request requires the token from the previous request. This means that we cannot really speed up the query, but we can at least write partial results to disk once they are available. Would this already help you?
I’ve attached an updated version of your query, feel free to check it out and share your experiences with us…
All the best, Christian
basex-talk@mailman.uni-konstanz.de