The following code generates the error "Stack Overflow: try tail recursion?"
The code reads in bibliographic data using OAI-PMH and updates a database for each chunk of data. With OAI-PMH, only part of the data is available for each request, so the server returns a resumption token if there are more data available.
The xquery function making the queries is implemented recursively preceded by a database update request (see the last two lines) for each call. Is it db:add() that causes the stack overflow? The recursion cannot be placed further towards the end!
declare %updating function local:getResumption($token) { if (empty($token)) then () else let $http-request := http:send-request($http-option, $URL || $token) let $result := if ($http-request instance of node()) then $http-request else <http-err>{$http-request}</http-err>
let $resume := $result//oai:resumptionToken/text() return ( db:add($database,element chunk {$result//oai:metadata}, $path) , local:getResumption($resume) ) };
Best, Lars
Hello Lars,
if you have a deep recursion Java will at some point hit its stack size limit. Have you already tried to simply increase the Java stack size, e.g. by passing the parameter -Xss2m to the JVM?
Cheers Dirk
On 05/11/2016 01:43 PM, Lars Johnsen wrote:
The following code generates the error "Stack Overflow: try tail recursion?"
The code reads in bibliographic data using OAI-PMH and updates a database for each chunk of data. With OAI-PMH, only part of the data is available for each request, so the server returns a resumption token if there are more data available.
The xquery function making the queries is implemented recursively preceded by a database update request (see the last two lines) for each call. Is it db:add() that causes the stack overflow? The recursion cannot be placed further towards the end!
declare %updating function local:getResumption($token) { if (empty($token)) then () else let $http-request := http:send-request($http-option, $URL || $token) let $result := if ($http-request instance of node()) then $http-request else <http-err>{$http-request}</http-err>
let $resume := $result//oai:resumptionToken/text() return ( db:add($database,element chunk {$result//oai:metadata},
$path) , local:getResumption($resume) ) };
Best, Lars
The basexgui startup file now contains:
BASEX_JVM="-Xmx8g -Xss4m $BASEX_JVM"
It helped the script a long way, but eventually it had to kneel. It works fine though, on smaller datasets.
Maybe there is some other way to get the data over. I'll have a talk with the guys providing the OAI-endpoint.
Thanks for the pointer to Xss!
Lars
2016-05-11 14:38 GMT+02:00 Dirk Kirsten dk@basex.org:
Hello Lars,
if you have a deep recursion Java will at some point hit its stack size limit. Have you already tried to simply increase the Java stack size, e.g. by passing the parameter -Xss2m to the JVM?
Cheers Dirk
On 05/11/2016 01:43 PM, Lars Johnsen wrote:
The following code generates the error "Stack Overflow: try tail recursion?"
The code reads in bibliographic data using OAI-PMH and updates a database for each chunk of data. With OAI-PMH, only part of the data is available for each request, so the server returns a resumption token if there are more data available.
The xquery function making the queries is implemented recursively preceded by a database update request (see the last two lines) for each call. Is it db:add() that causes the stack overflow? The recursion cannot be placed further towards the end!
declare %updating function local:getResumption($token) { if (empty($token)) then () else let $http-request := http:send-request($http-option, $URL || $token) let $result := if ($http-request instance of node()) then $http-request else <http-err>{$http-request}</http-err>
let $resume := $result//oai:resumptionToken/text() return ( db:add($database,element chunk {$result//oai:metadata}, $path)
, local:getResumption($resume) ) };
Best, Lars
-- Dirk Kirsten, BaseX GmbH, http://basexgmbh.de |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 91 68 276, Fax: 0049 7531 20 05 22
Hi Lars
I have done some OAI-PMH fetches but never got into stack-overflow issues. I guess one workaround you can do on your part is to partition your query with date-ranges using the query parameters "from" and "until" on your initial call to the endpoint.
Regards, Johan Mörén
On Wed, May 11, 2016 at 5:07 PM Lars Johnsen yoonsen@gmail.com wrote:
The basexgui startup file now contains:
BASEX_JVM="-Xmx8g -Xss4m $BASEX_JVM"
It helped the script a long way, but eventually it had to kneel. It works fine though, on smaller datasets.
Maybe there is some other way to get the data over. I'll have a talk with the guys providing the OAI-endpoint.
Thanks for the pointer to Xss!
Lars
2016-05-11 14:38 GMT+02:00 Dirk Kirsten dk@basex.org:
Hello Lars,
if you have a deep recursion Java will at some point hit its stack size limit. Have you already tried to simply increase the Java stack size, e.g. by passing the parameter -Xss2m to the JVM?
Cheers Dirk
On 05/11/2016 01:43 PM, Lars Johnsen wrote:
The following code generates the error "Stack Overflow: try tail recursion?"
The code reads in bibliographic data using OAI-PMH and updates a database for each chunk of data. With OAI-PMH, only part of the data is available for each request, so the server returns a resumption token if there are more data available.
The xquery function making the queries is implemented recursively preceded by a database update request (see the last two lines) for each call. Is it db:add() that causes the stack overflow? The recursion cannot be placed further towards the end!
declare %updating function local:getResumption($token) { if (empty($token)) then () else let $http-request := http:send-request($http-option, $URL || $token) let $result := if ($http-request instance of node()) then $http-request else <http-err>{$http-request}</http-err>
let $resume := $result//oai:resumptionToken/text() return ( db:add($database,element chunk {$result//oai:metadata},
$path) , local:getResumption($resume) ) };
Best, Lars
-- Dirk Kirsten, BaseX GmbH, http://basexgmbh.de |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 91 68 276, Fax: 0049 7531 20 05 22
Hello,
If your case allows using external tools for harvesting, I can highly recommend metha (https://github.com/miku/metha) which is a fairly full featured command line OAI-PMH harvester.
Best regards,
Matti L.
On 11/05/16 18:31 , "basex-talk-bounces@mailman.uni-konstanz.de on behalf of Johan Mörén" <basex-talk-bounces@mailman.uni-konstanz.de on behalf of johan.moren@gmail.com> wrote:
Maybe there is some other way to get the data over. I'll have a talk with the guys providing the OAI-endpoint.
Thanks Johan and Matti for useful suggestions.
Cutting down on the chunks seems to be a viable alternative.
It would have been nice, though, to have a robust harvester in XQuery that could take on anything, although the recursive version works fine as long as the dataset consist of a couple of thousand entries.
Best, Lars
2016-05-12 8:16 GMT+02:00 Lassila, Matti matti.j.lassila@jyu.fi:
Hello,
If your case allows using external tools for harvesting, I can highly recommend metha (https://github.com/miku/metha) which is a fairly full featured command line OAI-PMH harvester.
Best regards,
Matti L.
On 11/05/16 18:31 , "basex-talk-bounces@mailman.uni-konstanz.de on behalf of Johan Mörén" <basex-talk-bounces@mailman.uni-konstanz.de on behalf of johan.moren@gmail.com> wrote:
Maybe there is some other way to get the data over. I'll have a talk with the guys providing the OAI-endpoint.
Hello Lars,
just a thought (and really just a pointer, I am neither a purely functional guy and also I feel like I am missing something obious...): Maybe you could rewrite the recursive approach using higher order functions. Consider a query like the following
hof:scan-left(1 to 100, map { "token": "starttoken" }, function($result, $index) { let $req := http:send-request(<http:request method="get"/>, "http://google.com?q=" || $result("token")) return map { "result": $req, "token" : $req//http:header[@name = "Date"]/@value/data() } })
It will issue 100 requests to google and use some specific token from the query before (in this case I used the date). This will output a sequence of the map entries and in a subsequent step you could return only the actual result values.
Best regards, Dirk
On 05/12/2016 12:55 PM, Lars Johnsen wrote:
Thanks Johan and Matti for useful suggestions.
Cutting down on the chunks seems to be a viable alternative.
It would have been nice, though, to have a robust harvester in XQuery that could take on anything, although the recursive version works fine as long as the dataset consist of a couple of thousand entries.
Best, Lars
2016-05-12 8:16 GMT+02:00 Lassila, Matti <matti.j.lassila@jyu.fi mailto:matti.j.lassila@jyu.fi>:
Hello, If your case allows using external tools for harvesting, I can highly recommend metha (https://github.com/miku/metha) which is a fairly full featured command line OAI-PMH harvester. Best regards, Matti L. On 11/05/16 18:31 , "basex-talk-bounces@mailman.uni-konstanz.de <mailto:basex-talk-bounces@mailman.uni-konstanz.de> on behalf of Johan Mörén" <basex-talk-bounces@mailman.uni-konstanz.de <mailto:basex-talk-bounces@mailman.uni-konstanz.de> on behalf of johan.moren@gmail.com <mailto:johan.moren@gmail.com>> wrote: >Maybe there is some other way to get the data over. I'll have a talk with >the guys providing the OAI-endpoint.
Thanks for pointer!
Code is rewritten using hof:until() and tested towards a particular set at our national provider of library data.
The script still accumulates data, so it will probably still run into memory troubles with larger datasets, but the stack-overflow should be taken care of.
For anyone interested, the code is attached below, and using hof:until() as the higher order function. To make it work, fill in URLs for a choosen OAI-endpoint, and maybe change som of the request parameters - this one fetches marc21 posts and uses sets. Some error checking may also be implemented.
Cheers, Lars
declare namespace oai = "http://www.openarchives.org/OAI/2.0/";
(:URL for resumption tokens :) declare variable $URL := "oai-URL?verb=ListRecords&resumptionToken=";
(:URL for initial request:) declare variable $URL2 := "oai-URL?verb=ListRecords&metadataPrefix=marc21&set=";
(: Variable for OAI-set - if not used, remove "set=" in URL2 :) declare variable $oai-set := "aset";
(: basex http :) declare variable $http-option := <http:request method='get' />;
(: ------
Fetch data from OAI-endpoint using a start map containing resumption token and the first set of data. The map has two keys, 'resume' and 'chunk', where 'chunk' is an accumulator holding data from the current and previous requests. hof:until() does not return an aggregated list of maps, so data must be collected somehow
------:)
declare function local:getResumption($startmap) {
let $token := map:get($startmap, 'resume') return if (empty($token)) then $startmap else let $http-request := http:send-request($http-option, $URL || $token) let $result := if ($http-request instance of node()) then $http-request else <http-err>{$http-request}</http-err> return map { 'resume': $result//oai:resumptionToken/text(), 'chunk': ( map:get($startmap, 'chunk'), $result//oai:metadata ) } };
(: Issue initial request :)
let $first := http:send-request($http-option, $URL2 || $oai-set)
(: Create startmap :)
let $init := map { 'chunk': $first//oai:metadata, 'resume': $first//oai:resumptionToken/text() }
let $oai := hof:until(
function($x) { empty(map:get($x, 'resume')) },
function($y) { local:getResumption($y) }, $init )
(: Amend with additional code like db:add() of file:write() here :)
return element oai {map:get($oai, 'chunk')}
2016-05-12 15:07 GMT+02:00 Dirk Kirsten dk@basex.org:
Hello Lars,
just a thought (and really just a pointer, I am neither a purely functional guy and also I feel like I am missing something obious...): Maybe you could rewrite the recursive approach using higher order functions. Consider a query like the following
hof:scan-left(1 to 100, map { "token": "starttoken" }, function($result, $index) { let $req := http:send-request(<http:request method="get"/>, "http://google.com?q=" http://google.com?q= || $result("token")) return map { "result": $req, "token" : $req//http:header[@name = "Date"]/@value/data() } })
It will issue 100 requests to google and use some specific token from the query before (in this case I used the date). This will output a sequence of the map entries and in a subsequent step you could return only the actual result values.
Best regards, Dirk
On 05/12/2016 12:55 PM, Lars Johnsen wrote:
Thanks Johan and Matti for useful suggestions.
Cutting down on the chunks seems to be a viable alternative.
It would have been nice, though, to have a robust harvester in XQuery that could take on anything, although the recursive version works fine as long as the dataset consist of a couple of thousand entries.
Best, Lars
2016-05-12 8:16 GMT+02:00 Lassila, Matti matti.j.lassila@jyu.fi:
Hello,
If your case allows using external tools for harvesting, I can highly recommend metha (https://github.com/miku/metha) which is a fairly full featured command line OAI-PMH harvester.
Best regards,
Matti L.
On 11/05/16 18:31 , "basex-talk-bounces@mailman.uni-konstanz.de on behalf of Johan Mörén" <basex-talk-bounces@mailman.uni-konstanz.de on behalf of johan.moren@gmail.com> wrote:
Maybe there is some other way to get the data over. I'll have a talk with the guys providing the OAI-endpoint.
-- Dirk Kirsten, BaseX GmbH, http://basexgmbh.de |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 91 68 276, Fax: 0049 7531 20 05 22
Dirk, Johan, Matti and others
Just an update on the OAI-Harvester. Here is a rewritten harvester that should work well on archives of any size. Parameters needs to be adjusted for the actual request (like dates and the like, this one uses sets), but the logic with higher order functions (hof namespace i BaseX) works as it should. Instead of adding to database it writes results to a file. Consequently it does not require much memory.
Best, Lars
declare namespace oai = "http://www.openarchives.org/OAI/2.0/";
(:*** URL for initial request - add suitable parameters for subquerying - this one uses set - if sets are not used, just delete references to it:) declare variable $URL2 :="_OAI-URL_?verb=ListRecords&metadataPrefix=marc21&set=";
(:*** URL for resumption tokens: fill in a suitable OAI-URL - this URL need not be changed :) declare variable $URL := "_OAI-URL_?verb=ListRecords&resumptionToken=";
(: *** basex http :) declare variable $http-option := <http:request method='get' />;
(:**********************
Function for fetching data using resumption token which writes result to file
**********************:)
declare function local:getResumption($file, $token) {
if (empty($token)) then () else let $http-request := http:send-request($http-option, $URL || $token) let $result := if ($http-request instance of node()) then $http-request else <http-err>{$http-request}</http-err> return ( file:append($file, $result//oai:metadata), $result//oai:resumptionToken/text() )
};
(:************
Define oai set and file for storage
*************:)
let $file := 'file.xml' let $oai-set := "aset"
(:*************
Get the first batch of data and retrieve the resumptiontoken If sets are not used, just remove the expression adding $oai-set This is the place for building up a more complex OAI-query if needed, by manipulating the variable $URL2 and the joining of parameters.
***************:)
let $first := http:send-request($http-option, $URL2 || $oai-set) let $init := $first//oai:resumptionToken/text()
(:**************
write data to disk and call hof:until(), quitting on empty resumption token
****************:)
return ( file:write-text($file, "<root>"), (: insert start tag of root element :)
file:append($file, $first//oai:metadata), (: write initial sequence of elements :)
hof:until( (: call hof:until() :) function($x) { empty($x) },
function($y) { local:getResumption($file, $y) }, $init ),
file:append-text($file, "</root>") (: insert end tag of root element :) )
2016-05-12 17:30 GMT+02:00 Lars Johnsen yoonsen@gmail.com:
Thanks for pointer!
Code is rewritten using hof:until() and tested towards a particular set at our national provider of library data.
The script still accumulates data, so it will probably still run into memory troubles with larger datasets, but the stack-overflow should be taken care of.
For anyone interested, the code is attached below, and using hof:until() as the higher order function. To make it work, fill in URLs for a choosen OAI-endpoint, and maybe change som of the request parameters - this one fetches marc21 posts and uses sets. Some error checking may also be implemented.
Cheers, Lars
declare namespace oai = "http://www.openarchives.org/OAI/2.0/";
(:URL for resumption tokens :) declare variable $URL := "oai-URL?verb=ListRecords&resumptionToken=";
(:URL for initial request:) declare variable $URL2 := "oai-URL?verb=ListRecords&metadataPrefix=marc21&set=";
(: Variable for OAI-set - if not used, remove "set=" in URL2 :) declare variable $oai-set := "aset";
(: basex http :) declare variable $http-option := <http:request method='get' />;
(: ------
Fetch data from OAI-endpoint using a start map containing resumption token and the first set of data. The map has two keys, 'resume' and 'chunk', where 'chunk' is an accumulator holding data from the current and previous requests. hof:until() does not return an aggregated list of maps, so data must be collected somehow
------:)
declare function local:getResumption($startmap) {
let $token := map:get($startmap, 'resume') return if (empty($token)) then $startmap else let $http-request := http:send-request($http-option, $URL || $token) let $result := if ($http-request instance of node()) then $http-request else <http-err>{$http-request}</http-err> return map { 'resume': $result//oai:resumptionToken/text(), 'chunk': ( map:get($startmap, 'chunk'), $result//oai:metadata ) } };
(: Issue initial request :)
let $first := http:send-request($http-option, $URL2 || $oai-set)
(: Create startmap :)
let $init := map { 'chunk': $first//oai:metadata, 'resume': $first//oai:resumptionToken/text() }
let $oai := hof:until(
function($x) { empty(map:get($x, 'resume')) },
function($y) { local:getResumption($y) }, $init )
(: Amend with additional code like db:add() of file:write() here :)
return element oai {map:get($oai, 'chunk')}
2016-05-12 15:07 GMT+02:00 Dirk Kirsten dk@basex.org:
Hello Lars,
just a thought (and really just a pointer, I am neither a purely functional guy and also I feel like I am missing something obious...): Maybe you could rewrite the recursive approach using higher order functions. Consider a query like the following
hof:scan-left(1 to 100, map { "token": "starttoken" }, function($result, $index) { let $req := http:send-request(<http:request method="get"/>, "http://google.com?q=" http://google.com?q= || $result("token")) return map { "result": $req, "token" : $req//http:header[@name = "Date"]/@value/data() } })
It will issue 100 requests to google and use some specific token from the query before (in this case I used the date). This will output a sequence of the map entries and in a subsequent step you could return only the actual result values.
Best regards, Dirk
On 05/12/2016 12:55 PM, Lars Johnsen wrote:
Thanks Johan and Matti for useful suggestions.
Cutting down on the chunks seems to be a viable alternative.
It would have been nice, though, to have a robust harvester in XQuery that could take on anything, although the recursive version works fine as long as the dataset consist of a couple of thousand entries.
Best, Lars
2016-05-12 8:16 GMT+02:00 Lassila, Matti matti.j.lassila@jyu.fi:
Hello,
If your case allows using external tools for harvesting, I can highly recommend metha (https://github.com/miku/metha) which is a fairly full featured command line OAI-PMH harvester.
Best regards,
Matti L.
On 11/05/16 18:31 , "basex-talk-bounces@mailman.uni-konstanz.de on behalf of Johan Mörén" <basex-talk-bounces@mailman.uni-konstanz.de on behalf of johan.moren@gmail.com> wrote:
Maybe there is some other way to get the data over. I'll have a talk
with
the guys providing the OAI-endpoint.
-- Dirk Kirsten, BaseX GmbH, http://basexgmbh.de |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 91 68 276, Fax: 0049 7531 20 05 22
basex-talk@mailman.uni-konstanz.de