Hi all
Here is code that gradually eats up memory, whether run in GUI or as command. All it does is creating temporary collections out of folders, and writing them to file.
Is there a simple way to avoid this code to eat up memory? It runs out of memory (set at 12GB for command, 18GB in GUI) after 300 folders or so, and it has to process 20 000 of them.
Best Lars G Johnsen Norwegian National Library
Here is the actual code
(: process list of folders :)
for $collections in file:list($digibooks) let $html := $htmlfiles || substring-before($collections, "_ocr") || ".html"
return
(: code is rerun so check if files exist :)
if (not(file:exists($html))) then try {
(: create a temporary collection of the files and write result to disk :)
file:write( $html, db:digibok-to-html( collection($digibooks || $collections)) )
} catch * { $err:code } else ()
Hi Lars,
Here is some background information for the reported behavior (sorry in advance if this is known to you anyway): The functional semantics of XQuery requires that repeated calls to fn:doc and fn:collection return the same documents. This can e.g. be shown by the following query:
doc('x.xml') is doc('x.xml')
As it's difficult to guess in advance which of the opened documents will possibly be requested again in the same query, they are all kept in main-memory until query evaluation is completed.
However, things are different with functions like fetch:xml [1]. You may need to tweak your query a little bit, because the function will always give you single XML documents.
Does this help? Christian
[1] http://docs.basex.org/wiki/Fetch_Module#fetch:xml
On Fri, Mar 27, 2015 at 10:41 AM, Lars Johnsen yoonsen@gmail.com wrote:
Hi all
Here is code that gradually eats up memory, whether run in GUI or as command. All it does is creating temporary collections out of folders, and writing them to file.
Is there a simple way to avoid this code to eat up memory? It runs out of memory (set at 12GB for command, 18GB in GUI) after 300 folders or so, and it has to process 20 000 of them.
Best Lars G Johnsen Norwegian National Library
Here is the actual code
(: process list of folders :)
for $collections in file:list($digibooks) let $html := $htmlfiles || substring-before($collections, "_ocr") || ".html"
return (: code is rerun so check if files exist :) if (not(file:exists($html))) then try { (: create a temporary collection of the files and write result to
disk :)
file:write( $html, db:digibok-to-html( collection($digibooks || $collections)) ) } catch * { $err:code } else ()
Hi Christian, and thanks a lot for the pointer to fetch:xml - it seems to do the trick! Now, a little recoding, and it should be working.
Best, Lars
2015-03-27 10:48 GMT+01:00 Christian Grün christian.gruen@gmail.com:
Hi Lars,
Here is some background information for the reported behavior (sorry in advance if this is known to you anyway): The functional semantics of XQuery requires that repeated calls to fn:doc and fn:collection return the same documents. This can e.g. be shown by the following query:
doc('x.xml') is doc('x.xml')
As it's difficult to guess in advance which of the opened documents will possibly be requested again in the same query, they are all kept in main-memory until query evaluation is completed.
However, things are different with functions like fetch:xml [1]. You may need to tweak your query a little bit, because the function will always give you single XML documents.
Does this help? Christian
[1] http://docs.basex.org/wiki/Fetch_Module#fetch:xml
On Fri, Mar 27, 2015 at 10:41 AM, Lars Johnsen yoonsen@gmail.com wrote:
Hi all
Here is code that gradually eats up memory, whether run in GUI or as command. All it does is creating temporary collections out of folders,
and
writing them to file.
Is there a simple way to avoid this code to eat up memory? It runs out of memory (set at 12GB for command, 18GB in GUI) after 300 folders or so,
and
it has to process 20 000 of them.
Best Lars G Johnsen Norwegian National Library
Here is the actual code
(: process list of folders :)
for $collections in file:list($digibooks) let $html := $htmlfiles || substring-before($collections, "_ocr") || ".html"
return (: code is rerun so check if files exist :) if (not(file:exists($html))) then try { (: create a temporary collection of the files and write
result to
disk :)
file:write( $html, db:digibok-to-html( collection($digibooks || $collections)) ) } catch * { $err:code } else ()
... and thanks for the background on XQuery semantics!
The tweak was quite simple. The collection() function is simply emulated in this script, where the db: prefix is bound to a private namespace :
declare function db:collection($folder) { for $file in file:list($folder) return fetch:xml($folder || $file) };
Now memory stays below a couple of GB fluctuating according to folder size.
Cheers, Lars
2015-03-27 10:58 GMT+01:00 Lars Johnsen yoonsen@gmail.com:
Hi Christian, and thanks a lot for the pointer to fetch:xml - it seems to do the trick! Now, a little recoding, and it should be working.
Best, Lars
2015-03-27 10:48 GMT+01:00 Christian Grün christian.gruen@gmail.com:
Hi Lars,
Here is some background information for the reported behavior (sorry in advance if this is known to you anyway): The functional semantics of XQuery requires that repeated calls to fn:doc and fn:collection return the same documents. This can e.g. be shown by the following query:
doc('x.xml') is doc('x.xml')
As it's difficult to guess in advance which of the opened documents will possibly be requested again in the same query, they are all kept in main-memory until query evaluation is completed.
However, things are different with functions like fetch:xml [1]. You may need to tweak your query a little bit, because the function will always give you single XML documents.
Does this help? Christian
[1] http://docs.basex.org/wiki/Fetch_Module#fetch:xml
On Fri, Mar 27, 2015 at 10:41 AM, Lars Johnsen yoonsen@gmail.com wrote:
Hi all
Here is code that gradually eats up memory, whether run in GUI or as command. All it does is creating temporary collections out of folders,
and
writing them to file.
Is there a simple way to avoid this code to eat up memory? It runs out
of
memory (set at 12GB for command, 18GB in GUI) after 300 folders or so,
and
it has to process 20 000 of them.
Best Lars G Johnsen Norwegian National Library
Here is the actual code
(: process list of folders :)
for $collections in file:list($digibooks) let $html := $htmlfiles || substring-before($collections, "_ocr")
||
".html"
return (: code is rerun so check if files exist :) if (not(file:exists($html))) then try { (: create a temporary collection of the files and write
result to
disk :)
file:write( $html, db:digibok-to-html( collection($digibooks || $collections)) ) } catch * { $err:code } else ()
Perfect!
On Fri, Mar 27, 2015 at 11:20 AM, Lars Johnsen yoonsen@gmail.com wrote:
... and thanks for the background on XQuery semantics!
The tweak was quite simple. The collection() function is simply emulated in this script, where the db: prefix is bound to a private namespace :
declare function db:collection($folder) { for $file in file:list($folder) return fetch:xml($folder || $file) };
Now memory stays below a couple of GB fluctuating according to folder size.
Cheers, Lars
2015-03-27 10:58 GMT+01:00 Lars Johnsen yoonsen@gmail.com:
Hi Christian, and thanks a lot for the pointer to fetch:xml - it seems to do the trick! Now, a little recoding, and it should be working.
Best, Lars
2015-03-27 10:48 GMT+01:00 Christian Grün christian.gruen@gmail.com:
Hi Lars,
Here is some background information for the reported behavior (sorry in advance if this is known to you anyway): The functional semantics of XQuery requires that repeated calls to fn:doc and fn:collection return the same documents. This can e.g. be shown by the following query:
doc('x.xml') is doc('x.xml')
As it's difficult to guess in advance which of the opened documents will possibly be requested again in the same query, they are all kept in main-memory until query evaluation is completed.
However, things are different with functions like fetch:xml [1]. You may need to tweak your query a little bit, because the function will always give you single XML documents.
Does this help? Christian
[1] http://docs.basex.org/wiki/Fetch_Module#fetch:xml
On Fri, Mar 27, 2015 at 10:41 AM, Lars Johnsen yoonsen@gmail.com wrote:
Hi all
Here is code that gradually eats up memory, whether run in GUI or as command. All it does is creating temporary collections out of folders, and writing them to file.
Is there a simple way to avoid this code to eat up memory? It runs out of memory (set at 12GB for command, 18GB in GUI) after 300 folders or so, and it has to process 20 000 of them.
Best Lars G Johnsen Norwegian National Library
Here is the actual code
(: process list of folders :)
for $collections in file:list($digibooks) let $html := $htmlfiles || substring-before($collections, "_ocr") || ".html"
return (: code is rerun so check if files exist :) if (not(file:exists($html))) then try { (: create a temporary collection of the files and write
result to disk :)
file:write( $html, db:digibok-to-html( collection($digibooks || $collections)) ) } catch * { $err:code } else ()
basex-talk@mailman.uni-konstanz.de