OutOfMemoryError at Query#more()

List overview All Threads
Download

newer

older

map behavior ?

Server Variables, cached vars, etc

Simon Chatelain

22 Sep 2017 22 Sep '17

3:33 a.m.

Hello,

I am facing an issue while retrieving some big amount of XML documents from a BaseX collection.

Each document (as an XML file) is around 10 KB, and in the problematic case I must retrieve around 70000 of them.

I am using Session#query(String query) then Query#more() and Query#next() to iterate through the result of my query.

try (final Query query = l_Session.query(“query”)) {

while (query.more()) {

String xml = query.next();

}

If there is more than a certain amount of XML document in the result of my query I get a OutOfMemoryError (full stack trace in attached file) when executing query.more().

I did the test with BaseX 8.6.6 and 8.6.7, Java 8, VM arguments –Xmx1024m

I also try to use QueryProcessor and QueryProcessor#iter() instead of Session#query(String query). But is it safe to use it knowing that my application is multithreaded and that each thread has its own session to query or add elements from/to multiple collections?

Moreover, for now all access to BaseX are done through a session, so my application can run with an embedded BaseX or with a BaseX server. If I start using QueryProcessor, then it will be embedded BaseX only, right?

I also attached a simple example showing the problem.

Any advice would be much appreciated

Thanks

Simon

Attachments:

attachment.html (text/html — 3.3 KB)
StackTrace.txt (text/plain — 1.5 KB)
OutOfMemoryWithQueryMore.java (application/octet-stream — 1.8 KB)

Show replies by date

Fabrice ETANCHAUD

22 Sep 22 Sep

3:53 a.m.

Bonjour Simon,

I would send a query for each document, externalizing the loop in java.

A question : could you process be written in xquery ? That way you might not face memory overflow.

Best regards, Fabrice Etanchaud CERFrance Poitou-Charentes

De : basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Simon Chatelain Envoyé : vendredi 22 septembre 2017 09:34 À : BaseX Objet : [basex-talk] OutOfMemoryError at Query#more()

Hello, I am facing an issue while retrieving some big amount of XML documents from a BaseX collection. Each document (as an XML file) is around 10 KB, and in the problematic case I must retrieve around 70000 of them. I am using Session#query(String query) then Query#more() and Query#next() to iterate through the result of my query.

try (final Query query = l_Session.query(“query”)) { while (query.more()) { String xml = query.next(); } } If there is more than a certain amount of XML document in the result of my query I get a OutOfMemoryError (full stack trace in attached file) when executing query.more().

I did the test with BaseX 8.6.6 and 8.6.7, Java 8, VM arguments –Xmx1024m

Increasing the Xmx value is not a solution as I don’t know what the maximum amount of data I will have to retrieve in the future. So what I need is a reliable way of executing such queries and iterate through the result without exploding the heap size. I also try to use QueryProcessor and QueryProcessor#iter() instead of Session#query(String query). But is it safe to use it knowing that my application is multithreaded and that each thread has its own session to query or add elements from/to multiple collections? Moreover, for now all access to BaseX are done through a session, so my application can run with an embedded BaseX or with a BaseX server. If I start using QueryProcessor, then it will be embedded BaseX only, right?

I also attached a simple example showing the problem.

Any advice would be much appreciated

Thanks Simon

Simon Chatelain

7:31 a.m.

Bonjour Fabrice,

Thanks for the suggestion. I did try that (sending a query for each document), and it does work … sort of. Performance wise, it's really slow even if the database is fully optimized.

As for writing my process in xquery, that’s a good question. Honestly I don’t know as I am quite new at xquery, I lack the expertise.

I’ll try to give more detail about what I am trying to achieve.

In my database I have a series of XML documents, which, once really simplified, look like that.

</notif>

</notif>

</notif>

...

</notif>

</notif>

</notif>

</notif>

...

</notif>

What I need to get is:

The first XML document (first as in smallest @ts value)

Then the next document with <flag>1</flag> (again next in the @ts order)

Then the next document with <flag>0</flag>

And so on…

That would be the documents highlighted in red in the above example.

Roughly only 1 out of 1000 documents has <flag>1</flag>

I tried several approaches to do that, but the faster one I found is to iterate through all documents with a very simple xquery and keep only the ones I need,

for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ return $d

Another approach was to first select all documents with <flag>1</flag>

for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag = 1 return $d

then for each of those get the next document

(for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag = 0 and $d/@ts > ‘[ts of previous document]’ return $d)[1]

Or select the first document,

(for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ return $d)[1]

then query the next

(for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag = 1 and $d/@ts > ‘[ts of previous document]’ return $d)[1]

And the next…

(for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag = 0 and $d/@ts > ‘[ts of previous document]’ return $d)[1]

And so on.

But none of those is as fast as the first one, and then I hit this OutOfMemory issue.

So if there is a way to rewrite all that process in xquery that could be an option worth trying, or if there is a more efficient way to write the query

(for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag = 0 and $d/@ts > ‘[ts of previous document]’ return $d)[1]

That could also solve my problem.

Regards

Simon

On 22 September 2017 at 09:53, Fabrice ETANCHAUD < fetanchaud@pch.cerfrance.fr> wrote:

...

Bonjour Simon,

I would send a query for each document,

externalizing the loop in java.

A question : could you process be written in xquery ? That way you might not face memory overflow.

Best regards,

Fabrice Etanchaud

CERFrance Poitou-Charentes

*De :* basex-talk-bounces@mailman.uni-konstanz.de [mailto: basex-talk-bounces@mailman.uni-konstanz.de] *De la part de* Simon Chatelain *Envoyé :* vendredi 22 septembre 2017 09:34 *À :* BaseX *Objet :* [basex-talk] OutOfMemoryError at Query#more()

Hello,

I am facing an issue while retrieving some big amount of XML documents from a BaseX collection.

Each document (as an XML file) is around 10 KB, and in the problematic case I must retrieve around 70000 of them.

I am using Session#query(String query) then Query#more() and Query#next() to iterate through the result of my query.

try (final Query query = l_Session.query(“query”)) {

while (query.more()) {
            String xml = query.next();
}

}

If there is more than a certain amount of XML document in the result of my query I get a OutOfMemoryError (full stack trace in attached file) when executing query.more().

I did the test with BaseX 8.6.6 and 8.6.7, Java 8, VM arguments –Xmx1024m

Increasing the Xmx value is not a solution as I don’t know what the maximum amount of data I will have to retrieve in the future. So what I need is a reliable way of executing such queries and iterate through the result without exploding the heap size.

I also try to use QueryProcessor and QueryProcessor#iter() instead of Session#query(String query). But is it safe to use it knowing that my application is multithreaded and that each thread has its own session to query or add elements from/to multiple collections?

Moreover, for now all access to BaseX are done through a session, so my application can run with an embedded BaseX or with a BaseX server. If I start using QueryProcessor, then it will be embedded BaseX only, right?

I also attached a simple example showing the problem.

Any advice would be much appreciated

Thanks

Simon

Fabrice ETANCHAUD

8:58 a.m.

Bonjour à nouveau, Simon,

I think that tumbling windows could be of great help in your use case :

Let consider the following test db :

1. Creation

db:create(‘test’)

2. Documents insertion (in @ts descending order to check that the solution is working whatever the document physical order)

for $i in 1 to 100 let $ts := current-dateTime() + xs:dayTimeDuration('PT'||(100-$i+1)||'S') let $flag := random:integer(2) return db:add( 'test', <notif id ="name1" ts="{$ts}"> <flag>{$flag}</flag> </notif>, 'notif' || $i || '.xml')

Then the following query should do the job :

for tumbling window $i in sort( db:open('test'), (), function($doc) { $doc/notif/@ts/data() }) start $s when fn:true() end $e next $n when $e/notif/flag != $n/notif/flag return $i[1]

It iterate on the sorted documents (by ascending @ts), And output the first document of each monotonic flag group.

Hoping I did it right, Best regards,

Fabrice CERFrance Poitou-Charentes

De : basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Simon Chatelain Envoyé : vendredi 22 septembre 2017 13:32 À : BaseX Objet : Re: [basex-talk] OutOfMemoryError at Query#more()

Bonjour Fabrice,

Thanks for the suggestion. I did try that (sending a query for each document), and it does work … sort of. Performance wise, it's really slow even if the database is fully optimized.

As for writing my process in xquery, that’s a good question. Honestly I don’t know as I am quite new at xquery, I lack the expertise.

I’ll try to give more detail about what I am trying to achieve.

In my database I have a series of XML documents, which, once really simplified, look like that.

What I need to get is: The first XML document (first as in smallest @ts value) Then the next document with <flag>1</flag> (again next in the @ts order) Then the next document with <flag>0</flag> And so on…

That would be the documents highlighted in red in the above example. Roughly only 1 out of 1000 documents has <flag>1</flag>

I tried several approaches to do that, but the faster one I found is to iterate through all documents with a very simple xquery and keep only the ones I need, for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ return $d Another approach was to first select all documents with <flag>1</flag> for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag = 1 return $d then for each of those get the next document (for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag = 0 and $d/@ts > ‘[ts of previous document]’ return $d)[1]

Or select the first document, (for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ return $d)[1] then query the next (for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag = 1 and $d/@ts > ‘[ts of previous document]’ return $d)[1] And the next… (for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag = 0 and $d/@ts > ‘[ts of previous document]’ return $d)[1] And so on.

But none of those is as fast as the first one, and then I hit this OutOfMemory issue.

So if there is a way to rewrite all that process in xquery that could be an option worth trying, or if there is a more efficient way to write the query (for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag = 0 and $d/@ts > ‘[ts of previous document]’ return $d)[1] That could also solve my problem.

Regards

Simon

On 22 September 2017 at 09:53, Fabrice ETANCHAUD <fetanchaud@pch.cerfrance.frmailto:fetanchaud@pch.cerfrance.fr> wrote: Bonjour Simon,

I would send a query for each document, externalizing the loop in java.

A question : could you process be written in xquery ? That way you might not face memory overflow.

Best regards, Fabrice Etanchaud CERFrance Poitou-Charentes

De : basex-talk-bounces@mailman.uni-konstanz.demailto:basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.demailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Simon Chatelain Envoyé : vendredi 22 septembre 2017 09:34 À : BaseX Objet : [basex-talk] OutOfMemoryError at Query#more()

I did the test with BaseX 8.6.6 and 8.6.7, Java 8, VM arguments –Xmx1024m

I also attached a simple example showing the problem.

Any advice would be much appreciated

Thanks Simon

Simon Chatelain

10:44 a.m.

Hello,

Excellent, thank you very much. It does work, and quite fast it seems.

Now I'll go and read some documentation on xquery...

Merci encore, et bon week-end

Simon

On 22 September 2017 at 14:58, Fabrice ETANCHAUD < fetanchaud@pch.cerfrance.fr> wrote:

...

Bonjour à nouveau, Simon,

I think that tumbling windows could be of great help in your use case :

Let consider the following test db :
  Creation
db:create(‘test’)
  Documents insertion (in @ts descending order to check that the
solution is working whatever the document physical order)

for $i in 1 to 100

let $ts := current-dateTime() + xs:dayTimeDuration('PT'||(100-$i+1)||'S')

let $flag := random:integer(2)

return

db:add(
'test',

<notif id ="name1" ts="{$ts}">

  <flag>{$flag}</flag>

</notif>,

'notif' || $i || '.xml')
Then the following query should do the job :

for tumbling window $i in sort(

db:open('test'),

(),

function($doc) {
$doc/notif/@ts/data()
})

start $s when fn:true()

end $e next $n when $e/notif/flag != $n/notif/flag

return

$i[1]

It iterate on the sorted documents (by ascending @ts),

And output the first document of each monotonic flag group.

Hoping I did it right,

Best regards,

Fabrice

CERFrance Poitou-Charentes

*De :* basex-talk-bounces@mailman.uni-konstanz.de [mailto: basex-talk-bounces@mailman.uni-konstanz.de] *De la part de* Simon Chatelain *Envoyé :* vendredi 22 septembre 2017 13:32 *À :* BaseX *Objet :* Re: [basex-talk] OutOfMemoryError at Query#more()

Bonjour Fabrice,

Thanks for the suggestion. I did try that (sending a query for each document), and it does work … sort of. Performance wise, it's really slow even if the database is fully optimized.

As for writing my process in xquery, that’s a good question. Honestly I don’t know as I am quite new at xquery, I lack the expertise.

I’ll try to give more detail about what I am trying to achieve.

In my database I have a series of XML documents, which, once really simplified, look like that.

<notif id ="name1" ts="2016-01-01T08:01:05.000">
  <flag>0</flag>
</notif>

<notif id ="name1" ts="2016-01-01T08:01:10.000">
  <flag>0</flag>
</notif>

<notif id ="name1" ts="2016-01-01T08:01:15.000">
  <flag>0</flag>
</notif>

...

<notif id ="name1" ts="2016-01-01T08:01:20.000">
  <flag>1</flag>
</notif>

<notif id ="name1" ts="2016-01-01T08:01:25.000">
  <flag>0</flag>
</notif>

<notif id ="name1" ts="2016-01-01T08:01:30.000">
  <flag>0</flag>
</notif>

<notif id ="name1" ts="2016-01-01T08:01:35.000">
  <flag>0</flag>
</notif>

...

<notif id ="name1" ts="2016-01-01T08:01:40.000">
  <flag>1</flag>
</notif>

What I need to get is:

The first XML document (first as in smallest @ts value)

Then the next document with <flag>1</flag> (again next in the @ts order)

Then the next document with <flag>0</flag>

And so on…

That would be the documents highlighted in red in the above example.

Roughly only 1 out of 1000 documents has <flag>1</flag>

I tried several approaches to do that, but the faster one I found is to iterate through all documents with a very simple xquery and keep only the ones I need,

for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ return $d

Another approach was to first select all documents with <flag>1</flag>

for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag = 1 return $d

then for each of those get the next document

(for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag = 0 and $d/@ts > ‘[ts of previous document]’ return $d)[1]

Or select the first document,

(for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ return $d)[1]

then query the next

(for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag = 1 and $d/@ts > ‘[ts of previous document]’ return $d)[1]

And the next…

(for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag = 0 and $d/@ts > ‘[ts of previous document]’ return $d)[1]

And so on.

But none of those is as fast as the first one, and then I hit this OutOfMemory issue.

So if there is a way to rewrite all that process in xquery that could be an option worth trying, or if there is a more efficient way to write the query

(for $d in collection(‘1234567’)/* where $d/@name = ‘name1’ and $d/flag = 0 and $d/@ts > ‘[ts of previous document]’ return $d)[1]

That could also solve my problem.

Regards

Simon

On 22 September 2017 at 09:53, Fabrice ETANCHAUD < fetanchaud@pch.cerfrance.fr> wrote:

Bonjour Simon,

I would send a query for each document,

externalizing the loop in java.

A question : could you process be written in xquery ? That way you might not face memory overflow.

Best regards,

Fabrice Etanchaud

CERFrance Poitou-Charentes

*De :* basex-talk-bounces@mailman.uni-konstanz.de [mailto: basex-talk-bounces@mailman.uni-konstanz.de] *De la part de* Simon Chatelain *Envoyé :* vendredi 22 septembre 2017 09:34 *À :* BaseX *Objet :* [basex-talk] OutOfMemoryError at Query#more()

Hello,

I am facing an issue while retrieving some big amount of XML documents from a BaseX collection.

Each document (as an XML file) is around 10 KB, and in the problematic case I must retrieve around 70000 of them.

I am using Session#query(String query) then Query#more() and Query#next() to iterate through the result of my query.

try (final Query query = l_Session.query(“query”)) {

while (query.more()) {
            String xml = query.next();
}

}

If there is more than a certain amount of XML document in the result of my query I get a OutOfMemoryError (full stack trace in attached file) when executing query.more().

I did the test with BaseX 8.6.6 and 8.6.7, Java 8, VM arguments –Xmx1024m

Increasing the Xmx value is not a solution as I don’t know what the maximum amount of data I will have to retrieve in the future. So what I need is a reliable way of executing such queries and iterate through the result without exploding the heap size.

I also try to use QueryProcessor and QueryProcessor#iter() instead of Session#query(String query). But is it safe to use it knowing that my application is multithreaded and that each thread has its own session to query or add elements from/to multiple collections?

Moreover, for now all access to BaseX are done through a session, so my application can run with an embedded BaseX or with a BaseX server. If I start using QueryProcessor, then it will be embedded BaseX only, right?

I also attached a simple example showing the problem.

Any advice would be much appreciated

Thanks

Simon

Fabrice ETANCHAUD

11:10 a.m.

Be warned : by using XQuery and BaseX, you are going to feel your coworkers’ fear for your new gain of productivity ! Like your management’s fear for a such powerful and underrated technology ! ;-)

Best regards, Fabrice

De : basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Simon Chatelain Envoyé : vendredi 22 septembre 2017 16:45 À : BaseX Objet : Re: [basex-talk] OutOfMemoryError at Query#more()

Hello,

Excellent, thank you very much. It does work, and quite fast it seems.

Now I'll go and read some documentation on xquery...

Merci encore, et bon week-end

Simon

On 22 September 2017 at 14:58, Fabrice ETANCHAUD <fetanchaud@pch.cerfrance.frmailto:fetanchaud@pch.cerfrance.fr> wrote: Bonjour à nouveau, Simon,

I think that tumbling windows could be of great help in your use case :

Let consider the following test db :

1. Creation

db:create(‘test’)

2. Documents insertion (in @ts descending order to check that the solution is working whatever the document physical order)

Then the following query should do the job :

for tumbling window $i in sort( db:open('test'), (), function($doc) { $doc/notif/@ts/data() }) start $s when fn:true() end $e next $n when $e/notif/flag != $n/notif/flag return $i[1]

It iterate on the sorted documents (by ascending @ts), And output the first document of each monotonic flag group.

Hoping I did it right, Best regards,

Fabrice CERFrance Poitou-Charentes

De : basex-talk-bounces@mailman.uni-konstanz.demailto:basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.demailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Simon Chatelain Envoyé : vendredi 22 septembre 2017 13:32 À : BaseX Objet : Re: [basex-talk] OutOfMemoryError at Query#more()

Bonjour Fabrice,

Thanks for the suggestion. I did try that (sending a query for each document), and it does work … sort of. Performance wise, it's really slow even if the database is fully optimized.

As for writing my process in xquery, that’s a good question. Honestly I don’t know as I am quite new at xquery, I lack the expertise.

I’ll try to give more detail about what I am trying to achieve.

In my database I have a series of XML documents, which, once really simplified, look like that.

That would be the documents highlighted in red in the above example. Roughly only 1 out of 1000 documents has <flag>1</flag>

But none of those is as fast as the first one, and then I hit this OutOfMemory issue.

Regards

Simon

On 22 September 2017 at 09:53, Fabrice ETANCHAUD <fetanchaud@pch.cerfrance.frmailto:fetanchaud@pch.cerfrance.fr> wrote: Bonjour Simon,

I would send a query for each document, externalizing the loop in java.

A question : could you process be written in xquery ? That way you might not face memory overflow.

Best regards, Fabrice Etanchaud CERFrance Poitou-Charentes

I did the test with BaseX 8.6.6 and 8.6.7, Java 8, VM arguments –Xmx1024m

I also attached a simple example showing the problem.

Any advice would be much appreciated

Thanks Simon

Marco Lettere

11:13 a.m.

*(ROFL)*

On 22/09/2017 17:10, Fabrice ETANCHAUD wrote:

...

Be warned : by using XQuery and BaseX, you are going to feel your coworkers’ fear for your new gain of productivity !

Like your management’s fear for a such powerful and underrated technology ! ;-)

Best regards,

Fabrice

2856

Age (days ago)

2856

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

6 comments

3 participants

tags (0)

participants (3)

Fabrice ETANCHAUD
Marco Lettere
Simon Chatelain