Out Of Memory

List overview All Threads
Download

newer

older

copy-namespaces declaration

Python client and BaseX 8.0

Mansi Sheth

5 Nov 2014 5 Nov '14

8:48 p.m.

Hello,

I have a use case, where I have to extract lots in information from each XML in each DB. Something like, attribute values of most of the nodes in an XML. For such, queries based goes Out Of Memory with below exception. I am giving it ~12GB of RAM on i7 processor. Well I can't complain here since I am most definitely asking for loads of data, but is there any way I can get these kinds of data successfully ?

mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp BaseX 8.0 beta b45c1e2 [Server] Server was started (port: 1984) HTTP Server was started (port: 8984) Exception in thread "qtp2068921630-18" java.lang.OutOfMemoryError: Java heap space at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073) at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342) at org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526) at org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572) at java.lang.Thread.run(Thread.java:744)

-- - Mansi

Attachments:

attachment.html (text/html — 1.9 KB)

Show replies by date

Christian Grün

6 Nov 6 Nov

2:48 p.m.

Hi Mansi,

I think we need more information on the queries that are causing the problems.

Best, Christian

On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth mansi.sheth@gmail.com wrote:

...

Hello,

I have a use case, where I have to extract lots in information from each XML in each DB. Something like, attribute values of most of the nodes in an XML. For such, queries based goes Out Of Memory with below exception. I am giving it ~12GB of RAM on i7 processor. Well I can't complain here since I am most definitely asking for loads of data, but is there any way I can get these kinds of data successfully ?

mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp BaseX 8.0 beta b45c1e2 [Server] Server was started (port: 1984) HTTP Server was started (port: 8984) Exception in thread "qtp2068921630-18" java.lang.OutOfMemoryError: Java heap space at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073) at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342) at org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526) at org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572) at java.lang.Thread.run(Thread.java:744)

--

Mansi

Mansi Sheth

4:33 p.m.

This would need a lot of details, so bear with me below:

Briefly my XML files look like:

<A> can contain , <C> or <D> and B, C or D can contain E. We have 1000s (currently 3000 in my test data set) of such xml files, of size 50MB on an average. Its tons of data ! Currently, my database is of ~18GB in size.

Query: /A/*//E/@name/string()

This query, was going OOM, within few mins.

I tried a few ways, of whitelisting, with contain clause, to truncate the result set. That didn't help too. So, now I am out of ideas. This is giving JVM 10GB of dedicated memory.

Once, above query works and doesn't go Out Of Memory, I also need corresponding file names too:

XYZ.xml //E/@name PQR.xml //E/@name

Let me know if you would need more details, to appreciate the issue ? - Mansi

On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün christian.gruen@gmail.com wrote:

...

Hi Mansi,

I think we need more information on the queries that are causing the problems.

Best, Christian

On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth mansi.sheth@gmail.com wrote:

...
Hello,

I have a use case, where I have to extract lots in information from each

XML

...
in each DB. Something like, attribute values of most of the nodes in an

XML.

...
For such, queries based goes Out Of Memory with below exception. I am

giving

...
it ~12GB of RAM on i7 processor. Well I can't complain here since I am

most

...
definitely asking for loads of data, but is there any way I can get these kinds of data successfully ?

mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp BaseX 8.0 beta b45c1e2 [Server] Server was started (port: 1984) HTTP Server was started (port: 8984) Exception in thread "qtp2068921630-18" java.lang.OutOfMemoryError: Java

heap

...
space at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857)

...
at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073)

...
at

org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342)

...
at

org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526)

...
at

org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44)

...
at

org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)

...
at java.lang.Thread.run(Thread.java:744)

--

Mansi

-- - Mansi

Fabrice Etanchaud

4:48 p.m.

Hi Mansi,

Here you have a natural partition of your data : the files you ingested. So my first suggestion would be to query your data on a file basis:

for $doc in db:open(‘your_collection_name’) let $file-name := db:path($doc) return file:write( $file-name, <names> { for $name in $doc//E/@name/data() return <name>{$name}</name> } </names> )

Is it for indexing ?

Hope it helps,

Best regards,

Fabrice Etanchaud Questel/Orbit

De : basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Mansi Sheth Envoyé : jeudi 6 novembre 2014 16:33 À : Christian Grün Cc : BaseX Objet : Re: [basex-talk] Out Of Memory

This would need a lot of details, so bear with me below:

Briefly my XML files look like:

Query: /A/*//E/@name/string()

This query, was going OOM, within few mins.

I tried a few ways, of whitelisting, with contain clause, to truncate the result set. That didn't help too. So, now I am out of ideas. This is giving JVM 10GB of dedicated memory.

Once, above query works and doesn't go Out Of Memory, I also need corresponding file names too:

XYZ.xml //E/@name PQR.xml //E/@name

Let me know if you would need more details, to appreciate the issue ? - Mansi

On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün <christian.gruen@gmail.commailto:christian.gruen@gmail.com> wrote: Hi Mansi,

I think we need more information on the queries that are causing the problems.

Best, Christian

On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth <mansi.sheth@gmail.commailto:mansi.sheth@gmail.com> wrote:

...

Hello,

I have a use case, where I have to extract lots in information from each XML in each DB. Something like, attribute values of most of the nodes in an XML. For such, queries based goes Out Of Memory with below exception. I am giving it ~12GB of RAM on i7 processor. Well I can't complain here since I am most definitely asking for loads of data, but is there any way I can get these kinds of data successfully ?

mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp BaseX 8.0 beta b45c1e2 [Server] Server was started (port: 1984) HTTP Server was started (port: 8984) Exception in thread "qtp2068921630-18" java.lang.OutOfMemoryError: Java heap space at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073) at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342) at org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526) at org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572) at java.lang.Thread.run(Thread.java:744)

--

Mansi

-- - Mansi

Mansi Sheth

5:11 p.m.

Interesting idea, I thought of using db partition, but didn't pursue it further, mainly due to below thought process.

Currently, I am ingesting ~3000 xml files, storing ~50 xml files per db, which would be growing quickly. So, below approach would lead to ~3000 more files (which would be increasing), increasing I/O operations considerably for further pre-processing.

However, I don't really care if process takes few minutes to few hours (as long as its not day(s) ;)). Given the situation and my options, I would surely try this.

Database, is currently indexed at attribute level, as thats what I would be querying the most. Do you think, I should do anything differently ?

Thanks, - Mansi

On Thu, Nov 6, 2014 at 10:48 AM, Fabrice Etanchaud fetanchaud@questel.com wrote:

...

Hi Mansi,

Here you have a natural partition of your data : the files you ingested.

So my first suggestion would be to query your data on a file basis:

for $doc in db:open(‘your_collection_name’)

let $file-name := db:path($doc)

return
 file:write(
$file-name,

<names>
 {

 for $name in
$doc//E/@name/data()
 return
<name>{$name}</name>

}

</names>

)

Is it for indexing ?

Hope it helps,

Best regards,

Fabrice Etanchaud

Questel/Orbit

*De :* basex-talk-bounces@mailman.uni-konstanz.de [mailto: basex-talk-bounces@mailman.uni-konstanz.de] *De la part de* Mansi Sheth *Envoyé :* jeudi 6 novembre 2014 16:33 *À :* Christian Grün *Cc :* BaseX *Objet :* Re: [basex-talk] Out Of Memory

This would need a lot of details, so bear with me below:

Briefly my XML files look like:

<A name="">


 <C name="">

 <D name="">

 <E name=""/>
<A> can contain , <C> or <D> and B, C or D can contain E. We have 1000s (currently 3000 in my test data set) of such xml files, of size 50MB on an average. Its tons of data ! Currently, my database is of ~18GB in size.

Query: /A/*//E/@name/string()

This query, was going OOM, within few mins.

I tried a few ways, of whitelisting, with contain clause, to truncate the result set. That didn't help too. So, now I am out of ideas. This is giving JVM 10GB of dedicated memory.

Once, above query works and doesn't go Out Of Memory, I also need corresponding file names too:

XYZ.xml //E/@name

PQR.xml //E/@name

Let me know if you would need more details, to appreciate the issue ?

Mansi

On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün christian.gruen@gmail.com wrote:

Hi Mansi,

I think we need more information on the queries that are causing the problems.

Best, Christian

On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth mansi.sheth@gmail.com wrote:

...
Hello,

I have a use case, where I have to extract lots in information from each

XML

...
in each DB. Something like, attribute values of most of the nodes in an

XML.

...
For such, queries based goes Out Of Memory with below exception. I am

giving

...
it ~12GB of RAM on i7 processor. Well I can't complain here since I am

most

...
definitely asking for loads of data, but is there any way I can get these kinds of data successfully ?

mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp BaseX 8.0 beta b45c1e2 [Server] Server was started (port: 1984) HTTP Server was started (port: 8984) Exception in thread "qtp2068921630-18" java.lang.OutOfMemoryError: Java

heap

...
space at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857)

...
at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073)

...
at

org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342)

...
at

org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526)

...
at

org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44)

...
at

org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)

...
at java.lang.Thread.run(Thread.java:744)

--

Mansi

--

Mansi

-- - Mansi

Graydon Saunders

5:21 p.m.

Hi Mansi --

Just out of habitual paranoia about the performance of *// in XPath, I might try replacing /A/*//E/@name/string() with E[ancestor::A[not(parent::*)]/@name and not worry about stringifying the resulting sequence of attribute nodes until the next step, whatever that might be. It might not matter to the optimizer at all, but it might.

Also, from your description of the data, do you care where the tree is rooted or just that you've got an E? If it _is_ just an E, what you want might look like

for x in E/@name return (string($x),tokenize(base-uri($x),'/')[last()])

Do you need to worry about cases where @name is empty?

-- Graydon

On Thu, Nov 6, 2014 at 11:11 AM, Mansi Sheth mansi.sheth@gmail.com wrote:

...

Interesting idea, I thought of using db partition, but didn't pursue it further, mainly due to below thought process.

Currently, I am ingesting ~3000 xml files, storing ~50 xml files per db, which would be growing quickly. So, below approach would lead to ~3000 more files (which would be increasing), increasing I/O operations considerably for further pre-processing.

However, I don't really care if process takes few minutes to few hours (as long as its not day(s) ;)). Given the situation and my options, I would surely try this.

Database, is currently indexed at attribute level, as thats what I would be querying the most. Do you think, I should do anything differently ?

Thanks,

Mansi

On Thu, Nov 6, 2014 at 10:48 AM, Fabrice Etanchaud fetanchaud@questel.com wrote:

...
Hi Mansi,

Here you have a natural partition of your data : the files you ingested.

So my first suggestion would be to query your data on a file basis:

for $doc in db:open(‘your_collection_name’)

let $file-name := db:path($doc)

return
 file:write(
$file-name,

<names>
 {

 for $name in
$doc//E/@name/data()
 return
<name>{$name}</name>

}

</names>

)

Is it for indexing ?

Hope it helps,

Best regards,

Fabrice Etanchaud

Questel/Orbit

De : basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Mansi Sheth Envoyé : jeudi 6 novembre 2014 16:33 À : Christian Grün Cc : BaseX Objet : Re: [basex-talk] Out Of Memory

This would need a lot of details, so bear with me below:

Briefly my XML files look like:

<A name="">


 <C name="">

 <D name="">

 <E name=""/>
<A> can contain , <C> or <D> and B, C or D can contain E. We have 1000s (currently 3000 in my test data set) of such xml files, of size 50MB on an average. Its tons of data ! Currently, my database is of ~18GB in size.

Query: /A/*//E/@name/string()

This query, was going OOM, within few mins.

I tried a few ways, of whitelisting, with contain clause, to truncate the result set. That didn't help too. So, now I am out of ideas. This is giving JVM 10GB of dedicated memory.

Once, above query works and doesn't go Out Of Memory, I also need corresponding file names too:

XYZ.xml //E/@name

PQR.xml //E/@name

Let me know if you would need more details, to appreciate the issue ?

Mansi

On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün christian.gruen@gmail.com wrote:

Hi Mansi,

I think we need more information on the queries that are causing the problems.

Best, Christian

On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth mansi.sheth@gmail.com wrote:

...
Hello,

I have a use case, where I have to extract lots in information from each XML in each DB. Something like, attribute values of most of the nodes in an XML. For such, queries based goes Out Of Memory with below exception. I am giving it ~12GB of RAM on i7 processor. Well I can't complain here since I am most definitely asking for loads of data, but is there any way I can get these kinds of data successfully ?

mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp BaseX 8.0 beta b45c1e2 [Server] Server was started (port: 1984) HTTP Server was started (port: 8984) Exception in thread "qtp2068921630-18" java.lang.OutOfMemoryError: Java heap space at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857) at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073) at

org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342) at

org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526) at

org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44) at

org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572) at java.lang.Thread.run(Thread.java:744)

--

Mansi

--

Mansi
--

Mansi

Fabrice Etanchaud

5:23 p.m.

The solution depends on the usage you will have of your extraction. May I ask you what is your extraction for ?

Best regards, Fabrice

De : Mansi Sheth [mailto:mansi.sheth@gmail.com] Envoyé : jeudi 6 novembre 2014 17:11 À : Fabrice Etanchaud Cc : Christian Grün; BaseX Objet : Re: [basex-talk] Out Of Memory

Interesting idea, I thought of using db partition, but didn't pursue it further, mainly due to below thought process.

However, I don't really care if process takes few minutes to few hours (as long as its not day(s) ;)). Given the situation and my options, I would surely try this.

Database, is currently indexed at attribute level, as thats what I would be querying the most. Do you think, I should do anything differently ?

Thanks, - Mansi

On Thu, Nov 6, 2014 at 10:48 AM, Fabrice Etanchaud <fetanchaud@questel.commailto:fetanchaud@questel.com> wrote: Hi Mansi,

Here you have a natural partition of your data : the files you ingested. So my first suggestion would be to query your data on a file basis:

for $doc in db:open(‘your_collection_name’) let $file-name := db:path($doc) return file:write( $file-name, <names> { for $name in $doc//E/@name/data() return <name>{$name}</name> } </names> )

Is it for indexing ?

Hope it helps,

Best regards,

Fabrice Etanchaud Questel/Orbit

De : basex-talk-bounces@mailman.uni-konstanz.demailto:basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.demailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Mansi Sheth Envoyé : jeudi 6 novembre 2014 16:33 À : Christian Grün Cc : BaseX Objet : Re: [basex-talk] Out Of Memory

This would need a lot of details, so bear with me below:

Briefly my XML files look like:

Query: /A/*//E/@name/string()

This query, was going OOM, within few mins.

I tried a few ways, of whitelisting, with contain clause, to truncate the result set. That didn't help too. So, now I am out of ideas. This is giving JVM 10GB of dedicated memory.

Once, above query works and doesn't go Out Of Memory, I also need corresponding file names too:

XYZ.xml //E/@name PQR.xml //E/@name

Let me know if you would need more details, to appreciate the issue ? - Mansi

On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün <christian.gruen@gmail.commailto:christian.gruen@gmail.com> wrote: Hi Mansi,

I think we need more information on the queries that are causing the problems.

Best, Christian

On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth <mansi.sheth@gmail.commailto:mansi.sheth@gmail.com> wrote:

...

Hello,

I have a use case, where I have to extract lots in information from each XML in each DB. Something like, attribute values of most of the nodes in an XML. For such, queries based goes Out Of Memory with below exception. I am giving it ~12GB of RAM on i7 processor. Well I can't complain here since I am most definitely asking for loads of data, but is there any way I can get these kinds of data successfully ?

mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp BaseX 8.0 beta b45c1e2 [Server] Server was started (port: 1984) HTTP Server was started (port: 8984) Exception in thread "qtp2068921630-18" java.lang.OutOfMemoryError: Java heap space at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073) at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342) at org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526) at org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572) at java.lang.Thread.run(Thread.java:744)

--

Mansi

-- - Mansi

Mansi Sheth

8:58 p.m.

Briefly explaining, trying to extract these values/per xml file (where .xml files are ID), to map it to its corresponding values.

Imagine, you have 100s of customers, and each customer uses/needs 1000s of different "@name". These "@name" would be similar across customer, but few would be using some values, few customer some other. Trying to collect all this information and find, which "@name" is used by most customer and so on and so forth. There are few such use cases, this one being most generic.

On Thu, Nov 6, 2014 at 11:23 AM, Fabrice Etanchaud fetanchaud@questel.com wrote:

...

The solution depends on the usage you will have of your extraction.

May I ask you what is your extraction for ?

Best regards,

Fabrice

*De :* Mansi Sheth [mailto:mansi.sheth@gmail.com] *Envoyé :* jeudi 6 novembre 2014 17:11 *À :* Fabrice Etanchaud *Cc :* Christian Grün; BaseX

*Objet :* Re: [basex-talk] Out Of Memory

Interesting idea, I thought of using db partition, but didn't pursue it further, mainly due to below thought process.

Currently, I am ingesting ~3000 xml files, storing ~50 xml files per db, which would be growing quickly. So, below approach would lead to ~3000 more files (which would be increasing), increasing I/O operations considerably for further pre-processing.

However, I don't really care if process takes few minutes to few hours (as long as its not day(s) ;)). Given the situation and my options, I would surely try this.

Database, is currently indexed at attribute level, as thats what I would be querying the most. Do you think, I should do anything differently ?

Thanks,

Mansi

On Thu, Nov 6, 2014 at 10:48 AM, Fabrice Etanchaud fetanchaud@questel.com wrote:

Hi Mansi,

Here you have a natural partition of your data : the files you ingested.

So my first suggestion would be to query your data on a file basis:

for $doc in db:open(‘your_collection_name’)

let $file-name := db:path($doc)

return
 file:write(
$file-name,

<names>
 {

 for $name in
$doc//E/@name/data()
 return
<name>{$name}</name>

}

</names>

)

Is it for indexing ?

Hope it helps,

Best regards,

Fabrice Etanchaud

Questel/Orbit

*De :* basex-talk-bounces@mailman.uni-konstanz.de [mailto: basex-talk-bounces@mailman.uni-konstanz.de] *De la part de* Mansi Sheth *Envoyé :* jeudi 6 novembre 2014 16:33 *À :* Christian Grün *Cc :* BaseX *Objet :* Re: [basex-talk] Out Of Memory

This would need a lot of details, so bear with me below:

Briefly my XML files look like:

<A name="">


 <C name="">

 <D name="">

 <E name=""/>
<A> can contain , <C> or <D> and B, C or D can contain E. We have 1000s (currently 3000 in my test data set) of such xml files, of size 50MB on an average. Its tons of data ! Currently, my database is of ~18GB in size.

Query: /A/*//E/@name/string()

This query, was going OOM, within few mins.

I tried a few ways, of whitelisting, with contain clause, to truncate the result set. That didn't help too. So, now I am out of ideas. This is giving JVM 10GB of dedicated memory.

Once, above query works and doesn't go Out Of Memory, I also need corresponding file names too:

XYZ.xml //E/@name

PQR.xml //E/@name

Let me know if you would need more details, to appreciate the issue ?

Mansi

On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün christian.gruen@gmail.com wrote:

Hi Mansi,

I think we need more information on the queries that are causing the problems.

Best, Christian

On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth mansi.sheth@gmail.com wrote:

...
Hello,

I have a use case, where I have to extract lots in information from each

XML

...
in each DB. Something like, attribute values of most of the nodes in an

XML.

...
For such, queries based goes Out Of Memory with below exception. I am

giving

...
it ~12GB of RAM on i7 processor. Well I can't complain here since I am

most

...
definitely asking for loads of data, but is there any way I can get these kinds of data successfully ?

mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp BaseX 8.0 beta b45c1e2 [Server] Server was started (port: 1984) HTTP Server was started (port: 8984) Exception in thread "qtp2068921630-18" java.lang.OutOfMemoryError: Java

heap

...
space at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857)

...
at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073)

...
at

org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342)

...
at

org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526)

...
at

org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44)

...
at

org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)

...
at java.lang.Thread.run(Thread.java:744)

--

Mansi

--

Mansi

--

Mansi

-- - Mansi

Graydon Saunders

9:21 p.m.

Hi Mansi --

If you use

for x in E/@name[starts-with(.,'pqr')] return (tokenize(base-uri($x),'/')[last()],string($x))

for each of the 150-odd values (you may want to generate the query :) it will more likely work. It's not just the size of the database, it's the size of the result, too; keeping the individual query results small gives the optimizer a chance to recognize it's done with some data and free up some memory. I've had to work pretty hard at keeping the intermediate stages small enough to fit in memory for queries before where simple queries on the ~4 GB database were quite fast. It was large intermediate data structures that would run out the available memory.

-- Graydon

On Thu, Nov 6, 2014 at 2:58 PM, Mansi Sheth mansi.sheth@gmail.com wrote:

...

Briefly explaining, trying to extract these values/per xml file (where .xml files are ID), to map it to its corresponding values.

Imagine, you have 100s of customers, and each customer uses/needs 1000s of different "@name". These "@name" would be similar across customer, but few would be using some values, few customer some other. Trying to collect all this information and find, which "@name" is used by most customer and so on and so forth. There are few such use cases, this one being most generic.

On Thu, Nov 6, 2014 at 11:23 AM, Fabrice Etanchaud fetanchaud@questel.com wrote:

...
The solution depends on the usage you will have of your extraction.

May I ask you what is your extraction for ?

Best regards,

Fabrice

De : Mansi Sheth [mailto:mansi.sheth@gmail.com] Envoyé : jeudi 6 novembre 2014 17:11 À : Fabrice Etanchaud Cc : Christian Grün; BaseX

Objet : Re: [basex-talk] Out Of Memory

Interesting idea, I thought of using db partition, but didn't pursue it further, mainly due to below thought process.

Currently, I am ingesting ~3000 xml files, storing ~50 xml files per db, which would be growing quickly. So, below approach would lead to ~3000 more files (which would be increasing), increasing I/O operations considerably for further pre-processing.

However, I don't really care if process takes few minutes to few hours (as long as its not day(s) ;)). Given the situation and my options, I would surely try this.

Database, is currently indexed at attribute level, as thats what I would be querying the most. Do you think, I should do anything differently ?

Thanks,

Mansi

On Thu, Nov 6, 2014 at 10:48 AM, Fabrice Etanchaud fetanchaud@questel.com wrote:

Hi Mansi,

Here you have a natural partition of your data : the files you ingested.

So my first suggestion would be to query your data on a file basis:

for $doc in db:open(‘your_collection_name’)

let $file-name := db:path($doc)

return
 file:write(
$file-name,

<names>
 {

 for $name in
$doc//E/@name/data()
 return
<name>{$name}</name>

}

</names>

)

Is it for indexing ?

Hope it helps,

Best regards,

Fabrice Etanchaud

Questel/Orbit

De : basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Mansi Sheth Envoyé : jeudi 6 novembre 2014 16:33 À : Christian Grün Cc : BaseX Objet : Re: [basex-talk] Out Of Memory

This would need a lot of details, so bear with me below:

Briefly my XML files look like:

<A name="">


 <C name="">

 <D name="">

 <E name=""/>
<A> can contain , <C> or <D> and B, C or D can contain E. We have 1000s (currently 3000 in my test data set) of such xml files, of size 50MB on an average. Its tons of data ! Currently, my database is of ~18GB in size.

Query: /A/*//E/@name/string()

This query, was going OOM, within few mins.

I tried a few ways, of whitelisting, with contain clause, to truncate the result set. That didn't help too. So, now I am out of ideas. This is giving JVM 10GB of dedicated memory.

Once, above query works and doesn't go Out Of Memory, I also need corresponding file names too:

XYZ.xml //E/@name

PQR.xml //E/@name

Let me know if you would need more details, to appreciate the issue ?

Mansi

On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün christian.gruen@gmail.com wrote:

Hi Mansi,

I think we need more information on the queries that are causing the problems.

Best, Christian

On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth mansi.sheth@gmail.com wrote:

...
Hello,

I have a use case, where I have to extract lots in information from each XML in each DB. Something like, attribute values of most of the nodes in an XML. For such, queries based goes Out Of Memory with below exception. I am giving it ~12GB of RAM on i7 processor. Well I can't complain here since I am most definitely asking for loads of data, but is there any way I can get these kinds of data successfully ?

mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp BaseX 8.0 beta b45c1e2 [Server] Server was started (port: 1984) HTTP Server was started (port: 8984) Exception in thread "qtp2068921630-18" java.lang.OutOfMemoryError: Java heap space at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857) at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073) at

org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342) at

org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526) at

org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44) at

org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572) at java.lang.Thread.run(Thread.java:744)

--

Mansi

--

Mansi

--

Mansi
--

Mansi

Christian Grün

7 Nov 7 Nov

10:48 a.m.

Hi Mansi,

...

Once, above query works and doesn't go Out Of Memory, I also need corresponding file names too:

Sorry, I skipped this one. Here is one way to do it:

declare option output:item-separator "
"; for $db in db:open('....') let $path := db:path($db) for $name in $db//E/@name return $path || out:tab() || $name

I was surprised to hear that you are getting OOM errors on command-line, because the query you mentioned should then be evaluated in a streaming fashion (i. e., it should require very low and constant memory).

Could you try the above query? If it fails, could you possibly send me the query plan? On command line, it can be retrieved via the -x flag.

I just remember that you have been using xquery:eval, right? My guess it that it occurs in combination with this function, because it may require all results to be cached before they are being sent back to the client. Do you think you can alternatively put your queries into files, or do you need more flexibility?

Christian

On Thu, Nov 6, 2014 at 8:58 PM, Mansi Sheth mansi.sheth@gmail.com wrote:

...

Briefly explaining, trying to extract these values/per xml file (where .xml files are ID), to map it to its corresponding values.

Imagine, you have 100s of customers, and each customer uses/needs 1000s of different "@name". These "@name" would be similar across customer, but few would be using some values, few customer some other. Trying to collect all this information and find, which "@name" is used by most customer and so on and so forth. There are few such use cases, this one being most generic.

On Thu, Nov 6, 2014 at 11:23 AM, Fabrice Etanchaud fetanchaud@questel.com wrote:

...
The solution depends on the usage you will have of your extraction.

May I ask you what is your extraction for ?

Best regards,

Fabrice

De : Mansi Sheth [mailto:mansi.sheth@gmail.com] Envoyé : jeudi 6 novembre 2014 17:11 À : Fabrice Etanchaud Cc : Christian Grün; BaseX

Objet : Re: [basex-talk] Out Of Memory

Interesting idea, I thought of using db partition, but didn't pursue it further, mainly due to below thought process.

Currently, I am ingesting ~3000 xml files, storing ~50 xml files per db, which would be growing quickly. So, below approach would lead to ~3000 more files (which would be increasing), increasing I/O operations considerably for further pre-processing.

However, I don't really care if process takes few minutes to few hours (as long as its not day(s) ;)). Given the situation and my options, I would surely try this.

Database, is currently indexed at attribute level, as thats what I would be querying the most. Do you think, I should do anything differently ?

Thanks,

Mansi

On Thu, Nov 6, 2014 at 10:48 AM, Fabrice Etanchaud fetanchaud@questel.com wrote:

Hi Mansi,

Here you have a natural partition of your data : the files you ingested.

So my first suggestion would be to query your data on a file basis:

for $doc in db:open(‘your_collection_name’)

let $file-name := db:path($doc)

return
 file:write(
$file-name,

<names>
 {

 for $name in
$doc//E/@name/data()
 return
<name>{$name}</name>

}

</names>

)

Is it for indexing ?

Hope it helps,

Best regards,

Fabrice Etanchaud

Questel/Orbit

De : basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Mansi Sheth Envoyé : jeudi 6 novembre 2014 16:33 À : Christian Grün Cc : BaseX Objet : Re: [basex-talk] Out Of Memory

This would need a lot of details, so bear with me below:

Briefly my XML files look like:

<A name="">


 <C name="">

 <D name="">

 <E name=""/>
<A> can contain , <C> or <D> and B, C or D can contain E. We have 1000s (currently 3000 in my test data set) of such xml files, of size 50MB on an average. Its tons of data ! Currently, my database is of ~18GB in size.

Query: /A/*//E/@name/string()

This query, was going OOM, within few mins.

I tried a few ways, of whitelisting, with contain clause, to truncate the result set. That didn't help too. So, now I am out of ideas. This is giving JVM 10GB of dedicated memory.

Once, above query works and doesn't go Out Of Memory, I also need corresponding file names too:

XYZ.xml //E/@name

PQR.xml //E/@name

Let me know if you would need more details, to appreciate the issue ?

Mansi

On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün christian.gruen@gmail.com wrote:

Hi Mansi,

I think we need more information on the queries that are causing the problems.

Best, Christian

On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth mansi.sheth@gmail.com wrote:

...
Hello,

I have a use case, where I have to extract lots in information from each XML in each DB. Something like, attribute values of most of the nodes in an XML. For such, queries based goes Out Of Memory with below exception. I am giving it ~12GB of RAM on i7 processor. Well I can't complain here since I am most definitely asking for loads of data, but is there any way I can get these kinds of data successfully ?

mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp BaseX 8.0 beta b45c1e2 [Server] Server was started (port: 1984) HTTP Server was started (port: 8984) Exception in thread "qtp2068921630-18" java.lang.OutOfMemoryError: Java heap space at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857) at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073) at

org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342) at

org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526) at

org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44) at

org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572) at java.lang.Thread.run(Thread.java:744)

--

Mansi

--

Mansi

--

Mansi
--

Mansi

Christian Grün

10:57 a.m.

...

do you need more flexibility?

To partially answer my own question, it might be interesting for you to hear that you have various ways of specifying queries via REST [1]:

* You can store your query server-side and use the ?run=... argument to evaluate this query file * You can send a POST request, which contains the query to be evaluated.

In both cases, intermediate results won't be cached, but directly streamed back to the client.

Hope this helps, Christian

[1] http://docs.basex.org/wiki/REST

On Fri, Nov 7, 2014 at 10:48 AM, Christian Grün christian.gruen@gmail.com wrote:

...

declare option output:item-separator "
"; for $db in db:open('....') let $path := db:path($db) for $name in $db//E/@name return $path || out:tab() || $name

Christian Grün

6 Nov 6 Nov

5:19 p.m.

...

Query: /A/*//E/@name/string()

In the GUI, all results will be cached, so you could think about switching to command line.

Do you really need to output all results, or do you do some further processing with the intermediate results?

For example, the query "count(/A/*//E/@name/string())" will probably run without getting stuck.

...

This query, was going OOM, within few mins.

I tried a few ways, of whitelisting, with contain clause, to truncate the result set. That didn't help too. So, now I am out of ideas. This is giving JVM 10GB of dedicated memory.

Once, above query works and doesn't go Out Of Memory, I also need corresponding file names too:

XYZ.xml //E/@name PQR.xml //E/@name

Let me know if you would need more details, to appreciate the issue ?

Mansi

On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün christian.gruen@gmail.com wrote:

...
Hi Mansi,

I think we need more information on the queries that are causing the problems.

Best, Christian

On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth mansi.sheth@gmail.com wrote:

...
Hello,

I have a use case, where I have to extract lots in information from each XML in each DB. Something like, attribute values of most of the nodes in an XML. For such, queries based goes Out Of Memory with below exception. I am giving it ~12GB of RAM on i7 processor. Well I can't complain here since I am most definitely asking for loads of data, but is there any way I can get these kinds of data successfully ?

mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp BaseX 8.0 beta b45c1e2 [Server] Server was started (port: 1984) HTTP Server was started (port: 8984) Exception in thread "qtp2068921630-18" java.lang.OutOfMemoryError: Java heap space at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857) at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073) at

org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342) at

org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526) at

org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44) at

org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572) at java.lang.Thread.run(Thread.java:744)

--

Mansi

--

Mansi

Mansi Sheth

8:54 p.m.

I would be doing tons of post processing. I never use UI. I either use REST thru cURL or command line.

I would basically need data in below format:

XML File Name, @name

I am trying to whitelist picking up values for only "starts-with(@name,"pqr"). where "pqr" is a list of 150 odd values.

My file names, are essentially some ID/keys, which I would need to map it further using sqlite to some values and may be group by it.. etc.

So, basically I am trying to visualize some data, based on its existence in which xml files. So, yes count(<query>) would be fine, but won't solve much purpose, since I still need value "pqr".

- Mansi

On Thu, Nov 6, 2014 at 11:19 AM, Christian Grün christian.gruen@gmail.com wrote:

...

...
Query: /A/*//E/@name/string()

In the GUI, all results will be cached, so you could think about switching to command line.

Do you really need to output all results, or do you do some further processing with the intermediate results?

For example, the query "count(/A/*//E/@name/string())" will probably run without getting stuck.

...
This query, was going OOM, within few mins.

I tried a few ways, of whitelisting, with contain clause, to truncate the result set. That didn't help too. So, now I am out of ideas. This is

giving

...
JVM 10GB of dedicated memory.

Once, above query works and doesn't go Out Of Memory, I also need corresponding file names too:

XYZ.xml //E/@name PQR.xml //E/@name

Let me know if you would need more details, to appreciate the issue ?

Mansi

On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün <

christian.gruen@gmail.com>

...
wrote:

...
Hi Mansi,

I think we need more information on the queries that are causing the problems.

Best, Christian

On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth mansi.sheth@gmail.com

wrote:

...
...
...
Hello,

I have a use case, where I have to extract lots in information from

each

...
...
...
XML in each DB. Something like, attribute values of most of the nodes in

an

...
...
...
XML. For such, queries based goes Out Of Memory with below exception. I am giving it ~12GB of RAM on i7 processor. Well I can't complain here since I am most definitely asking for loads of data, but is there any way I can get these kinds of data successfully ?

mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp BaseX 8.0 beta b45c1e2 [Server] Server was started (port: 1984) HTTP Server was started (port: 8984) Exception in thread "qtp2068921630-18" java.lang.OutOfMemoryError:

Java

...
...
...
heap space at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857)

...
...
...
at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073)

...
...
...
at

org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342)

...
...
...
at

org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526)

...
...
...
at

org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44)

...
...
...
at

org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)

...
...
...
at java.lang.Thread.run(Thread.java:744)

--

Mansi

--

Mansi

-- - Mansi

Fabrice Etanchaud

7 Nov 7 Nov

9:48 a.m.

Hi Mansi,

From what I can see, for each pqr value, you could use db:attribute-range to retrieve all the file names, group by/count to obtain statistics. You could also create a new collection from an extraction of only the data you need, changing @name into element and use full text fuzzy match.

Hoping it helps

Cordialement Fabrice

De : basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Mansi Sheth Envoyé : jeudi 6 novembre 2014 20:55 À : Christian Grün Cc : BaseX Objet : Re: [basex-talk] Out Of Memory

I would be doing tons of post processing. I never use UI. I either use REST thru cURL or command line.

I would basically need data in below format:

XML File Name, @name

I am trying to whitelist picking up values for only "starts-with(@name,"pqr"). where "pqr" is a list of 150 odd values.

My file names, are essentially some ID/keys, which I would need to map it further using sqlite to some values and may be group by it.. etc.

So, basically I am trying to visualize some data, based on its existence in which xml files. So, yes count(<query>) would be fine, but won't solve much purpose, since I still need value "pqr".

- Mansi

On Thu, Nov 6, 2014 at 11:19 AM, Christian Grün <christian.gruen@gmail.commailto:christian.gruen@gmail.com> wrote:

...

Query: /A/*//E/@name/string()

In the GUI, all results will be cached, so you could think about switching to command line.

Do you really need to output all results, or do you do some further processing with the intermediate results?

For example, the query "count(/A/*//E/@name/string())" will probably run without getting stuck.

...

This query, was going OOM, within few mins.

I tried a few ways, of whitelisting, with contain clause, to truncate the result set. That didn't help too. So, now I am out of ideas. This is giving JVM 10GB of dedicated memory.

Once, above query works and doesn't go Out Of Memory, I also need corresponding file names too:

XYZ.xml //E/@name PQR.xml //E/@name

Let me know if you would need more details, to appreciate the issue ?

Mansi

On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün <christian.gruen@gmail.commailto:christian.gruen@gmail.com> wrote:

...
Hi Mansi,

I think we need more information on the queries that are causing the problems.

Best, Christian

On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth <mansi.sheth@gmail.commailto:mansi.sheth@gmail.com> wrote:

...
Hello,

I have a use case, where I have to extract lots in information from each XML in each DB. Something like, attribute values of most of the nodes in an XML. For such, queries based goes Out Of Memory with below exception. I am giving it ~12GB of RAM on i7 processor. Well I can't complain here since I am most definitely asking for loads of data, but is there any way I can get these kinds of data successfully ?

mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp BaseX 8.0 beta b45c1e2 [Server] Server was started (port: 1984) HTTP Server was started (port: 8984) Exception in thread "qtp2068921630-18" java.lang.OutOfMemoryError: Java heap space at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857) at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073) at

org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342) at

org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526) at

org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44) at

org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572) at java.lang.Thread.run(Thread.java:744)

--

Mansi

--

Mansi

-- - Mansi

Mansi Sheth

6:25 p.m.

This email chain, is extremely helpful. Thanks a ton guys. Certainly one of the most helpful folks here :)

I have to try a lot of these suggestions but currently I am being pulled into something else, so I have to pause for the time being.

Will get back to this email thread, after trying a few things and my relevant observations.

- Mansi

On Fri, Nov 7, 2014 at 3:48 AM, Fabrice Etanchaud fetanchaud@questel.com wrote:

...

Hi Mansi,

From what I can see,

for each pqr value, you could use db:attribute-range to retrieve all the file names, group by/count to obtain statistics.

You could also create a new collection from an extraction of only the data you need, changing @name into element and use full text fuzzy match.

Hoping it helps

Cordialement

Fabrice

*De :* basex-talk-bounces@mailman.uni-konstanz.de [mailto: basex-talk-bounces@mailman.uni-konstanz.de] *De la part de* Mansi Sheth *Envoyé :* jeudi 6 novembre 2014 20:55 *À :* Christian Grün *Cc :* BaseX *Objet :* Re: [basex-talk] Out Of Memory

I would be doing tons of post processing. I never use UI. I either use REST thru cURL or command line.

I would basically need data in below format:

XML File Name, @name

I am trying to whitelist picking up values for only "starts-with(@name,"pqr"). where "pqr" is a list of 150 odd values.

My file names, are essentially some ID/keys, which I would need to map it further using sqlite to some values and may be group by it.. etc.

So, basically I am trying to visualize some data, based on its existence in which xml files. So, yes count(<query>) would be fine, but won't solve much purpose, since I still need value "pqr".

Mansi

On Thu, Nov 6, 2014 at 11:19 AM, Christian Grün christian.gruen@gmail.com wrote:

...
Query: /A/*//E/@name/string()

In the GUI, all results will be cached, so you could think about switching to command line.

Do you really need to output all results, or do you do some further processing with the intermediate results?

For example, the query "count(/A/*//E/@name/string())" will probably run without getting stuck.

...
This query, was going OOM, within few mins.

I tried a few ways, of whitelisting, with contain clause, to truncate the result set. That didn't help too. So, now I am out of ideas. This is

giving

...
JVM 10GB of dedicated memory.

Once, above query works and doesn't go Out Of Memory, I also need corresponding file names too:

XYZ.xml //E/@name PQR.xml //E/@name

Let me know if you would need more details, to appreciate the issue ?

Mansi

On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün <

christian.gruen@gmail.com>

...
wrote:

...
Hi Mansi,

I think we need more information on the queries that are causing the problems.

Best, Christian

On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth mansi.sheth@gmail.com

wrote:

...
...
...
Hello,

I have a use case, where I have to extract lots in information from

each

...
...
...
XML in each DB. Something like, attribute values of most of the nodes in

an

...
...
...
XML. For such, queries based goes Out Of Memory with below exception. I am giving it ~12GB of RAM on i7 processor. Well I can't complain here since I am most definitely asking for loads of data, but is there any way I can get these kinds of data successfully ?

mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp BaseX 8.0 beta b45c1e2 [Server] Server was started (port: 1984) HTTP Server was started (port: 8984) Exception in thread "qtp2068921630-18" java.lang.OutOfMemoryError:

Java

...
...
...
heap space at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857)

...
...
...
at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073)

...
...
...
at

org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342)

...
...
...
at

org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526)

...
...
...
at

org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44)

...
...
...
at

org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)

...
...
...
at java.lang.Thread.run(Thread.java:744)

--

Mansi

--

Mansi

--

Mansi

-- - Mansi

Mansi Sheth

30 Dec 30 Dec

2:38 p.m.

Hello,

Wanted to get back to this email chain and share my experience.

I got this running beautifully (including all post processing of results), using the below command:

curl -ig ' http://localhost:8984/rest?run=get_query.xq&n=/Archives/*/descendant::D/...)' | cut -d: -f1 | cut -d. -f1-3 | sort | uniq -c | sort -n -r

I am using Basex 8.0 beta 763cc93 build. Running this on i7 2.7GHZ MBP, giving 8GB to basexhttp process. it took around 34 min on a 41 GB data. I think, lot of time went in post processing (sorting) the result set, rather than actually extracting the results from BaseX DB.

When tried a similar query on a much smaller database(3GB) on a much powerful amazon instance, giving 20GB RAM to basex http process, got me results with post processing within 4 mins.

Thanks for all your inputs guys,

Keep BaseXing... !!! - Mansi

On Fri, Nov 7, 2014 at 12:25 PM, Mansi Sheth mansi.sheth@gmail.com wrote:

...

This email chain, is extremely helpful. Thanks a ton guys. Certainly one of the most helpful folks here :)

I have to try a lot of these suggestions but currently I am being pulled into something else, so I have to pause for the time being.

Will get back to this email thread, after trying a few things and my relevant observations.

Mansi

On Fri, Nov 7, 2014 at 3:48 AM, Fabrice Etanchaud fetanchaud@questel.com wrote:

...
Hi Mansi,

From what I can see,

for each pqr value, you could use db:attribute-range to retrieve all the file names, group by/count to obtain statistics.

You could also create a new collection from an extraction of only the data you need, changing @name into element and use full text fuzzy match.

Hoping it helps

Cordialement

Fabrice

*De :* basex-talk-bounces@mailman.uni-konstanz.de [mailto: basex-talk-bounces@mailman.uni-konstanz.de] *De la part de* Mansi Sheth *Envoyé :* jeudi 6 novembre 2014 20:55 *À :* Christian Grün *Cc :* BaseX *Objet :* Re: [basex-talk] Out Of Memory

I would be doing tons of post processing. I never use UI. I either use REST thru cURL or command line.

I would basically need data in below format:

XML File Name, @name

I am trying to whitelist picking up values for only "starts-with(@name,"pqr"). where "pqr" is a list of 150 odd values.

My file names, are essentially some ID/keys, which I would need to map it further using sqlite to some values and may be group by it.. etc.

So, basically I am trying to visualize some data, based on its existence in which xml files. So, yes count(<query>) would be fine, but won't solve much purpose, since I still need value "pqr".

Mansi

On Thu, Nov 6, 2014 at 11:19 AM, Christian Grün < christian.gruen@gmail.com> wrote:

...
Query: /A/*//E/@name/string()

In the GUI, all results will be cached, so you could think about switching to command line.

Do you really need to output all results, or do you do some further processing with the intermediate results?

For example, the query "count(/A/*//E/@name/string())" will probably run without getting stuck.

...
This query, was going OOM, within few mins.

I tried a few ways, of whitelisting, with contain clause, to truncate

the

...
result set. That didn't help too. So, now I am out of ideas. This is

giving

...
JVM 10GB of dedicated memory.

Once, above query works and doesn't go Out Of Memory, I also need corresponding file names too:

XYZ.xml //E/@name PQR.xml //E/@name

Let me know if you would need more details, to appreciate the issue ?

Mansi

On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün <

christian.gruen@gmail.com>

...
wrote:

...
Hi Mansi,

I think we need more information on the queries that are causing the problems.

Best, Christian

On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth mansi.sheth@gmail.com

wrote:

...
...
...
Hello,

I have a use case, where I have to extract lots in information from

each

...
...
...
XML in each DB. Something like, attribute values of most of the nodes in

an

...
...
...
XML. For such, queries based goes Out Of Memory with below exception. I am giving it ~12GB of RAM on i7 processor. Well I can't complain here since I

am

...
...
...
most definitely asking for loads of data, but is there any way I can get these kinds of data successfully ?

mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp BaseX 8.0 beta b45c1e2 [Server] Server was started (port: 1984) HTTP Server was started (port: 8984) Exception in thread "qtp2068921630-18" java.lang.OutOfMemoryError:

Java

...
...
...
heap space at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857)

...
...
...
at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073)

...
...
...
at

org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342)

...
...
...
at

org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526)

...
...
...
at

org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44)

...
...
...
at

org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)

...
...
...
at java.lang.Thread.run(Thread.java:744)

--

Mansi

--

Mansi

--

Mansi

--

Mansi

-- - Mansi

Florent Gallaire

6:45 p.m.

For my uses, "string()" seems to be extremely extremely slow at processing big data, you should try without it.

Best regards

Florent

On Tue, Dec 30, 2014 at 2:38 PM, Mansi Sheth mansi.sheth@gmail.com wrote:

...

Hello,

Wanted to get back to this email chain and share my experience.

I got this running beautifully (including all post processing of results), using the below command:

curl -ig ' http://localhost:8984/rest?run=get_query.xq&n=/Archives/*/descendant::D/...)' | cut -d: -f1 | cut -d. -f1-3 | sort | uniq -c | sort -n -r

I am using Basex 8.0 beta 763cc93 build. Running this on i7 2.7GHZ MBP, giving 8GB to basexhttp process. it took around 34 min on a 41 GB data. I think, lot of time went in post processing (sorting) the result set, rather than actually extracting the results from BaseX DB.

When tried a similar query on a much smaller database(3GB) on a much powerful amazon instance, giving 20GB RAM to basex http process, got me results with post processing within 4 mins.

Thanks for all your inputs guys,

Keep BaseXing... !!!

Mansi

On Fri, Nov 7, 2014 at 12:25 PM, Mansi Sheth mansi.sheth@gmail.com wrote:

...
This email chain, is extremely helpful. Thanks a ton guys. Certainly one of the most helpful folks here :)

I have to try a lot of these suggestions but currently I am being pulled into something else, so I have to pause for the time being.

Will get back to this email thread, after trying a few things and my relevant observations.

Mansi

On Fri, Nov 7, 2014 at 3:48 AM, Fabrice Etanchaud <fetanchaud@questel.com

...
wrote:

...
Hi Mansi,

From what I can see,

for each pqr value, you could use db:attribute-range to retrieve all the file names, group by/count to obtain statistics.

You could also create a new collection from an extraction of only the data you need, changing @name into element and use full text fuzzy match.

Hoping it helps

Cordialement

Fabrice

*De :* basex-talk-bounces@mailman.uni-konstanz.de [mailto: basex-talk-bounces@mailman.uni-konstanz.de] *De la part de* Mansi Sheth *Envoyé :* jeudi 6 novembre 2014 20:55 *À :* Christian Grün *Cc :* BaseX *Objet :* Re: [basex-talk] Out Of Memory

I would be doing tons of post processing. I never use UI. I either use REST thru cURL or command line.

I would basically need data in below format:

XML File Name, @name

I am trying to whitelist picking up values for only "starts-with(@name,"pqr"). where "pqr" is a list of 150 odd values.

My file names, are essentially some ID/keys, which I would need to map it further using sqlite to some values and may be group by it.. etc.

So, basically I am trying to visualize some data, based on its existence in which xml files. So, yes count(<query>) would be fine, but won't solve much purpose, since I still need value "pqr".

Mansi

On Thu, Nov 6, 2014 at 11:19 AM, Christian Grün < christian.gruen@gmail.com> wrote:

...
Query: /A/*//E/@name/string()

In the GUI, all results will be cached, so you could think about switching to command line.

Do you really need to output all results, or do you do some further processing with the intermediate results?

For example, the query "count(/A/*//E/@name/string())" will probably run without getting stuck.

...
This query, was going OOM, within few mins.

I tried a few ways, of whitelisting, with contain clause, to truncate

the

...
result set. That didn't help too. So, now I am out of ideas. This is

giving

...
JVM 10GB of dedicated memory.

Once, above query works and doesn't go Out Of Memory, I also need corresponding file names too:

XYZ.xml //E/@name PQR.xml //E/@name

Let me know if you would need more details, to appreciate the issue ?

Mansi

On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün <

christian.gruen@gmail.com>

...
wrote:

...
Hi Mansi,

I think we need more information on the queries that are causing the problems.

Best, Christian

On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth mansi.sheth@gmail.com

wrote:

...
...
...
Hello,

I have a use case, where I have to extract lots in information from

each

...
...
...
XML in each DB. Something like, attribute values of most of the nodes

in an

...
...
...
XML. For such, queries based goes Out Of Memory with below exception. I

am

...
...
...
giving it ~12GB of RAM on i7 processor. Well I can't complain here since I

am

...
...
...
most definitely asking for loads of data, but is there any way I can get these kinds of data successfully ?

mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp BaseX 8.0 beta b45c1e2 [Server] Server was started (port: 1984) HTTP Server was started (port: 8984) Exception in thread "qtp2068921630-18" java.lang.OutOfMemoryError:

Java

...
...
...
heap space at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857)

...
...
...
at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073)

...
...
...
at

org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342)

...
...
...
at

org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526)

...
...
...
at

org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44)

...
...
...
at

org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)

...
...
...
at java.lang.Thread.run(Thread.java:744)

--

Mansi

--

Mansi

--

Mansi

--

Mansi

--

Mansi

-- FLOSS Engineer & Lawyer

Christian Grün

6 Jan 6 Jan

11:51 a.m.

Hi Mansi,

...

curl -ig 'http://localhost:8984/rest?run=get_query.xq&n=/Archives/*/descendant::D/...)' | cut -d: -f1 | cut -d. -f1-3 | sort | uniq -c | sort -n -r

I guess you will get your result much faster by avoiding the post processing steps and doing everything with XQuery instead:

(for $n in distinct-values(/Archives/descendant::D/@name) order by ..... descending group by ... return ...)[position() = 1 to ......]

Hope this helps, Christian

...

I am using Basex 8.0 beta 763cc93 build. Running this on i7 2.7GHZ MBP, giving 8GB to basexhttp process. it took around 34 min on a 41 GB data. I think, lot of time went in post processing (sorting) the result set, rather than actually extracting the results from BaseX DB.

When tried a similar query on a much smaller database(3GB) on a much powerful amazon instance, giving 20GB RAM to basex http process, got me results with post processing within 4 mins.

Thanks for all your inputs guys,

Keep BaseXing... !!!

Mansi

On Fri, Nov 7, 2014 at 12:25 PM, Mansi Sheth mansi.sheth@gmail.com wrote:

...
This email chain, is extremely helpful. Thanks a ton guys. Certainly one of the most helpful folks here :)

I have to try a lot of these suggestions but currently I am being pulled into something else, so I have to pause for the time being.

Will get back to this email thread, after trying a few things and my relevant observations.

Mansi

On Fri, Nov 7, 2014 at 3:48 AM, Fabrice Etanchaud fetanchaud@questel.com wrote:

...
Hi Mansi,

From what I can see,

for each pqr value, you could use db:attribute-range to retrieve all the file names, group by/count to obtain statistics.

You could also create a new collection from an extraction of only the data you need, changing @name into element and use full text fuzzy match.

Hoping it helps

Cordialement

Fabrice

De : basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Mansi Sheth Envoyé : jeudi 6 novembre 2014 20:55 À : Christian Grün Cc : BaseX Objet : Re: [basex-talk] Out Of Memory

I would be doing tons of post processing. I never use UI. I either use REST thru cURL or command line.

I would basically need data in below format:

XML File Name, @name

I am trying to whitelist picking up values for only "starts-with(@name,"pqr"). where "pqr" is a list of 150 odd values.

My file names, are essentially some ID/keys, which I would need to map it further using sqlite to some values and may be group by it.. etc.

So, basically I am trying to visualize some data, based on its existence in which xml files. So, yes count(<query>) would be fine, but won't solve much purpose, since I still need value "pqr".

Mansi

On Thu, Nov 6, 2014 at 11:19 AM, Christian Grün christian.gruen@gmail.com wrote:

...
Query: /A/*//E/@name/string()

In the GUI, all results will be cached, so you could think about switching to command line.

Do you really need to output all results, or do you do some further processing with the intermediate results?

For example, the query "count(/A/*//E/@name/string())" will probably run without getting stuck.

...
This query, was going OOM, within few mins.

I tried a few ways, of whitelisting, with contain clause, to truncate the result set. That didn't help too. So, now I am out of ideas. This is giving JVM 10GB of dedicated memory.

Once, above query works and doesn't go Out Of Memory, I also need corresponding file names too:

XYZ.xml //E/@name PQR.xml //E/@name

Let me know if you would need more details, to appreciate the issue ?

Mansi

On Thu, Nov 6, 2014 at 8:48 AM, Christian Grün christian.gruen@gmail.com wrote:

...
Hi Mansi,

I think we need more information on the queries that are causing the problems.

Best, Christian

On Wed, Nov 5, 2014 at 8:48 PM, Mansi Sheth mansi.sheth@gmail.com wrote:

...
Hello,

I have a use case, where I have to extract lots in information from each XML in each DB. Something like, attribute values of most of the nodes in an XML. For such, queries based goes Out Of Memory with below exception. I am giving it ~12GB of RAM on i7 processor. Well I can't complain here since I am most definitely asking for loads of data, but is there any way I can get these kinds of data successfully ?

mansi-veracode:BigData mansiadmin$ ~/Downloads/basex/bin/basexhttp BaseX 8.0 beta b45c1e2 [Server] Server was started (port: 1984) HTTP Server was started (port: 8984) Exception in thread "qtp2068921630-18" java.lang.OutOfMemoryError: Java heap space at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1857) at

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2073) at

org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342) at

org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526) at

org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44) at

org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572) at java.lang.Thread.run(Thread.java:744)

--

Mansi

--

Mansi

--

Mansi

--

Mansi

--

Mansi

3948

Age (days ago)

4010

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

17 comments

5 participants

tags (0)

participants (5)

Christian Grün
Fabrice Etanchaud
Florent Gallaire
Graydon Saunders
Mansi Sheth