Efficient query for duplicates
I'm running out of memory (1.5 GB allocated) when querying for duplicate node values over a fairly flat XML database of approximately 450 MB. Can anyone suggest a more memory-efficient approach to framing this query than iterating over distinct-values as I do below? I'm hoping that there are some Basex tips and tricks to help out here. for $val in distinct-values(/dataset/item/pii) let $cnt := count(/dataset/item/pii[. = $val]) return if ($cnt > 1) then <duplicate>{$val}</duplicate> else null Thanks in advance, Constantine ________________________________ Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The Netherlands, Registration No. 33156677, Registered in The Netherlands.
Hello Constantine, your query very much looks like you want to group by each <pii/> element. Using XQuery 3.0 there even is a construct aimed at that and I would guess it should also preserve memory. It is called group by and you can find some more information at https://docs.basex.org/wiki/XQuery_3.0#Group_By or in the spec. It will look something like this: for $x in /dataset/item/pii let $val := $x/string() where count($x) > 1 return <duplicate>{$val}</duplicate> By the way, it looks you are using null like a NULL statement in other languages. Such a statement does not exist in XQuery. You might want to return an empty set () instead. Cheers, Dirk On 09/04/14 13:37, Hondros, Constantine (ELS-AMS) wrote:
I'm running out of memory (1.5 GB allocated) when querying for duplicate node values over a fairly flat XML database of approximately 450 MB.
Can anyone suggest a more memory-efficient approach to framing this query than iterating over distinct-values as I do below? I'm hoping that there are some Basex tips and tricks to help out here.
for $val in distinct-values(/dataset/item/pii) let $cnt := count(/dataset/item/pii[. = $val]) return if ($cnt > 1) then <duplicate>{$val}</duplicate> else null
Thanks in advance, Constantine
________________________________
Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The Netherlands, Registration No. 33156677, Registered in The Netherlands.
-- Dirk Kirsten, BaseX GmbH, http://basex.org |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22
Thanks Dirk. A little addition: 'let' should probably be replaced with 'group': for $x in /dataset/item/pii group by $val := $x/string() where count($x) > 1 return <duplicate>{ $val }</duplicate> Christian On Wed, Apr 9, 2014 at 1:59 PM, Dirk Kirsten <dk@basex.org> wrote:
Hello Constantine,
your query very much looks like you want to group by each <pii/> element. Using XQuery 3.0 there even is a construct aimed at that and I would guess it should also preserve memory. It is called group by and you can find some more information at https://docs.basex.org/wiki/XQuery_3.0#Group_By or in the spec. It will look something like this:
for $x in /dataset/item/pii let $val := $x/string() where count($x) > 1 return <duplicate>{$val}</duplicate>
By the way, it looks you are using null like a NULL statement in other languages. Such a statement does not exist in XQuery. You might want to return an empty set () instead.
Cheers, Dirk
On 09/04/14 13:37, Hondros, Constantine (ELS-AMS) wrote:
I'm running out of memory (1.5 GB allocated) when querying for duplicate node values over a fairly flat XML database of approximately 450 MB.
Can anyone suggest a more memory-efficient approach to framing this query than iterating over distinct-values as I do below? I'm hoping that there are some Basex tips and tricks to help out here.
for $val in distinct-values(/dataset/item/pii) let $cnt := count(/dataset/item/pii[. = $val]) return if ($cnt > 1) then <duplicate>{$val}</duplicate> else null
Thanks in advance, Constantine
________________________________
Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The Netherlands, Registration No. 33156677, Registered in The Netherlands.
-- Dirk Kirsten, BaseX GmbH, http://basex.org |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22
Thanks guys, using group by makes this query run beautifully. Just out of interest, if I was using a pre-XQuery 3.0 processor, would there have been any other approach than iterating over distinct-values? Thanks also for the pointer about using 'null'. The Basex query editor seemed to approve it when I tried it once, and the habit stuck ;-) Cheers, Constantine -----Original Message----- From: Christian Grün [mailto:christian.gruen@gmail.com] Sent: 09 April 2014 14:51 To: Dirk Kirsten Cc: Hondros, Constantine (ELS-AMS); basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Efficient query for duplicates Thanks Dirk. A little addition: 'let' should probably be replaced with 'group': for $x in /dataset/item/pii group by $val := $x/string() where count($x) > 1 return <duplicate>{ $val }</duplicate> Christian On Wed, Apr 9, 2014 at 1:59 PM, Dirk Kirsten <dk@basex.org> wrote:
Hello Constantine,
your query very much looks like you want to group by each <pii/> element. Using XQuery 3.0 there even is a construct aimed at that and I would guess it should also preserve memory. It is called group by and you can find some more information at https://docs.basex.org/wiki/XQuery_3.0#Group_By or in the spec. It will look something like this:
for $x in /dataset/item/pii let $val := $x/string() where count($x) > 1 return <duplicate>{$val}</duplicate>
By the way, it looks you are using null like a NULL statement in other languages. Such a statement does not exist in XQuery. You might want to return an empty set () instead.
Cheers, Dirk
On 09/04/14 13:37, Hondros, Constantine (ELS-AMS) wrote:
I'm running out of memory (1.5 GB allocated) when querying for duplicate node values over a fairly flat XML database of approximately 450 MB.
Can anyone suggest a more memory-efficient approach to framing this query than iterating over distinct-values as I do below? I'm hoping that there are some Basex tips and tricks to help out here.
for $val in distinct-values(/dataset/item/pii) let $cnt := count(/dataset/item/pii[. = $val]) return if ($cnt > 1) then <duplicate>{$val}</duplicate> else null
Thanks in advance, Constantine
________________________________
Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The Netherlands, Registration No. 33156677, Registered in The Netherlands.
-- Dirk Kirsten, BaseX GmbH, http://basex.org |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22
________________________________ Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The Netherlands, Registration No. 33156677, Registered in The Netherlands.
Just out of interest, if I was using a pre-XQuery 3.0 processor, would there have been any other approach than iterating over distinct-values?
...nope. The query might be evaluated faster if you bin all target elements to a variable... let $pii := /dataset/item/pii/string() for $val in distinct-values($pii)[count($pii[. = $val]) > 1] return <duplicate>{$val}</duplicate> ...but this depends on the underlying query processor.
Thanks also for the pointer about using 'null'. The Basex query editor seemed to approve it when I tried it once, and the habit stuck ;-)
"null" is a valid XQuery expresion. If you have a database opened, this query will give you all documents in the database with the root element <null>...</null>.
participants (3)
-
Christian Grün -
Dirk Kirsten -
Hondros, Constantine (ELS-AMS)