re-sort database

List overview All Threads
Download

newer

older

seeking for a document in a...

Error in 7.6 !! but not in 7.5

Cerstin Elisabeth Mahlow

12 Mar 2013 12 Mar '13

12:46 p.m.

Hi,

after a lot of data has been gathered, I realized that my update-function has a bug. It's not a big deal fixing it, however, I don't know how to resort the existing data.

Essentially, I wanted to create this kind of data:

However, the data looks like this:

So, the secondqueries are stored just after the entry they belong to. How would I be able to move these data from "right after a particular node" to "just inside this particular node" using XQuery Update?

Thanks in advance and best regards

Cerstin

-- Dr. phil. Cerstin Mahlow Universität Basel Departement Sprach- und Literaturwissenschaften Fachbereich Deutsche Sprach- und Literaturwissenschaft Nadelberg 4 4051 Basel Schweiz Tel: +41 61 267 07 65 Fax: +41 61 267 34 40 Mail: cerstin.mahlow@unibas.ch Web: http://www.oldphras.net

Show replies by date

Christian Grün

12 Mar 12 Mar

2:33 p.m.

Hi Cerstin,

the following query may help:

for $entry in $doc//entry let $next := $entry/following-sibling::entry[1] let $sc := $entry/following-sibling::secondquery [empty($next) or . << $next] return ( insert nodes $sc into $entry, delete nodes $sc )

Cheers, Christian ___________________________

On Tue, Mar 12, 2013 at 5:46 PM, Cerstin Elisabeth Mahlow cerstin.mahlow@unibas.ch wrote:

...

Hi,

after a lot of data has been gathered, I realized that my update-function has a bug. It's not a big deal fixing it, however, I don't know how to resort the existing data.

Essentially, I wanted to create this kind of data:

<collection> <entry> <node>123</node> <query>xyz</query> <secondquery>abc_1</secondquery> <secondquery>abc_2</secondquery> </entry> <entry> <node>456</node> <query>xyz</query> <secondquery>abc_1</secondquery> <secondquery>abc_3</secondquery> <secondquery>abc_4</secondquery> </entry> </collection>

However, the data looks like this:

<collection> <entry> <node>123</node> <query>xyz</query> </entry> <secondquery>abc_1</secondquery> <secondquery>abc_2</secondquery> <entry> <node>456</node> <query>xyz</query> </entry> <secondquery>abc_1</secondquery> <secondquery>abc_3</secondquery> <secondquery>abc_4</secondquery> </collection>

So, the secondqueries are stored just after the entry they belong to. How would I be able to move these data from "right after a particular node" to "just inside this particular node" using XQuery Update?

Thanks in advance and best regards

Cerstin

Dr. phil. Cerstin Mahlow

Universität Basel Departement Sprach- und Literaturwissenschaften Fachbereich Deutsche Sprach- und Literaturwissenschaft Nadelberg 4 4051 Basel Schweiz

Tel: +41 61 267 07 65 Fax: +41 61 267 34 40 Mail: cerstin.mahlow@unibas.ch Web: http://www.oldphras.net

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Wendell Piez

13 Mar 13 Mar

10:20 a.m.

Christian,

Alternatively, would this be a place one could use a 3.0 window clause?

This raises a related question. I have seen a big boost on performance when using 'group by' instead of the classic distinct-values-based grouping. I suppose this is not surprising. Cerstin's question, similarly, is a grouping question, although the grouping is based on proximity in document order, not on values. (In XSLT it would be addressed using xsl:for-each-group[@group-starting-with].)

When doing this (or any) sort of grouping, are we generally better off using the new 3.0 power features than doing it the old-fashioned way by hand? (I imagine that given the size of Cerstin's documents it may not be an issue for her, but what if the sequences were long?)

Cheers, Wendell

On Tue, Mar 12, 2013 at 2:33 PM, Christian Grün christian.gruen@gmail.com wrote:

...

Hi Cerstin,

the following query may help:

for $entry in $doc//entry let $next := $entry/following-sibling::entry[1] let $sc := $entry/following-sibling::secondquery [empty($next) or . << $next] return ( insert nodes $sc into $entry, delete nodes $sc )

Cheers, Christian ___________________________

On Tue, Mar 12, 2013 at 5:46 PM, Cerstin Elisabeth Mahlow cerstin.mahlow@unibas.ch wrote:

...
Hi,

after a lot of data has been gathered, I realized that my update-function has a bug. It's not a big deal fixing it, however, I don't know how to resort the existing data.

Essentially, I wanted to create this kind of data:

<collection> <entry> <node>123</node> <query>xyz</query> <secondquery>abc_1</secondquery> <secondquery>abc_2</secondquery> </entry> <entry> <node>456</node> <query>xyz</query> <secondquery>abc_1</secondquery> <secondquery>abc_3</secondquery> <secondquery>abc_4</secondquery> </entry> </collection>

However, the data looks like this:

<collection> <entry> <node>123</node> <query>xyz</query> </entry> <secondquery>abc_1</secondquery> <secondquery>abc_2</secondquery> <entry> <node>456</node> <query>xyz</query> </entry> <secondquery>abc_1</secondquery> <secondquery>abc_3</secondquery> <secondquery>abc_4</secondquery> </collection>

So, the secondqueries are stored just after the entry they belong to. How would I be able to move these data from "right after a particular node" to "just inside this particular node" using XQuery Update?

Thanks in advance and best regards

Cerstin

Dr. phil. Cerstin Mahlow

Universität Basel Departement Sprach- und Literaturwissenschaften Fachbereich Deutsche Sprach- und Literaturwissenschaft Nadelberg 4 4051 Basel Schweiz

Tel: +41 61 267 07 65 Fax: +41 61 267 34 40 Mail: cerstin.mahlow@unibas.ch Web: http://www.oldphras.net

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

-- Wendell Piez | http://www.wendellpiez.com XML | XSLT | electronic publishing Eat Your Vegetables _____oo_________o_o___ooooo____ooooooo_^

Christian Grün

10:25 a.m.

Hi Wendell,

good point. I agree that there are various ways to answer Cerstin’s question. Window clauses should be a good fit here, and should most probably provide better performance than requesting the following and preceding axes of a node.

Christian ___________________________

On Wed, Mar 13, 2013 at 3:20 PM, Wendell Piez wapiez@wendellpiez.com wrote:

...

Christian,

Alternatively, would this be a place one could use a 3.0 window clause?

This raises a related question. I have seen a big boost on performance when using 'group by' instead of the classic distinct-values-based grouping. I suppose this is not surprising. Cerstin's question, similarly, is a grouping question, although the grouping is based on proximity in document order, not on values. (In XSLT it would be addressed using xsl:for-each-group[@group-starting-with].)

When doing this (or any) sort of grouping, are we generally better off using the new 3.0 power features than doing it the old-fashioned way by hand? (I imagine that given the size of Cerstin's documents it may not be an issue for her, but what if the sequences were long?)

Cheers, Wendell

On Tue, Mar 12, 2013 at 2:33 PM, Christian Grün christian.gruen@gmail.com wrote:

...
Hi Cerstin,

the following query may help:

for $entry in $doc//entry let $next := $entry/following-sibling::entry[1] let $sc := $entry/following-sibling::secondquery [empty($next) or . << $next] return ( insert nodes $sc into $entry, delete nodes $sc )

Cheers, Christian ___________________________

On Tue, Mar 12, 2013 at 5:46 PM, Cerstin Elisabeth Mahlow cerstin.mahlow@unibas.ch wrote:

...
Hi,

after a lot of data has been gathered, I realized that my update-function has a bug. It's not a big deal fixing it, however, I don't know how to resort the existing data.

Essentially, I wanted to create this kind of data:

<collection> <entry> <node>123</node> <query>xyz</query> <secondquery>abc_1</secondquery> <secondquery>abc_2</secondquery> </entry> <entry> <node>456</node> <query>xyz</query> <secondquery>abc_1</secondquery> <secondquery>abc_3</secondquery> <secondquery>abc_4</secondquery> </entry> </collection>

However, the data looks like this:

<collection> <entry> <node>123</node> <query>xyz</query> </entry> <secondquery>abc_1</secondquery> <secondquery>abc_2</secondquery> <entry> <node>456</node> <query>xyz</query> </entry> <secondquery>abc_1</secondquery> <secondquery>abc_3</secondquery> <secondquery>abc_4</secondquery> </collection>

So, the secondqueries are stored just after the entry they belong to. How would I be able to move these data from "right after a particular node" to "just inside this particular node" using XQuery Update?

Thanks in advance and best regards

Cerstin

Dr. phil. Cerstin Mahlow

Universität Basel Departement Sprach- und Literaturwissenschaften Fachbereich Deutsche Sprach- und Literaturwissenschaft Nadelberg 4 4051 Basel Schweiz

Tel: +41 61 267 07 65 Fax: +41 61 267 34 40 Mail: cerstin.mahlow@unibas.ch Web: http://www.oldphras.net

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

-- Wendell Piez | http://www.wendellpiez.com XML | XSLT | electronic publishing Eat Your Vegetables _____oo_________o_o___ooooo____ooooooo_^

Wendell Piez

12:28 p.m.

Christian,

Thanks, but I was hoping I could lure you (or someone else) into suggesting what the syntax for a window clause might look like here! :-) Since they are something I haven't mastered yet.

(Any takers?)

Cheers, Wendell

On Wed, Mar 13, 2013 at 10:25 AM, Christian Grün christian.gruen@gmail.com wrote:

...

Hi Wendell,

good point. I agree that there are various ways to answer Cerstin’s question. Window clauses should be a good fit here, and should most probably provide better performance than requesting the following and preceding axes of a node.

Christian ___________________________

On Wed, Mar 13, 2013 at 3:20 PM, Wendell Piez wapiez@wendellpiez.com wrote:

...
Christian,

Alternatively, would this be a place one could use a 3.0 window clause?

This raises a related question. I have seen a big boost on performance when using 'group by' instead of the classic distinct-values-based grouping. I suppose this is not surprising. Cerstin's question, similarly, is a grouping question, although the grouping is based on proximity in document order, not on values. (In XSLT it would be addressed using xsl:for-each-group[@group-starting-with].)

When doing this (or any) sort of grouping, are we generally better off using the new 3.0 power features than doing it the old-fashioned way by hand? (I imagine that given the size of Cerstin's documents it may not be an issue for her, but what if the sequences were long?)

Cheers, Wendell

On Tue, Mar 12, 2013 at 2:33 PM, Christian Grün christian.gruen@gmail.com wrote:

...
Hi Cerstin,

the following query may help:

for $entry in $doc//entry let $next := $entry/following-sibling::entry[1] let $sc := $entry/following-sibling::secondquery [empty($next) or . << $next] return ( insert nodes $sc into $entry, delete nodes $sc )

Cheers, Christian ___________________________

On Tue, Mar 12, 2013 at 5:46 PM, Cerstin Elisabeth Mahlow cerstin.mahlow@unibas.ch wrote:

...
Hi,

after a lot of data has been gathered, I realized that my update-function has a bug. It's not a big deal fixing it, however, I don't know how to resort the existing data.

Essentially, I wanted to create this kind of data:

<collection> <entry> <node>123</node> <query>xyz</query> <secondquery>abc_1</secondquery> <secondquery>abc_2</secondquery> </entry> <entry> <node>456</node> <query>xyz</query> <secondquery>abc_1</secondquery> <secondquery>abc_3</secondquery> <secondquery>abc_4</secondquery> </entry> </collection>

However, the data looks like this:

<collection> <entry> <node>123</node> <query>xyz</query> </entry> <secondquery>abc_1</secondquery> <secondquery>abc_2</secondquery> <entry> <node>456</node> <query>xyz</query> </entry> <secondquery>abc_1</secondquery> <secondquery>abc_3</secondquery> <secondquery>abc_4</secondquery> </collection>

So, the secondqueries are stored just after the entry they belong to. How would I be able to move these data from "right after a particular node" to "just inside this particular node" using XQuery Update?

Thanks in advance and best regards

Cerstin

Dr. phil. Cerstin Mahlow

Universität Basel Departement Sprach- und Literaturwissenschaften Fachbereich Deutsche Sprach- und Literaturwissenschaft Nadelberg 4 4051 Basel Schweiz

Tel: +41 61 267 07 65 Fax: +41 61 267 34 40 Mail: cerstin.mahlow@unibas.ch Web: http://www.oldphras.net

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

-- Wendell Piez | http://www.wendellpiez.com XML | XSLT | electronic publishing Eat Your Vegetables _____oo_________o_o___ooooo____ooooooo_^

-- Wendell Piez | http://www.wendellpiez.com XML | XSLT | electronic publishing Eat Your Vegetables _____oo_________o_o___ooooo____ooooooo_^

Cerstin Elisabeth Mahlow

1:25 p.m.

Hi Christian,

Am 12.03.2013 um 19:33 schrieb Christian Grün christian.gruen@gmail.com:

...

the following query may help:

for $entry in $doc//entry let $next := $entry/following-sibling::entry[1] let $sc := $entry/following-sibling::secondquery [empty($next) or . << $next] return ( insert nodes $sc into $entry, delete nodes $sc )

Thanks! It works perfectly for the example and also for a small sample of the real data

However, my real data has about 140 000 of such entries and about 30 000 of such secondqueries, it's all in one database. Which is probably too big.

After 3320855 ms of execution time (and 3355613 ms for a second attempt) I got the following error message. Any ideas?

I already set VM=-Xmx1024m and I use BaseX 7.6.1 Beta from February 14 on a MacBook Air with a 2 GHz processor and 8 GB RAM.

Error: Improper use? Potential bug? Your feedback is welcome: Contact: basex-talk@mailman.uni-konstanz.de Version: BaseX 7.6.1 beta Java: Apple Inc., 1.6.0_43 OS: Mac OS X, x86_64

Stack Trace: java.lang.ArrayIndexOutOfBoundsException: 2147483647 org.basex.io.random.TableDiskAccess.cursor(TableDiskAccess.java:485) org.basex.io.random.TableDiskAccess.read5(TableDiskAccess.java:211) org.basex.data.Data.textOff(Data.java:422) org.basex.data.DiskData.text(DiskData.java:234) org.basex.index.value.DiskValues.readKeyAt(DiskValues.java:285) org.basex.index.value.DiskValues.get(DiskValues.java:441) org.basex.index.value.UpdatableDiskValues.index(UpdatableDiskValues.java:65) org.basex.data.DiskData.indexEnd(DiskData.java:355) org.basex.data.Data.insert(Data.java:841) org.basex.data.atomic.Insert.apply(Insert.java:31) org.basex.data.atomic.AtomicUpdateList.applyStructuralUpdates(AtomicUpdateList.java:297) org.basex.data.atomic.AtomicUpdateList.execute(AtomicUpdateList.java:285) org.basex.query.up.DatabaseUpdates.apply(DatabaseUpdates.java:183) org.basex.query.up.ContextModifier.apply(ContextModifier.java:90) org.basex.query.up.Updates.apply(Updates.java:120) org.basex.query.QueryContext.update(QueryContext.java:270) org.basex.query.QueryContext.value(QueryContext.java:255) org.basex.query.QueryContext.iter(QueryContext.java:240) org.basex.query.QueryContext.execute(QueryContext.java:498) org.basex.query.QueryProcessor.execute(QueryProcessor.java:96) org.basex.core.cmd.AQuery.query(AQuery.java:77) org.basex.core.cmd.XQuery.run(XQuery.java:22) org.basex.core.Command.run(Command.java:342) org.basex.core.Command.exec(Command.java:321) org.basex.core.Command.execute(Command.java:78) org.basex.gui.GUI.exec(GUI.java:397) org.basex.gui.GUI$7.run(GUI.java:349)

Compiling: - simplifying descendant-or-self step(s)

Optimized Query: for $entry in document-node { "collection-ws-new.xml" }/descendant::entry let $next := $entry/following-sibling::entry[1] let $sc := $entry/following-sibling::secondquery[(fn:empty($next) or (. << $next))] return (insert node $sc into $entry, delete nodes $sc)

Christian Grün

5:29 p.m.

Hi Cerstin,

...

However, my real data has about 140 000 of such entries and about 30 000 of such secondqueries, it's all in one database. Which is probably too big.

true; it may well be that the total amount of update operations is too large to be processed in a single step. I would advise to try to run the updates in several steps und trigger several query executions, à la…

declare variable $start external := 1; declare variable $end external := 1000;

for $entry in db:open("collection-ws-new.xml")/descendant::entry[position() = $start to $end] let $next := $entry/following-sibling::entry[1] let $sc := $entry/following-sibling::secondquery[empty($next) or . << $next] return (insert node $sc into $entry, delete nodes $sc)

...

After 3320855 ms of execution time (and 3355613 ms for a second attempt) I got the following error message. Any ideas?

Did you stop the update process, and do you still have the original data instance?

The error messages indicates that the updatable index structure could be corrupt. You could try to export your data and create a new database without updatable index structures; this could also speed up your updates. Maybe it even allows you to update all nodes in a single run.

Christian ___________________________

...

I already set VM=-Xmx1024m and I use BaseX 7.6.1 Beta from February 14 on a MacBook Air with a 2 GHz processor and 8 GB RAM.

Error: Improper use? Potential bug? Your feedback is welcome: Contact: basex-talk@mailman.uni-konstanz.de Version: BaseX 7.6.1 beta Java: Apple Inc., 1.6.0_43 OS: Mac OS X, x86_64

Stack Trace: java.lang.ArrayIndexOutOfBoundsException: 2147483647 org.basex.io.random.TableDiskAccess.cursor(TableDiskAccess.java:485) org.basex.io.random.TableDiskAccess.read5(TableDiskAccess.java:211) org.basex.data.Data.textOff(Data.java:422) org.basex.data.DiskData.text(DiskData.java:234) org.basex.index.value.DiskValues.readKeyAt(DiskValues.java:285) org.basex.index.value.DiskValues.get(DiskValues.java:441) org.basex.index.value.UpdatableDiskValues.index(UpdatableDiskValues.java:65) org.basex.data.DiskData.indexEnd(DiskData.java:355) org.basex.data.Data.insert(Data.java:841) org.basex.data.atomic.Insert.apply(Insert.java:31) org.basex.data.atomic.AtomicUpdateList.applyStructuralUpdates(AtomicUpdateList.java:297) org.basex.data.atomic.AtomicUpdateList.execute(AtomicUpdateList.java:285) org.basex.query.up.DatabaseUpdates.apply(DatabaseUpdates.java:183) org.basex.query.up.ContextModifier.apply(ContextModifier.java:90) org.basex.query.up.Updates.apply(Updates.java:120) org.basex.query.QueryContext.update(QueryContext.java:270) org.basex.query.QueryContext.value(QueryContext.java:255) org.basex.query.QueryContext.iter(QueryContext.java:240) org.basex.query.QueryContext.execute(QueryContext.java:498) org.basex.query.QueryProcessor.execute(QueryProcessor.java:96) org.basex.core.cmd.AQuery.query(AQuery.java:77) org.basex.core.cmd.XQuery.run(XQuery.java:22) org.basex.core.Command.run(Command.java:342) org.basex.core.Command.exec(Command.java:321) org.basex.core.Command.execute(Command.java:78) org.basex.gui.GUI.exec(GUI.java:397) org.basex.gui.GUI$7.run(GUI.java:349)

Compiling:

simplifying descendant-or-self step(s)

Optimized Query: for $entry in document-node { "collection-ws-new.xml" }/descendant::entry let $next := $entry/following-sibling::entry[1] let $sc := $entry/following-sibling::secondquery[(fn:empty($next) or (. << $next))] return (insert node $sc into $entry, delete nodes $sc)

-- Dr. phil. Cerstin Mahlow

Universität Basel Departement Sprach- und Literaturwissenschaften Fachbereich Deutsche Sprach- und Literaturwissenschaft Nadelberg 4 4051 Basel Schweiz

Tel: +41 61 267 07 65 Fax: +41 61 267 34 40 Mail: cerstin.mahlow@unibas.ch Web: http://www.oldphras.net

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Liam R E Quin

7:02 p.m.

On Wed, 2013-03-13 at 22:29 +0100, Christian Grün wrote:

...

Hi Cerstin, [...]

...

You could try to export your data and create a new database without updatable index structures; this could also speed up your updates. Maybe it even allows you to update all nodes in a single run.

...

...
I already set VM=-Xmx1024m and I use BaseX 7.6.1 Beta from February 14 on a MacBook Air with a 2 GHz processor and 8 GB RAM.

I'd try using VM=-Xmx6000m if you have 8G of RAM.

Liam

-- Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/ Pictures from old books: http://fromoldbooks.org/ Ankh: irc.sorcery.net irc.gnome.org freenode/#xml

Cerstin Elisabeth Mahlow

14 Mar 14 Mar

11:27 a.m.

Hi,

Am 14.03.2013 um 00:02 schrieb Liam R E Quin liam@w3.org:

...

On Wed, 2013-03-13 at 22:29 +0100, Christian Grün wrote:

...
You could try to export your data and create a new database without updatable index structures; this could also speed up your updates. Maybe it even allows you to update all nodes in a single run.

...
...
I already set VM=-Xmx1024m and I use BaseX 7.6.1 Beta from February 14 on a MacBook Air with a 2 GHz processor and 8 GB RAM.

I'd try using VM=-Xmx6000m if you have 8G of RAM.

OK, after combining both tips (using a database without updatable index and setting VM=-Xmx6000m) it worked in a single run. Thanks!

After 5'729'855 ms (95 minutes) it updated 35'344 nodes within the 165'000 entries in the database.

I don't know if this is slow and could be improved, but I'm happy having fixed the database :)

Best regards and thanks again

Cerstin

4663

Age (days ago)

4665

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

8 comments

4 participants

tags (0)

participants (4)

Cerstin Elisabeth Mahlow
Christian Grün
Liam R E Quin
Wendell Piez