Xquery for noobs : the HashSet problem

List overview All Threads
Download

newer

older

Multi threading doing slow process...

Replace node attribute value

jean-marc Mercier

12 Nov 2013 12 Nov '13

10:49 a.m.

Hi,

Following a recent discussion with BaseX team, here is a description of a problem that I don't know how to solve efficiently from within Xquery or BaseX Xquery extensions like map:module. Thus, any suggestion are welcomed !

The problem is the following : I want to write an XQUERY module called HashSet, implementing the XQUERY version of HashSet (I mean the JAVA object HashSet, or std::set in C++).

An HashSet is indeed a map specialization. Thus a suggestion is to use the BaseX map:module. However, I don't know how to use it efficiently. The reasons are the following : This HashSet module

1) must expose a method "hashset:new()" to allocate memory dynamically. The map:module provides the function map:new, this point is ok, I can wrap that.

2) This module must expose a method "hashset:clear($hashset)" to de-allocate memory dynamically. The map:module provides the function map:remove, and I could remove all elements. However this point is ko for me, as, AFAIU (as far as I understand), map:remove copy an the references from a map object into a smaller map object. It does not dealllocate memory, leading to poor overall performances.

3) must expose a method "hashset:add($hashset,$element)" to add memory dynamically. Through map:module, the only possibility that I see is to wrap it as map:new($hashset, map:entry($element,$dummyboolean)). Here too, a in-memory copy is done, leading to poor overall performances.

XQUERY is a very attractive langage to me. However, its very nature is to be a functional langage. The main inconvenients to working with XQUERY for me are : - first : its dynamic memory management (the in-memory printfoot of my XQUERY executables are usually tremendous). - second : it lacks powerful libraries, as complete math modules.

However, I might miss something ? Do not hesitate to teach me some XQUERY basic !

Cheers

JohnLeM

Attachments:

attachment.html (text/html — 2.2 KB)

Show replies by date

Christian Grün

12 Nov 12 Nov

2:48 p.m.

Dear JohnLeM,

thanks for your mail. As you already noted, XQuery is a functional language, and this is the reason why XQuery maps are not exactly comparable to maps and sets, as they are used in imperative languages:

All maps in XQuery are persisent (immutable): Once a map has been generated, it is not possible to change its contents. Instead, a new map is to be created by each insertion or deletion [1]. This sounds like a huge memory killer, but it’s not as bad as you might guess. Various efficient solutions exist for persistent map, such as the mapped trie that has been implemented in BaseX [2]. It will only create copies of parts of the data structure that are to be changed. The following query is an example for a query which creates 100.000 map with a single entry, and a large map containing all the entries; on my system, it requires 200 ms to run:

map:size(map:new( for $i in 1 to 100000 return map { $i := true() } ))

In short: persistent maps may not be as efficient as mutable maps, but they are usually not the bottleneck in XQuery applications, because deleted entries (or obsolete maps) will automatically be discarded by the garbage collector as soon as they are not in the query scope anymore. If you want to enforce this, you can put your map operations into FLWOR expresions or user-declared functions.

Back to your original questions:

...

This module must expose a method "hashset:clear($hashset)" to de-allocate

memory dynamically. The map:module provides the function map:remove, and I could remove all elements. [...] It does not deallocate memory, leading to poor overall performances.

It may be interesting to do some testing in order to find out what’s the actual bottleneck in your query. How man entries is your hash set supposed to contain?

...

must expose a method "hashset:add($hashset,$element)" to add memory

dynamically. Through map:module, the only possibility that I see is to wrap it as map:new($hashset, map:entry($element,$dummyboolean)).

Right, using true() is probably the best choice (booleans are only instantiated once in memory).

...

first : its dynamic memory management (the in-memory printfoot of my

XQUERY executables are usually tremendous).

This can in fact be a problem, and is mostly due to various decisions taken in the specification, and the complexity of XML nodes in general.

...

second : it lacks powerful libraries, as complete math modules.

What kind of functions would you be interested in?

Christian

[1] http://en.wikipedia.org/wiki/Persistent_data_structure [2] http://en.wikipedia.org/wiki/Hash_array_mapped_trie

jean-marc Mercier

5:48 p.m.

Dear Christian,

Thanks for your answer.

...

It may be interesting to do some testing in order to find out what’s the actual bottleneck in your query. How man entries is your hash set supposed to contain?

The HashSet contains 10M strings. That's not a big deal. The problems comes from filling the HashSet. Well, let us be more technacal. I am using hashset for tree traversal operations over a DataBase of 10M nodes, 1Go. This HashSet collects some informations during the traversal, mainly a string at each node. Actually, my code binds JAVA with something like

import module namespace set = "java.util.HashSet";

declare function init(){set:clear(), tree_traversal($my_big_DataBase), do_some_stuff_with_hash_set()} declare function tree_traversal($elements) { for $element in $elements return ( set:add(get_some_stuff_within($element) ), tree_traversal($element/element()) ) } and the whole traversal takes a minute.

I don't know how to achieve this with pure XQUERY. The only thing that I tried is something like :

declare function init(){let $my_hash_set := tree_traversal($my_big_DataBase, map:new()) return do_some_stuff_with($my_hash_set) }

declare function tree_traversal($elements, $hash_set) as function(*) (return a HashSet) { for $element in $elements return ( tree_traversal($nodes/element(), $hash_set), map:entry( get_some_stuff_within($element),boolean) ) } that performed very badly, compared to the first version (JAVA binding). I explained this thinking that the first version is O(N) operations, and the second is O(N^2) operations due to in-memory copy.

...

...

second : it lacks powerful libraries, as complete math modules.

...

What kind of functions would you be interested in?

I need a stochastic modulus, as http://www.boost.org/doc/libs/1_55_0/libs/math/doc/html/dist.html also a linear Algebra modulus as UBLAS http://www.boost.org/doc/libs/1_54_0/libs/numeric/ublas/doc/index.htm

I guess that I could find JAVA equivalent and bind them to BaseX. Such functionalities are available directly as XQUERY modules ?

Cheers,

Jean-Marc

2013/11/12 Christian Grün christian.gruen@gmail.com

...

Dear JohnLeM,

thanks for your mail. As you already noted, XQuery is a functional language, and this is the reason why XQuery maps are not exactly comparable to maps and sets, as they are used in imperative languages:

All maps in XQuery are persisent (immutable): Once a map has been generated, it is not possible to change its contents. Instead, a new map is to be created by each insertion or deletion [1]. This sounds like a huge memory killer, but it’s not as bad as you might guess. Various efficient solutions exist for persistent map, such as the mapped trie that has been implemented in BaseX [2]. It will only create copies of parts of the data structure that are to be changed. The following query is an example for a query which creates 100.000 map with a single entry, and a large map containing all the entries; on my system, it requires 200 ms to run:

map:size(map:new( for $i in 1 to 100000 return map { $i := true() } ))

In short: persistent maps may not be as efficient as mutable maps, but they are usually not the bottleneck in XQuery applications, because deleted entries (or obsolete maps) will automatically be discarded by the garbage collector as soon as they are not in the query scope anymore. If you want to enforce this, you can put your map operations into FLWOR expresions or user-declared functions.

Back to your original questions:

...

This module must expose a method "hashset:clear($hashset)" to

de-allocate

...
memory dynamically. The map:module provides the function map:remove, and

I

...
could remove all elements. [...] It does not deallocate memory, leading

to poor

...
overall performances.

It may be interesting to do some testing in order to find out what’s the actual bottleneck in your query. How man entries is your hash set supposed to contain?

...

must expose a method "hashset:add($hashset,$element)" to add memory

dynamically. Through map:module, the only possibility that I see is to

wrap

...
it as map:new($hashset, map:entry($element,$dummyboolean)).

Right, using true() is probably the best choice (booleans are only instantiated once in memory).

...

first : its dynamic memory management (the in-memory printfoot of my

XQUERY executables are usually tremendous).

This can in fact be a problem, and is mostly due to various decisions taken in the specification, and the complexity of XML nodes in general.

...

second : it lacks powerful libraries, as complete math modules.

What kind of functions would you be interested in?

Christian

[1] http://en.wikipedia.org/wiki/Persistent_data_structure [2] http://en.wikipedia.org/wiki/Hash_array_mapped_trie

Charles Duffy

5:59 p.m.

It's not necessarily true that immutable data structures perform an actual in-memory copy. See for instance Phil Bagwell's hash tries -- only minimal amounts of the data structure are actually copied during updates.

I don't know whether BaseX is making use of these optimizations (Christian could speak to that), but jumping directly to "it's immutible, so it must be inefficient" is not necessarily well-founded.

On Tue, Nov 12, 2013 at 10:48 AM, jean-marc Mercier < jeanmarc.mercier@gmail.com> wrote:

...

Dear Christian,

Thanks for your answer.

...
It may be interesting to do some testing in order to find out what’s the actual bottleneck in your query. How man entries is your hash set supposed to contain?

The HashSet contains 10M strings. That's not a big deal. The problems comes from filling the HashSet. Well, let us be more technacal. I am using hashset for tree traversal operations over a DataBase of 10M nodes, 1Go. This HashSet collects some informations during the traversal, mainly a string at each node. Actually, my code binds JAVA with something like

import module namespace set = "java.util.HashSet";

declare function init(){set:clear(), tree_traversal($my_big_DataBase), do_some_stuff_with_hash_set()} declare function tree_traversal($elements) { for $element in $elements return ( set:add(get_some_stuff_within($element) ), tree_traversal($element/element()) ) } and the whole traversal takes a minute.

I don't know how to achieve this with pure XQUERY. The only thing that I tried is something like :

declare function init(){let $my_hash_set := tree_traversal($my_big_DataBase, map:new()) return do_some_stuff_with($my_hash_set) }

declare function tree_traversal($elements, $hash_set) as function(*) (return a HashSet) { for $element in $elements return ( tree_traversal($nodes/element(), $hash_set), map:entry( get_some_stuff_within($element),boolean) ) } that performed very badly, compared to the first version (JAVA binding). I explained this thinking that the first version is O(N) operations, and the second is O(N^2) operations due to in-memory copy.

...
...

second : it lacks powerful libraries, as complete math modules.

...
What kind of functions would you be interested in?

I need a stochastic modulus, as http://www.boost.org/doc/libs/1_55_0/libs/math/doc/html/dist.html also a linear Algebra modulus as UBLAS http://www.boost.org/doc/libs/1_54_0/libs/numeric/ublas/doc/index.htm

I guess that I could find JAVA equivalent and bind them to BaseX. Such functionalities are available directly as XQUERY modules ?

Cheers,

Jean-Marc

2013/11/12 Christian Grün christian.gruen@gmail.com

...
Dear JohnLeM,

thanks for your mail. As you already noted, XQuery is a functional language, and this is the reason why XQuery maps are not exactly comparable to maps and sets, as they are used in imperative languages:

All maps in XQuery are persisent (immutable): Once a map has been generated, it is not possible to change its contents. Instead, a new map is to be created by each insertion or deletion [1]. This sounds like a huge memory killer, but it’s not as bad as you might guess. Various efficient solutions exist for persistent map, such as the mapped trie that has been implemented in BaseX [2]. It will only create copies of parts of the data structure that are to be changed. The following query is an example for a query which creates 100.000 map with a single entry, and a large map containing all the entries; on my system, it requires 200 ms to run:

map:size(map:new( for $i in 1 to 100000 return map { $i := true() } ))

In short: persistent maps may not be as efficient as mutable maps, but they are usually not the bottleneck in XQuery applications, because deleted entries (or obsolete maps) will automatically be discarded by the garbage collector as soon as they are not in the query scope anymore. If you want to enforce this, you can put your map operations into FLWOR expresions or user-declared functions.

Back to your original questions:

...

This module must expose a method "hashset:clear($hashset)" to

de-allocate

...
memory dynamically. The map:module provides the function map:remove,

and I

...
could remove all elements. [...] It does not deallocate memory,

leading to poor

...
overall performances.

It may be interesting to do some testing in order to find out what’s the actual bottleneck in your query. How man entries is your hash set supposed to contain?

...

must expose a method "hashset:add($hashset,$element)" to add memory

dynamically. Through map:module, the only possibility that I see is to

wrap

...
it as map:new($hashset, map:entry($element,$dummyboolean)).

Right, using true() is probably the best choice (booleans are only instantiated once in memory).

...

first : its dynamic memory management (the in-memory printfoot of my

XQUERY executables are usually tremendous).

This can in fact be a problem, and is mostly due to various decisions taken in the specification, and the complexity of XML nodes in general.

...

second : it lacks powerful libraries, as complete math modules.

What kind of functions would you be interested in?

Christian

[1] http://en.wikipedia.org/wiki/Persistent_data_structure [2] http://en.wikipedia.org/wiki/Hash_array_mapped_trie

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Christian Grün

6:38 p.m.

...

It's not necessarily true that immutable data structures perform an actual in-memory copy. See for instance Phil Bagwell's hash tries -- only minimal amounts of the data structure are actually copied during updates.

Exactly. By the way, BaseX is indeed based on Phil Bagwell’s hash tries..

...

I don't know whether BaseX is making use of these optimizations (Christian could speak to that), but jumping directly to "it's immutible, so it must be inefficient" is not necessarily well-founded.

On Tue, Nov 12, 2013 at 10:48 AM, jean-marc Mercier jeanmarc.mercier@gmail.com wrote:

...
Dear Christian,

Thanks for your answer.

...
It may be interesting to do some testing in order to find out what’s the actual bottleneck in your query. How man entries is your hash set supposed to contain?

The HashSet contains 10M strings. That's not a big deal. The problems comes from filling the HashSet. Well, let us be more technacal. I am using hashset for tree traversal operations over a DataBase of 10M nodes, 1Go. This HashSet collects some informations during the traversal, mainly a string at each node. Actually, my code binds JAVA with something like

import module namespace set = "java.util.HashSet";

declare function init(){set:clear(), tree_traversal($my_big_DataBase), do_some_stuff_with_hash_set()} declare function tree_traversal($elements) { for $element in $elements return ( set:add(get_some_stuff_within($element) ), tree_traversal($element/element()) ) } and the whole traversal takes a minute.

I don't know how to achieve this with pure XQUERY. The only thing that I tried is something like :

declare function init(){let $my_hash_set := tree_traversal($my_big_DataBase, map:new()) return do_some_stuff_with($my_hash_set) }

declare function tree_traversal($elements, $hash_set) as function(*) (return a HashSet) { for $element in $elements return ( tree_traversal($nodes/element(), $hash_set), map:entry( get_some_stuff_within($element),boolean) ) } that performed very badly, compared to the first version (JAVA binding). I explained this thinking that the first version is O(N) operations, and the second is O(N^2) operations due to in-memory copy.

...
...

second : it lacks powerful libraries, as complete math modules.

...
What kind of functions would you be interested in?

I need a stochastic modulus, as http://www.boost.org/doc/libs/1_55_0/libs/math/doc/html/dist.html also a linear Algebra modulus as UBLAS http://www.boost.org/doc/libs/1_54_0/libs/numeric/ublas/doc/index.htm

I guess that I could find JAVA equivalent and bind them to BaseX. Such functionalities are available directly as XQUERY modules ?

Cheers,

Jean-Marc

2013/11/12 Christian Grün christian.gruen@gmail.com

...
Dear JohnLeM,

thanks for your mail. As you already noted, XQuery is a functional language, and this is the reason why XQuery maps are not exactly comparable to maps and sets, as they are used in imperative languages:

All maps in XQuery are persisent (immutable): Once a map has been generated, it is not possible to change its contents. Instead, a new map is to be created by each insertion or deletion [1]. This sounds like a huge memory killer, but it’s not as bad as you might guess. Various efficient solutions exist for persistent map, such as the mapped trie that has been implemented in BaseX [2]. It will only create copies of parts of the data structure that are to be changed. The following query is an example for a query which creates 100.000 map with a single entry, and a large map containing all the entries; on my system, it requires 200 ms to run:

map:size(map:new( for $i in 1 to 100000 return map { $i := true() } ))

In short: persistent maps may not be as efficient as mutable maps, but they are usually not the bottleneck in XQuery applications, because deleted entries (or obsolete maps) will automatically be discarded by the garbage collector as soon as they are not in the query scope anymore. If you want to enforce this, you can put your map operations into FLWOR expresions or user-declared functions.

Back to your original questions:

...

This module must expose a method "hashset:clear($hashset)" to

de-allocate memory dynamically. The map:module provides the function map:remove, and I could remove all elements. [...] It does not deallocate memory, leading to poor overall performances.

It may be interesting to do some testing in order to find out what’s the actual bottleneck in your query. How man entries is your hash set supposed to contain?

...

must expose a method "hashset:add($hashset,$element)" to add memory

dynamically. Through map:module, the only possibility that I see is to wrap it as map:new($hashset, map:entry($element,$dummyboolean)).

Right, using true() is probably the best choice (booleans are only instantiated once in memory).

...

first : its dynamic memory management (the in-memory printfoot of my

XQUERY executables are usually tremendous).

This can in fact be a problem, and is mostly due to various decisions taken in the specification, and the complexity of XML nodes in general.

...

second : it lacks powerful libraries, as complete math modules.

What kind of functions would you be interested in?

Christian

[1] http://en.wikipedia.org/wiki/Persistent_data_structure [2] http://en.wikipedia.org/wiki/Hash_array_mapped_trie

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Christian Grün

6:45 p.m.

Hi Jean-Marc,

I’m not sure how your actual data looks like, but you could have a look the attached query, which uses XQuery maps, and which allowe d us to inser t 10 million strings in about 40 seconds . Most presumably, it’s not as memory efficient as a Java map, but my guess is that most memory is actually spent by the XQuery representation of string items, and not necessarily the map itself.

Hope this helps, Christian ___________________________ ___________________________

declare variable $my-big-DataBase := doc('db')/root/*;

declare function local:do-some-stuff-with-set($map) { count(map:keys($map)) };

declare function local:init() { local:do-some-stuff-with-set(local:tree-traversal($my-big-DataBase, map:new())) };

declare function local:tree-traversal($elements, $map) { fold-left($elements, $map, function($map2, $element) { let $map3 := map:new(($map2, { local:get-some-stuff-within($element) : true() })) return local:tree-traversal($element/element(), $map3) } ) };

declare function local:get-some-stuff-within($element) { $element/@name };

local:init() ___________________________ ___________________________

On Tue, Nov 12, 2013 at 5:48 PM, jean-marc Mercier < jeanmarc.mercier@gmail.com> wrote:

...

Dear Christian,

Thanks for your answer.

...
It may be interesting to do some testing in order to find out what’s the actual bottleneck in your query. How man entries is your hash set supposed to contain?

The HashSet contains 10M strings. That's not a big deal. The problems

comes

...

from filling the HashSet. Well, let us be more technacal. I am using hashset for tree traversal operations over a DataBase of 10M nodes, 1Go. This HashSet collects some informations during the traversal, mainly a string at each node. Actually, my code binds JAVA with something like

import module namespace set = "java.util.HashSet";

declare function init(){set:clear(), tree_traversal($my_big_DataBase), do_some_stuff_with_hash_set()} declare function tree_traversal($elements) { for $element in $elements return ( set:add(get_some_stuff_within($element) ), tree_traversal($element/element()) ) } and the whole traversal takes a minute.

I don't know how to achieve this with pure XQUERY. The only thing that I tried is something like :

declare function init(){let $my_hash_set :=

tree_traversal($my_big_DataBase,

...

map:new()) return do_some_stuff_with($my_hash_set) }

declare function tree_traversal($elements, $hash_set) as function(*)

(return

...

a HashSet) { for $element in $elements return ( tree_traversal($nodes/element(), $hash_set), map:entry( get_some_stuff_within($element),boolean) ) } that performed very badly, compared to the first version (JAVA binding). I explained this thinking that the first version is O(N) operations, and the second is O(N^2) operations due to in-memory copy.

...
...

second : it lacks powerful libraries, as complete math modules.

...
What kind of functions would you be interested in?

I need a stochastic modulus, as http://www.boost.org/doc/libs/1_55_0/libs/math/doc/html/dist.html also a linear Algebra modulus as UBLAS http://www.boost.org/doc/libs/1_54_0/libs/numeric/ublas/doc/index.htm

I guess that I could find JAVA equivalent and bind them to BaseX. Such functionalities are available directly as XQUERY modules ?

Cheers,

Jean-Marc

2013/11/12 Christian Grün christian.gruen@gmail.com

...
Dear JohnLeM,

thanks for your mail. As you already noted, XQuery is a functional language, and this is the reason why XQuery maps are not exactly comparable to maps and sets, as they are used in imperative languages:

All maps in XQuery are persisent (immutable): Once a map has been generated, it is not possible to change its contents. Instead, a new map is to be created by each insertion or deletion [1]. This sounds like a huge memory killer, but it’s not as bad as you might guess. Various efficient solutions exist for persistent map, such as the mapped trie that has been implemented in BaseX [2]. It will only create copies of parts of the data structure that are to be changed. The following query is an example for a query which creates 100.000 map with a single entry, and a large map containing all the entries; on my system, it requires 200 ms to run:

map:size(map:new( for $i in 1 to 100000 return map { $i := true() } ))

In short: persistent maps may not be as efficient as mutable maps, but they are usually not the bottleneck in XQuery applications, because deleted entries (or obsolete maps) will automatically be discarded by the garbage collector as soon as they are not in the query scope anymore. If you want to enforce this, you can put your map operations into FLWOR expresions or user-declared functions.

Back to your original questions:

...

This module must expose a method "hashset:clear($hashset)" to

de-allocate memory dynamically. The map:module provides the function map:remove,

and

...

...
...
I could remove all elements. [...] It does not deallocate memory,

leading

...

...
...
to poor overall performances.

It may be interesting to do some testing in order to find out what’s the actual bottleneck in your query. How man entries is your hash set supposed to contain?

...

must expose a method "hashset:add($hashset,$element)" to add memory

dynamically. Through map:module, the only possibility that I see is to wrap it as map:new($hashset, map:entry($element,$dummyboolean)).

Right, using true() is probably the best choice (booleans are only instantiated once in memory).

...

first : its dynamic memory management (the in-memory printfoot of my

XQUERY executables are usually tremendous).

This can in fact be a problem, and is mostly due to various decisions taken in the specification, and the complexity of XML nodes in general.

...

second : it lacks powerful libraries, as complete math modules.

What kind of functions would you be interested in?

Christian

[1] http://en.wikipedia.org/wiki/Persistent_data_structure [2] http://en.wikipedia.org/wiki/Hash_array_mapped_trie

jean-marc Mercier

10:05 p.m.

Christian,

Thx for your query. I have tested this code against a JAVA binding with HashSet, running a BaseX instance (i.e. declare variable $my-big-DataBase := db:open(..) ).

I've taken this function, more representative of my needs :

declare function local:get-some-stuff-within($element) { generate-id($element) };

Here are the result (I ran the test 4/5 times to be sure that my environment did not perturb the experiment) :

Java Hashset : start : 21:34:21.98 size HashSet : 17568843 end : 21:35:05.93

BaseX map based hashset

start : 21:37:56.59 size map : 17568843 end : 21:38:50.43

Some remarks :

- Without surprise, storing docs into BaseX accelerate the query. - Surpringly for me, this map code behave nicely (mine was terrible ! I don't know why). Is there something I should know about fold-left ? Well, clearly, I have a lot to learn about this langage. - HashSet seems to be 20 % faster, with a smaller memory footprint (10%, (4,3 against 4, unprecise measure, since I was looking to my Task Manager !) ).

Thus I could use maps to simulate HashSet, it not a very big overload. However, is there any incentive to trade off 20% performance ?

Cheers,

Jean-Marc

2013/11/12 Christian Grün christian.gruen@gmail.com

...

Hi Jean-Marc,

I’m not sure how your actual data looks like, but you could have a look the attached query, which uses XQuery maps, and which allowe d us to inser t 10 million strings in about 40 seconds . Most presumably, it’s not as memory efficient as a Java map, but my guess is that most memory is actually spent by the XQuery representation of string items, and not necessarily the map itself.

Hope this helps, Christian ___________________________ ___________________________

declare variable $my-big-DataBase := doc('db')/root/*;

declare function local:do-some-stuff-with-set($map) { count(map:keys($map)) };

declare function local:init() { local:do-some-stuff-with-set(local:tree-traversal($my-big-DataBase, map:new())) };

declare function local:tree-traversal($elements, $map) { fold-left($elements, $map, function($map2, $element) { let $map3 := map:new(($map2, { local:get-some-stuff-within($element) : true() })) return local:tree-traversal($element/element(), $map3) } ) };

declare function local:get-some-stuff-within($element) { $element/@name };

local:init() ___________________________ ___________________________

On Tue, Nov 12, 2013 at 5:48 PM, jean-marc Mercier < jeanmarc.mercier@gmail.com> wrote:

...
Dear Christian,

Thanks for your answer.

...
It may be interesting to do some testing in order to find out what’s the actual bottleneck in your query. How man entries is your hash set supposed to contain?

The HashSet contains 10M strings. That's not a big deal. The problems

comes

...
from filling the HashSet. Well, let us be more technacal. I am using hashset for tree traversal operations over a DataBase of 10M nodes, 1Go. This HashSet collects some informations during the traversal, mainly a string at each node.

Actually,

...
my code binds JAVA with something like

import module namespace set = "java.util.HashSet";

declare function init(){set:clear(), tree_traversal($my_big_DataBase), do_some_stuff_with_hash_set()} declare function tree_traversal($elements) { for $element in $elements return ( set:add(get_some_stuff_within($element) ), tree_traversal($element/element()) ) } and the whole traversal takes a minute.

I don't know how to achieve this with pure XQUERY. The only thing that I tried is something like :

declare function init(){let $my_hash_set :=

tree_traversal($my_big_DataBase,

...
map:new()) return do_some_stuff_with($my_hash_set) }

declare function tree_traversal($elements, $hash_set) as function(*)

(return

...
a HashSet) { for $element in $elements return ( tree_traversal($nodes/element(), $hash_set), map:entry( get_some_stuff_within($element),boolean) ) } that performed very badly, compared to the first version (JAVA binding).

I

...
explained this thinking that the first version is O(N) operations, and

the

...
second is O(N^2) operations due to in-memory copy.

...
...

second : it lacks powerful libraries, as complete math modules.

...
What kind of functions would you be interested in?

I need a stochastic modulus, as http://www.boost.org/doc/libs/1_55_0/libs/math/doc/html/dist.html also a linear Algebra modulus as UBLAS http://www.boost.org/doc/libs/1_54_0/libs/numeric/ublas/doc/index.htm

I guess that I could find JAVA equivalent and bind them to BaseX. Such functionalities are available directly as XQUERY modules ?

Cheers,

Jean-Marc

2013/11/12 Christian Grün christian.gruen@gmail.com

...
Dear JohnLeM,

thanks for your mail. As you already noted, XQuery is a functional language, and this is the reason why XQuery maps are not exactly comparable to maps and sets, as they are used in imperative languages:

All maps in XQuery are persisent (immutable): Once a map has been generated, it is not possible to change its contents. Instead, a new map is to be created by each insertion or deletion [1]. This sounds like a huge memory killer, but it’s not as bad as you might guess. Various efficient solutions exist for persistent map, such as the mapped trie that has been implemented in BaseX [2]. It will only create copies of parts of the data structure that are to be changed. The following query is an example for a query which creates 100.000 map with a single entry, and a large map containing all the entries; on my system, it requires 200 ms to run:

map:size(map:new( for $i in 1 to 100000 return map { $i := true() } ))

In short: persistent maps may not be as efficient as mutable maps, but they are usually not the bottleneck in XQuery applications, because deleted entries (or obsolete maps) will automatically be discarded by the garbage collector as soon as they are not in the query scope anymore. If you want to enforce this, you can put your map operations into FLWOR expresions or user-declared functions.

Back to your original questions:

...

This module must expose a method "hashset:clear($hashset)" to

de-allocate memory dynamically. The map:module provides the function map:remove,

and

...
...
...
I could remove all elements. [...] It does not deallocate memory,

leading

...
...
...
to poor overall performances.

It may be interesting to do some testing in order to find out what’s the actual bottleneck in your query. How man entries is your hash set supposed to contain?

...

must expose a method "hashset:add($hashset,$element)" to add memory

dynamically. Through map:module, the only possibility that I see is to wrap it as map:new($hashset, map:entry($element,$dummyboolean)).

Right, using true() is probably the best choice (booleans are only instantiated once in memory).

...

first : its dynamic memory management (the in-memory printfoot of my

XQUERY executables are usually tremendous).

This can in fact be a problem, and is mostly due to various decisions taken in the specification, and the complexity of XML nodes in general.

...

second : it lacks powerful libraries, as complete math modules.

What kind of functions would you be interested in?

Christian

[1] http://en.wikipedia.org/wiki/Persistent_data_structure [2] http://en.wikipedia.org/wiki/Hash_array_mapped_trie

Charles Duffy

10:29 p.m.

On Tue, Nov 12, 2013 at 3:05 PM, jean-marc Mercier < jeanmarc.mercier@gmail.com> wrote:

...

Thus I could use maps to simulate HashSet, it not a very big overload. However, is there any incentive to trade off 20% performance ?

Immutability is its own reward, or, rather, comes with its own set of them. :)

Working with immutable data makes lock-free thread-safety comparatively trivial, and enables a wide array of aggressive optimizations (including providing the compiler the option to make operations concurrent at-will; safe caching/memoization; etc).

Moreover, introducing things that *aren't* known to be immutible, thread-safe, deterministic operations into an environment where those guarantees otherwise exist means that a lot of optimizations are suddenly off-the-table / unsafe.

jean-marc Mercier

10:45 p.m.

point ! I'll try to refactor my code to include maps.

However, as mentionned above, I will have most probably to include JAVA librairies into my modules (maths), and might loose overall Immutability through these as well.

Thx everybody for the very helpful advises.

Cheers

Jean-Marc

2013/11/12 Charles Duffy charles@dyfis.net

...

On Tue, Nov 12, 2013 at 3:05 PM, jean-marc Mercier < jeanmarc.mercier@gmail.com> wrote:

...
Thus I could use maps to simulate HashSet, it not a very big overload. However, is there any incentive to trade off 20% performance ?

Immutability is its own reward, or, rather, comes with its own set of them. :)

Working with immutable data makes lock-free thread-safety comparatively trivial, and enables a wide array of aggressive optimizations (including providing the compiler the option to make operations concurrent at-will; safe caching/memoization; etc).

Moreover, introducing things that *aren't* known to be immutible, thread-safe, deterministic operations into an environment where those guarantees otherwise exist means that a lot of optimizations are suddenly off-the-table / unsafe.

Christian Grün

11:08 p.m.

Charles already pointed out well why it is beneficial to stick with the functional approach. It takes some time to comprehend the functional coding paradigm, but it really helps to write better code in general, even in imperative languages! Well, at least it helped me.

...

However, as mentionned above, I will have most probably to include JAVA librairies into my modules (maths), and might loose overall Immutability through these as well.

If you write your own Java Module, and if the result of a function is always the same for a given input, you can annotate this function as "deterministic" [1] (which should be the case for most mathematical operations, except creating random numbers). This way, it will be rewritten and optimized similar to native XQuery code.

Christian

[1] http://docs.basex.org/wiki/Java_Bindings#Context-Awareness ___________________________________

...

2013/11/12 Charles Duffy charles@dyfis.net

...
On Tue, Nov 12, 2013 at 3:05 PM, jean-marc Mercier jeanmarc.mercier@gmail.com wrote:

...
Thus I could use maps to simulate HashSet, it not a very big overload. However, is there any incentive to trade off 20% performance ?

Immutability is its own reward, or, rather, comes with its own set of them. :)

Working with immutable data makes lock-free thread-safety comparatively trivial, and enables a wide array of aggressive optimizations (including providing the compiler the option to make operations concurrent at-will; safe caching/memoization; etc).

Moreover, introducing things that *aren't* known to be immutible, thread-safe, deterministic operations into an environment where those guarantees otherwise exist means that a lot of optimizations are suddenly off-the-table / unsafe.

BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

4418

Age (days ago)

4418

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

9 comments

3 participants

tags (0)

participants (3)

Charles Duffy
Christian Grün
jean-marc Mercier