Christian,
Thx for your query. I have tested this code against a JAVA binding with HashSet, running a BaseX instance (i.e. declare variable $my-big-DataBase := db:open(..) ).
I've taken this function, more representative of my needs :
declare function local:get-some-stuff-within($element) { generate-id($element) };
Here are the result (I ran the test 4/5 times to be sure that my environment did not perturb the experiment) :
Java Hashset : start : 21:34:21.98 size HashSet : 17568843 end : 21:35:05.93
BaseX map based hashset
start : 21:37:56.59 size map : 17568843 end : 21:38:50.43
Some remarks :
- Without surprise, storing docs into BaseX accelerate the query. - Surpringly for me, this map code behave nicely (mine was terrible ! I don't know why). Is there something I should know about fold-left ? Well, clearly, I have a lot to learn about this langage. - HashSet seems to be 20 % faster, with a smaller memory footprint (10%, (4,3 against 4, unprecise measure, since I was looking to my Task Manager !) ).
Thus I could use maps to simulate HashSet, it not a very big overload. However, is there any incentive to trade off 20% performance ?
Cheers,
Jean-Marc
2013/11/12 Christian Grün christian.gruen@gmail.com
Hi Jean-Marc,
I’m not sure how your actual data looks like, but you could have a look the attached query, which uses XQuery maps, and which allowe d us to inser t 10 million strings in about 40 seconds . Most presumably, it’s not as memory efficient as a Java map, but my guess is that most memory is actually spent by the XQuery representation of string items, and not necessarily the map itself.
Hope this helps, Christian ___________________________ ___________________________
declare variable $my-big-DataBase := doc('db')/root/*;
declare function local:do-some-stuff-with-set($map) { count(map:keys($map)) };
declare function local:init() { local:do-some-stuff-with-set(local:tree-traversal($my-big-DataBase, map:new())) };
declare function local:tree-traversal($elements, $map) { fold-left($elements, $map, function($map2, $element) { let $map3 := map:new(($map2, { local:get-some-stuff-within($element) : true() })) return local:tree-traversal($element/element(), $map3) } ) };
declare function local:get-some-stuff-within($element) { $element/@name };
local:init() ___________________________ ___________________________
On Tue, Nov 12, 2013 at 5:48 PM, jean-marc Mercier < jeanmarc.mercier@gmail.com> wrote:
Dear Christian,
Thanks for your answer.
It may be interesting to do some testing in order to find out what’s the actual bottleneck in your query. How man entries is your hash set supposed to contain?
The HashSet contains 10M strings. That's not a big deal. The problems
comes
from filling the HashSet. Well, let us be more technacal. I am using hashset for tree traversal operations over a DataBase of 10M nodes, 1Go. This HashSet collects some informations during the traversal, mainly a string at each node.
Actually,
my code binds JAVA with something like
import module namespace set = "java.util.HashSet";
declare function init(){set:clear(), tree_traversal($my_big_DataBase), do_some_stuff_with_hash_set()} declare function tree_traversal($elements) { for $element in $elements return ( set:add(get_some_stuff_within($element) ), tree_traversal($element/element()) ) } and the whole traversal takes a minute.
I don't know how to achieve this with pure XQUERY. The only thing that I tried is something like :
declare function init(){let $my_hash_set :=
tree_traversal($my_big_DataBase,
map:new()) return do_some_stuff_with($my_hash_set) }
declare function tree_traversal($elements, $hash_set) as function(*)
(return
a HashSet) { for $element in $elements return ( tree_traversal($nodes/element(), $hash_set), map:entry( get_some_stuff_within($element),boolean) ) } that performed very badly, compared to the first version (JAVA binding).
I
explained this thinking that the first version is O(N) operations, and
the
second is O(N^2) operations due to in-memory copy.
- second : it lacks powerful libraries, as complete math modules.
What kind of functions would you be interested in?
I need a stochastic modulus, as http://www.boost.org/doc/libs/1_55_0/libs/math/doc/html/dist.html also a linear Algebra modulus as UBLAS http://www.boost.org/doc/libs/1_54_0/libs/numeric/ublas/doc/index.htm
I guess that I could find JAVA equivalent and bind them to BaseX. Such functionalities are available directly as XQUERY modules ?
Cheers,
Jean-Marc
2013/11/12 Christian Grün christian.gruen@gmail.com
Dear JohnLeM,
thanks for your mail. As you already noted, XQuery is a functional language, and this is the reason why XQuery maps are not exactly comparable to maps and sets, as they are used in imperative languages:
All maps in XQuery are persisent (immutable): Once a map has been generated, it is not possible to change its contents. Instead, a new map is to be created by each insertion or deletion [1]. This sounds like a huge memory killer, but it’s not as bad as you might guess. Various efficient solutions exist for persistent map, such as the mapped trie that has been implemented in BaseX [2]. It will only create copies of parts of the data structure that are to be changed. The following query is an example for a query which creates 100.000 map with a single entry, and a large map containing all the entries; on my system, it requires 200 ms to run:
map:size(map:new( for $i in 1 to 100000 return map { $i := true() } ))
In short: persistent maps may not be as efficient as mutable maps, but they are usually not the bottleneck in XQuery applications, because deleted entries (or obsolete maps) will automatically be discarded by the garbage collector as soon as they are not in the query scope anymore. If you want to enforce this, you can put your map operations into FLWOR expresions or user-declared functions.
Back to your original questions:
- This module must expose a method "hashset:clear($hashset)" to
de-allocate memory dynamically. The map:module provides the function map:remove,
and
I could remove all elements. [...] It does not deallocate memory,
leading
to poor overall performances.
It may be interesting to do some testing in order to find out what’s the actual bottleneck in your query. How man entries is your hash set supposed to contain?
- must expose a method "hashset:add($hashset,$element)" to add memory
dynamically. Through map:module, the only possibility that I see is to wrap it as map:new($hashset, map:entry($element,$dummyboolean)).
Right, using true() is probably the best choice (booleans are only instantiated once in memory).
- first : its dynamic memory management (the in-memory printfoot of my
XQUERY executables are usually tremendous).
This can in fact be a problem, and is mostly due to various decisions taken in the specification, and the complexity of XML nodes in general.
- second : it lacks powerful libraries, as complete math modules.
What kind of functions would you be interested in?
Christian
[1] http://en.wikipedia.org/wiki/Persistent_data_structure [2] http://en.wikipedia.org/wiki/Hash_array_mapped_trie