It's not necessarily true that immutable data structures perform an actual in-memory copy. See for instance Phil Bagwell's hash tries -- only minimal amounts of the data structure are actually copied during updates.
Exactly. By the way, BaseX is indeed based on Phil Bagwell’s hash tries..
I don't know whether BaseX is making use of these optimizations (Christian could speak to that), but jumping directly to "it's immutible, so it must be inefficient" is not necessarily well-founded.
On Tue, Nov 12, 2013 at 10:48 AM, jean-marc Mercier jeanmarc.mercier@gmail.com wrote:
Dear Christian,
Thanks for your answer.
It may be interesting to do some testing in order to find out what’s the actual bottleneck in your query. How man entries is your hash set supposed to contain?
The HashSet contains 10M strings. That's not a big deal. The problems comes from filling the HashSet. Well, let us be more technacal. I am using hashset for tree traversal operations over a DataBase of 10M nodes, 1Go. This HashSet collects some informations during the traversal, mainly a string at each node. Actually, my code binds JAVA with something like
import module namespace set = "java.util.HashSet";
declare function init(){set:clear(), tree_traversal($my_big_DataBase), do_some_stuff_with_hash_set()} declare function tree_traversal($elements) { for $element in $elements return ( set:add(get_some_stuff_within($element) ), tree_traversal($element/element()) ) } and the whole traversal takes a minute.
I don't know how to achieve this with pure XQUERY. The only thing that I tried is something like :
declare function init(){let $my_hash_set := tree_traversal($my_big_DataBase, map:new()) return do_some_stuff_with($my_hash_set) }
declare function tree_traversal($elements, $hash_set) as function(*) (return a HashSet) { for $element in $elements return ( tree_traversal($nodes/element(), $hash_set), map:entry( get_some_stuff_within($element),boolean) ) } that performed very badly, compared to the first version (JAVA binding). I explained this thinking that the first version is O(N) operations, and the second is O(N^2) operations due to in-memory copy.
- second : it lacks powerful libraries, as complete math modules.
What kind of functions would you be interested in?
I need a stochastic modulus, as http://www.boost.org/doc/libs/1_55_0/libs/math/doc/html/dist.html also a linear Algebra modulus as UBLAS http://www.boost.org/doc/libs/1_54_0/libs/numeric/ublas/doc/index.htm
I guess that I could find JAVA equivalent and bind them to BaseX. Such functionalities are available directly as XQUERY modules ?
Cheers,
Jean-Marc
2013/11/12 Christian Grün christian.gruen@gmail.com
Dear JohnLeM,
thanks for your mail. As you already noted, XQuery is a functional language, and this is the reason why XQuery maps are not exactly comparable to maps and sets, as they are used in imperative languages:
All maps in XQuery are persisent (immutable): Once a map has been generated, it is not possible to change its contents. Instead, a new map is to be created by each insertion or deletion [1]. This sounds like a huge memory killer, but it’s not as bad as you might guess. Various efficient solutions exist for persistent map, such as the mapped trie that has been implemented in BaseX [2]. It will only create copies of parts of the data structure that are to be changed. The following query is an example for a query which creates 100.000 map with a single entry, and a large map containing all the entries; on my system, it requires 200 ms to run:
map:size(map:new( for $i in 1 to 100000 return map { $i := true() } ))
In short: persistent maps may not be as efficient as mutable maps, but they are usually not the bottleneck in XQuery applications, because deleted entries (or obsolete maps) will automatically be discarded by the garbage collector as soon as they are not in the query scope anymore. If you want to enforce this, you can put your map operations into FLWOR expresions or user-declared functions.
Back to your original questions:
- This module must expose a method "hashset:clear($hashset)" to
de-allocate memory dynamically. The map:module provides the function map:remove, and I could remove all elements. [...] It does not deallocate memory, leading to poor overall performances.
It may be interesting to do some testing in order to find out what’s the actual bottleneck in your query. How man entries is your hash set supposed to contain?
- must expose a method "hashset:add($hashset,$element)" to add memory
dynamically. Through map:module, the only possibility that I see is to wrap it as map:new($hashset, map:entry($element,$dummyboolean)).
Right, using true() is probably the best choice (booleans are only instantiated once in memory).
- first : its dynamic memory management (the in-memory printfoot of my
XQUERY executables are usually tremendous).
This can in fact be a problem, and is mostly due to various decisions taken in the specification, and the complexity of XML nodes in general.
- second : it lacks powerful libraries, as complete math modules.
What kind of functions would you be interested in?
Christian
[1] http://en.wikipedia.org/wiki/Persistent_data_structure [2] http://en.wikipedia.org/wiki/Hash_array_mapped_trie
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk