Hello all,

I’m using BaseX to cluster a set of millions of small XML fragments which look something like this:

<organization>Institut für Organische Chemie der Universität Heidelberg</organization>

</affiliation>

I need to cluster based on fragment similarity – so taking into account elements, attributes and text nodes.

If I use the entire XML fragment as a grouping key, something like this:

for $a at $c in db:open('DB')/item/*/affiliation

group by $val := $a

… then will the grouping be equivalent to the functionality of the deep-equal function? First results seem to suggest this, but I want to make sure that grouping is not done on text node value alone or anything like that.

Incidentally, BaseX is simply unbelievably fast at executing this – a million fragments clustered and written out to another DB in 16 seconds on a laptop. My congratulations on an amazing product.

Regards,

Constantine

Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The Netherlands, Registration No. 33156677, Registered in The Netherlands.