Hello all,
I’m using BaseX to cluster a set of millions of small XML fragments which look something like this:
<affiliation>
<organization>Institut für Organische Chemie der Universität Heidelberg</organization>
<country iso-code="DEU"/>
</affiliation>
I need to cluster based on fragment similarity – so taking into account elements, attributes and text nodes.
If I use the entire XML fragment as a grouping key, something like this:
for $a at $c in db:open('DB')/item/*/affiliation
group by $val := $a
… then will the grouping be equivalent to the functionality of the deep-equal function? First results seem to suggest this, but I want to make sure that grouping is not done on text node value alone or anything like that.
Incidentally, BaseX is simply unbelievably fast at executing this – a million fragments clustered and written out to another DB in 16 seconds on a laptop. My congratulations on an amazing product.
Regards,
Constantine
Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The Netherlands, Registration No. 33156677, Registered in The Netherlands.