Hello all,

I’m using BaseX to cluster a set of millions of small XML fragments which look something like this:

 

<affiliation>

    <organization>Institut für Organische Chemie der Universität Heidelberg</organization>

    <country iso-code="DEU"/>

</affiliation>

 

I need to cluster based on fragment similarity – so taking into account elements, attributes and text nodes.

 

If I use the entire XML fragment as a grouping key, something like this:

 

for $a at $c in db:open('DB')/item/*/affiliation

group by $val := $a

 

… then will the grouping be equivalent to the functionality of the deep-equal function? First results seem to suggest this, but I want to make sure that grouping is not done on text node value alone or anything like that.

 

Incidentally, BaseX is simply unbelievably fast at executing this – a million fragments clustered and written out to another DB in 16 seconds on a laptop. My congratulations on an amazing product.

 

Regards,

Constantine



Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The Netherlands, Registration No. 33156677, Registered in The Netherlands.