Hi Christian --
The content set of interest is some documentation which is being re-written to improve it. The idea is to identify paragraphs which are similar enough that they should have the same standard wording when re-written.
So with input of:
<document>
<p>Under no circumstances should you rig an antenna during a thunderstorm.</p>
<p>It is important to dis-connect the device from all power.</p>
<p>You will need a number two phillips screwdriver.</p>
<p>It is important to disconnect the devices from all power.</p>
<p>You will need a #2 Phillips screwdriver.</p>
<p>It is important to disconnect the devices from ALL power.</p>
<p>Graphics card; do not eat.</p>
</document>
I'd want to be able to get output like:
<bucket>
<similar-paragraphs>
<p>It is important to dis-connect the device from all power.</p>
<p>It is important to disconnect the devices from all power.</p>
<p>It is important to disconnect the devices from ALL power.</p>
</similar-paragraphs>
<similar-paragraphs>
<p>You will need a number two phillips screwdriver.</p>
<p>You will need a #2 Phillips screwdriver.</p>
</similar-paragraphs>
<similar-paragraphs>
<p>Under no circumstances should you rig an antenna during a thunderstorm.</p>
</similar-paragraphs>
<similar-paragraphs>
<p>Graphics card; do not eat.</p>
</similar-paragraphs>
</bucket>
Thanks!
Graydon