Re: [basex-talk] differencing the string value of documents

12 Jun 2021


      On Fri, Jun 11, 2021 at 09:15:02AM +0200, Martin Honnen scripsit:
...
Of course contains(string(B), string(A)) would alone suffice to check
for a partial match.
I think I managed to express the problem badly.
Document A is the original; it has some quantity of text.  Document B is
produced from Document A via a transformation process that is known to
add text by doing things like getting display text for an internal link
from the reference target, inserting boilerplate legal disclaimers,
inserting standard text ("Chapter 1", etc.), and so on.
The goal is to perform a test to ensure that B still has all of A's
text, in the same order, after this transformation has taken place.
So no single string compare will work; it will certainly fail, and
because there are multiple points of insertion in the document (which
can be of arbitrary length), there's some question of appropriate scale
of testing.  (Hashing per notional sentences and comparing the hash
values finds differences but doesn't provide a test for order, for
example.)
I was hoping that there was a known algorithm for this kind of testing;
I have trouble believing I'm the first person who has wanted to do it.
It looks like ratcheting through B's text with A's text, one word at a
time, gives useful results.  It doesn't allow for locating the error
beyond "something, somewhere, is a problem" but that can be handled
through diff tools and the full text of each document.
Thanks!
Graydon
-- 
Graydon Saunders  | graydonish@gmail.com
Þæs oferéode, ðisses swá mæg.
-- Deor  ("That passed, so may this.")

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] differencing the string value of documents