On Fri, Jun 11, 2021 at 09:15:02AM +0200, Martin Honnen scripsit:
Of course contains(string(B), string(A)) would alone suffice to check for a partial match.
I think I managed to express the problem badly.
Document A is the original; it has some quantity of text. Document B is produced from Document A via a transformation process that is known to add text by doing things like getting display text for an internal link from the reference target, inserting boilerplate legal disclaimers, inserting standard text ("Chapter 1", etc.), and so on.
The goal is to perform a test to ensure that B still has all of A's text, in the same order, after this transformation has taken place.
So no single string compare will work; it will certainly fail, and because there are multiple points of insertion in the document (which can be of arbitrary length), there's some question of appropriate scale of testing. (Hashing per notional sentences and comparing the hash values finds differences but doesn't provide a test for order, for example.)
I was hoping that there was a known algorithm for this kind of testing; I have trouble believing I'm the first person who has wanted to do it.
It looks like ratcheting through B's text with A's text, one word at a time, gives useful results. It doesn't allow for locating the error beyond "something, somewhere, is a problem" but that can be handled through diff tools and the full text of each document.
Thanks! Graydon