[basex-talk] Re: aligning sequences of text?

12 Feb 2026

      Another option is to string-join the sequences via a unique character, 
then do a straightforward string diff. The XSLT function tan:diff() is 
efficient and results in high quality. You would then need to do some 
post-processing on the results to get your integer pairs. But presumably 
you're getting those integer pairs not as an end in itself but as a 
means to some other task, and the output of tan:diff() may get you there 
quicker. You might also be able to skip what I presume is a preprocess 
of turning two strings into two sequences of tokenized strings.

tan:diff() is written in XSLT, not XQuery, but you should be able to use 
fn:transform().

Code:
https://textalign.net/

Background:
https://www.balisage.net/Proceedings/vol26/html/Kalvesmaki01/BalisageVol26-K...

Best wishes,

Joel

On 2026-02-12 07:21, Graydon Saunders via BaseX-Talk wrote:
...
Thank you! I can foresee some brain stretching in my future.
And yes, just two sequences of text, and what should be very similar
text. (I'm trying to write tests for a conversion process.)
-- Graydon
On Thu, Feb 12, 2026, at 07:12, David Birnbaum wrote:
...
With just two sequences you can use Needleman-Wunsch. It’s a
dynamic programming algorithm that provides an optimal alignment
(good thing, although there may be more than one optimal alignment),
but it doesn’t scale well (not good thing). I describe an XSLT 3.0
implementation in my 2020 XMLPrague paper at
https://archive.xmlprague.cz/2020/files/xmlprague-2020-proceedings.pdf
...
Your question doesn’t clarify whether you’re looking for index
numbers in the alignment (where a word in one input might be matched
by a gap in the other) or in the inputs (where aligned words share a
position in the alignment but may have different positions in the
inputs). For either of those interpretations, though, a solution
will begin by finding an alignment.
David J. Birnbaum
djbpitt@gmail.com
...
On Feb 11, 2026, at 9:41 PM, Graydon Saunders
<graydonish@fastmail.com> wrote:
...

Hello!
If I have two (fairly long) sequences of text, ('The', 'words',
'are', 'sequence', 'members') and I want all the index numbers of
matching pairs despite the sequences only mostly matching (so a
word, or several words, can be missing from sequence A or sequence
B), is there an established algorithm for doing this?
(If I search on "aligning sequences" I get bioinformatics about
gene sequences; if I search on "aligning text" I get typography.)
Thanks!
Graydon
-- 
Joel Kalvesmaki
Director, Text Alignment Network
http://textalign.net