aligning sequences of text?
Hello! If I have two (fairly long) sequences of text, ('The', 'words', 'are', 'sequence', 'members') and I want all the index numbers of matching pairs despite the sequences only mostly matching (so a word, or several words, can be missing from sequence A or sequence B), is there an established algorithm for doing this? (If I search on "aligning sequences" I get bioinformatics about gene sequences; if I search on "aligning text" I get typography.) Thanks! Graydon
On Wed, 11 Feb 2026 21:41:25 -0500 Graydon Saunders via BaseX-Talk <basex-talk@mailman.uni-konstanz.de> wrote:
If I have two (fairly long) sequences of text, ('The', 'words', 'are', 'sequence', 'members') and I want all the index numbers of matching pairs despite the sequences only mostly matching (so a word, or several words, can be missing from sequence A or sequence B), is there an established algorithm for doing this?
There are several - Myers, MacIlroy (of Unix fame) and others, have published papers on the longest matching subsequence problem that is at the heart e.g. of the Unix diff program. liam -- Liam Quin: Delightful Computing - Training and Consultancy in XSLT / XML Markup / Typography / CSS / Accessibility / and more... Outreach for the GNU Image Manipulation Program Vintage art digital files - fromoldbooks.org
With just two sequences you can use Needleman-Wunsch. It’s a dynamic programming algorithm that provides an optimal alignment (good thing, although there may be more than one optimal alignment), but it doesn’t scale well (not good thing). I describe an XSLT 3.0 implementation in my 2020 XMLPrague paper at https://archive.xmlprague.cz/2020/files/xmlprague-2020-proceedings.pdf Your question doesn’t clarify whether you’re looking for index numbers in the alignment (where a word in one input might be matched by a gap in the other) or in the inputs (where aligned words share a position in the alignment but may have different positions in the inputs). For either of those interpretations, though, a solution will begin by finding an alignment. David J. Birnbaum djbpitt@gmail.com
On Feb 11, 2026, at 9:41 PM, Graydon Saunders <graydonish@fastmail.com> wrote:
Hello!
If I have two (fairly long) sequences of text, ('The', 'words', 'are', 'sequence', 'members') and I want all the index numbers of matching pairs despite the sequences only mostly matching (so a word, or several words, can be missing from sequence A or sequence B), is there an established algorithm for doing this?
(If I search on "aligning sequences" I get bioinformatics about gene sequences; if I search on "aligning text" I get typography.)
Thanks! Graydon
Thank you! I can foresee some brain stretching in my future. And yes, just two sequences of text, and what should be very similar text. (I'm trying to write tests for a conversion process.) -- Graydon On Thu, Feb 12, 2026, at 07:12, David Birnbaum wrote:
With just two sequences you can use Needleman-Wunsch. It’s a dynamic programming algorithm that provides an optimal alignment (good thing, although there may be more than one optimal alignment), but it doesn’t scale well (not good thing). I describe an XSLT 3.0 implementation in my 2020 XMLPrague paper at https://archive.xmlprague.cz/2020/files/xmlprague-2020-proceedings.pdf
Your question doesn’t clarify whether you’re looking for index numbers in the alignment (where a word in one input might be matched by a gap in the other) or in the inputs (where aligned words share a position in the alignment but may have different positions in the inputs). For either of those interpretations, though, a solution will begin by finding an alignment.
David J. Birnbaum djbpitt@gmail.com
On Feb 11, 2026, at 9:41 PM, Graydon Saunders <graydonish@fastmail.com> wrote: Hello!
If I have two (fairly long) sequences of text, ('The', 'words', 'are', 'sequence', 'members') and I want all the index numbers of matching pairs despite the sequences only mostly matching (so a word, or several words, can be missing from sequence A or sequence B), is there an established algorithm for doing this?
(If I search on "aligning sequences" I get bioinformatics about gene sequences; if I search on "aligning text" I get typography.)
Thanks! Graydon
Another option is to string-join the sequences via a unique character, then do a straightforward string diff. The XSLT function tan:diff() is efficient and results in high quality. You would then need to do some post-processing on the results to get your integer pairs. But presumably you're getting those integer pairs not as an end in itself but as a means to some other task, and the output of tan:diff() may get you there quicker. You might also be able to skip what I presume is a preprocess of turning two strings into two sequences of tokenized strings. tan:diff() is written in XSLT, not XQuery, but you should be able to use fn:transform(). Code: https://textalign.net/ Background: https://www.balisage.net/Proceedings/vol26/html/Kalvesmaki01/BalisageVol26-K... Best wishes, Joel On 2026-02-12 07:21, Graydon Saunders via BaseX-Talk wrote:
Thank you! I can foresee some brain stretching in my future.
And yes, just two sequences of text, and what should be very similar text. (I'm trying to write tests for a conversion process.)
-- Graydon
On Thu, Feb 12, 2026, at 07:12, David Birnbaum wrote:
With just two sequences you can use Needleman-Wunsch. It’s a dynamic programming algorithm that provides an optimal alignment (good thing, although there may be more than one optimal alignment), but it doesn’t scale well (not good thing). I describe an XSLT 3.0 implementation in my 2020 XMLPrague paper at
https://archive.xmlprague.cz/2020/files/xmlprague-2020-proceedings.pdf
Your question doesn’t clarify whether you’re looking for index numbers in the alignment (where a word in one input might be matched by a gap in the other) or in the inputs (where aligned words share a position in the alignment but may have different positions in the inputs). For either of those interpretations, though, a solution will begin by finding an alignment.
David J. Birnbaum djbpitt@gmail.com
On Feb 11, 2026, at 9:41 PM, Graydon Saunders <graydonish@fastmail.com> wrote:
Hello!
If I have two (fairly long) sequences of text, ('The', 'words', 'are', 'sequence', 'members') and I want all the index numbers of matching pairs despite the sequences only mostly matching (so a word, or several words, can be missing from sequence A or sequence B), is there an established algorithm for doing this?
(If I search on "aligning sequences" I get bioinformatics about gene sequences; if I search on "aligning text" I get typography.)
Thanks! Graydon
-- Joel Kalvesmaki Director, Text Alignment Network http://textalign.net
What I'm trying to do is to test the result of an external conversion process. BaseX's implementation of full text makes it easy and fast to get a count of words; the next part of the problem is locating the "of" responsible for before having 1182 instances and after having 1181 instances. My idea was to try to find the sequence position numbers which don't have matches and to use that to get the words-before and words-after. (Doing a relate-elements-before-and-after word-for-word compare has been judged impractical; there's too much change in the "this is section 2" identifiers for doing that to be less effort that doing the conversion in-house which has already been decided against.) Stuffing the full text into tan:diff() might be a better approach; I will certainly look into it. Thank you! Graydon On Thu, Feb 12, 2026, at 10:17, Joel Kalvesmaki wrote:
Another option is to string-join the sequences via a unique character, then do a straightforward string diff. The XSLT function tan:diff() is efficient and results in high quality. You would then need to do some post-processing on the results to get your integer pairs. But presumably you're getting those integer pairs not as an end in itself but as a means to some other task, and the output of tan:diff() may get you there quicker. You might also be able to skip what I presume is a preprocess of turning two strings into two sequences of tokenized strings.
tan:diff() is written in XSLT, not XQuery, but you should be able to use fn:transform().
Code: https://textalign.net/
Background: https://www.balisage.net/Proceedings/vol26/html/Kalvesmaki01/BalisageVol26-K...
Best wishes,
Joel
On 2026-02-12 07:21, Graydon Saunders via BaseX-Talk wrote:
Thank you! I can foresee some brain stretching in my future.
And yes, just two sequences of text, and what should be very similar text. (I'm trying to write tests for a conversion process.)
-- Graydon
On Thu, Feb 12, 2026, at 07:12, David Birnbaum wrote:
With just two sequences you can use Needleman-Wunsch. It’s a dynamic programming algorithm that provides an optimal alignment (good thing, although there may be more than one optimal alignment), but it doesn’t scale well (not good thing). I describe an XSLT 3.0 implementation in my 2020 XMLPrague paper at
https://archive.xmlprague.cz/2020/files/xmlprague-2020-proceedings.pdf
Your question doesn’t clarify whether you’re looking for index numbers in the alignment (where a word in one input might be matched by a gap in the other) or in the inputs (where aligned words share a position in the alignment but may have different positions in the inputs). For either of those interpretations, though, a solution will begin by finding an alignment.
David J. Birnbaum djbpitt@gmail.com
On Feb 11, 2026, at 9:41 PM, Graydon Saunders <graydonish@fastmail.com> wrote:
Hello!
If I have two (fairly long) sequences of text, ('The', 'words', 'are', 'sequence', 'members') and I want all the index numbers of matching pairs despite the sequences only mostly matching (so a word, or several words, can be missing from sequence A or sequence B), is there an established algorithm for doing this?
(If I search on "aligning sequences" I get bioinformatics about gene sequences; if I search on "aligning text" I get typography.)
Thanks! Graydon
-- Joel Kalvesmaki Director, Text Alignment Network http://textalign.net
participants (4)
-
David Birnbaum -
Graydon Saunders -
Joel Kalvesmaki -
Liam R. E. Quin