What I'm trying to do is to test the result of an external conversion process.
BaseX's implementation of full text makes it easy and fast to get a count of words; the next part of the problem is locating the "of" responsible for before having 1182 instances and after having 1181 instances. My idea was to try to find the sequence position numbers which don't have matches and to use that to get the words-before and words-after. (Doing a relate-elements-before-and-after word-for-word compare has been judged impractical; there's too much change in the "this is section 2" identifiers for doing that to be less effort that doing the conversion in-house which has already been decided against.)
Stuffing the full text into tan:diff() might be a better approach; I will certainly look into it.
Another option is to string-join the sequences via a unique character,
then do a straightforward string diff. The XSLT function tan:diff() is
efficient and results in high quality. You would then need to do some
post-processing on the results to get your integer pairs. But presumably
you're getting those integer pairs not as an end in itself but as a
means to some other task, and the output of tan:diff() may get you there
quicker. You might also be able to skip what I presume is a preprocess
of turning two strings into two sequences of tokenized strings.
tan:diff() is written in XSLT, not XQuery, but you should be able to use
fn:transform().
Code:
Background:
Best wishes,
Joel
On 2026-02-12 07:21, Graydon Saunders via BaseX-Talk wrote:
> Thank you! I can foresee some brain stretching in my future.
>
> And yes, just two sequences of text, and what should be very similar
> text. (I'm trying to write tests for a conversion process.)
>
> -- Graydon
>
> On Thu, Feb 12, 2026, at 07:12, David Birnbaum wrote:
>
>> With just two sequences you can use Needleman-Wunsch. It’s a
>> dynamic programming algorithm that provides an optimal alignment
>> (good thing, although there may be more than one optimal alignment),
>> but it doesn’t scale well (not good thing). I describe an XSLT 3.0
>> implementation in my 2020 XMLPrague paper at
>>
>>
>> Your question doesn’t clarify whether you’re looking for index
>> numbers in the alignment (where a word in one input might be matched
>> by a gap in the other) or in the inputs (where aligned words share a
>> position in the alignment but may have different positions in the
>> inputs). For either of those interpretations, though, a solution
>> will begin by finding an alignment.
>>
>> David J. Birnbaum
>>
>>> On Feb 11, 2026, at 9:41 PM, Graydon Saunders
>>
>>>
>>> Hello!
>>>
>>> If I have two (fairly long) sequences of text, ('The', 'words',
>>> 'are', 'sequence', 'members') and I want all the index numbers of
>>> matching pairs despite the sequences only mostly matching (so a
>>> word, or several words, can be missing from sequence A or sequence
>>> B), is there an established algorithm for doing this?
>>>
>>> (If I search on "aligning sequences" I get bioinformatics about
>>> gene sequences; if I search on "aligning text" I get typography.)
>>>
>>> Thanks!
>>> Graydon
--
Joel Kalvesmaki
Director, Text Alignment Network