Thanks both Andy and Jeff for your suggestions regarding diff / longest

common subsequence. It got my creative juices flowing and I rolled my

own, which I didn't see in the literature. There may be some overlap

with Ukkonen (see

http://www.csse.monash.edu.au/~lloyd.../Dynamic/Edit/) since

the complexity appears to have similar overtones. The running time for

sequences which are highly similar should be linear in the length of

the strings. The one thing that is missing is an efficient

dissimilarity test so that one may bail out of the routine early.

Step 1: For each word in the first string, have an array of where it

is in the first string

Step 2: From this it is possible to construct what is known as a

permutation graph:

Start $ctr at 0; for each word in string 2 find the array of indeces

in string 1 and (iterating backwards through that array), unshift

++$ctr onto an array at that index point.

The array of arrays built up in this step should now be collapsed,

forming a permuation of the integers from 1 ... $ctr. The

corresponding indeces are assumed to be numbered consecutively starting

at 1 for string 2. This constitutes a permutation graph (a permutation

graph is one where the corresponding integers are connected, which

connections correspond to vertices; and vertices are adjacent iff their

corresponding segments intersect. This happens iff the order of the

two integers corresponding to the two segments differs above vs.

below).

The relevance to the problem is that any segment from the graph

indicates a common word in the original sequences. Two segments may be

in a common subsequence as long as they don't intersect.

Step 3:

Thus, the problem is reduced to finding the longest increasing

monotonic subsequence of the integers constructed in Step 2. I do this

by dynamic programming, working backwards through the sequence of

integers. For each position, I keep track of the longest monotonic

increasing subsequence (rightwards) from that point. I also keep track

of the maximum longest monotonic increasing subsequence (rightwards)

from that point as a speedup convenience.

Step 4:

Working from the left, reconstruct the sequence (longest increasing

monotonic subsequence).

Step 5:

Recover the mapping that the sequence from Step 4 represents (along

with the indexes into the original strings for each word). This is the

longest common subsequence between the two original strings.

Step 6.

Using Step 5, we can also determine which words from string 1 didn't

make it and which words from string 2 were not included, so we can

easily construct the diff where the words from string one that didn't

make it will have a strikethrough and the words from string 2 will be

bolded.

Csaba Gabor from Vienna