473,383 Members | 1,891 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,383 software developers and data experts.

Diff for words in string

I'm comparing the text of (snippets of) web pages which I expect to be
quite different or quite similar. In the case where they are similar,
I would like to display the more recent one and say something like:
Word 2 added [before word 2 in original]: "Jack be nimble"

Words 10-11 changed to: "the quick brown fox"
[from words 9-11 in original]: "the brown fast quick fox"

Words [20-22 in original] before word 20 removed: "sat in a corner on"
One way to do this is to replace all spacing chars with \n, write the
strings to two files, and then run diff (FC on my Win XP Pro = file
compare), collecting the output. Does anyone happen to have PHP code
for this where I don't have to write files? Note that the diff is on
words of the strings and not characters.

In particular, the normal algorithms for this (longest common
subsequence) only produce a number, and don't note the differences.

Also, I wanted to give a threshhold (about 10, say) and if the longest
common subsequence differs from the shorter string (strings are on the
order of 100 words) by at least this amount, then simply fail (since
the difference between the two strings would be deemed to great). This
should make the algorithm far more efficient. The corresponding
argument in FC would be /LB10.

Thanks,
Csaba Gabor from Vienna

Ref: http://www.ics.uci.edu/~dan/class/16...6/Dynamic.html
http://en.wikipedia.org/wiki/Longest...quence_problem
Note that this problem is distinct from the longest common substring
problem.

Jun 21 '06 #1
3 2940
On Wed, 21 Jun 2006 02:56:37 -0700, Csaba Gabor wrote:
One way to do this is to replace all spacing chars with \n, write the
strings to two files, and then run diff (FC on my Win XP Pro = file
compare), collecting the output. Does anyone happen to have PHP code for
this where I don't have to write files? Note that the diff is on words of
the strings and not characters.


http://pear.php.net/package/Text_Diff/

The code in that PEAR library may give you a starting point to
implementing it in PHP rather than calling fc.

Cheers,
Andy

--
Andy Jeffries MBCS CITP ZCE | gPHPEdit Lead Developer
http://www.gphpedit.org | PHP editor for Gnome 2
http://www.andyjeffries.co.uk | Personal site and photos

Jun 21 '06 #2
Thanks both Andy and Jeff for your suggestions regarding diff / longest
common subsequence. It got my creative juices flowing and I rolled my
own, which I didn't see in the literature. There may be some overlap
with Ukkonen (see
http://www.csse.monash.edu.au/~lloyd.../Dynamic/Edit/) since
the complexity appears to have similar overtones. The running time for
sequences which are highly similar should be linear in the length of
the strings. The one thing that is missing is an efficient
dissimilarity test so that one may bail out of the routine early.
Step 1: For each word in the first string, have an array of where it
is in the first string

Step 2: From this it is possible to construct what is known as a
permutation graph:
Start $ctr at 0; for each word in string 2 find the array of indeces
in string 1 and (iterating backwards through that array), unshift
++$ctr onto an array at that index point.

The array of arrays built up in this step should now be collapsed,
forming a permuation of the integers from 1 ... $ctr. The
corresponding indeces are assumed to be numbered consecutively starting
at 1 for string 2. This constitutes a permutation graph (a permutation
graph is one where the corresponding integers are connected, which
connections correspond to vertices; and vertices are adjacent iff their
corresponding segments intersect. This happens iff the order of the
two integers corresponding to the two segments differs above vs.
below).

The relevance to the problem is that any segment from the graph
indicates a common word in the original sequences. Two segments may be
in a common subsequence as long as they don't intersect.

Step 3:
Thus, the problem is reduced to finding the longest increasing
monotonic subsequence of the integers constructed in Step 2. I do this
by dynamic programming, working backwards through the sequence of
integers. For each position, I keep track of the longest monotonic
increasing subsequence (rightwards) from that point. I also keep track
of the maximum longest monotonic increasing subsequence (rightwards)
from that point as a speedup convenience.

Step 4:
Working from the left, reconstruct the sequence (longest increasing
monotonic subsequence).

Step 5:
Recover the mapping that the sequence from Step 4 represents (along
with the indexes into the original strings for each word). This is the
longest common subsequence between the two original strings.

Step 6.
Using Step 5, we can also determine which words from string 1 didn't
make it and which words from string 2 were not included, so we can
easily construct the diff where the words from string one that didn't
make it will have a strikethrough and the words from string 2 will be
bolded.

Csaba Gabor from Vienna

Jun 22 '06 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: superprad | last post by:
X-No-Archive: yes what I am looking for is 1. To create a list of different words of various lengths(1-15) using A-Z,a-z,0-9 and punctuations.Basically anything that could be found on a text...
4
by: Eric Boutin | last post by:
Hi all ! I'm currently writing a function that evaluate if there is a diffence beetween 2 streams (istringstream and ifstream). The problem is, it doesn't work. It *always* finds a difference...
4
by: Mark Reed | last post by:
Hi all, I have a query (query1) which shows scan date, scan time & operator. One scan = 1 record. What I want to do is create a report based on query 2 from query1 which shows all the scans AND...
9
by: Ching-Lung | last post by:
Hi all, I try to create a tool to check the delta (diff) of 2 binaries and create the delta binary. I use binary formatter (serialization) to create the delta binary. It works fine but the...
6
by: Igor Shevchenko | last post by:
Hi! Suppose I have "pg_dump -s" of two pg installs, one is "dev", another is "production". Their schemas don't differ too much, and I want to get a "diff -u"-like schema diff so I can quickly...
4
by: Andreas Kasparek | last post by:
Hola! I'm preparing my master thesis about a XML Merge Tool implementation and was wondering if there is any open standard for XML diff regarding topics like: - is a diff result computed on...
1
by: duncan.welch | last post by:
Does anyone know of a good string diff algorithm they could point me in the direction of? I just need to display the differences between two reasonably large strings in a pretty format on a web...
0
by: kdsutaia | last post by:
hi! I hv to read two files from two diff directories and need to count pairs of words from both the files. as well I hv to keep track of no of documents where this terms present. All files...
6
by: Aaron Gray | last post by:
Hi, I am working on an HTML WYSISYG Wiki and need to display a diff page like WikiPedia does if two people edit a file at the same time to give the second user the diff. Basically with additions...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.