Percentage matching of text

Bruce Eckel

Background: for the 4th edition of Thinking in Java, I'm trying to
once again improve the testing scheme for the examples in the book. I
want to verify that the output I show in the book is "reasonably
correct." I say "Reasonably" because a number of examples produce
random numbers or text or the time of day or in general things that do
not repeat themselves from one execution to the next. So, much of the
text will be the same between the "control sample" and the "test
sample," but some of it will be different.

I will be using Python or Jython for the test framework.

What I'd like to do is find an algorithm that produces the results of
a text comparison as a percentage-match. Thus I would be able to
assert that my test samples must match the control sample by at least
(for example) 83% for the test to pass. Clearly, this wouldn't be a
perfect test but it would help flag problems, which is primarily what
I need.

Does anyone know of an algorithm or library that would do this? Thanks
in advance.

Bruce Eckel
Br***@EckelObjects.com

Jul 18 '05 #1

Subscribe Reply

2351

Helmut Jarausch

Bruce Eckel wrote:

Background: for the 4th edition of Thinking in Java, I'm trying to
once again improve the testing scheme for the examples in the book. I
want to verify that the output I show in the book is "reasonably
correct." I say "Reasonably" because a number of examples produce
random numbers or text or the time of day or in general things that do
not repeat themselves from one execution to the next. So, much of the
text will be the same between the "control sample" and the "test
sample," but some of it will be different.

I will be using Python or Jython for the test framework.

What I'd like to do is find an algorithm that produces the results of
a text comparison as a percentage-match. Thus I would be able to
assert that my test samples must match the control sample by at least
(for example) 83% for the test to pass. Clearly, this wouldn't be a
perfect test but it would help flag problems, which is primarily what
I need.

Does anyone know of an algorithm or library that would do this? Thanks
in advance.

Sorry, not in Python, but only in Perl
I think
ftp://ftp.funet.fi/pub/languages/per...ox-3.23.tar.gz
can be tweaked to do that.
--
Helmut Jarausch

Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany

Jul 18 '05 #2

Helmut Jarausch

Bruce Eckel wrote:

Background: for the 4th edition of Thinking in Java, I'm trying to
once again improve the testing scheme for the examples in the book. I
want to verify that the output I show in the book is "reasonably
correct." I say "Reasonably" because a number of examples produce
random numbers or text or the time of day or in general things that do
not repeat themselves from one execution to the next. So, much of the
text will be the same between the "control sample" and the "test
sample," but some of it will be different.

I will be using Python or Jython for the test framework.

What I'd like to do is find an algorithm that produces the results of
a text comparison as a percentage-match. Thus I would be able to
assert that my test samples must match the control sample by at least
(for example) 83% for the test to pass. Clearly, this wouldn't be a
perfect test but it would help flag problems, which is primarily what
I need.

Does anyone know of an algorithm or library that would do this? Thanks
in advance.

Jul 18 '05 #3

Dan Bishop

Bruce Eckel <Br********@MailBlocks.com> wrote in message news:<ma*************************************@pyth on.org>...

Background: for the 4th edition of Thinking in Java, I'm trying to
once again improve the testing scheme for the examples in the book. I
want to verify that the output I show in the book is "reasonably
correct." I say "Reasonably" because a number of examples produce
random numbers or text or the time of day or in general things that do
not repeat themselves from one execution to the next. So, much of the
text will be the same between the "control sample" and the "test
sample," but some of it will be different.

I will be using Python or Jython for the test framework.

What I'd like to do is find an algorithm that produces the results of
a text comparison as a percentage-match. Thus I would be able to
assert that my test samples must match the control sample by at least
(for example) 83% for the test to pass. Clearly, this wouldn't be a
perfect test but it would help flag problems, which is primarily what
I need.

Does anyone know of an algorithm or library that would do this? Thanks
in advance.

One of the simpler ones is to calculate the length of the longest
common subsequence of the test output and the control output.

def lcsLength(seqA, seqB):
lenTable = [[0] * len(seqB) for i in seqA]
# Set each lenTable[i][j] to lcsLength(seqA[:i+1], seqB[:j+1])
for i, a in enumerate(seqA):
for j, b in enumerate(seqB):
if a == b:
lenTable[i][j] = lenTable[i-1][j-1] + 1
else:
lenTable[i][j] = max(lenTable[i-1][j], lenTable[i][j-1])
return lenTable[-1][-1]

To convert this to a percentage value, simply divide by the length of
the control output.

Btw, thank you for those footnotes in Thinking in Java that encouraged
me to try Python :-)

Jul 18 '05 #4

Oleg Paraschenko

Hello Bruce,

Bruce Eckel <Br********@MailBlocks.com> wrote in message
news:<ma*************************************@pyth on.org>

...
What I'd like to do is find an algorithm that produces the results
of a text comparison as a percentage-match.
...
Does anyone know of an algorithm or library that would do this?
Thanks in advance.

I suggest you to look at my software, GetReuse and its SDK:

http://getreuse.com/
http://getreuse.com/sdk/

The formula for the calculation of the similarity is based on the
scientific research. Any other "good" method of calculations should
produce results that are equivalent in some terms to the GetReuse
results. I have not wrote a paper yet; the formula is a improvement
of the formula from http://www.cs.ucsb.edu/~mli/sid.ps . Unfortunately,
I froze the project but the current code is tested and should work well.
Bruce Eckel
Br***@EckelObjects.com

Regards, Oleg

Jul 18 '05 #5

Mark 'Kamikaze' Hughes

Bruce Eckel <Br********@MailBlocks.com>
wrote on Fri, 30 Jul 2004 07:52:39 -0600:

Background: for the 4th edition of Thinking in Java, I'm trying to
once again improve the testing scheme for the examples in the book. I
want to verify that the output I show in the book is "reasonably
correct." I say "Reasonably" because a number of examples produce
random numbers or text or the time of day or in general things that do
not repeat themselves from one execution to the next. So, much of the
text will be the same between the "control sample" and the "test
sample," but some of it will be different.

I will be using Python or Jython for the test framework.

What I'd like to do is find an algorithm that produces the results of
a text comparison as a percentage-match. Thus I would be able to
assert that my test samples must match the control sample by at least
(for example) 83% for the test to pass. Clearly, this wouldn't be a
perfect test but it would help flag problems, which is primarily what
I need.

Does anyone know of an algorithm or library that would do this? Thanks
in advance.

Here's an outside-the-box solution: set the random number seed and use
a fixed date in your tests. Now you can test fixed values, even though
the application is "random".

--
<a href="http://kuoi.asui.uidaho.edu/~kamikaze/"> Mark Hughes </a>
"Virtues foster one another; so too, vices.
Bad English kills trees, consumes energy, and befouls the Earth.
Good English renews it." -The Underground Grammarian, v1n2

Jul 18 '05 #6

Steve Christensen

In article <ma*************************************@python.or g>, Bruce
Eckel wrote:

What I'd like to do is find an algorithm that produces the results of
a text comparison as a percentage-match. Thus I would be able to
assert that my test samples must match the control sample by at least
(for example) 83% for the test to pass. Clearly, this wouldn't be a
perfect test but it would help flag problems, which is primarily what
I need.

Does anyone know of an algorithm or library that would do this? Thanks
in advance.

Have you come across the following yet?

Levenshtein C extension module for Python:
http://trific.ath.cx/resources/python/levenshtein/
And/or:
http://hetland.org/python/distance.py
-Steve

Jul 18 '05 #7

Similar topics

[perl-python] string pattern matching

by: Xah Lee | last post by:

# -*- coding: utf-8 -*- # Python # Matching string patterns # # Sometimes you want to know if a string is of # particular pattern. Let's say in your website # you have converted all images...

Python

Formatting percentage?

by: Rob Meade | last post by:

Hi all, I have a recordset iterating through and dumping out to the screen a series of percentages, using the precision 5 and numericscale 2 etc. When I dump them to the page some of the...

ASP / Active Server Pages

Need workaround for IE5 miscalculating percentage widths

by: Oliver Burnett-Hall | last post by:

I'm trying to move to using tableless page layouts, but I've come across what appears to be a bug in IE5's rendering that I can't find a way to overcome. The page has a sidebar to the left of...

HTML / CSS

Regex Pattern Matching algorithm in mono/c#

by: Day Of The Eagle | last post by:

Jeff_Relf wrote: > ...yet you don't even know what RegEx is. > I'm looking at the source code for mono's Regex implementation right now. You can download that source here ( use the class...

.NET Framework

CSS border/margin width and percentage width

by: Hacking Bear | last post by:

Hi, I still don't quite fully understand how to handle mixing border/margin pixel width with percentage width. In the example below, I want to place side-by-side two DIV boxes inside a box....

HTML / CSS

How to calculate the percentage of each character in a text file?

by: Umesh | last post by:

Plese help. Is there any software by which we can do that?

C / C++

Pattern matching

by: VanKha | last post by:

I write this program for pattern-matching,but it gives wrong result: #include<iostream> #include<conio.h> #include<string.h> using namespace std; main() { char text,pat;...

C / C++

Matching chars in a std::string

by: tech | last post by:

Hi, I need a function to specify a match pattern including using wildcard characters as below to find chars in a std::string. The match pattern can contain the wildcard characters "*" and "?",...

C / C++

Adding New Table Records and calculating a Percentage increase

by: zoeb | last post by:

Hi. I have a form which the user enters 2 years worth of data into (one record per year). The aim, is to populate the table this form is based on with 3 more years worth of data (i.e. creating 3...

Microsoft Access / VBA

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET