newb: comapring two strings

manstey

Hi,

Is there a clever way to see if two strings of the same length vary by
only one character, and what the character is in both strings.

E.g. str1=yaqtil str2=yaqtel

they differ at str1[4] and the difference is ('i','e')

But if there was str1=yiqtol and str2=yaqtel, I am not interested.

can anyone suggest a simple way to do this?

My next problem is, I have a list of 300,000+ words and I want to find
every pair of such strings. I thought I would first sort on length of
string, but how do I iterate through the following:

str1
str2
str3
str4
str5

so that I compare str1 & str2, str1 & str3, str 1 & str4, str1 & str5,
str2 & str3, str3 & str4, str3 & str5, str4 & str5.

Thanks in advance,
Matthew

May 18 '06 #1

Subscribe Post Reply

1476

Diez B. Roggisch

> Is there a clever way to see if two strings of the same length vary by

only one character, and what the character is in both strings.

E.g. str1=yaqtil str2=yaqtel

they differ at str1[4] and the difference is ('i','e')

But if there was str1=yiqtol and str2=yaqtel, I am not interested.

can anyone suggest a simple way to do this?
Use the levenshtein distance.
http://en.wikisource.org/wiki/Levenshtein_distance

My next problem is, I have a list of 300,000+ words and I want to find
every pair of such strings. I thought I would first sort on length of
string, but how do I iterate through the following:

str1
str2
str3
str4
str5

so that I compare str1 & str2, str1 & str3, str 1 & str4, str1 & str5,
str2 & str3, str3 & str4, str3 & str5, str4 & str5.

decorate-sort-undecorate is the idion for this

l = <list of strings>

l = [(len(w), w) for w in l]
l.sort()
l = [w for _, w in l]
Diez

May 18 '06 #2

Justin Azoff

manstey wrote:

Hi,

Is there a clever way to see if two strings of the same length vary by
only one character, and what the character is in both strings.

E.g. str1=yaqtil str2=yaqtel

they differ at str1[4] and the difference is ('i','e')

something like this maybe?

str1='yaqtil'
str2='yaqtel'
set(enumerate(str1)) ^ set(enumerate(str2)) set([(4, 'e'), (4, 'i')])

--
- Justin

May 19 '06 #3

Serge Orlov

manstey wrote:

Hi,

Is there a clever way to see if two strings of the same length vary by
only one character, and what the character is in both strings.

E.g. str1=yaqtil str2=yaqtel

they differ at str1[4] and the difference is ('i','e')

But if there was str1=yiqtol and str2=yaqtel, I am not interested.

can anyone suggest a simple way to do this?

My next problem is, I have a list of 300,000+ words and I want to find
every pair of such strings. I thought I would first sort on length of
string, but how do I iterate through the following:

str1
str2
str3
str4
str5

so that I compare str1 & str2, str1 & str3, str 1 & str4, str1 & str5,
str2 & str3, str3 & str4, str3 & str5, str4 & str5.

If your strings are pretty short you can do it like this even without
sorting by length first:

def fuzzy_keys(s):
for pos in range(len(s)):
yield s[0:pos]+chr(0)+s[pos+1:]

def fuzzy_insert(d, s):
for fuzzy_key in fuzzy_keys(s):
if fuzzy_key in d:
strings = d[fuzzy_key]
if type(strings) is list:
strings += s
else:
d[fuzzy_key] = [strings, s]
else:
d[fuzzy_key] = s

def gather_fuzzy_matches(d):
for strings in d.itervalues():
if type(strings) is list:
yield strings

acc = {}
fuzzy_insert(acc, "yaqtel")
fuzzy_insert(acc, "yaqtil")
fuzzy_insert(acc, "oaqtil")
print list(gather_fuzzy_matches(acc))

prints

[['yaqtil', 'oaqtil'], ['yaqtel', 'yaqtil']]

May 19 '06 #4

johnzenger

manstey wrote:

Hi,

Is there a clever way to see if two strings of the same length vary by
only one character, and what the character is in both strings.
You want zip.

def diffbyonlyone(string1, string2):
diffcount = 0
for c1, c2 in zip(string1, string2):
if c1 != c2:
diffcount += 1
if diffcount > 1:
return False
return diffcount == 1

print diffbyonlyone("yaqtil","yaqtel") # True
print diffbyonlyone("yiqtol","yaqtel") # False

If your strings are long, it might be faster/more memory efficient to
use itertools.izip instead.
My next problem is, I have a list of 300,000+ words and I want to find
every pair of such strings. I thought I would first sort on length of
string, but how do I iterate through the following:

str1
str2
str3
str4
str5

so that I compare str1 & str2, str1 & str3, str 1 & str4, str1 & str5,
str2 & str3, str3 & str4, str3 & str5, str4 & str5.

for index1 in xrange(len(words)):
for index2 in xrange(index1+1,len(words)):
if diffbyonlyone(words[index1], words[index2]):
print words[index1] + " -- " + words[index2]

....but by all means run that only on sets of words that you have
already identified, pursuant to some criteria like word length, to be
likely matches. Do the math; that's a lot of comparisons!

May 19 '06 #5

Paul McGuire

"manstey" <ma*****@csu.edu.au> wrote in message
news:11**********************@j55g2000cwa.googlegr oups.com...

Hi,

Is there a clever way to see if two strings of the same length vary by
only one character, and what the character is in both strings.

E.g. str1=yaqtil str2=yaqtel

they differ at str1[4] and the difference is ('i','e')

def offByNoMoreThanOneCharacter(a,b): .... return len(a)==len(b) and sum(map(lambda (x,y): x==y, zip(a,b))) >=
len(a)-1
.... offByNoMoreThanOneCharacter("abc","abd") True offByNoMoreThanOneCharacter("abc","axd") False offByNoMoreThanOneCharacter("yaqtil","yaqtel") True offByNoMoreThanOneCharacter("yiqtol","yaqtel")

False

This takes advantage of the Python convention that True evaluates to 1 and
False evaluates to 0.

-- Paul

May 19 '06 #6

Paul Rubin

"Paul McGuire" <pt***@austin.rr._bogus_.com> writes:

def offByNoMoreThanOneCharacter(a,b):

... return len(a)==len(b) and sum(map(lambda (x,y): x==y, zip(a,b))) >=
len(a)-1

Yikes! How about (untested):

def offByNoMoreThanOneCharacter(a,b):
return len(a)==len(b) and \
len([i for i in xrange(len(a)) if a[i] != b[i]]) <= 1

which seems more direct. I'm not sure it's correct since I'd
consider "apple" and "apples" to be one character apart even though
their lengths are unequal.

But I think this type of function is useless for the problem at hand,
since there are around N=300,000 words, comparing every pair of them
is O(N**2) which is a bit painful.

I'd use a decorate-sort-undecorate approach like:

# generate all the ways to
def generate_all_variants(word):
wlower = word.lower()
yield (wlower, word)
for i in xrange(len(word)):
yield (wlower[:i] + wlower[i+1:], word)

Use this to generate all the variants of all the words in the
dictionary, and write those out into a file, each line containing a
variant plus the original word. Then use a sorting utility like the
unix "sort" program to sort the file. Those programs work efficiently
even if the file is too large to fit in memory. Then read the sorted
file to find equivalence classes on the variants.

May 19 '06 #7

Peter Otten

manstey wrote:

Is there a clever way to see if two strings of the same length vary by
only one character, and what the character is in both strings.

E.g. str1=yaqtil str2=yaqtel

they differ at str1[4] and the difference is ('i','e')

But if there was str1=yiqtol and str2=yaqtel, I am not interested.

can anyone suggest a simple way to do this?

My next problem is, I have a list of 300,000+ words and I want to find
every pair of such strings. I thought I would first sort on length of
string, but how do I iterate through the following:

Not sure if it can handle 300000 words, but here is something to play with.

import sys

def find_similars(words, lookup=None, dupes=None):
if lookup is None:
lookup = {}
if dupes is None:
dupes = set()
for word in words:
low_word = word.lower()
if low_word not in dupes:
dupes.add(low_word)
last = None
for i, c in enumerate(low_word):
if c == last: continue
key = low_word[:i], low_word[i+1:]
if key in lookup:
lookup[key].append(word)
else:
lookup[key] = [word]
last = c
return (group for group in lookup.itervalues() if len(group) > 1)

def main():
import optparse
parser = optparse.OptionParser()
parser.usage += " infile[s]"
parser.add_option("-n", "--limit", type="int", help="process at most
LIMIT words")
options, args = parser.parse_args()
if args:
words = (w.strip() for fn in args for w in open(fn))
else:
words = (w.strip() for w in sys.stdin)
if options.limit is not None:
from itertools import islice
words = islice(words, max_len)

for group in find_similars(words):
print " ".join(sorted(group))

if __name__ == "__main__":
main()

Peter

May 19 '06 #8

John Machin

> Use the levenshtein distance.

Given the constraint that the two strings are the same length, I'm
assuming (as other posters appear to have done) that "vary by only one
character" precludes insertion and deletion operations.

In that case, the problem can be solved in O(n) time by a simple loop
which counts the number of differences and notes the position of the
first (if any) difference. Any full-distance Levenshtein method that
does no pre-processing must take O(n**2) time. It is possible to take
advantage of the fact that the OP doesn't care about what the distance
is exactly if it is not 1; 0 is just as bad as 2, 3, 4, etc. In this
case it is possible to achieve O(n) time [ref. Ukkonen]. However (a)
one still needs to track where the difference arose (b) we are talking
about extreme overkill for the OP's problem.

May 19 '06 #9

Paul McGuire

"John Machin" <sj******@lexicon.net> wrote in message
news:11**********************@j73g2000cwa.googlegr oups.com...

In that case, the problem can be solved in O(n) time by a simple loop
which counts the number of differences and notes the position of the
first (if any) difference. Any full-distance Levenshtein method that
does no pre-processing must take O(n**2) time. It is possible to take
advantage of the fact that the OP doesn't care about what the distance
is exactly if it is not 1; 0 is just as bad as 2, 3, 4, etc.

Here is a class that implements 4 different approaches to this comparison
problem, with a configurable number of maximum mismatches (where a and b are
the strings being compared):

1. use sum(map(lambda (x,y): x!=y, itertools.izip(a,b))) to count
mismatches - no short-circuiting
2. use sum(map(abs, itertools.starmap(cmp, itertools.izip(a,b)))) to count
mismatches - no short-circuiting
3. use explicit for loop to compare each tuple returned from
itertools.izip(a,b) - short-circuits when number of mismatches exceeds
allowable number
4. use for loop over itertools.takewhile to count mismatches -
short-circuits when number of mismatches exceeds allowable number

Also note the short-circuiting in the initializer - if the strings being
compared are shorter in length then the number of allowed mismatches, than
they will always match, no comparisons required. (In the OP's specific
case, any two one-character strings will match within one character).

Of course this does nothing to address the larger issue of the program
design - I just wanted to learn a little more about itertools. :)

-- Paul
from itertools import izip,starmap,takewhile

class offByNoMoreThanOneCharacter(object):
def __init__(self,a,b,maxMismatches=1):
len_a = len(a)
self.val = None
if len_a != len(b):
self.val = False
elif len_a <= maxMismatches:
self.val = True
else:
self.a = a
self.b = b
self.maxMismatches = maxMismatches

def eval1(self):
# one-liner using map with lambda - sum counts mismatches
self.val = sum(map(lambda (x,y): x!=y, izip(self.a,self.b))) <=
self.maxMismatches

def eval2(self):
# one-liner using map with abs and cmp - sum counts mismatches
self.val = sum(map(abs, starmap(cmp, izip(self.a,self.b)))) <=
self.maxMismatches

def eval3(self):
# explicit traversal, with short-circuit when too many mismatches
found
mismatches = 0
for (ac,bc) in izip(self.a,self.b):
if ac != bc:
mismatches += 1
if mismatches > self.maxMismatches:
self.val = False
break
else:
self.val = True

def eval4(self):
# traversal using takewhile, also short-circuits when too many
mismatches found
numMismatches = 0
maxMismatches = self.maxMismatches
stillOk = lambda (s,t)=(None,None) : numMismatches <= maxMismatches
for (ac,bc) in takewhile(stillOk, izip(self.a,self.b)):
numMismatches += (ac != bc)
self.val = stillOk()
# special instance method to return "boolean-ness" of this object
def __nonzero__(self):
if self.val is None:
# change this line to use whichever eval method you want to test
self.eval1()
return self.val
def test(a,b):
print a,b
print bool(offByNoMoreThanOneCharacter(a,b))
print

test("abc","abd")
test("abc","axd")
test("yaqtil","yaqtel")
test("yiqtol","yaqtel")

May 19 '06 #10

bearophileHUGS

I think a way to solve the problem may be:
1) create a little Python script to separate the original words in many
files, each one containing only words of the same length. Every
filename can contain the relative word length.
2) write a little C program with two nested loops, that scans all the
pairs of the words of a single file, looking for single char
differences using a scanning of the chars. Probably there are many
possible tricks to speed up this search (like separating the words in
subgroups, of using some assembly-derived tricks, etc) but maybe you
don't need them. As C data structure you may simply use a single block
(because you know the length of a single word, so the files produced by
Python can be without spaces and returns).

I have suggested C because if the words are all of the same length then
you have 30000^2 = 90 000 000 000 pairs to test.
If you want, before creating the C program you can create a little
Python+Psyco prototype that uses array.array of chars (because
sometimes Psyco uses them quite quickly).

Bye,
bearophile

May 19 '06 #11

bearophileHUGS

> I have suggested C because if the words are all of the same length then

you have 30000^2 = 90 000 000 000 pairs to test.

Sorry, you have (n*(n-1))/2 pairs to test (~ 45 000 000 000).

bearophile

May 19 '06 #12

Paul Rubin

be************@lycos.com writes:

I have suggested C because if the words are all of the same length then
you have 30000^2 = 90 000 000 000 pairs to test.

Sorry, you have (n*(n-1))/2 pairs to test (~ 45 000 000 000).

Still terrible. Use a better algorithm!

May 19 '06 #13

John Machin

Paul Rubin wrote:

bearophileH...@lycos.com writes:
I have suggested C because if the words are all of the same length then
you have 30000^2 = 90 000 000 000 pairs to test.
Sorry, you have (n*(n-1))/2 pairs to test (~ 45 000 000 000).
Still terrible. Use a better algorithm!

To put all this in perspective, here's a very similar real-world
problem:

You have a customer database with 300,000 records. Some of them are
duplicated, because there are differences in recording the customers'
names, like a one-keystroke typo. The task is to locate pairs of rows
which are likely to be duplicates. 45 billion comaprisons[1] may be
ecxessive.

And here's a clue for a better algorithm: Knuth's TAOCP vol 3 [1973
edition] chapter 6 (searching) third page "as if the words were spelled
backwards". If you find yourself reading about soundex, you've
definitely gone too far :-)

[1]: "comapring" in the subject: is this a game of 'Spot the deliberate
mistake"? Is the OP counting a transposition of adjacent characters as
"varying by one character"?

Cheers

May 19 '06 #14

bearophileHUGS

Paul Rubin>Still terrible. Use a better algorithm!<

I agree, it's O(n^2), but if you need to run this program just 1 time,
and the program is in C, and you don't want to use much time to think
and code a better algorithm (or you aren't able to do it) then maybe
that naive solution can be enough, (if the running time is around 30-90
minutes).
Total time to solve a one-shot problem is often the sum of thinking
time + programming time + running time :-)
Paul Rubin>Use this to generate all the variants of all the words in
the dictionary, and write those out into a file, each line containing a
variant plus the original word. Then use a sorting utility like the
unix "sort" program to sort the file. Those programs work efficiently
even if the file is too large to fit in memory. Then read the sorted
file to find equivalence classes on the variants.<

If the words are all 5 character long, then I think you are creating:
(len(lowercase)-1) * 5 * 30000 = 3750000
strings of len 5+5 (plus enter if you save it into a file), this means
a file of about 37 MB. All those trings may even fit in memory if you
have lot of it.
Your solution looks nice :-) (It reminds me a little the word signature
trick used to find anagrams).

Bye,
bearophile

May 19 '06 #15

Similar topics

Java Newb

by: Travis Ray | last post by:

Hi, I just started a java class and I have this assignment and I'm stumped... I realize the code is messy... but I just started... so go easy on me :) Java Problem I am working on this...

Java

Simple question from Python newb... What am I doing wrong? For x, y, z in aTuple:

by: Amy G | last post by:

I have a whole bunch of tuples that look something like this, aTuple = ('1074545869.6580.msg', 't_bryan_pw@gmcc.ab.ca', 'Your one stop prescriptions') now that I have this I try for x, y, z...

Python

newb question

by: Mitch | last post by:

when using CIN How do i make it so user input can be 5 or 6 letters and how do i put input into an if else statement ? Example, i want the user to be able to input a word then program to decide...

C / C++

Some Newb Problem with "int", please help.

by: Apotheosis | last post by:

The problem professor gave us is: Write a program which reads two integer values. If the first is less than the second, print the message "up". If the second is less than the first, print the...

C / C++

newb question about strings

by: Blue Man | last post by:

hello I am writing an error handling within different classes in asp.net. the error class reports the Exception E's string to the database. but it cuts big strings, if i just display the error it...

C# / C Sharp

NewB question on text manipulation

by: ProvoWallis | last post by:

I'm totally stumped by this problem so I'm hoping someone can give me a little advice or point me in the right direction. I have a file that looks like this: <SC>APPEAL<XC>40-24; 40-46; 42-46;...

Python

Newb looking for data binding help

by: kevinwolfe | last post by:

Hi all. I'd like any suggestions on how I can get my data set (not a DataSet) bound to a couple of controls on a form. Let me start by describing what my data looks like. Each entry correlates...

Visual Basic .NET

newb: Join two string variables

by: johnny | last post by:

How do I join two string variables? I want to do: download_dir + filename. download_dir=r'c:/download/' filename =r'log.txt' I want to get something like this: c:/download/log.txt

Python

SWIG and char* newb questions :)

by: code_berzerker | last post by:

Hi i'm relatively new to Python and my C/C++ knowledge is near to None. Having said that I feel justified to ask stupid questions :) Ok now more seriously. I have question refering to char* used...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA