473,587 Members | 2,466 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

newb: comapring two strings

Hi,

Is there a clever way to see if two strings of the same length vary by
only one character, and what the character is in both strings.

E.g. str1=yaqtil str2=yaqtel

they differ at str1[4] and the difference is ('i','e')

But if there was str1=yiqtol and str2=yaqtel, I am not interested.

can anyone suggest a simple way to do this?

My next problem is, I have a list of 300,000+ words and I want to find
every pair of such strings. I thought I would first sort on length of
string, but how do I iterate through the following:

str1
str2
str3
str4
str5

so that I compare str1 & str2, str1 & str3, str 1 & str4, str1 & str5,
str2 & str3, str3 & str4, str3 & str5, str4 & str5.

Thanks in advance,
Matthew

May 18 '06 #1
14 1493
> Is there a clever way to see if two strings of the same length vary by
only one character, and what the character is in both strings.

E.g. str1=yaqtil str2=yaqtel

they differ at str1[4] and the difference is ('i','e')

But if there was str1=yiqtol and str2=yaqtel, I am not interested.

can anyone suggest a simple way to do this?
Use the levenshtein distance.
http://en.wikisource.org/wiki/Levenshtein_distance

My next problem is, I have a list of 300,000+ words and I want to find
every pair of such strings. I thought I would first sort on length of
string, but how do I iterate through the following:

str1
str2
str3
str4
str5

so that I compare str1 & str2, str1 & str3, str 1 & str4, str1 & str5,
str2 & str3, str3 & str4, str3 & str5, str4 & str5.


decorate-sort-undecorate is the idion for this

l = <list of strings>

l = [(len(w), w) for w in l]
l.sort()
l = [w for _, w in l]
Diez
May 18 '06 #2
manstey wrote:
Hi,

Is there a clever way to see if two strings of the same length vary by
only one character, and what the character is in both strings.

E.g. str1=yaqtil str2=yaqtel

they differ at str1[4] and the difference is ('i','e')


something like this maybe?
str1='yaqtil'
str2='yaqtel'
set(enumerate(s tr1)) ^ set(enumerate(s tr2)) set([(4, 'e'), (4, 'i')])


--
- Justin

May 19 '06 #3
manstey wrote:
Hi,

Is there a clever way to see if two strings of the same length vary by
only one character, and what the character is in both strings.

E.g. str1=yaqtil str2=yaqtel

they differ at str1[4] and the difference is ('i','e')

But if there was str1=yiqtol and str2=yaqtel, I am not interested.

can anyone suggest a simple way to do this?

My next problem is, I have a list of 300,000+ words and I want to find
every pair of such strings. I thought I would first sort on length of
string, but how do I iterate through the following:

str1
str2
str3
str4
str5

so that I compare str1 & str2, str1 & str3, str 1 & str4, str1 & str5,
str2 & str3, str3 & str4, str3 & str5, str4 & str5.


If your strings are pretty short you can do it like this even without
sorting by length first:

def fuzzy_keys(s):
for pos in range(len(s)):
yield s[0:pos]+chr(0)+s[pos+1:]

def fuzzy_insert(d, s):
for fuzzy_key in fuzzy_keys(s):
if fuzzy_key in d:
strings = d[fuzzy_key]
if type(strings) is list:
strings += s
else:
d[fuzzy_key] = [strings, s]
else:
d[fuzzy_key] = s

def gather_fuzzy_ma tches(d):
for strings in d.itervalues():
if type(strings) is list:
yield strings

acc = {}
fuzzy_insert(ac c, "yaqtel")
fuzzy_insert(ac c, "yaqtil")
fuzzy_insert(ac c, "oaqtil")
print list(gather_fuz zy_matches(acc) )

prints

[['yaqtil', 'oaqtil'], ['yaqtel', 'yaqtil']]

May 19 '06 #4

manstey wrote:
Hi,

Is there a clever way to see if two strings of the same length vary by
only one character, and what the character is in both strings.
You want zip.

def diffbyonlyone(s tring1, string2):
diffcount = 0
for c1, c2 in zip(string1, string2):
if c1 != c2:
diffcount += 1
if diffcount > 1:
return False
return diffcount == 1

print diffbyonlyone(" yaqtil","yaqtel ") # True
print diffbyonlyone(" yiqtol","yaqtel ") # False

If your strings are long, it might be faster/more memory efficient to
use itertools.izip instead.
My next problem is, I have a list of 300,000+ words and I want to find
every pair of such strings. I thought I would first sort on length of
string, but how do I iterate through the following:

str1
str2
str3
str4
str5

so that I compare str1 & str2, str1 & str3, str 1 & str4, str1 & str5,
str2 & str3, str3 & str4, str3 & str5, str4 & str5.


for index1 in xrange(len(word s)):
for index2 in xrange(index1+1 ,len(words)):
if diffbyonlyone(w ords[index1], words[index2]):
print words[index1] + " -- " + words[index2]

....but by all means run that only on sets of words that you have
already identified, pursuant to some criteria like word length, to be
likely matches. Do the math; that's a lot of comparisons!

May 19 '06 #5

"manstey" <ma*****@csu.ed u.au> wrote in message
news:11******** **************@ j55g2000cwa.goo glegroups.com.. .
Hi,

Is there a clever way to see if two strings of the same length vary by
only one character, and what the character is in both strings.

E.g. str1=yaqtil str2=yaqtel

they differ at str1[4] and the difference is ('i','e')

def offByNoMoreThan OneCharacter(a, b): .... return len(a)==len(b) and sum(map(lambda (x,y): x==y, zip(a,b))) >=
len(a)-1
.... offByNoMoreThan OneCharacter("a bc","abd") True offByNoMoreThan OneCharacter("a bc","axd") False offByNoMoreThan OneCharacter("y aqtil","yaqtel" ) True offByNoMoreThan OneCharacter("y iqtol","yaqtel" )

False

This takes advantage of the Python convention that True evaluates to 1 and
False evaluates to 0.

-- Paul

May 19 '06 #6
"Paul McGuire" <pt***@austin.r r._bogus_.com> writes:
def offByNoMoreThan OneCharacter(a, b):

... return len(a)==len(b) and sum(map(lambda (x,y): x==y, zip(a,b))) >=
len(a)-1


Yikes! How about (untested):

def offByNoMoreThan OneCharacter(a, b):
return len(a)==len(b) and \
len([i for i in xrange(len(a)) if a[i] != b[i]]) <= 1

which seems more direct. I'm not sure it's correct since I'd
consider "apple" and "apples" to be one character apart even though
their lengths are unequal.

But I think this type of function is useless for the problem at hand,
since there are around N=300,000 words, comparing every pair of them
is O(N**2) which is a bit painful.

I'd use a decorate-sort-undecorate approach like:

# generate all the ways to
def generate_all_va riants(word):
wlower = word.lower()
yield (wlower, word)
for i in xrange(len(word )):
yield (wlower[:i] + wlower[i+1:], word)

Use this to generate all the variants of all the words in the
dictionary, and write those out into a file, each line containing a
variant plus the original word. Then use a sorting utility like the
unix "sort" program to sort the file. Those programs work efficiently
even if the file is too large to fit in memory. Then read the sorted
file to find equivalence classes on the variants.
May 19 '06 #7
manstey wrote:
Is there a clever way to see if two strings of the same length vary by
only one character, and what the character is in both strings.

E.g. str1=yaqtil str2=yaqtel

they differ at str1[4] and the difference is ('i','e')

But if there was str1=yiqtol and str2=yaqtel, I am not interested.

can anyone suggest a simple way to do this?

My next problem is, I have a list of 300,000+ words and I want to find
every pair of such strings. I thought I would first sort on length of
string, but how do I iterate through the following:


Not sure if it can handle 300000 words, but here is something to play with.

import sys

def find_similars(w ords, lookup=None, dupes=None):
if lookup is None:
lookup = {}
if dupes is None:
dupes = set()
for word in words:
low_word = word.lower()
if low_word not in dupes:
dupes.add(low_w ord)
last = None
for i, c in enumerate(low_w ord):
if c == last: continue
key = low_word[:i], low_word[i+1:]
if key in lookup:
lookup[key].append(word)
else:
lookup[key] = [word]
last = c
return (group for group in lookup.itervalu es() if len(group) > 1)

def main():
import optparse
parser = optparse.Option Parser()
parser.usage += " infile[s]"
parser.add_opti on("-n", "--limit", type="int", help="process at most
LIMIT words")
options, args = parser.parse_ar gs()
if args:
words = (w.strip() for fn in args for w in open(fn))
else:
words = (w.strip() for w in sys.stdin)
if options.limit is not None:
from itertools import islice
words = islice(words, max_len)

for group in find_similars(w ords):
print " ".join(sorted(g roup))

if __name__ == "__main__":
main()

Peter
May 19 '06 #8
> Use the levenshtein distance.

Given the constraint that the two strings are the same length, I'm
assuming (as other posters appear to have done) that "vary by only one
character" precludes insertion and deletion operations.

In that case, the problem can be solved in O(n) time by a simple loop
which counts the number of differences and notes the position of the
first (if any) difference. Any full-distance Levenshtein method that
does no pre-processing must take O(n**2) time. It is possible to take
advantage of the fact that the OP doesn't care about what the distance
is exactly if it is not 1; 0 is just as bad as 2, 3, 4, etc. In this
case it is possible to achieve O(n) time [ref. Ukkonen]. However (a)
one still needs to track where the difference arose (b) we are talking
about extreme overkill for the OP's problem.

May 19 '06 #9
"John Machin" <sj******@lexic on.net> wrote in message
news:11******** **************@ j73g2000cwa.goo glegroups.com.. .
In that case, the problem can be solved in O(n) time by a simple loop
which counts the number of differences and notes the position of the
first (if any) difference. Any full-distance Levenshtein method that
does no pre-processing must take O(n**2) time. It is possible to take
advantage of the fact that the OP doesn't care about what the distance
is exactly if it is not 1; 0 is just as bad as 2, 3, 4, etc.


Here is a class that implements 4 different approaches to this comparison
problem, with a configurable number of maximum mismatches (where a and b are
the strings being compared):

1. use sum(map(lambda (x,y): x!=y, itertools.izip( a,b))) to count
mismatches - no short-circuiting
2. use sum(map(abs, itertools.starm ap(cmp, itertools.izip( a,b)))) to count
mismatches - no short-circuiting
3. use explicit for loop to compare each tuple returned from
itertools.izip( a,b) - short-circuits when number of mismatches exceeds
allowable number
4. use for loop over itertools.takew hile to count mismatches -
short-circuits when number of mismatches exceeds allowable number

Also note the short-circuiting in the initializer - if the strings being
compared are shorter in length then the number of allowed mismatches, than
they will always match, no comparisons required. (In the OP's specific
case, any two one-character strings will match within one character).

Of course this does nothing to address the larger issue of the program
design - I just wanted to learn a little more about itertools. :)

-- Paul
from itertools import izip,starmap,ta kewhile

class offByNoMoreThan OneCharacter(ob ject):
def __init__(self,a ,b,maxMismatche s=1):
len_a = len(a)
self.val = None
if len_a != len(b):
self.val = False
elif len_a <= maxMismatches:
self.val = True
else:
self.a = a
self.b = b
self.maxMismatc hes = maxMismatches

def eval1(self):
# one-liner using map with lambda - sum counts mismatches
self.val = sum(map(lambda (x,y): x!=y, izip(self.a,sel f.b))) <=
self.maxMismatc hes

def eval2(self):
# one-liner using map with abs and cmp - sum counts mismatches
self.val = sum(map(abs, starmap(cmp, izip(self.a,sel f.b)))) <=
self.maxMismatc hes

def eval3(self):
# explicit traversal, with short-circuit when too many mismatches
found
mismatches = 0
for (ac,bc) in izip(self.a,sel f.b):
if ac != bc:
mismatches += 1
if mismatches > self.maxMismatc hes:
self.val = False
break
else:
self.val = True

def eval4(self):
# traversal using takewhile, also short-circuits when too many
mismatches found
numMismatches = 0
maxMismatches = self.maxMismatc hes
stillOk = lambda (s,t)=(None,Non e) : numMismatches <= maxMismatches
for (ac,bc) in takewhile(still Ok, izip(self.a,sel f.b)):
numMismatches += (ac != bc)
self.val = stillOk()
# special instance method to return "boolean-ness" of this object
def __nonzero__(sel f):
if self.val is None:
# change this line to use whichever eval method you want to test
self.eval1()
return self.val
def test(a,b):
print a,b
print bool(offByNoMor eThanOneCharact er(a,b))
print

test("abc","abd ")
test("abc","axd ")
test("yaqtil"," yaqtel")
test("yiqtol"," yaqtel")
May 19 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
2718
by: Travis Ray | last post by:
Hi, I just started a java class and I have this assignment and I'm stumped... I realize the code is messy... but I just started... so go easy on me :) Java Problem I am working on this assignment. The user enters a loan amount, number of years to repay, and annual percentage rate. The project is supposed to output a table for each...
3
1552
by: Amy G | last post by:
I have a whole bunch of tuples that look something like this, aTuple = ('1074545869.6580.msg', 't_bryan_pw@gmcc.ab.ca', 'Your one stop prescriptions') now that I have this I try for x, y, z in aTuple: do something with x do something with y
2
1169
by: Mitch | last post by:
when using CIN How do i make it so user input can be 5 or 6 letters and how do i put input into an if else statement ? Example, i want the user to be able to input a word then program to decide what word to spit out. char input; cout<<"Hi "; cin>>input
24
2338
by: Apotheosis | last post by:
The problem professor gave us is: Write a program which reads two integer values. If the first is less than the second, print the message "up". If the second is less than the first, print the message "down" If the numbers are equal, print the message "equal" If there is an error reading the data, print a message containing the word "Error"...
3
1187
by: Blue Man | last post by:
hello I am writing an error handling within different classes in asp.net. the error class reports the Exception E's string to the database. but it cuts big strings, if i just display the error it is ok , but when i send it to function to store it in database just a little part of error string appears. is that because of size of string?cuz...
4
1255
by: ProvoWallis | last post by:
I'm totally stumped by this problem so I'm hoping someone can give me a little advice or point me in the right direction. I have a file that looks like this: <SC>APPEAL<XC>40-24; 40-46; 42-46; 42-48; 42-62; 42-63 <SC>PROC GUIDE<XC>92<LT>1(b)(1) (i.e., <<SC><XC><SC><XC><LT>
3
1365
by: kevinwolfe | last post by:
Hi all. I'd like any suggestions on how I can get my data set (not a DataSet) bound to a couple of controls on a form. Let me start by describing what my data looks like. Each entry correlates to a piece of equipment and has the following information: Two strings, one for a system name and one for equipment name An integer An...
6
11297
by: johnny | last post by:
How do I join two string variables? I want to do: download_dir + filename. download_dir=r'c:/download/' filename =r'log.txt' I want to get something like this: c:/download/log.txt
3
2116
by: code_berzerker | last post by:
Hi i'm relatively new to Python and my C/C++ knowledge is near to None. Having said that I feel justified to ask stupid questions :) Ok now more seriously. I have question refering to char* used as function parameters to return values. I have read SWIG manual to find best way to overcome that, but there are many warnings about memory leaks...
0
7849
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
8215
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
8347
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7973
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
8220
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
0
6626
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5718
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
3844
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
1189
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.