Hi,
Is there a clever way to see if two strings of the same length vary by
only one character, and what the character is in both strings.
E.g. str1=yaqtil str2=yaqtel
they differ at str1[4] and the difference is ('i','e')
But if there was str1=yiqtol and str2=yaqtel, I am not interested.
can anyone suggest a simple way to do this?
My next problem is, I have a list of 300,000+ words and I want to find
every pair of such strings. I thought I would first sort on length of
string, but how do I iterate through the following:
str1
str2
str3
str4
str5
so that I compare str1 & str2, str1 & str3, str 1 & str4, str1 & str5,
str2 & str3, str3 & str4, str3 & str5, str4 & str5.
Thanks in advance,
Matthew 14 1493
> Is there a clever way to see if two strings of the same length vary by only one character, and what the character is in both strings.
E.g. str1=yaqtil str2=yaqtel
they differ at str1[4] and the difference is ('i','e')
But if there was str1=yiqtol and str2=yaqtel, I am not interested.
can anyone suggest a simple way to do this?
Use the levenshtein distance. http://en.wikisource.org/wiki/Levenshtein_distance
My next problem is, I have a list of 300,000+ words and I want to find every pair of such strings. I thought I would first sort on length of string, but how do I iterate through the following:
str1 str2 str3 str4 str5
so that I compare str1 & str2, str1 & str3, str 1 & str4, str1 & str5, str2 & str3, str3 & str4, str3 & str5, str4 & str5.
decorate-sort-undecorate is the idion for this
l = <list of strings>
l = [(len(w), w) for w in l]
l.sort()
l = [w for _, w in l]
Diez
manstey wrote: Hi,
Is there a clever way to see if two strings of the same length vary by only one character, and what the character is in both strings.
E.g. str1=yaqtil str2=yaqtel
they differ at str1[4] and the difference is ('i','e')
something like this maybe? str1='yaqtil' str2='yaqtel' set(enumerate(s tr1)) ^ set(enumerate(s tr2))
set([(4, 'e'), (4, 'i')])
--
- Justin
manstey wrote: Hi,
Is there a clever way to see if two strings of the same length vary by only one character, and what the character is in both strings.
E.g. str1=yaqtil str2=yaqtel
they differ at str1[4] and the difference is ('i','e')
But if there was str1=yiqtol and str2=yaqtel, I am not interested.
can anyone suggest a simple way to do this?
My next problem is, I have a list of 300,000+ words and I want to find every pair of such strings. I thought I would first sort on length of string, but how do I iterate through the following:
str1 str2 str3 str4 str5
so that I compare str1 & str2, str1 & str3, str 1 & str4, str1 & str5, str2 & str3, str3 & str4, str3 & str5, str4 & str5.
If your strings are pretty short you can do it like this even without
sorting by length first:
def fuzzy_keys(s):
for pos in range(len(s)):
yield s[0:pos]+chr(0)+s[pos+1:]
def fuzzy_insert(d, s):
for fuzzy_key in fuzzy_keys(s):
if fuzzy_key in d:
strings = d[fuzzy_key]
if type(strings) is list:
strings += s
else:
d[fuzzy_key] = [strings, s]
else:
d[fuzzy_key] = s
def gather_fuzzy_ma tches(d):
for strings in d.itervalues():
if type(strings) is list:
yield strings
acc = {}
fuzzy_insert(ac c, "yaqtel")
fuzzy_insert(ac c, "yaqtil")
fuzzy_insert(ac c, "oaqtil")
print list(gather_fuz zy_matches(acc) )
prints
[['yaqtil', 'oaqtil'], ['yaqtel', 'yaqtil']]
manstey wrote: Hi,
Is there a clever way to see if two strings of the same length vary by only one character, and what the character is in both strings.
You want zip.
def diffbyonlyone(s tring1, string2):
diffcount = 0
for c1, c2 in zip(string1, string2):
if c1 != c2:
diffcount += 1
if diffcount > 1:
return False
return diffcount == 1
print diffbyonlyone(" yaqtil","yaqtel ") # True
print diffbyonlyone(" yiqtol","yaqtel ") # False
If your strings are long, it might be faster/more memory efficient to
use itertools.izip instead.
My next problem is, I have a list of 300,000+ words and I want to find every pair of such strings. I thought I would first sort on length of string, but how do I iterate through the following:
str1 str2 str3 str4 str5
so that I compare str1 & str2, str1 & str3, str 1 & str4, str1 & str5, str2 & str3, str3 & str4, str3 & str5, str4 & str5.
for index1 in xrange(len(word s)):
for index2 in xrange(index1+1 ,len(words)):
if diffbyonlyone(w ords[index1], words[index2]):
print words[index1] + " -- " + words[index2]
....but by all means run that only on sets of words that you have
already identified, pursuant to some criteria like word length, to be
likely matches. Do the math; that's a lot of comparisons!
"manstey" <ma*****@csu.ed u.au> wrote in message
news:11******** **************@ j55g2000cwa.goo glegroups.com.. . Hi,
Is there a clever way to see if two strings of the same length vary by only one character, and what the character is in both strings.
E.g. str1=yaqtil str2=yaqtel
they differ at str1[4] and the difference is ('i','e') def offByNoMoreThan OneCharacter(a, b):
.... return len(a)==len(b) and sum(map(lambda (x,y): x==y, zip(a,b))) >=
len(a)-1
.... offByNoMoreThan OneCharacter("a bc","abd")
True offByNoMoreThan OneCharacter("a bc","axd")
False offByNoMoreThan OneCharacter("y aqtil","yaqtel" )
True offByNoMoreThan OneCharacter("y iqtol","yaqtel" )
False
This takes advantage of the Python convention that True evaluates to 1 and
False evaluates to 0.
-- Paul
"Paul McGuire" <pt***@austin.r r._bogus_.com> writes: def offByNoMoreThan OneCharacter(a, b):
... return len(a)==len(b) and sum(map(lambda (x,y): x==y, zip(a,b))) >= len(a)-1
Yikes! How about (untested):
def offByNoMoreThan OneCharacter(a, b):
return len(a)==len(b) and \
len([i for i in xrange(len(a)) if a[i] != b[i]]) <= 1
which seems more direct. I'm not sure it's correct since I'd
consider "apple" and "apples" to be one character apart even though
their lengths are unequal.
But I think this type of function is useless for the problem at hand,
since there are around N=300,000 words, comparing every pair of them
is O(N**2) which is a bit painful.
I'd use a decorate-sort-undecorate approach like:
# generate all the ways to
def generate_all_va riants(word):
wlower = word.lower()
yield (wlower, word)
for i in xrange(len(word )):
yield (wlower[:i] + wlower[i+1:], word)
Use this to generate all the variants of all the words in the
dictionary, and write those out into a file, each line containing a
variant plus the original word. Then use a sorting utility like the
unix "sort" program to sort the file. Those programs work efficiently
even if the file is too large to fit in memory. Then read the sorted
file to find equivalence classes on the variants.
manstey wrote: Is there a clever way to see if two strings of the same length vary by only one character, and what the character is in both strings.
E.g. str1=yaqtil str2=yaqtel
they differ at str1[4] and the difference is ('i','e')
But if there was str1=yiqtol and str2=yaqtel, I am not interested.
can anyone suggest a simple way to do this?
My next problem is, I have a list of 300,000+ words and I want to find every pair of such strings. I thought I would first sort on length of string, but how do I iterate through the following:
Not sure if it can handle 300000 words, but here is something to play with.
import sys
def find_similars(w ords, lookup=None, dupes=None):
if lookup is None:
lookup = {}
if dupes is None:
dupes = set()
for word in words:
low_word = word.lower()
if low_word not in dupes:
dupes.add(low_w ord)
last = None
for i, c in enumerate(low_w ord):
if c == last: continue
key = low_word[:i], low_word[i+1:]
if key in lookup:
lookup[key].append(word)
else:
lookup[key] = [word]
last = c
return (group for group in lookup.itervalu es() if len(group) > 1)
def main():
import optparse
parser = optparse.Option Parser()
parser.usage += " infile[s]"
parser.add_opti on("-n", "--limit", type="int", help="process at most
LIMIT words")
options, args = parser.parse_ar gs()
if args:
words = (w.strip() for fn in args for w in open(fn))
else:
words = (w.strip() for w in sys.stdin)
if options.limit is not None:
from itertools import islice
words = islice(words, max_len)
for group in find_similars(w ords):
print " ".join(sorted(g roup))
if __name__ == "__main__":
main()
Peter
> Use the levenshtein distance.
Given the constraint that the two strings are the same length, I'm
assuming (as other posters appear to have done) that "vary by only one
character" precludes insertion and deletion operations.
In that case, the problem can be solved in O(n) time by a simple loop
which counts the number of differences and notes the position of the
first (if any) difference. Any full-distance Levenshtein method that
does no pre-processing must take O(n**2) time. It is possible to take
advantage of the fact that the OP doesn't care about what the distance
is exactly if it is not 1; 0 is just as bad as 2, 3, 4, etc. In this
case it is possible to achieve O(n) time [ref. Ukkonen]. However (a)
one still needs to track where the difference arose (b) we are talking
about extreme overkill for the OP's problem.
"John Machin" <sj******@lexic on.net> wrote in message
news:11******** **************@ j73g2000cwa.goo glegroups.com.. . In that case, the problem can be solved in O(n) time by a simple loop which counts the number of differences and notes the position of the first (if any) difference. Any full-distance Levenshtein method that does no pre-processing must take O(n**2) time. It is possible to take advantage of the fact that the OP doesn't care about what the distance is exactly if it is not 1; 0 is just as bad as 2, 3, 4, etc.
Here is a class that implements 4 different approaches to this comparison
problem, with a configurable number of maximum mismatches (where a and b are
the strings being compared):
1. use sum(map(lambda (x,y): x!=y, itertools.izip( a,b))) to count
mismatches - no short-circuiting
2. use sum(map(abs, itertools.starm ap(cmp, itertools.izip( a,b)))) to count
mismatches - no short-circuiting
3. use explicit for loop to compare each tuple returned from
itertools.izip( a,b) - short-circuits when number of mismatches exceeds
allowable number
4. use for loop over itertools.takew hile to count mismatches -
short-circuits when number of mismatches exceeds allowable number
Also note the short-circuiting in the initializer - if the strings being
compared are shorter in length then the number of allowed mismatches, than
they will always match, no comparisons required. (In the OP's specific
case, any two one-character strings will match within one character).
Of course this does nothing to address the larger issue of the program
design - I just wanted to learn a little more about itertools. :)
-- Paul
from itertools import izip,starmap,ta kewhile
class offByNoMoreThan OneCharacter(ob ject):
def __init__(self,a ,b,maxMismatche s=1):
len_a = len(a)
self.val = None
if len_a != len(b):
self.val = False
elif len_a <= maxMismatches:
self.val = True
else:
self.a = a
self.b = b
self.maxMismatc hes = maxMismatches
def eval1(self):
# one-liner using map with lambda - sum counts mismatches
self.val = sum(map(lambda (x,y): x!=y, izip(self.a,sel f.b))) <=
self.maxMismatc hes
def eval2(self):
# one-liner using map with abs and cmp - sum counts mismatches
self.val = sum(map(abs, starmap(cmp, izip(self.a,sel f.b)))) <=
self.maxMismatc hes
def eval3(self):
# explicit traversal, with short-circuit when too many mismatches
found
mismatches = 0
for (ac,bc) in izip(self.a,sel f.b):
if ac != bc:
mismatches += 1
if mismatches > self.maxMismatc hes:
self.val = False
break
else:
self.val = True
def eval4(self):
# traversal using takewhile, also short-circuits when too many
mismatches found
numMismatches = 0
maxMismatches = self.maxMismatc hes
stillOk = lambda (s,t)=(None,Non e) : numMismatches <= maxMismatches
for (ac,bc) in takewhile(still Ok, izip(self.a,sel f.b)):
numMismatches += (ac != bc)
self.val = stillOk()
# special instance method to return "boolean-ness" of this object
def __nonzero__(sel f):
if self.val is None:
# change this line to use whichever eval method you want to test
self.eval1()
return self.val
def test(a,b):
print a,b
print bool(offByNoMor eThanOneCharact er(a,b))
print
test("abc","abd ")
test("abc","axd ")
test("yaqtil"," yaqtel")
test("yiqtol"," yaqtel") This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Travis Ray |
last post by:
Hi, I just started a java class and I have this assignment and I'm
stumped... I realize the code is messy... but I just started... so go easy
on me :)
Java Problem
I am working on this assignment. The user enters a loan amount, number of
years to repay, and annual percentage rate. The project is supposed to
output a table for each...
|
by: Amy G |
last post by:
I have a whole bunch of tuples that look something like this,
aTuple = ('1074545869.6580.msg', 't_bryan_pw@gmcc.ab.ca', 'Your one stop
prescriptions')
now that I have this I try
for x, y, z in aTuple:
do something with x
do something with y
|
by: Mitch |
last post by:
when using CIN
How do i make it so user input can be 5 or 6 letters
and
how do i put input into an if else statement ?
Example, i want the user to be able to input a word then program to decide
what word to spit out.
char input;
cout<<"Hi ";
cin>>input
|
by: Apotheosis |
last post by:
The problem professor gave us is:
Write a program which reads two integer values. If the first is less
than the second, print the message "up". If the second is less than
the first, print the message "down" If the numbers are equal, print
the message "equal" If there is an error reading the data, print a
message containing the word "Error"...
|
by: Blue Man |
last post by:
hello
I am writing an error handling within different classes in asp.net. the
error class reports the Exception E's string to the database.
but it cuts big strings, if i just display the error it is ok , but when i
send it to function to store it in database just a little part of error
string appears.
is that because of size of string?cuz...
| |
by: ProvoWallis |
last post by:
I'm totally stumped by this problem so I'm hoping someone can give me a
little advice or point me in the right direction.
I have a file that looks like this:
<SC>APPEAL<XC>40-24; 40-46; 42-46; 42-48; 42-62; 42-63 <SC>PROC
GUIDE<XC>92<LT>1(b)(1)
(i.e., <<SC><XC><SC><XC><LT>
|
by: kevinwolfe |
last post by:
Hi all.
I'd like any suggestions on how I can get my data set (not a DataSet)
bound to a couple of controls on a form. Let me start by describing
what my data looks like. Each entry correlates to a piece of equipment
and has the following information:
Two strings, one for a system name and one for equipment name
An integer
An...
|
by: johnny |
last post by:
How do I join two string variables?
I want to do: download_dir + filename.
download_dir=r'c:/download/'
filename =r'log.txt'
I want to get something like this:
c:/download/log.txt
|
by: code_berzerker |
last post by:
Hi i'm relatively new to Python and my C/C++ knowledge is near to
None. Having said that I feel justified to ask stupid questions :)
Ok now more seriously. I have question refering to char* used as
function parameters to return values. I have read SWIG manual to find
best way to overcome that, but there are many warnings about memory
leaks...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it. ...
| |
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in...
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...
| |