469,957 Members | 2,006 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,957 developers. It's quick & easy.

A bug in difflib module? (find_longest_match)

from random import randint

s1 = ''
s2 = ''

for i in xrange(1000):
s1 += chr(randint(97,122))
s2 += chr(randint(97,122))

print s1[:25]
print s2[:25]

import difflib

s = difflib.SequenceMatcher(None, s1, s2)

print s.find_longest_match(0, len(s1), 0, len(s2))
>>============== RESTART ====================
yymgzldocfaafcborxbpqyade
urvwtnkwfmcduybjqmrleflqx
(0, 0, 0)
>>>
I think it's line #314 in difflib "who's to blame" --
Jun 27 '08 #1
3 1636
En Thu, 01 May 2008 04:35:17 -0300, n00m <n0**@narod.ruescribió:
from random import randint

s1 = ''
s2 = ''

for i in xrange(1000):
s1 += chr(randint(97,122))
s2 += chr(randint(97,122))

print s1[:25]
print s2[:25]

import difflib

s = difflib.SequenceMatcher(None, s1, s2)

print s.find_longest_match(0, len(s1), 0, len(s2))
>>>============== RESTART ====================
yymgzldocfaafcborxbpqyade
urvwtnkwfmcduybjqmrleflqx
(0, 0, 0)
>>>>

I think it's line #314 in difflib "who's to blame" --
Me too. Could you think of some alternative? Simply disabling that
"popularity check" would slow down the algorithm, according to the
comments.

--
Gabriel Genellina

Jun 27 '08 #2


Gabriel Genellina:
En Thu, 01 May 2008 04:35:17 -0300, n00m <n0**@narod.ruescribi�:
from random import randint

s1 = ''
s2 = ''

for i in xrange(1000):
s1 += chr(randint(97,122))
s2 += chr(randint(97,122))

print s1[:25]
print s2[:25]

import difflib

s = difflib.SequenceMatcher(None, s1, s2)

print s.find_longest_match(0, len(s1), 0, len(s2))
>>============== RESTART ====================
yymgzldocfaafcborxbpqyade
urvwtnkwfmcduybjqmrleflqx
(0, 0, 0)
>>>
I think it's line #314 in difflib "who's to blame" --

Me too. Could you think of some alternative? Simply disabling that
"popularity check" would slow down the algorithm, according to the
comments.

--
Gabriel Genellina
No idea :)
Jun 27 '08 #3
En Thu, 01 May 2008 06:21:22 -0300, n00m <n0**@narod.ruescribió:
import difflib
s = difflib.SequenceMatcher(None, s1, s2)
print s.find_longest_match(0, len(s1), 0, len(s2))
(0, 0, 0)

I think it's line #314 in difflib "who's to blame" --

Me too. Could you think of some alternative? Simply disabling that
"popularity check" would slow down the algorithm, according to the
comments.

No idea :)
The "ignore popular elements" is only an optmization, and it should not be
applied in your case because it forces the algorithm to yield an invalid
result.
I can think of two alternatives:
- tune up the conditions when the optimization is used, or
- make it user configurable

SequenceMatcher is a public class, and it is also internally used by
Differ and others to compare both sequences of lines *and* pairs of
similar lines (considered as sequences of characters). In this last usage
the "ignore popular elements" has no much sense, as shown in your example
feeding directly two dissimilar strings.
In principle one should disable the "populardict" stuff when dealing with
strings. Below is a simple attempt to detect that case:

(around line 311 in difflib.py)

b_is_string = isinstance(b, basestring) # add this line
for i, elt in enumerate(b):
if elt in b2j:
indices = b2j[elt]
if not b_is_string and n >= 200 and len(indices) * 100 >
n: # change this line
populardict[elt] = 1
del indices[:]
else:
indices.append(i)
else:
b2j[elt] = [i]
--
Gabriel Genellina

Jun 27 '08 #4

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

reply views Thread by shuhsien | last post: by
3 posts views Thread by Humpdydum | last post: by
5 posts views Thread by koara | last post: by
11 posts views Thread by John Henry | last post: by
reply views Thread by stefaan | last post: by
7 posts views Thread by whitewave | last post: by
2 posts views Thread by krishnakant Mane | last post: by
1 post views Thread by erikcw | last post: by
reply views Thread by rainxy | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.