By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,548 Members | 1,495 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,548 IT Pros & Developers. It's quick & easy.

find_longest_match in SequenceMatcher

P: n/a
Hello, it might be too late or too hot, but i cannot work out this
behaviour of find_longest_match() in difflib.SequenceMatcher:

string1:
releasenotesforwildmagicversion01thiscdromcontains theinitialreleaseofthesourcecodethataccompaniesthe book"3dgameenginedesign:apracticalapproachtorealti mecomputergraphics"thereareanumberofknownissuesabo utthecodeastheseissuesareaddressedtheupdatedcodewi llbeavailableatthewebsitehttp://wwwmagicsoftwarecom/3dgameenginedesignhtmlbugssuggestionsforimprovemen tsandothercorrespondencecanbesenttosupport@magicso ftwarecomthecurrentknownissuesare1meshalgorithmfor continuouslevelofdetailappearsnottobeworkingbase

string2:
releasenotesforwildmagicversion02updatefromversion 01toversion02ifyourcopyofthebookhasversion01andify oudownloadedversion02fromthewebsitethenapplythefol lowingdirectionsforinstallingtheupdateforalinuxins tallationseethesectionattheendofthisdocumentupdate directionsassumingthatthetopleveldirectoryiscalled magicreplacebyyourtoplevelnameyoushouldhavethevers ion01contentsinthislocation1deletethecontentsofmag ic\include2deletethesubdirectorymagic\source\mgcap plication3deletetheobsoletefiles:amagic\source\mgc

find_longest_match(0,500,0,500)=(24,43,10)="versio n01t"

What? O_o Clearly there is a longer match, right at the beginning!
And then, after removal of the last character from each string (i found
the limit of 500 by trial and error -- and it looks suspiciously
rounded):

find_longest_match(0,499,0,499)=(0,0,32)="releasen otesforwildmagicversion0"
Is this the expected behaviour? What's going on?
Thank you for any ideas

Jul 23 '06 #1
Share this Question
Share on Google+
5 Replies


P: n/a

koara wrote:
Hello, it might be too late or too hot, but i cannot work out this
behaviour of find_longest_match() in difflib.SequenceMatcher:

string1:
[snipped 500-byte string]
>
string2:
[snipped 500-byte string]
>
find_longest_match(0,500,0,500)=(24,43,10)="versio n01t"

What? O_o Clearly there is a longer match, right at the beginning!
And then, after removal of the last character from each string (i found
the limit of 500 by trial and error -- and it looks suspiciously
rounded):
What limit? (a) My results [see below] (b) my inspection of the Python
version 2.4 source for the difflib module (c) what I know of the author
-- all tend to indicate that there is no hidden undocumented length
limit.
>
find_longest_match(0,499,0,499)=(0,0,32)="releasen otesforwildmagicversion0"
Is this the expected behaviour? What's going on?
My code: (koara.py)
8<---
strg1 =
r"""releasenotesforwildmagicversion01thiscdromcont ainstheinitialreleaseofthesourcecodethataccompanie sthebook"3dgameenginedesign:apracticalapproachtore altimecomputergraphics"thereareanumberofknownissue saboutthecodeastheseissuesareaddressedtheupdatedco dewillbeavailableatthewebsitehttp://wwwmagicsoftwarecom/3dgameenginedesignhtmlbugssuggestionsforimprovemen tsandothercorrespondencecanbesenttosupport@magicso ftwarecomthecurrentknownissuesare1meshalgorithmfor continuouslevelofdetailappearsnottobeworkingbase"" "
strg2 =
r"""releasenotesforwildmagicversion02updatefromver sion01toversion02ifyourcopyofthebookhasversion01an difyoudownloadedversion02fromthewebsitethenapplyth efollowingdirectionsforinstallingtheupdateforalinu xinstallationseethesectionattheendofthisdocumentup datedirectionsassumingthatthetopleveldirectoryisca lledmagicreplacebyyourtoplevelnameyoushouldhavethe version01contentsinthislocation1deletethecontentso fmagic\include2deletethesubdirectorymagic\source\m gcapplication3deletetheobsoletefiles:amagic\source \mgc"""
import sys
print sys.version
from difflib import SequenceMatcher as SM
smo = SM(None, strg1, strg2)
print len(strg1), len(strg2)
print smo.find_longest_match(0, 500, 0, 500)
print smo.find_longest_match(0, 499, 0, 499)
print smo.find_longest_match(0, 100, 0, 100)
print smo.find_longest_match(1, 101, 1, 101)
print smo.find_longest_match(2, 102, 2, 102)
8<---

The results on 4 python versions:

C:\junk>c:\python24\python koara.py
2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)]
500 500
(24, 43, 10)
(24, 43, 10)
(24, 43, 10)
(24, 43, 10)
(24, 43, 10)

C:\junk>c:\python23\python koara.py
2.3.5 (#62, Feb 8 2005, 16:23:02) [MSC v.1200 32 bit (Intel)]
500 500
(24, 43, 10)
(24, 43, 10)
(24, 43, 10)
(24, 43, 10)
(24, 43, 10)

C:\junk>c:\python22\python koara.py
2.2.3 (#42, May 30 2003, 18:12:08) [MSC 32 bit (Intel)]
500 500
(0, 0, 32)
(0, 0, 32)
(0, 0, 32)
(1, 1, 31)
(2, 2, 30)

C:\junk>c:\python21\python koara.py
2.1.3 (#35, Apr 8 2002, 17:47:50) [MSC 32 bit (Intel)]
500 500
(0, 0, 32)
(0, 0, 32)
(0, 0, 32)
(1, 1, 31)
(2, 2, 30)

Looks to me like the problem has nothing at all to do with the length
of the searched strings, but a bug appeared in 2.3. What version(s)
were you using? Can you reproduce your results (500 & 499 giving
different answers) with the same version?

Anyway, as they say in the classics, "Take a number; the timbot will be
with you shortly."

Cheers,
John

Jul 24 '06 #2

P: n/a
John Machin wrote:
--test results snip---
Looks to me like the problem has nothing at all to do with the length
of the searched strings, but a bug appeared in 2.3. What version(s)
were you using? Can you reproduce your results (500 & 499 giving
different answers) with the same version?
Hello John, thank you for investigating and responding!

Yes, I can reproduce the behaviour with different results within the
same version -- which is 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310
32 bit (Intel)]

The catch is to remove the last character, as i described in my
original post, as opposed to passing reduced length parameters to
find_longest_match, which is what you did.

It is morning now, but i still fail to see the mistake i am making --
if it is indeed a bug, where do i report it?

Cheers!

Jul 24 '06 #3

P: n/a
koara wrote:
John Machin wrote:
--test results snip---
Looks to me like the problem has nothing at all to do with the length
of the searched strings, but a bug appeared in 2.3. What version(s)
were you using? Can you reproduce your results (500 & 499 giving
different answers) with the same version?

Hello John, thank you for investigating and responding!

Yes, I can reproduce the behaviour with different results within the
same version -- which is 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310
32 bit (Intel)]

The catch is to remove the last character, as i described in my
original post, as opposed to passing reduced length parameters to
find_longest_match, which is what you did.

It is morning now, but i still fail to see the mistake i am making --
if it is indeed a bug, where do i report it?
Further sniffing shows that we were both partially right/wrong:
There is a bug.
It is length-dependant.
It was introduced in Python 2.3.

If you want to, you can hack your copy of
<Python-2.4_installation_directory>/Lib/difflib.py
so that line 316 reads:
if 0:
instead of:
if n >= 200 and len(indices) * 100 n:

I'll report it.

Cheers,
John

Jul 24 '06 #4

P: n/a

John Machin wrote:
koara wrote:
John Machin wrote:
There is a bug.
I'll report it.
Reported.
http://sourceforge.net/tracker/index...70&atid=105470

Jul 24 '06 #5

P: n/a
Hello again John -- your hack/fix seems to work. Thanks a lot, now
let's hope timbot will indeed be here shortly with a proper fix =)

Jul 25 '06 #6

This discussion thread is closed

Replies have been disabled for this discussion.