By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,100 Members | 2,846 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 444,100 IT Pros & Developers. It's quick & easy.

How fuzzy is get_close_matches() in difflib?

P: n/a
I am just wondering what's with get_close_matches() in difflib. What's
the magic? How fuzzy do I need to get in order to get a match?

Nov 17 '06 #1
Share this Question
Share on Google+
11 Replies


P: n/a
On Thu, 16 Nov 2006 16:40:49 -0800, John Henry wrote:
I am just wondering what's with get_close_matches() in difflib. What's
the magic? How fuzzy do I need to get in order to get a match?

Why don't you try it and see?
>>from difflib import get_close_matches
get_close_matches("appel", ["ape", "apple", "peach", "puppy"])
['apple', 'ape']
>>import keyword as _keyword
get_close_matches("wheel", _keyword.kwlist)
['while']
>>get_close_matches("apple", _keyword.kwlist)
[]
>>get_close_matches("accept", _keyword.kwlist)
['except']
Those example, by the way, come from here:
>>help(get_close_matches)

--
Steven

Nov 17 '06 #2

P: n/a
I did try them and I am impressed. It helped me found a lot of useful
info. I just want to get a feel as to what constitutes a "match".
Steven D'Aprano wrote:
On Thu, 16 Nov 2006 16:40:49 -0800, John Henry wrote:
I am just wondering what's with get_close_matches() in difflib. What's
the magic? How fuzzy do I need to get in order to get a match?


Why don't you try it and see?
>from difflib import get_close_matches
get_close_matches("appel", ["ape", "apple", "peach", "puppy"])
['apple', 'ape']
>import keyword as _keyword
get_close_matches("wheel", _keyword.kwlist)
['while']
>get_close_matches("apple", _keyword.kwlist)
[]
>get_close_matches("accept", _keyword.kwlist)
['except']
Those example, by the way, come from here:
>help(get_close_matches)


--
Steven
Nov 17 '06 #3

P: n/a
On Thu, 16 Nov 2006 20:19:50 -0800, John Henry wrote:
I did try them and I am impressed. It helped me found a lot of useful
info. I just want to get a feel as to what constitutes a "match".
The source code has lots of comments, but they don't explain the basic
algorithm (at least not in the difflib.py supplied with Python 2.3).

There is no single diff algorithm, but I believe that the basic idea is to
look for insertions and/or deletions of strings. If you want more
detail, google "diff". Once you have a list of differences, the closest
match is the search string with the fewest differences.

As for getting a feel of what constitutes a match, I really can't make any
better suggestion than just try lots of examples with the interactive
Python shell.

--
Steven D'Aprano

Nov 17 '06 #4

P: n/a
I encountered a case where I am trying to match "HIDESST1" and
"HIDESCT1" against ["HIDEDST1", "HIDEDCT1", "HIDEDCT2", "HIDEDCT3"]

Well, they both hit "HIDEDST1" as the first match which is not exactly
the result I was looking for. I don't understand why "HIDESCT1" would
not hit "HIDEDCT1" as a first choice.

Steven D'Aprano wrote:
On Thu, 16 Nov 2006 20:19:50 -0800, John Henry wrote:
I did try them and I am impressed. It helped me found a lot of useful
info. I just want to get a feel as to what constitutes a "match".

The source code has lots of comments, but they don't explain the basic
algorithm (at least not in the difflib.py supplied with Python 2.3).

There is no single diff algorithm, but I believe that the basic idea is to
look for insertions and/or deletions of strings. If you want more
detail, google "diff". Once you have a list of differences, the closest
match is the search string with the fewest differences.

As for getting a feel of what constitutes a match, I really can't make any
better suggestion than just try lots of examples with the interactive
Python shell.

--
Steven D'Aprano
Nov 17 '06 #5

P: n/a
On Thu, 16 Nov 2006 22:59:41 -0800, John Henry wrote:
I encountered a case where I am trying to match "HIDESST1" and
"HIDESCT1" against ["HIDEDST1", "HIDEDCT1", "HIDEDCT2", "HIDEDCT3"]

Well, they both hit "HIDEDST1" as the first match which is not exactly
the result I was looking for. I don't understand why "HIDESCT1" would
not hit "HIDEDCT1" as a first choice.
All those strings are so similar that I think the exact order of "best
match" will depend on the precise details of the algorithm. I note that
the difflib specifically says:

[quote]
This does not yield minimal edit sequences, but does tend to yield matches
that "look right" to people.
[end quote]

There are many ways that one can calculate the difference between two
strings, e.g. is the string "acab" an "ab" with "ac" inserted at the
front, or "ab" with "ca" inserted in the middle? The Unix diff utility
may return minimal diffs, but difflib does not make that claim.

You want to see "HIDEDCT1" match closer to "HIDESCT1" than "HIDEDST1":

HIDEDCT1 -- John's "best match" target string
HIDEDST1 -- difflib's "best match" target string
HIDESCT1 -- source string

John's best match matches in seven of eight positions, compared to six
of eight for the difflib best match. Disregarding order, both have seven
matching characters. That's a pretty slim difference between the two:
>>"".join(difflib.Differ().compare("HIDEDCT1", "HIDESCT1"))
' H I D E- D+ S C T 1'
>>"".join(difflib.Differ().compare("HIDEDST1", "HIDESCT1"))
' H I D E- D S+ C T 1'

I honestly don't know how to interpret those results :(

>>s = difflib.SequenceMatcher(None, "HIDEDCT1", "HIDESCT1")
t = difflib.SequenceMatcher(None, "HIDEDST1", "HIDESCT1")

for block in s.get_matching_blocks():
.... print "a[%d] and b[%d] match for %d elements" % block
....
a[0] and b[0] match for 4 elements
a[5] and b[5] match for 3 elements
a[8] and b[8] match for 0 elements
>>>
for block in t.get_matching_blocks():
.... print "a[%d] and b[%d] match for %d elements" % block
....
a[0] and b[0] match for 4 elements
a[5] and b[4] match for 1 elements
a[6] and b[6] match for 2 elements
a[8] and b[8] match for 0 elements
>>>
I think what you are seeing is an artifact of the precise details of the
pattern matcher. Feel free to subclass the SequenceMatcher or Differ
classes to get the results you expect :-)
--
Steven D'Aprano

Nov 17 '06 #6

P: n/a

John Henry wrote:
I am just wondering what's with get_close_matches() in difflib. What's
the magic? How fuzzy do I need to get in order to get a match?
Are you desperate to understand the inner workings of difflib, or do
you want merely to do some fuzzy matching of strings using a well-known
somewhat-more-understandable zillions-of-implementations metric?

If the latter, google "Levenshtein distance" for the metric, and
"Python Levenshtein" -- first hit for me is an implementation in a
Python C-extension. If you don't have the ability to compile a C
extension, or don't need the speed, there should be a few pure-Python
versions around; I'm specifically aware of one by Magnus Lie Hetland.
It's less than a screenful of code; good idea to grab it anyway for
educational purposes -- a quite Pythonic implementation of the
traditional algorithm.

HTH,
John

Nov 17 '06 #7

P: n/a
On 2006-11-17, John Henry <jo**********@hotmail.comwrote:
I encountered a case where I am trying to match "HIDESST1" and
"HIDESCT1" against ["HIDEDST1", "HIDEDCT1", "HIDEDCT2", "HIDEDCT3"]

Well, they both hit "HIDEDST1" as the first match which is not exactly
the result I was looking for. I don't understand why "HIDESCT1" would
not hit "HIDEDCT1" as a first choice.
H I D E D S T 1 H I D E D C T 1

H . .
I . .
D . .
E . .
S .
C .
T . .
1 . .

As far as I can see the distance of HIDEDCT1 to HIDESCT1 is
the same as the distance of HIDEDCT1 to HIDEDST1. In both
cases you have to remove one character from the target as well
as one character from the candidate in order to get the
same substring.

--
Antoon Pardon
Nov 17 '06 #8

P: n/a
I suppose you are right. I guess I ended up with an odd case.

I was thinking that:

To change "HIDE*S*ST1" to "HIDE*D*ST1", all you do is remove the "*S*"
from the source and the "*D*" from the target.

In order to change "HIDE*SC*T1" to "HIDE*DS*T1", I thought you have to
remove 2 characters *SC* from the source. Then I realize that it's
not true. If you remove the "C" from the source, and the "D" from the
*DS* of the destination, it's a match (!)

So, yes, they have the same distance!
Antoon Pardon wrote:
On 2006-11-17, John Henry <jo**********@hotmail.comwrote:
I encountered a case where I am trying to match "HIDESST1" and
"HIDESCT1" against ["HIDEDST1", "HIDEDCT1", "HIDEDCT2", "HIDEDCT3"]

Well, they both hit "HIDEDST1" as the first match which is not exactly
the result I was looking for. I don't understand why "HIDESCT1" would
not hit "HIDEDCT1" as a first choice.

H I D E D S T 1 H I D E D C T 1

H . .
I . .
D . .
E . .
S .
C .
T . .
1 . .

As far as I can see the distance of HIDEDCT1 to HIDESCT1 is
the same as the distance of HIDEDCT1 to HIDEDST1. In both
cases you have to remove one character from the target as well
as one character from the candidate in order to get the
same substring.

--
Antoon Pardon
Nov 17 '06 #9

P: n/a
Learn something new everyday. I always wondered how spell checkers are
done. Thanks.

John Machin wrote:
John Henry wrote:
I am just wondering what's with get_close_matches() in difflib. What's
the magic? How fuzzy do I need to get in order to get a match?

Are you desperate to understand the inner workings of difflib, or do
you want merely to do some fuzzy matching of strings using a well-known
somewhat-more-understandable zillions-of-implementations metric?

If the latter, google "Levenshtein distance" for the metric, and
"Python Levenshtein" -- first hit for me is an implementation in a
Python C-extension. If you don't have the ability to compile a C
extension, or don't need the speed, there should be a few pure-Python
versions around; I'm specifically aware of one by Magnus Lie Hetland.
It's less than a screenful of code; good idea to grab it anyway for
educational purposes -- a quite Pythonic implementation of the
traditional algorithm.

HTH,
John
Nov 17 '06 #10

P: n/a
Steven D'Aprano wrote:

<snip>
>
>s = difflib.SequenceMatcher(None, "HIDEDCT1", "HIDESCT1")
t = difflib.SequenceMatcher(None, "HIDEDST1", "HIDESCT1")

for block in s.get_matching_blocks():
... print "a[%d] and b[%d] match for %d elements" % block
...
a[0] and b[0] match for 4 elements
a[5] and b[5] match for 3 elements
a[8] and b[8] match for 0 elements
>>
for block in t.get_matching_blocks():
... print "a[%d] and b[%d] match for %d elements" % block
...
a[0] and b[0] match for 4 elements
a[5] and b[4] match for 1 elements
a[6] and b[6] match for 2 elements
a[8] and b[8] match for 0 elements
>>

I think what you are seeing is an artifact of the precise details of the
pattern matcher. Feel free to subclass the SequenceMatcher or Differ
classes to get the results you expect :-)
Looks like for this particular case, looking at number of matched
groups worked better.

Nov 17 '06 #11

P: n/a


On Nov 17, 7:19 pm, Steven D'Aprano <s...@REMOVEME.cybersource.com.au>
wrote:
[snip]
You want to see "HIDEDCT1" match closer to "HIDESCT1" than "HIDEDST1":

HIDEDCT1 -- John's "best match" target string
HIDEDST1 -- difflib's "best match" target string
HIDESCT1 -- source string

John's best match matches in seven of eight positions, compared to six
of eight for the difflib best match. Disregarding order, both have seven
matching characters. That's a pretty slim difference between the two:
>"".join(difflib.Differ().compare("HIDEDCT1", "HIDESCT1"))
' H I D E- D+ S C T 1'
>>"".join(difflib.Differ().compare("HIDEDST1", "HIDESCT1"))
' H I D E- D S+ C T 1'
>
I honestly don't know how to interpret those results :(
Take 1, firing from the hip:

The formatting of that output is suboptimal. Better would be:
' H I D E -D +S C T 1'
interpreted as "delete D, insert S"
' H I D E -D S +C T 1'
interpreted as "delete D, insert C"

After reflection that the author of difflib is known not to be so
silly, take 2:

| >>list(difflib.Differ().compare("HIDEDCT1", "HIDESCT1"))
[' H', ' I', ' D', ' E', '- D', '+ S', ' C', ' T', ' 1']

And the docs shed some light on what is going on:

| >>help(difflib.Differ().compare)
Help on method compare in module difflib:

compare(self, a, b) method of difflib.Differ instance
Compare two sequences of lines; generate the resulting delta.

Each sequence must contain individual single-line strings ending
with
newlines. Such sequences can be obtained from the `readlines()`
method
of file-like objects. The delta generated also consists of
newline-
terminated strings, ready to be printed as-is via the writeline()
method of a file-like object.

Example:
>>print
''.join(Differ().compare('one\ntwo\nthree\n'.split lines(1),
...
'ore\ntree\nemu\n'.splitlines(1))),
- one
? ^
+ ore
? ^
- two
- three
? -
+ tree
+ emu

HTH,
John

Nov 17 '06 #12

This discussion thread is closed

Replies have been disabled for this discussion.