473,320 Members | 1,861 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

How fuzzy is get_close_matches() in difflib?

I am just wondering what's with get_close_matches() in difflib. What's
the magic? How fuzzy do I need to get in order to get a match?

Nov 17 '06 #1
11 8656
On Thu, 16 Nov 2006 16:40:49 -0800, John Henry wrote:
I am just wondering what's with get_close_matches() in difflib. What's
the magic? How fuzzy do I need to get in order to get a match?

Why don't you try it and see?
>>from difflib import get_close_matches
get_close_matches("appel", ["ape", "apple", "peach", "puppy"])
['apple', 'ape']
>>import keyword as _keyword
get_close_matches("wheel", _keyword.kwlist)
['while']
>>get_close_matches("apple", _keyword.kwlist)
[]
>>get_close_matches("accept", _keyword.kwlist)
['except']
Those example, by the way, come from here:
>>help(get_close_matches)

--
Steven

Nov 17 '06 #2
I did try them and I am impressed. It helped me found a lot of useful
info. I just want to get a feel as to what constitutes a "match".
Steven D'Aprano wrote:
On Thu, 16 Nov 2006 16:40:49 -0800, John Henry wrote:
I am just wondering what's with get_close_matches() in difflib. What's
the magic? How fuzzy do I need to get in order to get a match?


Why don't you try it and see?
>from difflib import get_close_matches
get_close_matches("appel", ["ape", "apple", "peach", "puppy"])
['apple', 'ape']
>import keyword as _keyword
get_close_matches("wheel", _keyword.kwlist)
['while']
>get_close_matches("apple", _keyword.kwlist)
[]
>get_close_matches("accept", _keyword.kwlist)
['except']
Those example, by the way, come from here:
>help(get_close_matches)


--
Steven
Nov 17 '06 #3
On Thu, 16 Nov 2006 20:19:50 -0800, John Henry wrote:
I did try them and I am impressed. It helped me found a lot of useful
info. I just want to get a feel as to what constitutes a "match".
The source code has lots of comments, but they don't explain the basic
algorithm (at least not in the difflib.py supplied with Python 2.3).

There is no single diff algorithm, but I believe that the basic idea is to
look for insertions and/or deletions of strings. If you want more
detail, google "diff". Once you have a list of differences, the closest
match is the search string with the fewest differences.

As for getting a feel of what constitutes a match, I really can't make any
better suggestion than just try lots of examples with the interactive
Python shell.

--
Steven D'Aprano

Nov 17 '06 #4
I encountered a case where I am trying to match "HIDESST1" and
"HIDESCT1" against ["HIDEDST1", "HIDEDCT1", "HIDEDCT2", "HIDEDCT3"]

Well, they both hit "HIDEDST1" as the first match which is not exactly
the result I was looking for. I don't understand why "HIDESCT1" would
not hit "HIDEDCT1" as a first choice.

Steven D'Aprano wrote:
On Thu, 16 Nov 2006 20:19:50 -0800, John Henry wrote:
I did try them and I am impressed. It helped me found a lot of useful
info. I just want to get a feel as to what constitutes a "match".

The source code has lots of comments, but they don't explain the basic
algorithm (at least not in the difflib.py supplied with Python 2.3).

There is no single diff algorithm, but I believe that the basic idea is to
look for insertions and/or deletions of strings. If you want more
detail, google "diff". Once you have a list of differences, the closest
match is the search string with the fewest differences.

As for getting a feel of what constitutes a match, I really can't make any
better suggestion than just try lots of examples with the interactive
Python shell.

--
Steven D'Aprano
Nov 17 '06 #5
On Thu, 16 Nov 2006 22:59:41 -0800, John Henry wrote:
I encountered a case where I am trying to match "HIDESST1" and
"HIDESCT1" against ["HIDEDST1", "HIDEDCT1", "HIDEDCT2", "HIDEDCT3"]

Well, they both hit "HIDEDST1" as the first match which is not exactly
the result I was looking for. I don't understand why "HIDESCT1" would
not hit "HIDEDCT1" as a first choice.
All those strings are so similar that I think the exact order of "best
match" will depend on the precise details of the algorithm. I note that
the difflib specifically says:

[quote]
This does not yield minimal edit sequences, but does tend to yield matches
that "look right" to people.
[end quote]

There are many ways that one can calculate the difference between two
strings, e.g. is the string "acab" an "ab" with "ac" inserted at the
front, or "ab" with "ca" inserted in the middle? The Unix diff utility
may return minimal diffs, but difflib does not make that claim.

You want to see "HIDEDCT1" match closer to "HIDESCT1" than "HIDEDST1":

HIDEDCT1 -- John's "best match" target string
HIDEDST1 -- difflib's "best match" target string
HIDESCT1 -- source string

John's best match matches in seven of eight positions, compared to six
of eight for the difflib best match. Disregarding order, both have seven
matching characters. That's a pretty slim difference between the two:
>>"".join(difflib.Differ().compare("HIDEDCT1", "HIDESCT1"))
' H I D E- D+ S C T 1'
>>"".join(difflib.Differ().compare("HIDEDST1", "HIDESCT1"))
' H I D E- D S+ C T 1'

I honestly don't know how to interpret those results :(

>>s = difflib.SequenceMatcher(None, "HIDEDCT1", "HIDESCT1")
t = difflib.SequenceMatcher(None, "HIDEDST1", "HIDESCT1")

for block in s.get_matching_blocks():
.... print "a[%d] and b[%d] match for %d elements" % block
....
a[0] and b[0] match for 4 elements
a[5] and b[5] match for 3 elements
a[8] and b[8] match for 0 elements
>>>
for block in t.get_matching_blocks():
.... print "a[%d] and b[%d] match for %d elements" % block
....
a[0] and b[0] match for 4 elements
a[5] and b[4] match for 1 elements
a[6] and b[6] match for 2 elements
a[8] and b[8] match for 0 elements
>>>
I think what you are seeing is an artifact of the precise details of the
pattern matcher. Feel free to subclass the SequenceMatcher or Differ
classes to get the results you expect :-)
--
Steven D'Aprano

Nov 17 '06 #6

John Henry wrote:
I am just wondering what's with get_close_matches() in difflib. What's
the magic? How fuzzy do I need to get in order to get a match?
Are you desperate to understand the inner workings of difflib, or do
you want merely to do some fuzzy matching of strings using a well-known
somewhat-more-understandable zillions-of-implementations metric?

If the latter, google "Levenshtein distance" for the metric, and
"Python Levenshtein" -- first hit for me is an implementation in a
Python C-extension. If you don't have the ability to compile a C
extension, or don't need the speed, there should be a few pure-Python
versions around; I'm specifically aware of one by Magnus Lie Hetland.
It's less than a screenful of code; good idea to grab it anyway for
educational purposes -- a quite Pythonic implementation of the
traditional algorithm.

HTH,
John

Nov 17 '06 #7
On 2006-11-17, John Henry <jo**********@hotmail.comwrote:
I encountered a case where I am trying to match "HIDESST1" and
"HIDESCT1" against ["HIDEDST1", "HIDEDCT1", "HIDEDCT2", "HIDEDCT3"]

Well, they both hit "HIDEDST1" as the first match which is not exactly
the result I was looking for. I don't understand why "HIDESCT1" would
not hit "HIDEDCT1" as a first choice.
H I D E D S T 1 H I D E D C T 1

H . .
I . .
D . .
E . .
S .
C .
T . .
1 . .

As far as I can see the distance of HIDEDCT1 to HIDESCT1 is
the same as the distance of HIDEDCT1 to HIDEDST1. In both
cases you have to remove one character from the target as well
as one character from the candidate in order to get the
same substring.

--
Antoon Pardon
Nov 17 '06 #8
I suppose you are right. I guess I ended up with an odd case.

I was thinking that:

To change "HIDE*S*ST1" to "HIDE*D*ST1", all you do is remove the "*S*"
from the source and the "*D*" from the target.

In order to change "HIDE*SC*T1" to "HIDE*DS*T1", I thought you have to
remove 2 characters *SC* from the source. Then I realize that it's
not true. If you remove the "C" from the source, and the "D" from the
*DS* of the destination, it's a match (!)

So, yes, they have the same distance!
Antoon Pardon wrote:
On 2006-11-17, John Henry <jo**********@hotmail.comwrote:
I encountered a case where I am trying to match "HIDESST1" and
"HIDESCT1" against ["HIDEDST1", "HIDEDCT1", "HIDEDCT2", "HIDEDCT3"]

Well, they both hit "HIDEDST1" as the first match which is not exactly
the result I was looking for. I don't understand why "HIDESCT1" would
not hit "HIDEDCT1" as a first choice.

H I D E D S T 1 H I D E D C T 1

H . .
I . .
D . .
E . .
S .
C .
T . .
1 . .

As far as I can see the distance of HIDEDCT1 to HIDESCT1 is
the same as the distance of HIDEDCT1 to HIDEDST1. In both
cases you have to remove one character from the target as well
as one character from the candidate in order to get the
same substring.

--
Antoon Pardon
Nov 17 '06 #9
Learn something new everyday. I always wondered how spell checkers are
done. Thanks.

John Machin wrote:
John Henry wrote:
I am just wondering what's with get_close_matches() in difflib. What's
the magic? How fuzzy do I need to get in order to get a match?

Are you desperate to understand the inner workings of difflib, or do
you want merely to do some fuzzy matching of strings using a well-known
somewhat-more-understandable zillions-of-implementations metric?

If the latter, google "Levenshtein distance" for the metric, and
"Python Levenshtein" -- first hit for me is an implementation in a
Python C-extension. If you don't have the ability to compile a C
extension, or don't need the speed, there should be a few pure-Python
versions around; I'm specifically aware of one by Magnus Lie Hetland.
It's less than a screenful of code; good idea to grab it anyway for
educational purposes -- a quite Pythonic implementation of the
traditional algorithm.

HTH,
John
Nov 17 '06 #10
Steven D'Aprano wrote:

<snip>
>
>s = difflib.SequenceMatcher(None, "HIDEDCT1", "HIDESCT1")
t = difflib.SequenceMatcher(None, "HIDEDST1", "HIDESCT1")

for block in s.get_matching_blocks():
... print "a[%d] and b[%d] match for %d elements" % block
...
a[0] and b[0] match for 4 elements
a[5] and b[5] match for 3 elements
a[8] and b[8] match for 0 elements
>>
for block in t.get_matching_blocks():
... print "a[%d] and b[%d] match for %d elements" % block
...
a[0] and b[0] match for 4 elements
a[5] and b[4] match for 1 elements
a[6] and b[6] match for 2 elements
a[8] and b[8] match for 0 elements
>>

I think what you are seeing is an artifact of the precise details of the
pattern matcher. Feel free to subclass the SequenceMatcher or Differ
classes to get the results you expect :-)
Looks like for this particular case, looking at number of matched
groups worked better.

Nov 17 '06 #11


On Nov 17, 7:19 pm, Steven D'Aprano <s...@REMOVEME.cybersource.com.au>
wrote:
[snip]
You want to see "HIDEDCT1" match closer to "HIDESCT1" than "HIDEDST1":

HIDEDCT1 -- John's "best match" target string
HIDEDST1 -- difflib's "best match" target string
HIDESCT1 -- source string

John's best match matches in seven of eight positions, compared to six
of eight for the difflib best match. Disregarding order, both have seven
matching characters. That's a pretty slim difference between the two:
>"".join(difflib.Differ().compare("HIDEDCT1", "HIDESCT1"))
' H I D E- D+ S C T 1'
>>"".join(difflib.Differ().compare("HIDEDST1", "HIDESCT1"))
' H I D E- D S+ C T 1'
>
I honestly don't know how to interpret those results :(
Take 1, firing from the hip:

The formatting of that output is suboptimal. Better would be:
' H I D E -D +S C T 1'
interpreted as "delete D, insert S"
' H I D E -D S +C T 1'
interpreted as "delete D, insert C"

After reflection that the author of difflib is known not to be so
silly, take 2:

| >>list(difflib.Differ().compare("HIDEDCT1", "HIDESCT1"))
[' H', ' I', ' D', ' E', '- D', '+ S', ' C', ' T', ' 1']

And the docs shed some light on what is going on:

| >>help(difflib.Differ().compare)
Help on method compare in module difflib:

compare(self, a, b) method of difflib.Differ instance
Compare two sequences of lines; generate the resulting delta.

Each sequence must contain individual single-line strings ending
with
newlines. Such sequences can be obtained from the `readlines()`
method
of file-like objects. The delta generated also consists of
newline-
terminated strings, ready to be printed as-is via the writeline()
method of a file-like object.

Example:
>>print
''.join(Differ().compare('one\ntwo\nthree\n'.split lines(1),
...
'ore\ntree\nemu\n'.splitlines(1))),
- one
? ^
+ ore
? ^
- two
- three
? -
+ tree
+ emu

HTH,
John

Nov 17 '06 #12

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Humpdydum | last post by:
Can anyone try the following in their python interpreter? These give correct output: >>> print list(ndiff(,)) >>> print list(ndiff(,)) >>> print list(ndiff(,))
24
by: BBands | last post by:
I have some CDs and have been archiving them on a PC. I wrote a Python script that spans the archive and returns a list of its contents: ...]. I wanted to add a search function to locate all the...
24
by: cassetti | last post by:
Here's the issue: I have roughly 20 MS excel spreadsheets, each row contains a record. These records were hand entered by people in call centers. The problem is, there can and are duplicate...
1
by: Neilen Marais | last post by:
Hi I'm trying to compare some text to find differences other than whitespace. I seem to be misunderstanding something, since I can't even get a basic example to work: In : d =...
14
by: Steve Bergman | last post by:
I'm looking for a module to do fuzzy comparison of strings. I have 2 item master files which are supposed to be identical, but they have thousands of records where the item numbers don't match in...
7
by: whitewave | last post by:
Hi Guys, I'm a bit confused in difflib. In most cases, the differences found using difflib works well but when I have come across the following set of text: .... problem even for the simple...
2
by: krishnakant Mane | last post by:
hello all, I have a bit of a confusing question. firstly I wanted a library which can do an svn like diff with two files. let's say I have file1 and file2 where file2 contains some thing which...
1
by: erikcw | last post by:
Hi, I'm trying to create an undo/redo feature for a webapp I'm working on (django based). I'd like to have an undo/redo function. My first thought was to use the difflib to generate a diff to...
3
by: n00m | last post by:
from random import randint s1 = '' s2 = '' for i in xrange(1000): s1 += chr(randint(97,122)) s2 += chr(randint(97,122)) print s1
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
0
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.