I am just wondering what's with get_close_matches() in difflib. What's
the magic? How fuzzy do I need to get in order to get a match? 11 8536
On Thu, 16 Nov 2006 16:40:49 -0800, John Henry wrote:
I am just wondering what's with get_close_matches() in difflib. What's
the magic? How fuzzy do I need to get in order to get a match?
Why don't you try it and see?
>>from difflib import get_close_matches get_close_matches("appel", ["ape", "apple", "peach", "puppy"])
['apple', 'ape']
>>import keyword as _keyword get_close_matches("wheel", _keyword.kwlist)
['while']
>>get_close_matches("apple", _keyword.kwlist)
[]
>>get_close_matches("accept", _keyword.kwlist)
['except']
Those example, by the way, come from here:
>>help(get_close_matches)
--
Steven
I did try them and I am impressed. It helped me found a lot of useful
info. I just want to get a feel as to what constitutes a "match".
Steven D'Aprano wrote:
On Thu, 16 Nov 2006 16:40:49 -0800, John Henry wrote:
I am just wondering what's with get_close_matches() in difflib. What's
the magic? How fuzzy do I need to get in order to get a match?
Why don't you try it and see?
>from difflib import get_close_matches get_close_matches("appel", ["ape", "apple", "peach", "puppy"])
['apple', 'ape']
>import keyword as _keyword get_close_matches("wheel", _keyword.kwlist)
['while']
>get_close_matches("apple", _keyword.kwlist)
[]
>get_close_matches("accept", _keyword.kwlist)
['except']
Those example, by the way, come from here:
>help(get_close_matches)
--
Steven
On Thu, 16 Nov 2006 20:19:50 -0800, John Henry wrote:
I did try them and I am impressed. It helped me found a lot of useful
info. I just want to get a feel as to what constitutes a "match".
The source code has lots of comments, but they don't explain the basic
algorithm (at least not in the difflib.py supplied with Python 2.3).
There is no single diff algorithm, but I believe that the basic idea is to
look for insertions and/or deletions of strings. If you want more
detail, google "diff". Once you have a list of differences, the closest
match is the search string with the fewest differences.
As for getting a feel of what constitutes a match, I really can't make any
better suggestion than just try lots of examples with the interactive
Python shell.
--
Steven D'Aprano
I encountered a case where I am trying to match "HIDESST1" and
"HIDESCT1" against ["HIDEDST1", "HIDEDCT1", "HIDEDCT2", "HIDEDCT3"]
Well, they both hit "HIDEDST1" as the first match which is not exactly
the result I was looking for. I don't understand why "HIDESCT1" would
not hit "HIDEDCT1" as a first choice.
Steven D'Aprano wrote:
On Thu, 16 Nov 2006 20:19:50 -0800, John Henry wrote:
I did try them and I am impressed. It helped me found a lot of useful
info. I just want to get a feel as to what constitutes a "match".
The source code has lots of comments, but they don't explain the basic
algorithm (at least not in the difflib.py supplied with Python 2.3).
There is no single diff algorithm, but I believe that the basic idea is to
look for insertions and/or deletions of strings. If you want more
detail, google "diff". Once you have a list of differences, the closest
match is the search string with the fewest differences.
As for getting a feel of what constitutes a match, I really can't make any
better suggestion than just try lots of examples with the interactive
Python shell.
--
Steven D'Aprano
On Thu, 16 Nov 2006 22:59:41 -0800, John Henry wrote:
I encountered a case where I am trying to match "HIDESST1" and
"HIDESCT1" against ["HIDEDST1", "HIDEDCT1", "HIDEDCT2", "HIDEDCT3"]
Well, they both hit "HIDEDST1" as the first match which is not exactly
the result I was looking for. I don't understand why "HIDESCT1" would
not hit "HIDEDCT1" as a first choice.
All those strings are so similar that I think the exact order of "best
match" will depend on the precise details of the algorithm. I note that
the difflib specifically says:
[quote]
This does not yield minimal edit sequences, but does tend to yield matches
that "look right" to people.
[end quote]
There are many ways that one can calculate the difference between two
strings, e.g. is the string "acab" an "ab" with "ac" inserted at the
front, or "ab" with "ca" inserted in the middle? The Unix diff utility
may return minimal diffs, but difflib does not make that claim.
You want to see "HIDEDCT1" match closer to "HIDESCT1" than "HIDEDST1":
HIDEDCT1 -- John's "best match" target string
HIDEDST1 -- difflib's "best match" target string
HIDESCT1 -- source string
John's best match matches in seven of eight positions, compared to six
of eight for the difflib best match. Disregarding order, both have seven
matching characters. That's a pretty slim difference between the two:
>>"".join(difflib.Differ().compare("HIDEDCT1", "HIDESCT1"))
' H I D E- D+ S C T 1'
>>"".join(difflib.Differ().compare("HIDEDST1", "HIDESCT1"))
' H I D E- D S+ C T 1'
I honestly don't know how to interpret those results :(
>>s = difflib.SequenceMatcher(None, "HIDEDCT1", "HIDESCT1") t = difflib.SequenceMatcher(None, "HIDEDST1", "HIDESCT1")
for block in s.get_matching_blocks():
.... print "a[%d] and b[%d] match for %d elements" % block
....
a[0] and b[0] match for 4 elements
a[5] and b[5] match for 3 elements
a[8] and b[8] match for 0 elements
>>> for block in t.get_matching_blocks():
.... print "a[%d] and b[%d] match for %d elements" % block
....
a[0] and b[0] match for 4 elements
a[5] and b[4] match for 1 elements
a[6] and b[6] match for 2 elements
a[8] and b[8] match for 0 elements
>>>
I think what you are seeing is an artifact of the precise details of the
pattern matcher. Feel free to subclass the SequenceMatcher or Differ
classes to get the results you expect :-)
--
Steven D'Aprano
John Henry wrote:
I am just wondering what's with get_close_matches() in difflib. What's
the magic? How fuzzy do I need to get in order to get a match?
Are you desperate to understand the inner workings of difflib, or do
you want merely to do some fuzzy matching of strings using a well-known
somewhat-more-understandable zillions-of-implementations metric?
If the latter, google "Levenshtein distance" for the metric, and
"Python Levenshtein" -- first hit for me is an implementation in a
Python C-extension. If you don't have the ability to compile a C
extension, or don't need the speed, there should be a few pure-Python
versions around; I'm specifically aware of one by Magnus Lie Hetland.
It's less than a screenful of code; good idea to grab it anyway for
educational purposes -- a quite Pythonic implementation of the
traditional algorithm.
HTH,
John
On 2006-11-17, John Henry <jo**********@hotmail.comwrote:
I encountered a case where I am trying to match "HIDESST1" and
"HIDESCT1" against ["HIDEDST1", "HIDEDCT1", "HIDEDCT2", "HIDEDCT3"]
Well, they both hit "HIDEDST1" as the first match which is not exactly
the result I was looking for. I don't understand why "HIDESCT1" would
not hit "HIDEDCT1" as a first choice.
H I D E D S T 1 H I D E D C T 1
H . .
I . .
D . .
E . .
S .
C .
T . .
1 . .
As far as I can see the distance of HIDEDCT1 to HIDESCT1 is
the same as the distance of HIDEDCT1 to HIDEDST1. In both
cases you have to remove one character from the target as well
as one character from the candidate in order to get the
same substring.
--
Antoon Pardon
I suppose you are right. I guess I ended up with an odd case.
I was thinking that:
To change "HIDE*S*ST1" to "HIDE*D*ST1", all you do is remove the "*S*"
from the source and the "*D*" from the target.
In order to change "HIDE*SC*T1" to "HIDE*DS*T1", I thought you have to
remove 2 characters *SC* from the source. Then I realize that it's
not true. If you remove the "C" from the source, and the "D" from the
*DS* of the destination, it's a match (!)
So, yes, they have the same distance!
Antoon Pardon wrote:
On 2006-11-17, John Henry <jo**********@hotmail.comwrote:
I encountered a case where I am trying to match "HIDESST1" and
"HIDESCT1" against ["HIDEDST1", "HIDEDCT1", "HIDEDCT2", "HIDEDCT3"]
Well, they both hit "HIDEDST1" as the first match which is not exactly
the result I was looking for. I don't understand why "HIDESCT1" would
not hit "HIDEDCT1" as a first choice.
H I D E D S T 1 H I D E D C T 1
H . .
I . .
D . .
E . .
S .
C .
T . .
1 . .
As far as I can see the distance of HIDEDCT1 to HIDESCT1 is
the same as the distance of HIDEDCT1 to HIDEDST1. In both
cases you have to remove one character from the target as well
as one character from the candidate in order to get the
same substring.
--
Antoon Pardon
Learn something new everyday. I always wondered how spell checkers are
done. Thanks.
John Machin wrote:
John Henry wrote:
I am just wondering what's with get_close_matches() in difflib. What's
the magic? How fuzzy do I need to get in order to get a match?
Are you desperate to understand the inner workings of difflib, or do
you want merely to do some fuzzy matching of strings using a well-known
somewhat-more-understandable zillions-of-implementations metric?
If the latter, google "Levenshtein distance" for the metric, and
"Python Levenshtein" -- first hit for me is an implementation in a
Python C-extension. If you don't have the ability to compile a C
extension, or don't need the speed, there should be a few pure-Python
versions around; I'm specifically aware of one by Magnus Lie Hetland.
It's less than a screenful of code; good idea to grab it anyway for
educational purposes -- a quite Pythonic implementation of the
traditional algorithm.
HTH,
John
Steven D'Aprano wrote:
<snip>
>
>s = difflib.SequenceMatcher(None, "HIDEDCT1", "HIDESCT1") t = difflib.SequenceMatcher(None, "HIDEDST1", "HIDESCT1")
for block in s.get_matching_blocks():
... print "a[%d] and b[%d] match for %d elements" % block
...
a[0] and b[0] match for 4 elements
a[5] and b[5] match for 3 elements
a[8] and b[8] match for 0 elements
>> for block in t.get_matching_blocks():
... print "a[%d] and b[%d] match for %d elements" % block
...
a[0] and b[0] match for 4 elements
a[5] and b[4] match for 1 elements
a[6] and b[6] match for 2 elements
a[8] and b[8] match for 0 elements
>>
I think what you are seeing is an artifact of the precise details of the
pattern matcher. Feel free to subclass the SequenceMatcher or Differ
classes to get the results you expect :-)
Looks like for this particular case, looking at number of matched
groups worked better.
On Nov 17, 7:19 pm, Steven D'Aprano <s...@REMOVEME.cybersource.com.au>
wrote:
[snip]
You want to see "HIDEDCT1" match closer to "HIDESCT1" than "HIDEDST1":
HIDEDCT1 -- John's "best match" target string
HIDEDST1 -- difflib's "best match" target string
HIDESCT1 -- source string
John's best match matches in seven of eight positions, compared to six
of eight for the difflib best match. Disregarding order, both have seven
matching characters. That's a pretty slim difference between the two:
>"".join(difflib.Differ().compare("HIDEDCT1", "HIDESCT1"))
' H I D E- D+ S C T 1'
>>"".join(difflib.Differ().compare("HIDEDST1", "HIDESCT1"))
' H I D E- D S+ C T 1'
>
I honestly don't know how to interpret those results :(
Take 1, firing from the hip:
The formatting of that output is suboptimal. Better would be:
' H I D E -D +S C T 1'
interpreted as "delete D, insert S"
' H I D E -D S +C T 1'
interpreted as "delete D, insert C"
After reflection that the author of difflib is known not to be so
silly, take 2:
| >>list(difflib.Differ().compare("HIDEDCT1", "HIDESCT1"))
[' H', ' I', ' D', ' E', '- D', '+ S', ' C', ' T', ' 1']
And the docs shed some light on what is going on:
| >>help(difflib.Differ().compare)
Help on method compare in module difflib:
compare(self, a, b) method of difflib.Differ instance
Compare two sequences of lines; generate the resulting delta.
Each sequence must contain individual single-line strings ending
with
newlines. Such sequences can be obtained from the `readlines()`
method
of file-like objects. The delta generated also consists of
newline-
terminated strings, ready to be printed as-is via the writeline()
method of a file-like object.
Example:
>>print
''.join(Differ().compare('one\ntwo\nthree\n'.split lines(1),
...
'ore\ntree\nemu\n'.splitlines(1))),
- one
? ^
+ ore
? ^
- two
- three
? -
+ tree
+ emu
HTH,
John This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Humpdydum |
last post by:
Can anyone try the following in their python interpreter?
These give correct output:
>>> print list(ndiff(,))
>>> print list(ndiff(,))
>>> print list(ndiff(,))
|
by: BBands |
last post by:
I have some CDs and have been archiving them on a PC. I wrote a Python
script that spans the archive and returns a list of its contents:
...]. I wanted to add a search function to
locate all the...
|
by: cassetti |
last post by:
Here's the issue:
I have roughly 20 MS excel spreadsheets, each row contains a record.
These records were hand entered by people in call centers.
The problem is, there can and are duplicate...
|
by: Neilen Marais |
last post by:
Hi
I'm trying to compare some text to find differences other than whitespace.
I seem to be misunderstanding something, since I can't even get a basic
example to work:
In : d =...
|
by: Steve Bergman |
last post by:
I'm looking for a module to do fuzzy comparison of strings. I have 2
item master files which are supposed to be identical, but they have
thousands of records where the item numbers don't match in...
|
by: whitewave |
last post by:
Hi Guys,
I'm a bit confused in difflib. In most cases, the differences
found using difflib works well but when I have come across the
following set of text:
.... problem even for the simple...
|
by: krishnakant Mane |
last post by:
hello all,
I have a bit of a confusing question.
firstly I wanted a library which can do an svn like diff with two files.
let's say I have file1 and file2 where file2 contains some thing which...
|
by: erikcw |
last post by:
Hi,
I'm trying to create an undo/redo feature for a webapp I'm working on
(django based). I'd like to have an undo/redo function.
My first thought was to use the difflib to generate a diff to...
|
by: n00m |
last post by:
from random import randint
s1 = ''
s2 = ''
for i in xrange(1000):
s1 += chr(randint(97,122))
s2 += chr(randint(97,122))
print s1
|
by: Kemmylinns12 |
last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and efficiency. While initially associated with cryptocurrencies...
|
by: Matthew3360 |
last post by:
Hi, I have a python app that i want to be able to get variables from a php page on my webserver. My python app is on my computer. How would I make it so the python app could use a http request to get...
|
by: AndyPSV |
last post by:
HOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and on my computerHOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and...
|
by: Arjunsri |
last post by:
I have a Redshift database that I need to use as an import data source. I have configured the DSN connection using the server, port, database, and credentials and received a successful connection...
|
by: WisdomUfot |
last post by:
It's an interesting question you've got about how Gmail hides the HTTP referrer when a link in an email is clicked. While I don't have the specific technical details, Gmail likely implements measures...
|
by: Matthew3360 |
last post by:
Hi,
I have been trying to connect to a local host using php curl. But I am finding it hard to do this. I am doing the curl get request from my web server and have made sure to enable curl. I get a...
|
by: Oralloy |
last post by:
Hello Folks,
I am trying to hook up a CPU which I designed using SystemC to I/O pins on an FPGA.
My problem (spelled failure) is with the synthesis of my design into a bitstream, not the C++...
|
by: Carina712 |
last post by:
Setting background colors for Excel documents can help to improve the visual appeal of the document and make it easier to read and understand. Background colors can be used to highlight important...
|
by: Rahul1995seven |
last post by:
Introduction:
In the realm of programming languages, Python has emerged as a powerhouse. With its simplicity, versatility, and robustness, Python has gained popularity among beginners and experts...
| |