473,887 Members | 2,388 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Fuzzy string comparison

I'm looking for a module to do fuzzy comparison of strings. I have 2
item master files which are supposed to be identical, but they have
thousands of records where the item numbers don't match in various
ways. One might include a '-' or have leading zeros, or have a single
character missing, or a zero that is typed as a letter 'O'. That kind
of thing. These tables currently reside in a mysql database. I was
wondering if there is a good package to let me compare strings and
return a value that is a measure of their similarity. Kind of like
soundex but for strings that aren't words.

Thanks,
Steve Bergman

Dec 26 '06 #1
14 13510
Steve Bergman wrote:
I'm looking for a module to do fuzzy comparison of strings. [...]
Check module difflib, it returns difference between two sequences.
Dec 26 '06 #2
Wojciech Mula wrote:
Steve Bergman wrote:
I'm looking for a module to do fuzzy comparison of strings. [...]

Check module difflib, it returns difference between two sequences.
and it's intended for comparing text files, and is relatively slow.

Google "python levenshtein". You'll probably find this a better fit for
typoed keys in a database.

What I would do for a quick start on this exercise (as described) is:

To compare two strings, take copies, and:
1. strip out all spaces (including \xA0 i.e.   if the data has
been anywhere near the web) from each string; also all "-" (in general
strip frequently occurring meaningless punctuation)
2. remove leading zeroes from each string
3. d = levenshtein_dis tance(string_a, string_b) # string_a etc is the
reduced string, not the original
4. error_metric = float(d) / max(len(string_ a), len(string_b))

The error_metric will be 0.0 if the strings are the same (after
removing spaces, leading zeroes, etc) and 1.0 if they are completely
different (no characters in common).

.... and you don't want anything "kind of like soundex". That's a bit
like saying you'd like to travel in an aeroplane "kind of like the
Wright brothers' " ;-)

Cheers,
John

Dec 26 '06 #3
On Tue, 2006-12-26 at 13:08 -0800, John Machin wrote:
Wojciech Mula wrote:
Steve Bergman wrote:
I'm looking for a module to do fuzzy comparison of strings. [...]
Check module difflib, it returns difference between two sequences.

and it's intended for comparing text files, and is relatively slow.

Google "python levenshtein". You'll probably find this a better fit for
typoed keys in a database.
[...]
Using the Levenshtein distance in combination with stripping "noise"
characters is a good start, but the OP might want to take it a step
further. One of the OP's requirements is to recognize visually similar
strings, but 241O (Letter O at the end) and 241X have the same
Levenshtein distance from 2410 (digit zero at the end) while the former
is visually much closer to 2410 than the latter.

It seems to me that this could be achieved by starting with a standard
Levenshtein implementation such as http://hetland.org/python/distance.py
and altering the line "change = change + 1" to something like "change =
change + visual_distance (a[j-1], b[i-1])". visual_distance () would be a
function that embodies the OP's idea of which character replacements are
tolerable by returning a number between 0 (the two characters are
visually identical) and 1 (the two characters are completely different).

Hope this helps,

-Carsten
Dec 26 '06 #4
At Tuesday 26/12/2006 18:08, John Machin wrote:
>Wojciech Mula wrote:
Steve Bergman wrote:
I'm looking for a module to do fuzzy comparison of strings. [...]
Check module difflib, it returns difference between two sequences.

and it's intended for comparing text files, and is relatively slow.

Google "python levenshtein". You'll probably find this a better fit for
typoed keys in a database.
Other alternatives: trigram, n-gram, Jaro's distance. There are some
Python implem. available.
--
Gabriel Genellina
Softlab SRL


_______________ _______________ _______________ _____
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas

Dec 26 '06 #5

Carsten Haese wrote:
On Tue, 2006-12-26 at 13:08 -0800, John Machin wrote:
Wojciech Mula wrote:
Steve Bergman wrote:
I'm looking for a module to do fuzzy comparison of strings. [...]
>
Check module difflib, it returns difference between two sequences.
and it's intended for comparing text files, and is relatively slow.

Google "python levenshtein". You'll probably find this a better fit for
typoed keys in a database.
[...]

Using the Levenshtein distance in combination with stripping "noise"
characters is a good start, but the OP might want to take it a step
further. One of the OP's requirements is to recognize visually similar
strings, but 241O (Letter O at the end) and 241X have the same
Levenshtein distance from 2410 (digit zero at the end) while the former
is visually much closer to 2410 than the latter.

It seems to me that this could be achieved by starting with a standard
Levenshtein implementation such as http://hetland.org/python/distance.py
and altering the line "change = change + 1" to something like "change =
change + visual_distance (a[j-1], b[i-1])". visual_distance () would be a
function that embodies the OP's idea of which character replacements are
tolerable by returning a number between 0 (the two characters are
visually identical) and 1 (the two characters are completely different).
Ya ya ya, I could have told him a whole lot more -- please consider
that what I did tell him was IMHO over-generous in response to an OT
question asking for assistance with performing a Google search.

.... and given his keys are described as "numbers", a better example
might be 241O or 241o false-matching with 2416.

.... and it might be a good idea if he ran the simplistic approach first
and saw what near-misses he actually came up with before complicating
it and slowing down what is already an O(N**2*L**2) exercise in the
traditional/novice implementation where N is the number of keys and L
is their average length.

The OP needs to think about 123456789 compared with 123426789; are they
the same account or not? What other information does he have?

HTH,
John

Dec 26 '06 #6
"John Machin" <sj******@lexic on.netwrote:
To compare two strings, take copies, and:
Taking a copy of a string seems kind of superfluous in Python.
Dec 27 '06 #7
"Steve Bergman" <st***@rueb.com writes:
I'm looking for a module to do fuzzy comparison of strings. I have 2
item master files which are supposed to be identical, but they have
thousands of records where the item numbers don't match in various
ways. One might include a '-' or have leading zeros, or have a single
character missing, or a zero that is typed as a letter 'O'. That kind
of thing. These tables currently reside in a mysql database. I was
wondering if there is a good package to let me compare strings and
return a value that is a measure of their similarity. Kind of like
soundex but for strings that aren't words.
If you were using PostgreSQL there's a contrib package (pg_trgm) that could
help a lot with that. It can show you the distance between two strings based
on a trigram comparison.

You can see how it works on the README
(http://www.sai.msu.su/~megera/postgr...README.pg_trgm) and maybe
port it for your needs.

But it probably won't be a one operation only search, you'll have to
post process results to decide what to do on multiple matches.

--
Jorge Godoy <jg****@gmail.c om>
Dec 27 '06 #8

Duncan Booth wrote:
"John Machin" <sj******@lexic on.netwrote:
To compare two strings, take copies, and:

Taking a copy of a string seems kind of superfluous in Python.
You are right, I really meant don't do:
original = original.strip( ).replace(....) .replace(....)
(a strange way of doing things which I see occasionally in other folks'
code)

Cheers,
John

Dec 27 '06 #9
On Wed, 27 Dec 2006 02:52:42 -0800, John Machin wrote:
>
Duncan Booth wrote:
>"John Machin" <sj******@lexic on.netwrote:
To compare two strings, take copies, and:

Taking a copy of a string seems kind of superfluous in Python.

You are right, I really meant don't do:
original = original.strip( ).replace(....) .replace(....)
(a strange way of doing things which I see occasionally in other folks'
code)
Well, sure it is strange if you call the variable "original". But if you
don't care about the original, what's so strange about writing this?

astring = astring.strip() .replace(....). replace(....)

Why keep the original value of astring around if you no longer need it?
--
Steven.

Dec 27 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
1496
by: ramin | last post by:
Hi, I would be greately thankful if somebody can give me some information about how string comparison is implemented in mysql. (Both in Physical and Application layer) How queries on string (for example '>' or like or strcmp) are implemented. Any idea where can I find the open source for string comparison/ ordering/ implementation? Does it use the sme structre as strcmp in c? Thanks a lot, Ramin
46
5202
by: yadurajj | last post by:
Hello i am newbie trying to learn C..I need to know about string comparisons in C, without using a library function,...recently I was asked this in an interview..I can write a small program but I was told that wouldn't it be wise to first get the length of the strings..if it doesn't match then they are not the same..I agreed...then he said..but again that would be an overhead first measuring the length...and then doing a character by...
4
7195
by: Peter Kirk | last post by:
Hi I am looking at some code which in many places performs string comparison using == instead of Equals. Am I right in assuming that this will in fact work "as expected" when it is strings involved? By "as expected" I mean that as long as the strings are instantiated using string literals, then == and Equals are the same. string s1 = "xxx";
1
1577
by: bjjnova | last post by:
I have the following string comparison that is throwing an error I cannot find ( I will include my attempts to trace the error) In the line following the asterisks, written as I have below, a true is always returned and the printed msg appears that the 2 variables are not equal. This happens even when the print statement writes out, "Change Doc Types TASK and TASK are not equal". (So, even when I test each variable separately and it...
9
3772
by: Usman Jamil | last post by:
Hi I'm having a strange error while comparing two strings. Please check the code below. This is a simple string comparison code and works just fine on all of my machines. While debugging an issue on a client's machine, who had turkish windows installed on his system, I found out that this simple piece of code does'nt work. The messages boxes that are displayed are in this sequence. 1. to upper works with szWINDOWS
3
4107
by: itmfl | last post by:
We are writing a program that multiplies two matrices of size n x m and m x n together. The matrices are stored in a file. The user provides the filename in the command line prompt. The file is formatted like so: /beginning/ matrix1 3 2 1 6 2 4 3 5 matrix2
10
8580
by: lilly07 | last post by:
Hi, I have one column of strings in 1st file file and another file which consists of 5 clumns in each line and my basic objective is to find each item/line of 1st file is available in 3rd column of 2 nd file. And I tried the following logic. It might be bit round about way but as a beginner am trying as follows. The column in the 1st file is having data as example NS008_456_R0030_3008 The 2nd file data is as follows: + test ...
0
9957
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9799
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
11173
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10875
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10432
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9593
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7988
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6011
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
3
3245
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.