473,378 Members | 1,555 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,378 software developers and data experts.

Fuzzy string comparison

I'm looking for a module to do fuzzy comparison of strings. I have 2
item master files which are supposed to be identical, but they have
thousands of records where the item numbers don't match in various
ways. One might include a '-' or have leading zeros, or have a single
character missing, or a zero that is typed as a letter 'O'. That kind
of thing. These tables currently reside in a mysql database. I was
wondering if there is a good package to let me compare strings and
return a value that is a measure of their similarity. Kind of like
soundex but for strings that aren't words.

Thanks,
Steve Bergman

Dec 26 '06 #1
14 13439
Steve Bergman wrote:
I'm looking for a module to do fuzzy comparison of strings. [...]
Check module difflib, it returns difference between two sequences.
Dec 26 '06 #2
Wojciech Mula wrote:
Steve Bergman wrote:
I'm looking for a module to do fuzzy comparison of strings. [...]

Check module difflib, it returns difference between two sequences.
and it's intended for comparing text files, and is relatively slow.

Google "python levenshtein". You'll probably find this a better fit for
typoed keys in a database.

What I would do for a quick start on this exercise (as described) is:

To compare two strings, take copies, and:
1. strip out all spaces (including \xA0 i.e.   if the data has
been anywhere near the web) from each string; also all "-" (in general
strip frequently occurring meaningless punctuation)
2. remove leading zeroes from each string
3. d = levenshtein_distance(string_a, string_b) # string_a etc is the
reduced string, not the original
4. error_metric = float(d) / max(len(string_a), len(string_b))

The error_metric will be 0.0 if the strings are the same (after
removing spaces, leading zeroes, etc) and 1.0 if they are completely
different (no characters in common).

.... and you don't want anything "kind of like soundex". That's a bit
like saying you'd like to travel in an aeroplane "kind of like the
Wright brothers' " ;-)

Cheers,
John

Dec 26 '06 #3
On Tue, 2006-12-26 at 13:08 -0800, John Machin wrote:
Wojciech Mula wrote:
Steve Bergman wrote:
I'm looking for a module to do fuzzy comparison of strings. [...]
Check module difflib, it returns difference between two sequences.

and it's intended for comparing text files, and is relatively slow.

Google "python levenshtein". You'll probably find this a better fit for
typoed keys in a database.
[...]
Using the Levenshtein distance in combination with stripping "noise"
characters is a good start, but the OP might want to take it a step
further. One of the OP's requirements is to recognize visually similar
strings, but 241O (Letter O at the end) and 241X have the same
Levenshtein distance from 2410 (digit zero at the end) while the former
is visually much closer to 2410 than the latter.

It seems to me that this could be achieved by starting with a standard
Levenshtein implementation such as http://hetland.org/python/distance.py
and altering the line "change = change + 1" to something like "change =
change + visual_distance(a[j-1], b[i-1])". visual_distance() would be a
function that embodies the OP's idea of which character replacements are
tolerable by returning a number between 0 (the two characters are
visually identical) and 1 (the two characters are completely different).

Hope this helps,

-Carsten
Dec 26 '06 #4
At Tuesday 26/12/2006 18:08, John Machin wrote:
>Wojciech Mula wrote:
Steve Bergman wrote:
I'm looking for a module to do fuzzy comparison of strings. [...]
Check module difflib, it returns difference between two sequences.

and it's intended for comparing text files, and is relatively slow.

Google "python levenshtein". You'll probably find this a better fit for
typoed keys in a database.
Other alternatives: trigram, n-gram, Jaro's distance. There are some
Python implem. available.
--
Gabriel Genellina
Softlab SRL


__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas

Dec 26 '06 #5

Carsten Haese wrote:
On Tue, 2006-12-26 at 13:08 -0800, John Machin wrote:
Wojciech Mula wrote:
Steve Bergman wrote:
I'm looking for a module to do fuzzy comparison of strings. [...]
>
Check module difflib, it returns difference between two sequences.
and it's intended for comparing text files, and is relatively slow.

Google "python levenshtein". You'll probably find this a better fit for
typoed keys in a database.
[...]

Using the Levenshtein distance in combination with stripping "noise"
characters is a good start, but the OP might want to take it a step
further. One of the OP's requirements is to recognize visually similar
strings, but 241O (Letter O at the end) and 241X have the same
Levenshtein distance from 2410 (digit zero at the end) while the former
is visually much closer to 2410 than the latter.

It seems to me that this could be achieved by starting with a standard
Levenshtein implementation such as http://hetland.org/python/distance.py
and altering the line "change = change + 1" to something like "change =
change + visual_distance(a[j-1], b[i-1])". visual_distance() would be a
function that embodies the OP's idea of which character replacements are
tolerable by returning a number between 0 (the two characters are
visually identical) and 1 (the two characters are completely different).
Ya ya ya, I could have told him a whole lot more -- please consider
that what I did tell him was IMHO over-generous in response to an OT
question asking for assistance with performing a Google search.

.... and given his keys are described as "numbers", a better example
might be 241O or 241o false-matching with 2416.

.... and it might be a good idea if he ran the simplistic approach first
and saw what near-misses he actually came up with before complicating
it and slowing down what is already an O(N**2*L**2) exercise in the
traditional/novice implementation where N is the number of keys and L
is their average length.

The OP needs to think about 123456789 compared with 123426789; are they
the same account or not? What other information does he have?

HTH,
John

Dec 26 '06 #6
"John Machin" <sj******@lexicon.netwrote:
To compare two strings, take copies, and:
Taking a copy of a string seems kind of superfluous in Python.
Dec 27 '06 #7
"Steve Bergman" <st***@rueb.comwrites:
I'm looking for a module to do fuzzy comparison of strings. I have 2
item master files which are supposed to be identical, but they have
thousands of records where the item numbers don't match in various
ways. One might include a '-' or have leading zeros, or have a single
character missing, or a zero that is typed as a letter 'O'. That kind
of thing. These tables currently reside in a mysql database. I was
wondering if there is a good package to let me compare strings and
return a value that is a measure of their similarity. Kind of like
soundex but for strings that aren't words.
If you were using PostgreSQL there's a contrib package (pg_trgm) that could
help a lot with that. It can show you the distance between two strings based
on a trigram comparison.

You can see how it works on the README
(http://www.sai.msu.su/~megera/postgr...README.pg_trgm) and maybe
port it for your needs.

But it probably won't be a one operation only search, you'll have to
post process results to decide what to do on multiple matches.

--
Jorge Godoy <jg****@gmail.com>
Dec 27 '06 #8

Duncan Booth wrote:
"John Machin" <sj******@lexicon.netwrote:
To compare two strings, take copies, and:

Taking a copy of a string seems kind of superfluous in Python.
You are right, I really meant don't do:
original = original.strip().replace(....).replace(....)
(a strange way of doing things which I see occasionally in other folks'
code)

Cheers,
John

Dec 27 '06 #9
On Wed, 27 Dec 2006 02:52:42 -0800, John Machin wrote:
>
Duncan Booth wrote:
>"John Machin" <sj******@lexicon.netwrote:
To compare two strings, take copies, and:

Taking a copy of a string seems kind of superfluous in Python.

You are right, I really meant don't do:
original = original.strip().replace(....).replace(....)
(a strange way of doing things which I see occasionally in other folks'
code)
Well, sure it is strange if you call the variable "original". But if you
don't care about the original, what's so strange about writing this?

astring = astring.strip().replace(....).replace(....)

Why keep the original value of astring around if you no longer need it?
--
Steven.

Dec 27 '06 #10
Thanks, all. Yes, Levenshtein seems to be the magic word I was looking
for. (It's blazingly fast, too.)

I suspect that if I strip out all the punctuation, etc. from both the
itemnumber and description columns, as suggested, and concatenate them,
pairing the record with its closest match in the other file, it ought
to be pretty accurate. Obviously, the final decision will be up to a
human being, but this should help them quite a bit.

BTW, excluding all the items that match exactly, I only have 8000 items
in one file to compare to 2600 in the other. As fast as
python-levenshtein seems to be, this should finish in well under a
minute.

Thanks again.

-Steve

Dec 27 '06 #11
Steve Bergman wrote:
Thanks, all. Yes, Levenshtein seems to be the magic word I was looking
for. (It's blazingly fast, too.)

I suspect that if I strip out all the punctuation, etc. from both the
itemnumber and description columns, as suggested, and concatenate them,
pairing the record with its closest match in the other file, it ought
to be pretty accurate. Obviously, the final decision will be up to a
human being, but this should help them quite a bit.

BTW, excluding all the items that match exactly, I only have 8000 items
in one file to compare to 2600 in the other. As fast as
python-levenshtein seems to be, this should finish in well under a
minute.
The above suggests that you plan to do a preliminary pass using exact
comparison, and remove exact-matching pairs from further consideration.
If that is the case, here are a few questions for you to ponder:

What about 789o123 in file A and 789o123 in file B? Are you concerned
about standardising your item-numbers?

What about cases like 7890123 and 789o123 in file A? Are you concerned
about duplicated records within a file?

What about cases like 7890123 and 789o123 in file A and 7890123 and
789o123 and 78-901-23 in file B? Are you concerned about grouping all
instances of the same item?
If you are, the magic phrase you are looking for is "union find".

HTH,
John

Dec 27 '06 #12
At Wednesday 27/12/2006 18:59, John Machin wrote:
Thanks, all. Yes, Levenshtein seems to be the magic word I was looking
for. (It's blazingly fast, too.)
In case you need something more, this article is a good starting point:
Record Linkage: A Machine Learning Approach, A Toolbox, and A Digital
Government Web Service (2003)
Mohamed G. Elfeky, Vassilios S. Verykios, Ahmed K. Elmagarmid, Thanaa
M. Ghanem, Ahmed R. Huwait.
http://citeseer.ist.psu.edu/elfeky03record.html
--
Gabriel Genellina
Softlab SRL


__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas

Dec 28 '06 #13
jmw

Gabriel Genellina wrote:
At Tuesday 26/12/2006 18:08, John Machin wrote:
I'm looking for a module to do fuzzy comparison of strings. [...]
Other alternatives: trigram, n-gram, Jaro's distance. There are some
Python implem. available.
Quick question, you mentioned the data you need to run comparisons on
is stored in a database. Is this string comparison a one-time
processing kind of thing to clean up the data, or are you going to have
to continually do fuzzy string comparison on the data in the database?
There are some papers out there on implementing n-gram string
comparisons completely in SQL so that you don't have to pull back all
the data in your tables in order to do fuzzy comparisons. I can drum
up some code I did a while ago and post it (in java).

Dec 28 '06 #14
jmw

Gabriel Genellina wrote:
At Tuesday 26/12/2006 18:08, John Machin wrote:
I'm looking for a module to do fuzzy comparison of strings. [...]
Other alternatives: trigram, n-gram, Jaro's distance. There are some
Python implem. available.
Quick question, you mentioned the data you need to run comparisons on
is stored in a database. Is this string comparison a one-time
processing kind of thing to clean up the data, or are you going to have
to continually do fuzzy string comparison on the data in the database?
There are some papers out there on implementing n-gram string
comparisons completely in SQL so that you don't have to pull back all
the data in your tables in order to do fuzzy comparisons. I can drum
up some code I did a while ago and post it (in java).

Dec 28 '06 #15

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: ramin | last post by:
Hi, I would be greately thankful if somebody can give me some information about how string comparison is implemented in mysql. (Both in Physical and Application layer) How queries on string (for...
46
by: yadurajj | last post by:
Hello i am newbie trying to learn C..I need to know about string comparisons in C, without using a library function,...recently I was asked this in an interview..I can write a small program but I...
4
by: Peter Kirk | last post by:
Hi I am looking at some code which in many places performs string comparison using == instead of Equals. Am I right in assuming that this will in fact work "as expected" when it is strings...
1
by: bjjnova | last post by:
I have the following string comparison that is throwing an error I cannot find ( I will include my attempts to trace the error) In the line following the asterisks, written as I have below, a...
9
by: Usman Jamil | last post by:
Hi I'm having a strange error while comparing two strings. Please check the code below. This is a simple string comparison code and works just fine on all of my machines. While debugging an...
3
by: itmfl | last post by:
We are writing a program that multiplies two matrices of size n x m and m x n together. The matrices are stored in a file. The user provides the filename in the command line prompt. The file is...
10
by: lilly07 | last post by:
Hi, I have one column of strings in 1st file file and another file which consists of 5 clumns in each line and my basic objective is to find each item/line of 1st file is available in 3rd column of...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.