473,809 Members | 2,931 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Fuzzy string comparison

I'm looking for a module to do fuzzy comparison of strings. I have 2
item master files which are supposed to be identical, but they have
thousands of records where the item numbers don't match in various
ways. One might include a '-' or have leading zeros, or have a single
character missing, or a zero that is typed as a letter 'O'. That kind
of thing. These tables currently reside in a mysql database. I was
wondering if there is a good package to let me compare strings and
return a value that is a measure of their similarity. Kind of like
soundex but for strings that aren't words.

Thanks,
Steve Bergman

Dec 26 '06
14 13500
Thanks, all. Yes, Levenshtein seems to be the magic word I was looking
for. (It's blazingly fast, too.)

I suspect that if I strip out all the punctuation, etc. from both the
itemnumber and description columns, as suggested, and concatenate them,
pairing the record with its closest match in the other file, it ought
to be pretty accurate. Obviously, the final decision will be up to a
human being, but this should help them quite a bit.

BTW, excluding all the items that match exactly, I only have 8000 items
in one file to compare to 2600 in the other. As fast as
python-levenshtein seems to be, this should finish in well under a
minute.

Thanks again.

-Steve

Dec 27 '06 #11
Steve Bergman wrote:
Thanks, all. Yes, Levenshtein seems to be the magic word I was looking
for. (It's blazingly fast, too.)

I suspect that if I strip out all the punctuation, etc. from both the
itemnumber and description columns, as suggested, and concatenate them,
pairing the record with its closest match in the other file, it ought
to be pretty accurate. Obviously, the final decision will be up to a
human being, but this should help them quite a bit.

BTW, excluding all the items that match exactly, I only have 8000 items
in one file to compare to 2600 in the other. As fast as
python-levenshtein seems to be, this should finish in well under a
minute.
The above suggests that you plan to do a preliminary pass using exact
comparison, and remove exact-matching pairs from further consideration.
If that is the case, here are a few questions for you to ponder:

What about 789o123 in file A and 789o123 in file B? Are you concerned
about standardising your item-numbers?

What about cases like 7890123 and 789o123 in file A? Are you concerned
about duplicated records within a file?

What about cases like 7890123 and 789o123 in file A and 7890123 and
789o123 and 78-901-23 in file B? Are you concerned about grouping all
instances of the same item?
If you are, the magic phrase you are looking for is "union find".

HTH,
John

Dec 27 '06 #12
At Wednesday 27/12/2006 18:59, John Machin wrote:
Thanks, all. Yes, Levenshtein seems to be the magic word I was looking
for. (It's blazingly fast, too.)
In case you need something more, this article is a good starting point:
Record Linkage: A Machine Learning Approach, A Toolbox, and A Digital
Government Web Service (2003)
Mohamed G. Elfeky, Vassilios S. Verykios, Ahmed K. Elmagarmid, Thanaa
M. Ghanem, Ahmed R. Huwait.
http://citeseer.ist.psu.edu/elfeky03record.html
--
Gabriel Genellina
Softlab SRL


_______________ _______________ _______________ _____
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas

Dec 28 '06 #13
jmw

Gabriel Genellina wrote:
At Tuesday 26/12/2006 18:08, John Machin wrote:
I'm looking for a module to do fuzzy comparison of strings. [...]
Other alternatives: trigram, n-gram, Jaro's distance. There are some
Python implem. available.
Quick question, you mentioned the data you need to run comparisons on
is stored in a database. Is this string comparison a one-time
processing kind of thing to clean up the data, or are you going to have
to continually do fuzzy string comparison on the data in the database?
There are some papers out there on implementing n-gram string
comparisons completely in SQL so that you don't have to pull back all
the data in your tables in order to do fuzzy comparisons. I can drum
up some code I did a while ago and post it (in java).

Dec 28 '06 #14
jmw

Gabriel Genellina wrote:
At Tuesday 26/12/2006 18:08, John Machin wrote:
I'm looking for a module to do fuzzy comparison of strings. [...]
Other alternatives: trigram, n-gram, Jaro's distance. There are some
Python implem. available.
Quick question, you mentioned the data you need to run comparisons on
is stored in a database. Is this string comparison a one-time
processing kind of thing to clean up the data, or are you going to have
to continually do fuzzy string comparison on the data in the database?
There are some papers out there on implementing n-gram string
comparisons completely in SQL so that you don't have to pull back all
the data in your tables in order to do fuzzy comparisons. I can drum
up some code I did a while ago and post it (in java).

Dec 28 '06 #15

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
1492
by: ramin | last post by:
Hi, I would be greately thankful if somebody can give me some information about how string comparison is implemented in mysql. (Both in Physical and Application layer) How queries on string (for example '>' or like or strcmp) are implemented. Any idea where can I find the open source for string comparison/ ordering/ implementation? Does it use the sme structre as strcmp in c? Thanks a lot, Ramin
46
5181
by: yadurajj | last post by:
Hello i am newbie trying to learn C..I need to know about string comparisons in C, without using a library function,...recently I was asked this in an interview..I can write a small program but I was told that wouldn't it be wise to first get the length of the strings..if it doesn't match then they are not the same..I agreed...then he said..but again that would be an overhead first measuring the length...and then doing a character by...
4
7190
by: Peter Kirk | last post by:
Hi I am looking at some code which in many places performs string comparison using == instead of Equals. Am I right in assuming that this will in fact work "as expected" when it is strings involved? By "as expected" I mean that as long as the strings are instantiated using string literals, then == and Equals are the same. string s1 = "xxx";
1
1573
by: bjjnova | last post by:
I have the following string comparison that is throwing an error I cannot find ( I will include my attempts to trace the error) In the line following the asterisks, written as I have below, a true is always returned and the printed msg appears that the 2 variables are not equal. This happens even when the print statement writes out, "Change Doc Types TASK and TASK are not equal". (So, even when I test each variable separately and it...
9
3770
by: Usman Jamil | last post by:
Hi I'm having a strange error while comparing two strings. Please check the code below. This is a simple string comparison code and works just fine on all of my machines. While debugging an issue on a client's machine, who had turkish windows installed on his system, I found out that this simple piece of code does'nt work. The messages boxes that are displayed are in this sequence. 1. to upper works with szWINDOWS
3
4104
by: itmfl | last post by:
We are writing a program that multiplies two matrices of size n x m and m x n together. The matrices are stored in a file. The user provides the filename in the command line prompt. The file is formatted like so: /beginning/ matrix1 3 2 1 6 2 4 3 5 matrix2
10
8545
by: lilly07 | last post by:
Hi, I have one column of strings in 1st file file and another file which consists of 5 clumns in each line and my basic objective is to find each item/line of 1st file is available in 3rd column of 2 nd file. And I tried the following logic. It might be bit round about way but as a beginner am trying as follows. The column in the 1st file is having data as example NS008_456_R0030_3008 The 2nd file data is as follows: + test ...
0
9721
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10376
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10375
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
1
7651
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6880
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5548
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5686
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4331
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3860
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.