473,387 Members | 1,650 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

Quick fuzzy search

Hi!

I have a database with approx. 2500 records. When entering a new record
I'd like to avoid double entries that may differ slightly in writing.

Currently I am calculating the Levenshtein Distance across all records
and offer the top ten matches. Unfortunately this is very slow on long
strings.

Is there a faster method to do this?

Best regards,
Martin
Apr 6 '06 #1
8 6336
Martin Schneider wrote:
I have a database with approx. 2500 records. When entering a new
record I'd like to avoid double entries that may differ slightly in
writing.
Currently I am calculating the Levenshtein Distance across all records
and offer the top ten matches. Unfortunately this is very slow on long
strings.

Is there a faster method to do this?


With such a small dataset it can be feasable to keep the
keys in memory, also calculating in static arrays
could save a cycle or two.
Of course you could use a less demanding metric, tinker
with upper and lower bounds on the way, or filter
candidate matches with a quick and dirty test before
applying the full metric on a selection.

--
Paul
Apr 6 '06 #2
"Martin Schneider" <ma**************@illusion-factory.de> wrote in message
news:49************@individual.net...
Hi!

I have a database with approx. 2500 records. When entering a new record
I'd like to avoid double entries that may differ slightly in writing.

Currently I am calculating the Levenshtein Distance across all records and
offer the top ten matches. Unfortunately this is very slow on long
strings.

Is there a faster method to do this?

Best regards,
Martin

If you want to do such a detailed comparison against each and every record
in the database, then of course it will be slow. What can we say? Get a
faster PC or buy more RAM?
If we knew what was being stored in the database, we might be able to make
more helpful suggestions. So if the database stores contact details and you
wanted to add a new contact:
Jane Smith, 117 Waterloo Road, London, SE1 8UL, United Kingdom
Would I really want to go through and evaluate how closely Jane's name
matches someone called Jan Smitt from Norway before I was sure this was not
a duplicate? However, we know nothing about your application.
Apr 6 '06 #3
Anthony England schrieb:
So if the database stores contact details and you
wanted to add a new contact:
Jane Smith, 117 Waterloo Road, London, SE1 8UL, United Kingdom
Would I really want to go through and evaluate how closely Jane's name
matches someone called Jan Smitt from Norway before I was sure this was not
a duplicate?


That's exactly the purpose.

Best regards,
Martin
Apr 6 '06 #4
"Martin Schneider" <ma**************@illusion-factory.de> wrote in message
news:49************@individual.net...
Anthony England schrieb:
So if the database stores contact details and you wanted to add a new
contact:
Jane Smith, 117 Waterloo Road, London, SE1 8UL, United Kingdom
Would I really want to go through and evaluate how closely Jane's name
matches someone called Jan Smitt from Norway before I was sure this was
not a duplicate?


That's exactly the purpose.

Best regards,
Martin

What I meant was that you could look at only the people in postcode which
starts with SE1 or perhaps only those in London or perhaps only those in the
United Kingdom. I would not compare the new record against every single
existing one.

When I have done this sort of thing before, I give the user some degree of
flexibilty so you can adjust how rigourously you will search for duplicates.
You could keep some soundex representation of the surname in an indexed
field and start the process by entering a surname.
Apr 6 '06 #5
Anthony England schrieb:
What I meant was that you could look at only the people in postcode which
starts with SE1 or perhaps only those in London or perhaps only those in the
United Kingdom. I would not compare the new record against every single
existing one.


I would :-) The search is pretty local in a geographical way and similar
names in almost every case identify the same company, even if the ZIP
code is different (the company may have moved it's location).

Best regards,
Martin
Apr 6 '06 #6
On Thu, 06 Apr 2006 09:15:21 +0200, Martin Schneider
<ma**************@illusion-factory.de> wrote:

I'm using Ratcliff/Obershelp Pattern Recognition Algorithm on about
25000 records, and it is not too slow. I'm using the version in C and
built a DLL, for optimum speed.

-Tom.

Hi!

I have a database with approx. 2500 records. When entering a new record
I'd like to avoid double entries that may differ slightly in writing.

Currently I am calculating the Levenshtein Distance across all records
and offer the top ten matches. Unfortunately this is very slow on long
strings.

Is there a faster method to do this?

Best regards,
Martin


Apr 6 '06 #7
There's a SOUNDEX algorithm, done in VBA, published in various places. If
you GOOGLE this newsgroup, chances are good that you'll find a link. In
fact, it may be in the Microsoft Knowledge Base at
http://support.microsoft.com. Caveat: AFAIK, it applies to English
pronunciations only.

Larry Linson
Microsoft Access MVP

"Martin Schneider" <ma**************@illusion-factory.de> wrote in message
news:49************@individual.net...
Anthony England schrieb:
What I meant was that you could look at only the people in postcode which
starts with SE1 or perhaps only those in London or perhaps only those in
the United Kingdom. I would not compare the new record against every
single existing one.


I would :-) The search is pretty local in a geographical way and similar
names in almost every case identify the same company, even if the ZIP code
is different (the company may have moved it's location).

Best regards,
Martin

Apr 6 '06 #8
"Larry Linson" <bo*****@localhost.not> wrote in
news:ImeZf.25403$fQ6.1287@trnddc03:
There's a SOUNDEX algorithm, done in VBA, published in various
places. If you GOOGLE this newsgroup, chances are good that you'll
find a link. In fact, it may be in the Microsoft Knowledge Base at
http://support.microsoft.com. Caveat: AFAIK, it applies to English
pronunciations only.


The Soundex2 algorithm is also freely available on the Web and is a
much better implementation for use in de-duping (far fewer false
positives), though it has the same language limitation.

--
David W. Fenton http://www.dfenton.com/
usenet at dfenton dot com http://www.dfenton.com/DFA/
Apr 6 '06 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Christos KK Loverdos | last post by:
Hi all! Is there any fuzzy logic package?
2
by: William Morris | last post by:
Apologies in advance for the cross-post. Not sure if this is better handled in ASP code or TransactSQL. Windows2000 Server SQL 2000 tblPeople contactid int firstname varchar(25) lastname...
1
by: Evaluating Fuzzy SW | last post by:
Hi All We are using soundex (and later tried Nysiis) for fuzzy name search software. But we faced a lot of problems the search accuracy was not very good also we saw a lot of misses of...
0
by: DNKMCA | last post by:
Hi, i was building a internet based search engine and heard of fuzzy search algorithms. Can anybody give fuzzy logic search algorithm for string search -dnk
24
by: BBands | last post by:
I have some CDs and have been archiving them on a PC. I wrote a Python script that spans the archive and returns a list of its contents: ...]. I wanted to add a search function to locate all the...
24
by: cassetti | last post by:
Here's the issue: I have roughly 20 MS excel spreadsheets, each row contains a record. These records were hand entered by people in call centers. The problem is, there can and are duplicate...
1
by: Ben | last post by:
hey, anyone knows any fuzzy search technology (approximate search) in python? thanks Ben
14
by: Steve Bergman | last post by:
I'm looking for a module to do fuzzy comparison of strings. I have 2 item master files which are supposed to be identical, but they have thousands of records where the item numbers don't match in...
0
by: jojo41300000 | last post by:
Hi all, Is there any ideas how to implement the fuzzy search using soundex in SQL server 2005? Could you please post some suggestions, ideas, or source codes if you have already...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.