Quick fuzzy search - Microsoft Access / VBA

Martin Schneider

Hi!

I have a database with approx. 2500 records. When entering a new record
I'd like to avoid double entries that may differ slightly in writing.

Currently I am calculating the Levenshtein Distance across all records
and offer the top ten matches. Unfortunately this is very slow on long
strings.

Is there a faster method to do this?

Best regards,
Martin

Apr 6 '06 #1

Subscribe Post Reply

6336

kaniest

Martin Schneider wrote:

I have a database with approx. 2500 records. When entering a new
record I'd like to avoid double entries that may differ slightly in
writing.
Currently I am calculating the Levenshtein Distance across all records
and offer the top ten matches. Unfortunately this is very slow on long
strings.

Is there a faster method to do this?

With such a small dataset it can be feasable to keep the
keys in memory, also calculating in static arrays
could save a cycle or two.
Of course you could use a less demanding metric, tinker
with upper and lower bounds on the way, or filter
candidate matches with a quick and dirty test before
applying the full metric on a selection.

--
Paul

Apr 6 '06 #2

Anthony England

"Martin Schneider" <ma**************@illusion-factory.de> wrote in message
news:49************@individual.net...

Hi!

I have a database with approx. 2500 records. When entering a new record
I'd like to avoid double entries that may differ slightly in writing.

Currently I am calculating the Levenshtein Distance across all records and
offer the top ten matches. Unfortunately this is very slow on long
strings.

Is there a faster method to do this?

Best regards,
Martin

If you want to do such a detailed comparison against each and every record
in the database, then of course it will be slow. What can we say? Get a
faster PC or buy more RAM?
If we knew what was being stored in the database, we might be able to make
more helpful suggestions. So if the database stores contact details and you
wanted to add a new contact:
Jane Smith, 117 Waterloo Road, London, SE1 8UL, United Kingdom
Would I really want to go through and evaluate how closely Jane's name
matches someone called Jan Smitt from Norway before I was sure this was not
a duplicate? However, we know nothing about your application.

Apr 6 '06 #3

Martin Schneider

Anthony England schrieb:

So if the database stores contact details and you
wanted to add a new contact:
Jane Smith, 117 Waterloo Road, London, SE1 8UL, United Kingdom
Would I really want to go through and evaluate how closely Jane's name
matches someone called Jan Smitt from Norway before I was sure this was not
a duplicate?

That's exactly the purpose.

Best regards,
Martin

Apr 6 '06 #4

Anthony England

"Martin Schneider" <ma**************@illusion-factory.de> wrote in message
news:49************@individual.net...

Anthony England schrieb:
So if the database stores contact details and you wanted to add a new
contact:
Jane Smith, 117 Waterloo Road, London, SE1 8UL, United Kingdom
Would I really want to go through and evaluate how closely Jane's name
matches someone called Jan Smitt from Norway before I was sure this was
not a duplicate?

That's exactly the purpose.

Best regards,
Martin

What I meant was that you could look at only the people in postcode which
starts with SE1 or perhaps only those in London or perhaps only those in the
United Kingdom. I would not compare the new record against every single
existing one.

When I have done this sort of thing before, I give the user some degree of
flexibilty so you can adjust how rigourously you will search for duplicates.
You could keep some soundex representation of the surname in an indexed
field and start the process by entering a surname.

Apr 6 '06 #5

Martin Schneider

Anthony England schrieb:

What I meant was that you could look at only the people in postcode which
starts with SE1 or perhaps only those in London or perhaps only those in the
United Kingdom. I would not compare the new record against every single
existing one.

I would :-) The search is pretty local in a geographical way and similar
names in almost every case identify the same company, even if the ZIP
code is different (the company may have moved it's location).

Best regards,
Martin

Apr 6 '06 #6

Tom van Stiphout

On Thu, 06 Apr 2006 09:15:21 +0200, Martin Schneider
<ma**************@illusion-factory.de> wrote:

I'm using Ratcliff/Obershelp Pattern Recognition Algorithm on about
25000 records, and it is not too slow. I'm using the version in C and
built a DLL, for optimum speed.

-Tom.

Hi!

I have a database with approx. 2500 records. When entering a new record
I'd like to avoid double entries that may differ slightly in writing.

Currently I am calculating the Levenshtein Distance across all records
and offer the top ten matches. Unfortunately this is very slow on long
strings.

Is there a faster method to do this?

Best regards,
Martin

Apr 6 '06 #7

Larry Linson

There's a SOUNDEX algorithm, done in VBA, published in various places. If
you GOOGLE this newsgroup, chances are good that you'll find a link. In
fact, it may be in the Microsoft Knowledge Base at
http://support.microsoft.com. Caveat: AFAIK, it applies to English
pronunciations only.

Larry Linson
Microsoft Access MVP

"Martin Schneider" <ma**************@illusion-factory.de> wrote in message
news:49************@individual.net...

Anthony England schrieb:
What I meant was that you could look at only the people in postcode which
starts with SE1 or perhaps only those in London or perhaps only those in
the United Kingdom. I would not compare the new record against every
single existing one.

I would :-) The search is pretty local in a geographical way and similar
names in almost every case identify the same company, even if the ZIP code
is different (the company may have moved it's location).

Best regards,
Martin

Apr 6 '06 #8

David W. Fenton

"Larry Linson" <bo*****@localhost.not> wrote in
news:ImeZf.25403$fQ6.1287@trnddc03:

There's a SOUNDEX algorithm, done in VBA, published in various
places. If you GOOGLE this newsgroup, chances are good that you'll
find a link. In fact, it may be in the Microsoft Knowledge Base at
http://support.microsoft.com. Caveat: AFAIK, it applies to English
pronunciations only.

The Soundex2 algorithm is also freely available on the Web and is a
much better implementation for use in de-duping (far fewer false
positives), though it has the same language limitation.

--
David W. Fenton http://www.dfenton.com/
usenet at dfenton dot com http://www.dfenton.com/DFA/

Apr 6 '06 #9

Similar topics

Fuzzy Logic

by: Christos KK Loverdos | last post by:

Hi all! Is there any fuzzy logic package?

Python

Search methods, "fuzzy" logic, duplicate checking, oh my!

by: William Morris | last post by:

Apologies in advance for the cross-post. Not sure if this is better handled in ASP code or TransactSQL. Windows2000 Server SQL 2000 tblPeople contactid int firstname varchar(25) lastname...

ASP / Active Server Pages

Any Fuzzy Software ( Other than SOUNDEX / NYSIIS )?

by: Evaluating Fuzzy SW | last post by:

Hi All We are using soundex (and later tried Nysiis) for fuzzy name search software. But we faced a lot of problems the search accuracy was not very good also we saw a lot of misses of...

Microsoft SQL Server

Fuzzy Logic for String search

by: DNKMCA | last post by:

Hi, i was building a internet based search engine and heard of fuzzy search algorithms. Can anybody give fuzzy logic search algorithm for string search -dnk

ASP / Active Server Pages

Fuzzy Lookups

by: BBands | last post by:

I have some CDs and have been archiving them on a PC. I wrote a Python script that spans the archive and returns a list of its contents: ...]. I wanted to add a search function to locate all the...

Python

Can Access use Fuzzy Logic

by: cassetti | last post by:

Here's the issue: I have roughly 20 MS excel spreadsheets, each row contains a record. These records were hand entered by people in call centers. The problem is, there can and are duplicate...

Microsoft Access / VBA

fuzzy search technology in python

by: Ben | last post by:

hey, anyone knows any fuzzy search technology (approximate search) in python? thanks Ben

Python

Fuzzy string comparison

by: Steve Bergman | last post by:

I'm looking for a module to do fuzzy comparison of strings. I have 2 item master files which are supposed to be identical, but they have thousands of records where the item numbers don't match in...

Python

how to implement fuzzy search using soundex in SQL SERVER 2005!

by: jojo41300000 | last post by:

Hi all, Is there any ideas how to implement the fuzzy search using soundex in SQL server 2005? Could you please post some suggestions, ideas, or source codes if you have already...

Microsoft SQL Server

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++