473,386 Members | 1,710 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

Search methods, "fuzzy" logic, duplicate checking, oh my!

Apologies in advance for the cross-post. Not sure if this is better handled
in ASP code or TransactSQL.

Windows2000 Server
SQL 2000

tblPeople
contactid int
firstname varchar(25)
lastname varchar(25)
address varchar(255)
city varchar(100)
state int
zip varchar(10)
homephone varchar(25)
workphone varchar(25)
mobilephone varchar(25)

Our contact search screen allows the users to search for a contact by phone,
lastname, firstname, city, state, zip. If no matches are found, then the
user is given the option of adding the information into the database as a
new contact. When searching, all whitespace and punctuation is removed for
the comparison.

The database is populated two ways:

1. by hand by our users, where the format is controlled through the
interface - i.e. phone numbers always have an area code, and the numbers are
stored with dashes. Numbers, dashes, spaces, and parentheses are the only
allow characters.
2. Client data as CSV files from a variety of proprietary sources, and
the data could have any ol' crap in it. We scrub as best we can, but some
garbage gets through.

So, what we find is when doing the phone number search, someone with
816-555-1234 won't be found if the user enters 555-1234, and vice versa.
Likewise, if any of the names are mis-spelled, the search will fail.

We'd like to make the query "smart". We've explored SOUNDEX, but the
results that come back often confuse the end-user: "Mercer" comes back with
"Mercer", "Mercier" (not too bad, yet), "Marker" (ok, but...), "Markwardt"
(um...), "Margarucci" (now we're getting blank stares...), and against
numeric values (like phone) it just dies. DIFFERENCE might get us there
somehow, though it returns a 4 on every variation in the preceeding list so
we can't sort that way. I've seen some searches where they're sorted by
"relevance", but I'm not sure how to make that happen.

So I'm looking for ideas on making a "smart search". Thanks just for
reading this far, and thanks for any help you can give.

- Wm
--
William Morris
Product Development, Seritas LLC
Kansas City, Missouri
Jul 19 '05 #1
2 2245
On Wed, 25 Feb 2004 10:27:08 -0600, William Morris
<ne***************************@seamlyne.com> wrote:
tblPeople
contactid int
firstname varchar(25)
lastname varchar(25)
address varchar(255)
city varchar(100)
state int
zip varchar(10)
homephone varchar(25)
workphone varchar(25)
mobilephone varchar(25)

So I'm looking for ideas on making a "smart search". Thanks just for
reading this far, and thanks for any help you can give.


You have clearly goteen into some more advanced areas than i can speak to,
but I wanted to share a couple simple things things I have used in the
past:

-- Use a "LIKE" clause in SQL to search for phone numbers without area
codes and to search for words or names that begin with the entered text
(searching for "Mer" to get "Mercer"). For example:

SELECT firstname, lastname FROM tblPeople WHERE homephone LIKE '%5551212'
-- will give you all the people with phone numbers that are the same but
include an area code

-- You might be able to reduce some of your search complexity depending on
the interface by using a list that dynamically reduces itself as the user
enters their search string (at least in the case of names or cities),
showing those people whose names are alphabetically similar. Some exciting
javascript and ASP trickery is needed to make this work, but by showing a
longer list of possibilities a user can look ahead at the whole list and
often find what they are looking for faster than a search. Also neatly
combined with an autocomplete script to fill in the rest of the selected
name automatically.

-- Encourage or even only offer a zip code search. Zip codes are far more
likely to be entered correctly than cities and states when dealing with
lots of user-created data. To reduce errors in cities and states, you
could even consider regenerating cities and states on import from a
commercially-available zip code database.
HTH,
Jenny
Jul 19 '05 #2
William,
Depending upon the SQL Server version (7.0 or 2000) as well as the
OS-platform (Win2K or Win2003) that you are using -- SELECT @@version --
Full Text Search (FTS) can be used to satisfy most of your search
requirements. Additionally, SQL FTS queries (CONTAINS and FREETEXT) can be
combined with SOUNDEX &/or DIFFERENCE, if necessary. If you're truly looking
for a "smart search", there are other methods using pure T-SQL that can get
you to the next level of search as I'm writing a book on this subject.

Note, you can also post FTS related questions to the newsgroup:
microsoft.public.sqlserver.fulltext (cc'ed)
Regards,
John
"William Morris" <ne***************************@seamlyne.com> wrote in
message news:c1*************@ID-205671.news.uni-berlin.de...
Apologies in advance for the cross-post. Not sure if this is better handled in ASP code or TransactSQL.

Windows2000 Server
SQL 2000

tblPeople
contactid int
firstname varchar(25)
lastname varchar(25)
address varchar(255)
city varchar(100)
state int
zip varchar(10)
homephone varchar(25)
workphone varchar(25)
mobilephone varchar(25)

Our contact search screen allows the users to search for a contact by phone, lastname, firstname, city, state, zip. If no matches are found, then the
user is given the option of adding the information into the database as a
new contact. When searching, all whitespace and punctuation is removed for the comparison.

The database is populated two ways:

1. by hand by our users, where the format is controlled through the
interface - i.e. phone numbers always have an area code, and the numbers are stored with dashes. Numbers, dashes, spaces, and parentheses are the only
allow characters.
2. Client data as CSV files from a variety of proprietary sources, and
the data could have any ol' crap in it. We scrub as best we can, but some
garbage gets through.

So, what we find is when doing the phone number search, someone with
816-555-1234 won't be found if the user enters 555-1234, and vice versa.
Likewise, if any of the names are mis-spelled, the search will fail.

We'd like to make the query "smart". We've explored SOUNDEX, but the
results that come back often confuse the end-user: "Mercer" comes back with "Mercer", "Mercier" (not too bad, yet), "Marker" (ok, but...), "Markwardt"
(um...), "Margarucci" (now we're getting blank stares...), and against
numeric values (like phone) it just dies. DIFFERENCE might get us there
somehow, though it returns a 4 on every variation in the preceeding list so we can't sort that way. I've seen some searches where they're sorted by
"relevance", but I'm not sure how to make that happen.

So I'm looking for ideas on making a "smart search". Thanks just for
reading this far, and thanks for any help you can give.

- Wm
--
William Morris
Product Development, Seritas LLC
Kansas City, Missouri

Jul 19 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Christos KK Loverdos | last post by:
Hi all! Is there any fuzzy logic package?
4
by: JP SIngh | last post by:
Hi There I am creating a search page and need help writing the code to build the sql string. I have 3 fields on the main page which a user can choose to enter search terms in any of the 3...
1
by: metsymani | last post by:
In my web application, I have a search screen coded in ASP.Net. The Search process takes lot of time. So, I need to show a wait page informing the user that "Search is in progress. Please wait" along...
0
by: DNKMCA | last post by:
Hi, i was building a internet based search engine and heard of fuzzy search algorithms. Can anybody give fuzzy logic search algorithm for string search -dnk
24
by: cassetti | last post by:
Here's the issue: I have roughly 20 MS excel spreadsheets, each row contains a record. These records were hand entered by people in call centers. The problem is, there can and are duplicate...
0
by: adi4u | last post by:
Explain the role of fuzzy logic in search engine algorithm.
8
by: Astan Chee | last post by:
Hi, Thanks for all the help from the previous problem. Turns out I didnt have to use wxSizers, just change the size of my wxWidgets everytime a EVT_SIZE is called. Anyway, Im trying to find a...
318
by: King Raz | last post by:
The shootout site has benchmarks comparing different languages. It includes C# Mono vs Java but not C# .NET vs Java. So I went through all the benchmark on the site ... ...
4
by: Robert Cramer | last post by:
I am needing to create a sort of "plug-in architecture" whereby different Web sites I maintain on the same server have slightly different functionality. Some Web sites will need to use the...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.