473,395 Members | 1,458 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

String Matching with Mistakes

jkmyoung
2,057 Expert 2GB
Hey, I'm trying to find an algorithm to approximately match different substrings within a master list, with some forgiveness on spelling.

Say you are allowed 3 errors in the string match:
Defining error: having to remove or add a character.
If you have to replace a character, that counts as 2 errors.
Also the search is case-insensitive for now.

So abcd and abc have 1 error
bcd and abcde have 2 errors
abcd and bcde have 2 errors
ace and abde have 3 errors

eg: (*contrived example) say the master list contains
Expand|Select|Wrap|Line Numbers
  1. Alpha
  2. Beta
  3. Gamma
  4. Delta
  5. Epsilon
  6. Zeta
  7. Eta
  8. Theta
  9. Iota
  10. Kappa
  11. Lambda
  12. Mu
  13. Nu 
  14. Xi 
  15. Omicron 
  16. Pi 
  17. Rho 
  18. Sigma 
  19. Tau 
  20. Upsilon 
  21. Phi 
  22. Chi 
  23. Psi 
  24. Omega 
  25.  
If someone types in "elta" they get:
eta with 1 error
delta with 1 error
beta with 2 errors
zeta with 2 errors
theta with 3 errors

The problem I have is, how do I determine when I should skip characters, add characters etc? This is particularly a problem when matching words with many of the same character, eg maybe
banana - banna
construction - constution
construction - constitution (shouldn't match)

Is there an easy way to determine on a mismatch whether I should:
1. Ignore a character from source?
2. Ignore a character from search substring?
3. Backtrack in some other way?

Or is there some way to 'hash' the string such that there are particular properties that we can use on the hash to determine if they are in fact similar?
Apr 14 '10 #1
1 2865
RedSon
5,000 Expert 4TB
It sounds like you are looking for some kind of predictive text and or partial/full matching library.

Entire companies are created to tackle this issue. This is a common problem with phones; hence T9. The T9 problem is not trivial.

I would suggest looking at open source libraries to see how they match text to a dictionary based on input: http://sourceforge.net/projects/lib378/

This project has been dormant for almost 2 years but you might be able to take it over and improve on it.
Apr 14 '10 #2

Sign in to post your reply or Sign up for a free account.

Similar topics

8
by: Eric Linders | last post by:
Hi, I'm trying to figure out the most efficient method for taking the first character in a string (which will be a number), and use it as a variable to check to see if the other numbers in the...
9
by: Xah Lee | last post by:
# -*- coding: utf-8 -*- # Python # Matching string patterns # # Sometimes you want to know if a string is of # particular pattern. Let's say in your website # you have converted all images...
19
by: Paul | last post by:
hi, there, for example, char *mystr="##this is##a examp#le"; I want to replace all the "##" in mystr with "****". How can I do this? I checked all the string functions in C, but did not...
10
by: javuchi | last post by:
I'm searching for a library which makes aproximative string matching, for example, searching in a dictionary the word "motorcycle", but returns similar strings like "motorcicle". Is there such a...
5
by: olaufr | last post by:
Hi, I'd need to perform simple pattern matching within a string using a list of possible patterns. For example, I want to know if the substring starting at position n matches any of the string I...
9
by: c_beginner | last post by:
Dear group, I want to implement a solution to the following link: http://acmicpc-live-archive.uva.es/nuevoportal/data/p2006.pdf As a beginning I am trying to implement this little sample...
9
by: | last post by:
I am interested in scanning web pages for content of interest, and then auto-classifying that content. I have tables of metadata that I can use for the classification, e.g. : "John P. Jones" "Jane...
0
NeoPa
by: NeoPa | last post by:
ANSI-89 v ANSI-92 Before we get into all the various types of pattern matching that can be used, there are two ANSI standards used for the main types of wildcard matching (matching zero or more...
11
by: tech | last post by:
Hi, I need a function to specify a match pattern including using wildcard characters as below to find chars in a std::string. The match pattern can contain the wildcard characters "*" and "?",...
1
by: Ben | last post by:
I apologize in advance for the newbie question. I'm trying to figure out a way to find all of the occurrences of a regular expression in a string including the overlapping ones. For example,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.