473,545 Members | 2,095 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Aproximative string matching

I'm searching for a library which makes aproximative string matching,
for example, searching in a dictionary the word "motorcycle ", but
returns similar strings like "motorcicle ".

Is there such a library?

Nov 22 '05 #1
10 9529
This algorithm is called soundex. Here is one implementation example.

http://aspn.activestate.com/ASPN/Coo...n/Recipe/52213

here is another:
http://effbot.org/librarybook/soundex.htm

Nov 22 '05 #2
This algorithm is called soundex. Here is one implementation example.

http://aspn.activestate.com/ASPN/Coo...n/Recipe/52213

here is another:
http://effbot.org/librarybook/soundex.htm

Nov 22 '05 #3
el*******@hotma il.com wrote:
This algorithm is called soundex. Here is one implementation example.

http://aspn.activestate.com/ASPN/Coo...n/Recipe/52213

here is another:
http://effbot.org/librarybook/soundex.htm


Soundex is *one* particular algorithm for approximate
string matching. It is optimised for matching
Anglo-American names (like Smith/Smythe), and is
considered to be quite old and obsolete for all but the
most trivial applications -- or so I'm told.

Soundex will not match arbitrary changes -- it will
match both cat and cet, but it won't match cat and mat.

A more sophisticated approximate string matching
algorithm will use the Levenshtein distance. You can
find a Useless implementation here:

http://www.uselesspython.com/download.php?script_id=108
Given a function levenshtein(s1, s2) that returns the
distance between two strings, you could use it for
approximate matching like this:

def approx_matching (strlist, target, dist=1):
"""Matches approximately strings in strlist to
a target string.

Returns a list of strings, where each string
matched is no further than an edit distance of
dist from the target.
"""
found = []
for s in strlist:
if levenshtein(s, target) <= dist:
found.append(s)
return s

--
Steven.

Nov 22 '05 #4
el*******@hotma il.com wrote:
This algorithm is called soundex. Here is one implementation example.

http://aspn.activestate.com/ASPN/Coo...n/Recipe/52213

here is another:
http://effbot.org/librarybook/soundex.htm


Soundex is *one* particular algorithm for approximate
string matching. It is optimised for matching
Anglo-American names (like Smith/Smythe), and is
considered to be quite old and obsolete for all but the
most trivial applications -- or so I'm told.

Soundex will not match arbitrary changes -- it will
match both cat and cet, but it won't match cat and mat.

A more sophisticated approximate string matching
algorithm will use the Levenshtein distance. You can
find a Useless implementation here:

http://www.uselesspython.com/download.php?script_id=108
Given a function levenshtein(s1, s2) that returns the
distance between two strings, you could use it for
approximate matching like this:

def approx_matching (strlist, target, dist=1):
"""Matches approximately strings in strlist to
a target string.

Returns a list of strings, where each string
matched is no further than an edit distance of
dist from the target.
"""
found = []
for s in strlist:
if levenshtein(s, target) <= dist:
found.append(s)
return s

--
Steven.

Nov 22 '05 #5
"javuchi" <ja*****@gmail. com> wrote:

I'm searching for a library which makes aproximative string matching,
for example, searching in a dictionary the word "motorcycle ", but
returns similar strings like "motorcicle ".

Is there such a library?


There is an algorithm called Soundex that replaces each word by a
4-character string, such that all words that are pronounced similarly
encode to the same string.

The algorithm is easy to implement; you can probably find one by Googling.
--
- Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.
Nov 22 '05 #6
"javuchi" <ja*****@gmail. com> wrote:

I'm searching for a library which makes aproximative string matching,
for example, searching in a dictionary the word "motorcycle ", but
returns similar strings like "motorcicle ".

Is there such a library?


There is an algorithm called Soundex that replaces each word by a
4-character string, such that all words that are pronounced similarly
encode to the same string.

The algorithm is easy to implement; you can probably find one by Googling.
--
- Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.
Nov 22 '05 #7
javuchi wrote:
I'm searching for a library which makes aproximative string matching,
for example, searching in a dictionary the word "motorcycle ", but
returns similar strings like "motorcicle ".

Is there such a library?


agrep (aproximate grep) allows for a certain amount of errors and there
exist Python bindings (http://www.bio.cam.ac.uk/~mw263/pyagrep.html)

Or google for "agrep python".

Daniel

Nov 22 '05 #8
javuchi wrote:
I'm searching for a library which makes aproximative string matching,
for example, searching in a dictionary the word "motorcycle ", but
returns similar strings like "motorcicle ".

Is there such a library?


agrep (aproximate grep) allows for a certain amount of errors and there
exist Python bindings (http://www.bio.cam.ac.uk/~mw263/pyagrep.html)

Or google for "agrep python".

Daniel

Nov 22 '05 #9
Tim Roberts wrote:
I'm searching for a library which makes aproximative string matching,
for example, searching in a dictionary the word "motorcycle ", but
returns similar strings like "motorcicle ".

Is there such a library?


There is an algorithm called Soundex that replaces each word by a
4-character string, such that all words that are pronounced similarly
encode to the same string.

The algorithm is easy to implement; you can probably find one by Googling.


Python used to ship with a soundex module, but it was removed
in 1.6, for various reasons. here's a replacement:

http://orca.mojam.com/~skip/python/soundex.py

</F>

Nov 22 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
3189
by: Xah Lee | last post by:
# -*- coding: utf-8 -*- # Python # Matching string patterns # # Sometimes you want to know if a string is of # particular pattern. Let's say in your website # you have converted all images files from gif # format to png format. Now you need to change the # html code to use the .png files. So, essentially
0
2727
by: Tom Warren | last post by:
I found a c program called similcmp on the net and converted it to vba if anybody wants it. I'll post the technical research on it if there is any call for it. It looks like it could be a useful tool for breaking ties when a phonic call returns a bunch of possibilities. Also, I'm looking for someone that has a zip code file with alternate...
19
78771
by: Paul | last post by:
hi, there, for example, char *mystr="##this is##a examp#le"; I want to replace all the "##" in mystr with "****". How can I do this? I checked all the string functions in C, but did not find one.
0
326
by: javuchi | last post by:
I'm searching for a library which makes aproximative string matching, for example, searching in a dictionary the word "motorcycle", but returns similar strings like "motorcicle". Is there such a library?
3
366
by: Day Of The Eagle | last post by:
Jeff_Relf wrote: > ...yet you don't even know what RegEx is. > I'm looking at the source code for mono's Regex implementation right now. You can download that source here ( use the class libraries download ). http://www.mono-project.com/Downloads
5
5736
by: olaufr | last post by:
Hi, I'd need to perform simple pattern matching within a string using a list of possible patterns. For example, I want to know if the substring starting at position n matches any of the string I have a list, as below: sentence = "the color is $red" patterns = pos = sentence.find($)
7
3262
by: Kevin CH | last post by:
Hi, I'm currently running into a confusion on regex and hopefully you guys can clear it up for me. Suppose I have a regular expression (0|(1(01*0)*1))* and two test strings: 110_1011101_ and _101101_1. (The underscores are not part of the string. They are added to show that both string has a substring that matches the pattern.) ...
8
3749
by: regis | last post by:
Greetings, about scanf matching nonempty sequences using the "%" matches a nonempty sequence of anything except '-' "%" matches a nonempty sequence of anything except ']" matches a nonempty sequence of anything except ']' "%" matches a nonempty sequence of anything except '^' "%" matches a nonempty sequence of '-' "%" matches a nonempty...
11
4810
by: tech | last post by:
Hi, I need a function to specify a match pattern including using wildcard characters as below to find chars in a std::string. The match pattern can contain the wildcard characters "*" and "?", where "*" matches zero or more consecutive occurrences of any character and "?" matches a single occurrence of any character. Does boost or some...
0
7464
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
7656
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
7751
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
0
5968
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5323
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
3449
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
1
1874
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1012
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
700
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.