469,358 Members | 1,659 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,358 developers. It's quick & easy.

Using Soundex (OT?)

Hi,

I'm curious about soundex. All I know that it's a way for making spelling-
error-tolerant word matching. What I want to know is whether the soundex
algorithm are made exclusively for english language, or can it be used for
any arbitrary language with satisfactory performance (by 'satisfactory
performance' I meant that it can detect at least 80% spelling-errors). What
about PHP soundex support?

TIA
Jul 17 '05 #1
6 3137
On 05 Feb 2005 19:09:04 GMT, Ricky Romaya <so*******@somewhere.com> wrote:
I'm curious about soundex. All I know that it's a way for making spelling-
error-tolerant word matching. What I want to know is whether the soundex
algorithm are made exclusively for english language, or can it be used for
any arbitrary language with satisfactory performance (by 'satisfactory
performance' I meant that it can detect at least 80% spelling-errors). What
about PHP soundex support?


Soundex is for English words, based on English pronunciation rules. See:
http://en.wikipedia.org/wiki/Soundex

There's also a reference there to Metaphone, which is supposedly better, but
also English-based.

--
Andy Hassall / <an**@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool
Jul 17 '05 #2
Andy Hassall wrote:
Soundex is for English words, based on English pronunciation rules.
See: http://en.wikipedia.org/wiki/Soundex


You *can* of course cook up your own Soundex-functions with values
created based on other languages the algorithm is very easy. For some
languages it might be rather easy, but possibly not worth the effort;
though the original algorithm is for english, it will work "quite
well" for many other languages too.

It's worthwhile to note that soundex (and similar functions) only work
for individual words, and that by using it you aren't supposed to
detect spelling errors. The best use for soundex is when you're
searching for names, addresses or the like and don't know how it is
actually written, but know what it sounds like - you can have the
soundex values stored in the database with other data and when you do
a search, you first look for the exact string the user entered. If
this doesn't return enough results, you count the soundex value for
the user input and try with that. This way you get results that "sound
same" ... so they're propably close to what you really were looking
for. I think a similar approach is used on the search engine at
www.php.net (I can't be certain though, but it seems like that - see
http://fi.php.net/manual-lookup.php?pattern=sundeks for example:)

--
Markku Uttula

Jul 17 '05 #3
Markku Uttula wrote:
http://fi.php.net/manual-lookup.php?pattern=sundeks for example:)


I hate to comment on my own postings, but I need to add that php.net
manual page for Soundex is quite good to read. It also has links to
some other functions (Metaphone and Levenshtein) that might prove
usefull.

--
Markku Uttula

Jul 17 '05 #4
"Ricky Romaya" <so*******@somewhere.com> wrote in message
news:Xn********************************@66.250.146 .159...
Hi,

I'm curious about soundex. All I know that it's a way for making spelling-
error-tolerant word matching. What I want to know is whether the soundex
algorithm are made exclusively for english language, or can it be used for
any arbitrary language with satisfactory performance (by 'satisfactory
performance' I meant that it can detect at least 80% spelling-errors). What about PHP soundex support?

TIA


Soundex is really only good for surnames. You can't use it for general text
search since it'd yield too many irrelevant results. It was designed for
grouping similiar surnames and not for handling typos. Names that are
spelled very differently could end up with the same value. For example,
Sznyder, Schneider, and Snyder are all given S536, while Smith, Smit, and
Schmidt get S530.

Soundex can handle surnames of foreign origins. For example, the variants of
my own--Leong, Leung, Liang, Long--all have the same soundex value.
Jul 17 '05 #5
Chung Leong wrote:
Soundex is really only good for surnames. You can't use it for general text
search since it'd yield too many irrelevant results. It was designed for
grouping similiar surnames and not for handling typos. Names that are
spelled very differently could end up with the same value. For example,
Sznyder, Schneider, and Snyder are all given S536, while Smith, Smit, and
Schmidt get S530.

Soundex can handle surnames of foreign origins. For example, the variants of
my own--Leong, Leung, Liang, Long--all have the same soundex value.


I found that a combination of the metaphone and Levenshtein function works
better for first names -- I'm using it to suggest alternatives in a
dictionary here:

<http://www.japanesetranslator.co.uk/your-name-in-japanese/>

It's supposed to be a dictionary of English names, but a lot of them are
actually of foreign origin (like most "English" names, I guess).

if I remember correctly, the Soundex function was a bit too clumsy and threw
out hundreds of alternatives for some unrecognized spellings, and none for
others.

Instead I use the metaphone function to search for possible alternatives,
and then sort them based on their Levenshtein distance from the search term.
It works pretty well.

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/
Jul 17 '05 #6
"Markku Uttula" <ma***********@disconova.com> wrote in news:rVfNd.1805
$U*******@reader1.news.jippii.net:
Markku Uttula wrote:
http://fi.php.net/manual-lookup.php?pattern=sundeks for example:)


I hate to comment on my own postings, but I need to add that php.net
manual page for Soundex is quite good to read. It also has links to
some other functions (Metaphone and Levenshtein) that might prove
usefull.

Well, could someone suggest some way to mimic google's 'suggested
keyword' functionality which works across different languages? I've done
some reading about soundex, metaphone, and levenshtein, which IMHO are
designed exclusively for english.

Also, I've read about aspell & pspell on PHP manual. Sadly, it doesn't
work on win32 platform (and not to mention it's an additional module,
which I don't have the authority to install). Anyway to simulate them on
pure PHP?

TIA
Jul 17 '05 #7

This discussion thread is closed

Replies have been disabled for this discussion.

By using this site, you agree to our Privacy Policy and Terms of Use.