By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,032 Members | 811 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,032 IT Pros & Developers. It's quick & easy.

Spell checking algorithms

P: n/a
I am trying to write a spelling checker. Specifically the part that
suggests words. From what I've read online, it seems like the
preferred way to do this, is to find the double metaphone equivalent
of a mispelled word, then to compute the mispelled word's distance
from other words which have the same, or a similar metaphone
equivalent.

I've run into some problems with this approach. For example,
"accommodation" has a metaphone code of AKMT, while "acocmmodation"
has a metaphone code of AKKM. These are different enough that my
implementation doesn't even try to see if the two words are close
together, although it's clear that they are.

Frustrated by this, I tried to compute the distance (Levenshtein)
between "acocmmodation" and every other word in my dictionary file.
This was suprisingly fast, and the correct suggestion appeared. This
makes me wonder, what is the point of doing all of the double
metaphone stuff, as opposed to just a brute force search? Why do all
of the spell checking algorithms first convert a mispelled word into a
phoenetic representation before comparing it to other words? Is it
because the speed used to be an issue or slower machines, or am I
missing something? For reference, when I say that the brute force
search was 'surprisingly fast', it took about 75 milliseconds to
compute the levenshtein distance from "acocmmodation" to every word in
a dictionary of 115K words.

Leo
Jul 17 '05 #1
Share this Question
Share on Google+
4 Replies


P: n/a
back to the good'ol days of algorithm design. today, everything seems to
be focused on web design and j2ee.

my question is, why are you writing your own spelling checker if it's
not to improve whats aleady available....?

- perry

Leo P. wrote:
I am trying to write a spelling checker. Specifically the part that
suggests words. From what I've read online, it seems like the
preferred way to do this, is to find the double metaphone equivalent
of a mispelled word, then to compute the mispelled word's distance
from other words which have the same, or a similar metaphone
equivalent.

I've run into some problems with this approach. For example,
"accommodation" has a metaphone code of AKMT, while "acocmmodation"
has a metaphone code of AKKM. These are different enough that my
implementation doesn't even try to see if the two words are close
together, although it's clear that they are.

Frustrated by this, I tried to compute the distance (Levenshtein)
between "acocmmodation" and every other word in my dictionary file.
This was suprisingly fast, and the correct suggestion appeared. This
makes me wonder, what is the point of doing all of the double
metaphone stuff, as opposed to just a brute force search? Why do all
of the spell checking algorithms first convert a mispelled word into a
phoenetic representation before comparing it to other words? Is it
because the speed used to be an issue or slower machines, or am I
missing something? For reference, when I say that the brute force
search was 'surprisingly fast', it took about 75 milliseconds to
compute the levenshtein distance from "acocmmodation" to every word in
a dictionary of 115K words.

Leo


Jul 17 '05 #2

P: n/a
Leo P.

Metaphone is designed to find words that sound similar (with a fuzzy
definition of similarity) but are spelled differently. This allows the
pgmr to match Kageonne to Cajun but not Cajun to Cigna (Soundex
matched the last two, which is one of the reasons that Metaphone was
developed). When people are attempting to write a word when they are
not sure of the spelling, they tend to try to "sound it out". Your
example, "acocmodation" for "accomodation", is clearly a typing error,
and not something that Metaphone is designed to find. Levenshtein
distance will find this error, but not errors where the user heard one
thing but spelled it very differently, but still attempting to get the
typed word to match what they heard. You need both types of matches,
edit distance as well as phonetic similarity, to cover the two most
common forms of misspellings.

-Lawrence Philips

perry anderson <pe***@cplusplus.org> wrote in message news:<c8**********************@news20.bellglobal.c om>...
back to the good'ol days of algorithm design. today, everything seems to
be focused on web design and j2ee.

my question is, why are you writing your own spelling checker if it's
not to improve whats aleady available....?

- perry

Leo P. wrote:
I am trying to write a spelling checker. Specifically the part that
suggests words. From what I've read online, it seems like the
preferred way to do this, is to find the double metaphone equivalent
of a mispelled word, then to compute the mispelled word's distance
from other words which have the same, or a similar metaphone
equivalent.

I've run into some problems with this approach. For example,
"accommodation" has a metaphone code of AKMT, while "acocmmodation"
has a metaphone code of AKKM. These are different enough that my
implementation doesn't even try to see if the two words are close
together, although it's clear that they are.

Frustrated by this, I tried to compute the distance (Levenshtein)
between "acocmmodation" and every other word in my dictionary file.
This was suprisingly fast, and the correct suggestion appeared. This
makes me wonder, what is the point of doing all of the double
metaphone stuff, as opposed to just a brute force search? Why do all
of the spell checking algorithms first convert a mispelled word into a
phoenetic representation before comparing it to other words? Is it
because the speed used to be an issue or slower machines, or am I
missing something? For reference, when I say that the brute force
search was 'surprisingly fast', it took about 75 milliseconds to
compute the levenshtein distance from "acocmmodation" to every word in
a dictionary of 115K words.

Leo

Jul 17 '05 #3

P: n/a
I'm writing a java program that needs a spell checker. I looked
around, and while Jazzy looks pretty good, I didn't want to deal with
the licensing restrictions. Plus, I've always enjoyed algorithm design
and implementation, and figured implementing a spell checker would be
interesting (it is).

Leo

perry anderson <pe***@cplusplus.org> wrote in message news:<c8**********************@news20.bellglobal.c om>...
back to the good'ol days of algorithm design. today, everything seems to
be focused on web design and j2ee.

my question is, why are you writing your own spelling checker if it's
not to improve whats aleady available....?

- perry

Leo P. wrote:
I am trying to write a spelling checker. Specifically the part that
suggests words. From what I've read online, it seems like the
preferred way to do this, is to find the double metaphone equivalent
of a mispelled word, then to compute the mispelled word's distance
from other words which have the same, or a similar metaphone
equivalent.

I've run into some problems with this approach. For example,
"accommodation" has a metaphone code of AKMT, while "acocmmodation"
has a metaphone code of AKKM. These are different enough that my
implementation doesn't even try to see if the two words are close
together, although it's clear that they are.

Frustrated by this, I tried to compute the distance (Levenshtein)
between "acocmmodation" and every other word in my dictionary file.
This was suprisingly fast, and the correct suggestion appeared. This
makes me wonder, what is the point of doing all of the double
metaphone stuff, as opposed to just a brute force search? Why do all
of the spell checking algorithms first convert a mispelled word into a
phoenetic representation before comparing it to other words? Is it
because the speed used to be an issue or slower machines, or am I
missing something? For reference, when I say that the brute force
search was 'surprisingly fast', it took about 75 milliseconds to
compute the levenshtein distance from "acocmmodation" to every word in
a dictionary of 115K words.

Leo

Jul 17 '05 #4

P: n/a
Thank you for the great example (Cajun). I now understand why both
types of spelling checks are necessary.

Thanks,
Leo

lf*@dolby.com (Philips) wrote in message news:<da**************************@posting.google. com>...
Leo P.

Metaphone is designed to find words that sound similar (with a fuzzy
definition of similarity) but are spelled differently. This allows the
pgmr to match Kageonne to Cajun but not Cajun to Cigna (Soundex
matched the last two, which is one of the reasons that Metaphone was
developed). When people are attempting to write a word when they are
not sure of the spelling, they tend to try to "sound it out". Your
example, "acocmodation" for "accomodation", is clearly a typing error,
and not something that Metaphone is designed to find. Levenshtein
distance will find this error, but not errors where the user heard one
thing but spelled it very differently, but still attempting to get the
typed word to match what they heard. You need both types of matches,
edit distance as well as phonetic similarity, to cover the two most
common forms of misspellings.

-Lawrence Philips

perry anderson <pe***@cplusplus.org> wrote in message news:<c8**********************@news20.bellglobal.c om>...
back to the good'ol days of algorithm design. today, everything seems to
be focused on web design and j2ee.

my question is, why are you writing your own spelling checker if it's
not to improve whats aleady available....?

- perry

Leo P. wrote:
I am trying to write a spelling checker. Specifically the part that
suggests words. From what I've read online, it seems like the
preferred way to do this, is to find the double metaphone equivalent
of a mispelled word, then to compute the mispelled word's distance
from other words which have the same, or a similar metaphone
equivalent.

I've run into some problems with this approach. For example,
"accommodation" has a metaphone code of AKMT, while "acocmmodation"
has a metaphone code of AKKM. These are different enough that my
implementation doesn't even try to see if the two words are close
together, although it's clear that they are.

Frustrated by this, I tried to compute the distance (Levenshtein)
between "acocmmodation" and every other word in my dictionary file.
This was suprisingly fast, and the correct suggestion appeared. This
makes me wonder, what is the point of doing all of the double
metaphone stuff, as opposed to just a brute force search? Why do all
of the spell checking algorithms first convert a mispelled word into a
phoenetic representation before comparing it to other words? Is it
because the speed used to be an issue or slower machines, or am I
missing something? For reference, when I say that the brute force
search was 'surprisingly fast', it took about 75 milliseconds to
compute the levenshtein distance from "acocmmodation" to every word in
a dictionary of 115K words.

Leo

Jul 17 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.