469,332 Members | 6,612 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,332 developers. It's quick & easy.

regex-strategy for finding *similar* words?

Hi all,

an interesting problem for regex nerds.
I've got a thesaurus of some hundred words and a moderately large
dataset of about 1 million words in some thousand small texts. Words
from the thesaurus appear at many places in my texts, but they are
often misspelled, just slightly different from the thesaurus.

Now I'm looking for the best strategy to match the appearence of my
thesaurus items in the texts. Do I have to build patterns from all my
thesaurus items for the expected misspellings, or is there a more
general approach possible? I tried to add '?.?' to each letter in a
thesaurus item, but this is much too weak (I get a lot of false
positives because this expression for example matches any possible
substring). Adding word boundries helps a little, but sometimes a
concept spans two words, and this could be spelled with a space char
*or* a dash. In this case, I'm back to matching almose everything.
Any ideas?
BTW, efficiency is not absolutely required, it's not meant to work
for realtime requests.

TIA,
best regards,
Christoph
Jul 18 '05 #1
6 3844
Christoph Pingel schrieb:
Hi all,

an interesting problem for regex nerds.
I've got a thesaurus of some hundred words and a moderately large
dataset of about 1 million words in some thousand small texts. Words
from the thesaurus appear at many places in my texts, but they are often
misspelled, just slightly different from the thesaurus.


You could set up a list of misspelling cases, scan a word for it e.g.
citti and turn it into a regex by applying suitable misspelling cases
But this is cumbersome. It is probably better to use a string distance
defined by the least number of operations (add,delete, replace, exchange)
to map one string onto another.

Search for '"Levenshtein distance" python' and find e.g.

http://trific.ath.cx/resources/python/levenshtein/

--
-------------------------------------------------------------------
Peter Maas, M+R Infosysteme, D-52070 Aachen, Tel +49-241-93878-0
E-mail 'cGV0ZXIubWFhc0BtcGx1c3IuZGU=\n'.decode('base64')
-------------------------------------------------------------------
Jul 18 '05 #2
Christoph Pingel wrote:
an interesting problem for regex nerds.
I've got a thesaurus of some hundred words and a moderately large
dataset of about 1 million words in some thousand small texts. Words
from the thesaurus appear at many places in my texts, but they are often
misspelled, just slightly different from the thesaurus.


There exists the agrep project (http://www.tgries.de/agrep/), for which
Python bindings exist. agrep (=approximate grep) allows you to specify
the number of allowed errors.

Daniel
Jul 18 '05 #3
Am Thu, 18 Nov 2004 13:20:08 +0100 schrieb Christoph Pingel:
Hi all,

an interesting problem for regex nerds.
I've got a thesaurus of some hundred words and a moderately large
dataset of about 1 million words in some thousand small texts. Words
from the thesaurus appear at many places in my texts, but they are
often misspelled, just slightly different from the thesaurus.


Hi,

You can write a method which takes a single word,
and returns a normalized version.

normalize("...ies") --> "...y"

normalize("running") --> "run"

Build a big dictionary which maps each word
to a list of files where they occur. Only
add normalized words to the dictionary (or database).

bigdict={"foo": ["file1.txt", "file2.txt", ...]}

HTH,
Thomas
Jul 18 '05 #4
Thank you, Levensthein distance and agrep sound both interesting,
I'll try both I think.

best,
Christoph
Jul 18 '05 #5
Peter Maas <pe***@somewhere.com> wrote in message news:<cn**********@swifty.westend.com>...
Christoph Pingel schrieb:
Hi all,

an interesting problem for regex nerds.
I've got a thesaurus of some hundred words and a moderately large
dataset of about 1 million words in some thousand small texts. Words
from the thesaurus appear at many places in my texts, but they are often
misspelled, just slightly different from the thesaurus.


You could set up a list of misspelling cases, scan a word for it e.g.
citti and turn it into a regex by applying suitable misspelling cases
But this is cumbersome. It is probably better to use a string distance
defined by the least number of operations (add,delete, replace, exchange)
to map one string onto another.

Search for '"Levenshtein distance" python' and find e.g.

http://trific.ath.cx/resources/python/levenshtein/


Forget regexes. A string distance function is a helpful start. Firstly
you need a method of parsing your input text into words or phrases.
Secondly you need a fuzzy dictionary into which you load your
thesaurus. This fuzzy dictionary will have a method like
fuzzy_dict.approx_lookup('mispeled', 2) which would return you a list
of words in the dictionary that had a (say) Levenshtein distance of 2
or less from your input word. Algorithms for fuzzy dictionaries:
google for "Burkhard Keller tree" and "permuted lexicon".
Alternatively you could look for a spelling checker with publicly
available source and grab some ideas or even code. You might be
interested in other string distance measures than Levenshtein distance
e.g. the so-called Damerau distance which counts transpositions. There
are faster algorithms available than the dynamic-programming one --
google "Heikki Hyyro". These allow you to preprocess your input word
and then compare with multiple dictionary candidates in O(kn) instead
of O(kmn) time where the input word has length m, and you compare with
k dictionary words each of average length n.

HTH,
John
Jul 18 '05 #6
>There
are faster algorithms available than the dynamic-programming one --
google "Heikki Hyyro". These allow you to preprocess your input word
and then compare with multiple dictionary candidates in O(kn) instead
of O(kmn) time where the input word has length m, and you compare with
k dictionary words each of average length n.


John,

thanks for the rich input!
I found the Hyyro/Navarro algorithm, and this looks interesting.
First I will give python's own difflib.SequenceMatcher class a try -
the project here doesn't necessarily need the fastest algorithm,
coding time is an issue however. I found that a match ratio slightly
above 0.85 captures most of my misspelling cases without producing
errors.

best,
Christoph

Jul 18 '05 #7

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

6 posts views Thread by tshad | last post: by
17 posts views Thread by clintonG | last post: by
5 posts views Thread by Chris | last post: by
6 posts views Thread by Extremest | last post: by
7 posts views Thread by Extremest | last post: by
3 posts views Thread by aspineux | last post: by
reply views Thread by Karch | last post: by
reply views Thread by zhoujie | last post: by
reply views Thread by suresh191 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.