473,695 Members | 2,187 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

regex-strategy for finding *similar* words?

Hi all,

an interesting problem for regex nerds.
I've got a thesaurus of some hundred words and a moderately large
dataset of about 1 million words in some thousand small texts. Words
from the thesaurus appear at many places in my texts, but they are
often misspelled, just slightly different from the thesaurus.

Now I'm looking for the best strategy to match the appearence of my
thesaurus items in the texts. Do I have to build patterns from all my
thesaurus items for the expected misspellings, or is there a more
general approach possible? I tried to add '?.?' to each letter in a
thesaurus item, but this is much too weak (I get a lot of false
positives because this expression for example matches any possible
substring). Adding word boundries helps a little, but sometimes a
concept spans two words, and this could be spelled with a space char
*or* a dash. In this case, I'm back to matching almose everything.
Any ideas?
BTW, efficiency is not absolutely required, it's not meant to work
for realtime requests.

TIA,
best regards,
Christoph
Jul 18 '05 #1
6 4047
Christoph Pingel schrieb:
Hi all,

an interesting problem for regex nerds.
I've got a thesaurus of some hundred words and a moderately large
dataset of about 1 million words in some thousand small texts. Words
from the thesaurus appear at many places in my texts, but they are often
misspelled, just slightly different from the thesaurus.


You could set up a list of misspelling cases, scan a word for it e.g.
citti and turn it into a regex by applying suitable misspelling cases
But this is cumbersome. It is probably better to use a string distance
defined by the least number of operations (add,delete, replace, exchange)
to map one string onto another.

Search for '"Levenshtei n distance" python' and find e.g.

http://trific.ath.cx/resources/python/levenshtein/

--
-------------------------------------------------------------------
Peter Maas, M+R Infosysteme, D-52070 Aachen, Tel +49-241-93878-0
E-mail 'cGV0ZXIubWFhc0 BtcGx1c3IuZGU=\ n'.decode('base 64')
-------------------------------------------------------------------
Jul 18 '05 #2
Christoph Pingel wrote:
an interesting problem for regex nerds.
I've got a thesaurus of some hundred words and a moderately large
dataset of about 1 million words in some thousand small texts. Words
from the thesaurus appear at many places in my texts, but they are often
misspelled, just slightly different from the thesaurus.


There exists the agrep project (http://www.tgries.de/agrep/), for which
Python bindings exist. agrep (=approximate grep) allows you to specify
the number of allowed errors.

Daniel
Jul 18 '05 #3
Am Thu, 18 Nov 2004 13:20:08 +0100 schrieb Christoph Pingel:
Hi all,

an interesting problem for regex nerds.
I've got a thesaurus of some hundred words and a moderately large
dataset of about 1 million words in some thousand small texts. Words
from the thesaurus appear at many places in my texts, but they are
often misspelled, just slightly different from the thesaurus.


Hi,

You can write a method which takes a single word,
and returns a normalized version.

normalize("...i es") --> "...y"

normalize("runn ing") --> "run"

Build a big dictionary which maps each word
to a list of files where they occur. Only
add normalized words to the dictionary (or database).

bigdict={"foo": ["file1.txt" , "file2.txt" , ...]}

HTH,
Thomas
Jul 18 '05 #4
Thank you, Levensthein distance and agrep sound both interesting,
I'll try both I think.

best,
Christoph
Jul 18 '05 #5
Peter Maas <pe***@somewher e.com> wrote in message news:<cn******* ***@swifty.west end.com>...
Christoph Pingel schrieb:
Hi all,

an interesting problem for regex nerds.
I've got a thesaurus of some hundred words and a moderately large
dataset of about 1 million words in some thousand small texts. Words
from the thesaurus appear at many places in my texts, but they are often
misspelled, just slightly different from the thesaurus.


You could set up a list of misspelling cases, scan a word for it e.g.
citti and turn it into a regex by applying suitable misspelling cases
But this is cumbersome. It is probably better to use a string distance
defined by the least number of operations (add,delete, replace, exchange)
to map one string onto another.

Search for '"Levenshtei n distance" python' and find e.g.

http://trific.ath.cx/resources/python/levenshtein/


Forget regexes. A string distance function is a helpful start. Firstly
you need a method of parsing your input text into words or phrases.
Secondly you need a fuzzy dictionary into which you load your
thesaurus. This fuzzy dictionary will have a method like
fuzzy_dict.appr ox_lookup('misp eled', 2) which would return you a list
of words in the dictionary that had a (say) Levenshtein distance of 2
or less from your input word. Algorithms for fuzzy dictionaries:
google for "Burkhard Keller tree" and "permuted lexicon".
Alternatively you could look for a spelling checker with publicly
available source and grab some ideas or even code. You might be
interested in other string distance measures than Levenshtein distance
e.g. the so-called Damerau distance which counts transpositions. There
are faster algorithms available than the dynamic-programming one --
google "Heikki Hyyro". These allow you to preprocess your input word
and then compare with multiple dictionary candidates in O(kn) instead
of O(kmn) time where the input word has length m, and you compare with
k dictionary words each of average length n.

HTH,
John
Jul 18 '05 #6
>There
are faster algorithms available than the dynamic-programming one --
google "Heikki Hyyro". These allow you to preprocess your input word
and then compare with multiple dictionary candidates in O(kn) instead
of O(kmn) time where the input word has length m, and you compare with
k dictionary words each of average length n.


John,

thanks for the rich input!
I found the Hyyro/Navarro algorithm, and this looks interesting.
First I will give python's own difflib.Sequenc eMatcher class a try -
the project here doesn't necessarily need the fastest algorithm,
coding time is an issue however. I found that a match ratio slightly
above 0.85 captures most of my misspelling cases without producing
errors.

best,
Christoph

Jul 18 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
2082
by: Jon Maz | last post by:
Hi All, Am getting frustrated trying to port the following (pretty simple) function to CSharp. The problem is that I'm lousy at Regular Expressions.... //from http://support.microsoft.com/default.aspx?scid=kb;EN-US;246800 function fxnParseIt() { var sInputString = 'asp and database';
4
9751
by: aevans1108 | last post by:
expanding this message to microsoft.public.dotnet.xml Greetings Please direct me to the right group if this is an inappropriate place to post this question. Thanks. I want to format a numeric value according to an arbitrary regular expression.
6
1772
by: tshad | last post by:
Is there a way to use Regex inside of a tag, such as asp:label? I tried something like this but can't make it work: <asp:label id="Phone" text=Regex.Replace('<%# Container.DataItem("Phone") %>',"(\d{3})(\d{3})(\d{4})","($1) $2-$3") runat="server"/> I have this inside my Repeater and want it to filter the field during bind. I can do it afterwards by just looping through the repeater items, but that is extra work and time.
17
3967
by: clintonG | last post by:
I'm using an .aspx tool I found at but as nice as the interface is I think I need to consider using others. Some can generate C# I understand. Your preferences please... <%= Clinton Gallagher http://forta.com/books/0672325667/
5
5095
by: Chris | last post by:
How Do I use the following auto-generated code from The Regulator? '------------------------------------------------------------------------------ ' <autogenerated> ' This code was generated by a tool. ' Runtime Version: 1.1.4322.2032 ' ' Changes to this file may cause incorrect behavior and will be lost if ' the code is regenerated. ' </autogenerated>
6
2499
by: Extremest | last post by:
I have a huge regex setup going on. If I don't do each one by itself instead of all in one it won't work for. Also would like to know if there is a faster way tried to use string.replace with all the right parts in there in one big line and for some reason that did not work either. Here is my regex's. static Regex rar = new Regex("\\.part.*", RegexOptions.IgnoreCase); static Regex par = new Regex("\\.vol.*", RegexOptions.IgnoreCase);
7
2582
by: Extremest | last post by:
I am using this regex. static Regex paranthesis = new Regex("(\\d*/\\d*)", RegexOptions.IgnoreCase); it should find everything between parenthesis that have some numbers onyl then a forward slash then some numbers. For some reason I am not getting that. It won't work at all in 2.0
3
2700
by: aspineux | last post by:
My goal is to write a parser for these imaginary string from the SMTP protocol, regarding RFC 821 and 1869. I'm a little flexible with the BNF from these RFC :-) Any comment ? tests= def RN(name, regex): """protect using () and give an optional name to a regex""" if name:
4
2669
by: CJ | last post by:
Is this the format to parse a string and return the value between the item? Regex pRE = new Regex("<File_Name>.*>(?<insideText>.*)</File_Name>"); I am trying to parse this string. <File_Name>Services</File_Name> Thanks
0
1730
by: Karch | last post by:
I have these two methods that are chewing up a ton of CPU time in my application. Does anyone have any suggestions on how to optimize them or rewrite them without Regex? The most time-consuming operation by a long-shot is the regex.Replace. Basically the only purpose of it is to remove spaces between opening/closing tags and the element name. Surely there is a better way. private string FixupJavascript(string htmlCode) { string result...
0
9112
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8971
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8822
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8815
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
6483
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4332
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4570
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2994
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
1970
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.