473,387 Members | 1,791 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

fastest way for humongous regexp search?

Hi,
I've got a list of 1000 common misspellings, and I'd like to check a set of
text files for those misspellings.
I'm trying to figure out the fastest way to do it; here's what I'm doing now
(below).

I'm still learning Python, love it, and I'm pretty sure that what I'm doing
is naive.

Thanks for taking the time to look at this,
--Tim
----------------------------------------------------------------------------
----------
(1) Create one humongous regexp, compile it and cPickle it. The regexp is
like this:

misspelled = (
'\\bjudgement\\b|' +
'\\bjudgemental\\b|' +

<snip><snip><snip>

'\\bYorksire\\b|' +
'\\bYoyages\\b')

p = re.compile(misspelled, re.I)
f = open('misspell.pat', 'w')
cPickle.dump(p,f)
f.close()
----------------------------------------------------------------------------
----------
(2) Check the file(s), report the misspelling, the line number and the
actual line of text.
- only warns on multiple identical misspellings
- using 'EtaOinShrdlu' as a nonsense line-marker; tried \n but that
didn't give correct results.
- running on HP Unix, Python 2.2

f = open('misspell.pat', 'r')
p = cPickle.load(f)

a = open('myfile.txt').readlines()
s = 'EtaOinShrdlu'.join(a)

mistake = {}
for mMatch in p.findall(s):
if mistake.get(mMatch,0):
print 'Warning: multiple occurrences of mistake "%s" ' % mMatch
else:
mistake[mMatch] = s.count('EtaOinShrdlu', 0, s.index(mMatch))

for k, v in mistake.items():
print 'Misspelling: "%s" on line number %d' % (k, mistake[k]+1)
print '%s \n' % a[mistake[k]]

Jul 18 '05 #1
1 1423
Tim Arnold wrote:
I've got a list of 1000 common misspellings, and I'd like to check a set of
text files for those misspellings.


A much simpler way would be to just store these misspellings as a dictionary
(or set), read and split each line into words, then check whether each
of words is in the set.

Istvan
Jul 18 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

10
by: Anand Pillai | last post by:
To search a word in a group of words, say a paragraph or a web page, would a string search or a regexp search be faster? The string search would of course be, if str.find(substr) != -1:...
6
by: Lukas Holcik | last post by:
Hi Python crazies!:)) There is a problem to be solved. I have a text and I have to parse it using a lot of regular expressions. In (lin)u(ni)x I could write in bash: cat file | sed 's/../../' |...
6
by: Rizyak | last post by:
******************** alt.php.sql,comp databases.ms-sqlserver microsoft.public.sqlserver.programming *********************************** Why doesn't this work: SELECT * FROM 'Events'
7
by: arno | last post by:
Hi, I want to search a substring within a string : fonction (str, substr) { if (str.search(substr) != -1) { // do something } }
3
by: Harry Haller | last post by:
What is the fastest way to search a client-side database? I have about 60-65 kb of data downloaded to the client which is present in 3 dynamically created list boxes. The boxes are filled from 3...
1
by: Harry Haller | last post by:
What is the fastest way to search a client-side database? I have about 60-65 kb of data downloaded to the client which is present in 3 dynamically created list boxes. The boxes are filled from 3...
11
by: Flyzone | last post by:
Hello, i have again problem with regexp :-P I need to match all lines that contain one word but not contain another. Like to do "grep one | grep -v two:" The syntax of the string is: (any...
8
by: Ben Amada | last post by:
Hi all. I know very little about regular expressions, but wanted to use one to validate an email address a user would be entering before the form is submitted. There are many examples out there. ...
3
by: Paddy | last post by:
Lets say i have a generator running that generates successive characters of a 'string' characters then I would have to 'freeze' the generator and pass the characters so far to re.search. It is...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.