473,383 Members | 1,885 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,383 software developers and data experts.

Regular Expressions: large amount of or's


Hi!

Given a string, I want to find all ocurrences of
certain predefined words in that string. Problem is, the list of
words that should be detected can be in the order of thousands.

With the re module, this can be solved something like this:

import re

r = re.compile("word1|word2|word3|.......|wordN")
r.findall(some_string)

Unfortunately, when having more than about 10 000 words in
the regexp, I get a regular expression runtime error when
trying to execute the findall function (compile works fine, but slow).

I don't know if using the re module is the right solution here, any
suggestions on alternative solutions or data structures which could
be used to solve the problem?

André

Jul 18 '05 #1
22 1990
[André Sřreng]
Given a string, I want to find all ocurrences of
certain predefined words in that string. Problem is, the list of
words that should be detected can be in the order of thousands.

With the re module, this can be solved something like this:

import re

r = re.compile("word1|word2|word3|.......|wordN")
r.findall(some_string)

Unfortunately, when having more than about 10 000 words in
the regexp, I get a regular expression runtime error when
trying to execute the findall function (compile works fine, but slow).

I don't know if using the re module is the right solution here, any
suggestions on alternative solutions or data structures which could
be used to solve the problem?


Put the words you're looking for into a set (or as the keys of a dict
in older Pythons; the values in the dict are irrelevant).

I don't know what you mean by "word", so write something that breaks
your string into what you mean by words. Then:

for word in something_that_produces_words(the_string):
if word in set_of_words:
# found one

This takes expected-case time proportional to the number of words in
the string, + setup time proportional to the number of "interesting"
words (the time needed to create a set or dict from them). 10,000
interesting words won't even start to strain it.
Jul 18 '05 #2
André Sřreng wrote:

Hi!

Given a string, I want to find all ocurrences of
certain predefined words in that string. Problem is, the list of
words that should be detected can be in the order of thousands.

With the re module, this can be solved something like this:

import re

r = re.compile("word1|word2|word3|.......|wordN")
r.findall(some_string)

Unfortunately, when having more than about 10 000 words in
the regexp, I get a regular expression runtime error when
trying to execute the findall function (compile works fine, but slow).

I don't know if using the re module is the right solution here, any
suggestions on alternative solutions or data structures which could
be used to solve the problem?
If you can split some_string into individual words, you could look them up in a set of known words:

known_words = set("word1 word2 word3 ....... wordN".split())
found_words = [ word for word in some_string.split() if word in known_words ]

Kent

André

Jul 18 '05 #3
This does not sound like a job for a single regex.

Using a list and listcomp (say your words are in a list called "mywordlist")
you can make this quite terse. Of course I have a way of writing algorithms
that have very large exp when people tell me the O(N^exp).

try this:
myregexlist = [re.compile(aword) for aword in mywordlist]
myoccurrences = [argx.findall(some_string) for argx in myregexlist]
Now you should have a 1:1 mapping of the mywordlist and myoccurrences. Of
course you can fill mywordlist with real regular expressions instead of just
words. If you want to count the words, you may just want to use the string
count method:
myoccurrences = [some_string.count(aword) for aword in mywordlist]
This may make more sense if you are not using true regexes.

James

On Tuesday 01 March 2005 11:46 am, André Sřreng wrote:
Hi!

Given a string, I want to find all ocurrences of
certain predefined words in that string. Problem is, the list of
words that should be detected can be in the order of thousands.

With the re module, this can be solved something like this:

import re

r = re.compile("word1|word2|word3|.......|wordN")
r.findall(some_string)

Unfortunately, when having more than about 10 000 words in
the regexp, I get a regular expression runtime error when
trying to execute the findall function (compile works fine, but slow).

I don't know if using the re module is the right solution here, any
suggestions on alternative solutions or data structures which could
be used to solve the problem?

André


--
James Stroud, Ph.D.
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095
Jul 18 '05 #4
Kent Johnson wrote:
André Sřreng wrote:

Hi!

Given a string, I want to find all ocurrences of
certain predefined words in that string. Problem is, the list of
words that should be detected can be in the order of thousands.

With the re module, this can be solved something like this:

import re

r = re.compile("word1|word2|word3|.......|wordN")
r.findall(some_string)

Unfortunately, when having more than about 10 000 words in
the regexp, I get a regular expression runtime error when
trying to execute the findall function (compile works fine, but slow).

I don't know if using the re module is the right solution here, any
suggestions on alternative solutions or data structures which could
be used to solve the problem?

If you can split some_string into individual words, you could look them
up in a set of known words:

known_words = set("word1 word2 word3 ....... wordN".split())
found_words = [ word for word in some_string.split() if word in
known_words ]

Kent

André


That is not exactly what I want. It should discover if some of
the predefined words appear as substrings, not only as equal
words. For instance, after matching "word2sgjoisejfisaword1yguyg", word2
and word1 should be detected.
Jul 18 '05 #5
Le mardi 1 Mars 2005 22:04, André Sřreng a écrit*:
That is not exactly what I want. It should discover if some of
the predefined words appear as substrings, not only as equal
words. For instance, after matching "word2sgjoisejfisaword1yguyg", word2
and word1 should be detected.


Hi,

A lexer producing a DFA like the one in pyggy (see
http://www.lava.net/~newsham/pyggy/) might be what you're looking for.

Regards,

Francis Girard

Jul 18 '05 #6
On Tue, 01 Mar 2005 22:04:15 +0100, André Sřreng <ws******@tiscali.no> wrote:
Kent Johnson wrote:
André Sřreng wrote:

Hi!

Given a string, I want to find all ocurrences of
certain predefined words in that string. Problem is, the list of
words that should be detected can be in the order of thousands.

With the re module, this can be solved something like this:

import re

r = re.compile("word1|word2|word3|.......|wordN")
r.findall(some_string)

Unfortunately, when having more than about 10 000 words in
the regexp, I get a regular expression runtime error when
trying to execute the findall function (compile works fine, but slow).

I don't know if using the re module is the right solution here, any
suggestions on alternative solutions or data structures which could
be used to solve the problem?

If you can split some_string into individual words, you could look them
up in a set of known words:

known_words = set("word1 word2 word3 ....... wordN".split())
found_words = [ word for word in some_string.split() if word in
known_words ]

Kent

André

That is not exactly what I want. It should discover if some of
the predefined words appear as substrings, not only as equal
words. For instance, after matching "word2sgjoisejfisaword1yguyg", word2
and word1 should be detected.


Show some initiative, man!
known_words = set(["word1", "word2"])
found_words = [word for word in known_words if word in "word2sgjoisejfisawo rd1yguyg"] found_words

['word1', 'word2']

Peace
Bill Mill
bill.mill at gmail.com
Jul 18 '05 #7
André Sřreng wrote:

Hi!

Given a string, I want to find all ocurrences of
certain predefined words in that string. Problem is, the list of
words that should be detected can be in the order of thousands.

With the re module, this can be solved something like this:

import re

r = re.compile("word1|word2|word3|.......|wordN")
r.findall(some_string)

Unfortunately, when having more than about 10 000 words in
the regexp, I get a regular expression runtime error when
trying to execute the findall function (compile works fine, but slow).


What error do you get? What version of Python are you using? re was changed in Python 2.4 to avoid
recursion, so if you are getting a stack overflow in Python 2.3 you should try 2.4.

Kent
Jul 18 '05 #8
André Sřreng <ws******@tiscali.no> wrote:
Given a string, I want to find all ocurrences of
certain predefined words in that string. Problem is, the list of
words that should be detected can be in the order of thousands.

With the re module, this can be solved something like this:

import re

r = re.compile("word1|word2|word3|.......|wordN")
r.findall(some_string)

Unfortunately, when having more than about 10 000 words in
the regexp, I get a regular expression runtime error when
trying to execute the findall function (compile works fine, but
slow).


I wrote a regexp optimiser for exactly this case.

Eg a regexp for all 5 letter words starting with re

$ grep -c '^re' /usr/share/dict/words
2727

$ grep '^re' /usr/share/dict/words | ./words-to-regexp.pl 5

re|re's|reac[ht]|rea(?:d|d[sy]|l|lm|m|ms|p|ps|r|r[ms])|reb(?:el|u[st])|rec(?:ap|ta|ur)|red|red's|red(?:id|o|s)|ree(?:d| ds|dy|f|fs|k|ks|l|ls|ve)|ref|ref's|refe[dr]|ref(?:it|s)|re(?:gal|hab|(?:ig|i)n|ins|lax|lay|li c|ly|mit|nal|nd|nds|new|nt|nts|p)|rep's|rep(?:ay|e l|ly|s)|rer(?:an|un)|res(?:et|in|t|ts)|ret(?:ch|ry )|re(?:use|v)|rev's|rev(?:el|s|ue)

As you can see its not perfect.

Find it in http://www.craig-wood.com/nick/pub/words-to-regexp.pl

Yes its perl and rather cludgy but may give you ideas!

--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick
Jul 18 '05 #9
Kent Johnson <ke****@tds.net> wrote:

:> Given a string, I want to find all ocurrences of
:> certain predefined words in that string. Problem is, the list of
:> words that should be detected can be in the order of thousands.
:>
:> With the re module, this can be solved something like this:
:>
:> import re
:>
:> r = re.compile("word1|word2|word3|.......|wordN")
:> r.findall(some_string)

The internal data structure that encodes that set of keywords is
probably humongous. An alternative approach to this problem is to
tokenize your string into words, and then check to see if each word is
in a defined list of "keywords". This works if your keywords are
single words:

###
keywords = set([word1, word2, ...])
matchingWords = set(re.findall(r'\w+')).intersection(keywords)
###

Would this approach work for you?

Otherwise, you may want to look at a specialized data structure for
doing mutiple keyword matching; I had an older module that wrapped
around a suffix tree:

http://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees/

It looks like other folks, thankfully, have written other
implementations of suffix trees:

http://cs.haifa.ac.il/~shlomo/suffix_tree/

Another approach is something called the Aho-Corasick algorithm:

http://portal.acm.org/citation.cfm?doid=360825.360855

though I haven't been able to find a nice Python module for this yet.
Best of wishes to you!
Jul 18 '05 #10
Bill Mill wrote:
On Tue, 01 Mar 2005 22:04:15 +0100, André Sřreng <ws******@tiscali.no> wrote:
Kent Johnson wrote:
André Sřreng wrote:
Hi!

Given a string, I want to find all ocurrences of
certain predefined words in that string. Problem is, the list of
words that should be detected can be in the order of thousands.

With the re module, this can be solved something like this:

import re

r = re.compile("word1|word2|word3|.......|wordN")
r.findall(some_string)

Unfortunately, when having more than about 10 000 words in
the regexp, I get a regular expression runtime error when
trying to execute the findall function (compile works fine, but slow).

I don't know if using the re module is the right solution here, any
suggestions on alternative solutions or data structures which could
be used to solve the problem?
If you can split some_string into individual words, you could look them
up in a set of known words:

known_words = set("word1 word2 word3 ....... wordN".split())
found_words = [ word for word in some_string.split() if word in
known_words ]

Kent
André


That is not exactly what I want. It should discover if some of
the predefined words appear as substrings, not only as equal
words. For instance, after matching "word2sgjoisejfisaword1yguyg", word2
and word1 should be detected.

Show some initiative, man!

known_words = set(["word1", "word2"])
found_words = [word for word in known_words if word in "word2sgjoisejfisawo
rd1yguyg"]
found_words


['word1', 'word2']

Peace
Bill Mill
bill.mill at gmail.com


Yes, but I was looking for a solution which would scale. Searching
through the same string 10000+++ times does not seem like a suitable
solution.

André
Jul 18 '05 #11
Daniel Yoo wrote:
Kent Johnson <ke****@tds.net> wrote:

:> Given a string, I want to find all ocurrences of
:> certain predefined words in that string. Problem is, the list of
:> words that should be detected can be in the order of thousands.
:>
:> With the re module, this can be solved something like this:
:>
:> import re
:>
:> r = re.compile("word1|word2|word3|.......|wordN")
:> r.findall(some_string)

The internal data structure that encodes that set of keywords is
probably humongous. An alternative approach to this problem is to
tokenize your string into words, and then check to see if each word is
in a defined list of "keywords". This works if your keywords are
single words:

###
keywords = set([word1, word2, ...])
matchingWords = set(re.findall(r'\w+')).intersection(keywords)
###

Would this approach work for you?

Otherwise, you may want to look at a specialized data structure for
doing mutiple keyword matching; I had an older module that wrapped
around a suffix tree:

http://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees/

It looks like other folks, thankfully, have written other
implementations of suffix trees:

http://cs.haifa.ac.il/~shlomo/suffix_tree/

Another approach is something called the Aho-Corasick algorithm:

http://portal.acm.org/citation.cfm?doid=360825.360855

though I haven't been able to find a nice Python module for this yet.
Best of wishes to you!


Thanks, seems like the Aho-Corasick algorithm is along the lines of
what I was looking for, but have not read the article completely yet.

Also:
http://alexandria.tue.nl/extra1/wskr...tml/200407.pdf

provided several alternative algorithms.

André

Jul 18 '05 #12
André Sřreng wrote:


Yes, but I was looking for a solution which would scale. Searching
through the same string 10000+++ times does not seem like a suitable
solution.

André


Just for curiosity, what would a regexp do? Perhaps it's a clue in how
you could do this in the way regexp's are executed.

ola

--
--------------------------------------
Ola Natvig <ol********@infosense.no>
infoSense AS / development
Jul 18 '05 #13
Ola Natvig wrote:
André Sřreng wrote:


Yes, but I was looking for a solution which would scale. Searching
through the same string 10000+++ times does not seem like a suitable
solution.

André

Just for curiosity, what would a regexp do? Perhaps it's a clue in how
you could do this in the way regexp's are executed.

ola


I think this article provides me with what I was looking for:

http://alexandria.tue.nl/extra1/wskr...tml/200407.pdf

Enough info there to keep me going for some while.
Jul 18 '05 #14
Can divide the regex on the bases of alphabets they are starting with
or can iterate on the list.

Regards,
Garry

http://garrythegambler.blogspot.com/
On Wed, 02 Mar 2005 12:50:01 +0100, André Sřreng <an*****@stud.cs.uit.no> wrote:
Ola Natvig wrote:
André Sřreng wrote:


Yes, but I was looking for a solution which would scale. Searching
through the same string 10000+++ times does not seem like a suitable
solution.

André

Just for curiosity, what would a regexp do? Perhaps it's a clue in how
you could do this in the way regexp's are executed.

ola


I think this article provides me with what I was looking for:

http://alexandria.tue.nl/extra1/wskr...tml/200407.pdf

Enough info there to keep me going for some while.
--
http://mail.python.org/mailman/listinfo/python-list

--
Thanks and Regards,
GSS
Jul 18 '05 #15
On Tue, 1 Mar 2005 15:03:50 -0500, Tim Peters <ti********@gmail.com>
wrote:
[André Sřreng]
Given a string, I want to find all ocurrences of
certain predefined words in that string. Problem is, the list of
words that should be detected can be in the order of thousands.

With the re module, this can be solved something like this:

import re

r = re.compile("word1|word2|word3|.......|wordN")
r.findall(some_string)

Unfortunately, when having more than about 10 000 words in
the regexp, I get a regular expression runtime error when
trying to execute the findall function (compile works fine, but slow).

I don't know if using the re module is the right solution here, any
suggestions on alternative solutions or data structures which could
be used to solve the problem?


Put the words you're looking for into a set (or as the keys of a dict
in older Pythons; the values in the dict are irrelevant).

I don't know what you mean by "word", so write something that breaks
your string into what you mean by words. Then:

for word in something_that_produces_words(the_string):
if word in set_of_words:
# found one

I have the same problem.
Unfortunately the meaning of a "word" depends on the word.
As an example I would like to count the number of occurrences of
movies titles in some text.

Maybe lex is more optimized?
Unfortunately is seems that there are no lex versions that generate
python (or PyRex) code.

Thanks and regards Manlio Perillo

Jul 18 '05 #16
Hi.

Python allows to subclass builtin classes but the Python Interpreter
uses builtin types.
As an example keyword arguments are inserted in a dict but I would
like to use an user defined SortedDict.

There are plans to add such a feature in a future version?
Thanks and regards Manlio Perillo
Jul 18 '05 #17
Hi.

Python allows to subclass builtin classes but the Python Interpreter
uses builtin types.
As an example keyword arguments are inserted in a dict but I would
like to use an user defined SortedDict.

There are plans to add such a feature in a future version?
Thanks and regards Manlio Perillo
Jul 18 '05 #18
: Otherwise, you may want to look at a specialized data structure for
: doing mutiple keyword matching; I had an older module that wrapped
: around a suffix tree:

: http://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees/

: It looks like other folks, thankfully, have written other
: implementations of suffix trees:

: http://cs.haifa.ac.il/~shlomo/suffix_tree/

: Another approach is something called the Aho-Corasick algorithm:

: http://portal.acm.org/citation.cfm?doid=360825.360855

: though I haven't been able to find a nice Python module for this yet.
Followup on this: I haven't been able to find one, so I took someone
else's implementation and adapted it. *grin*

Here you go:

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

This provides an 'ahocorasick' Python C extension module for doing
matching on a set of keywords. I'll start writing out the package
announcements tomorrow.
I hope this helps!
Jul 18 '05 #19

Daniel Yoo wrote:

Here you go:

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

This provides an 'ahocorasick' Python C extension module for doing
matching on a set of keywords. I'll start writing out the package
announcements tomorrow.


Looks good.

However:

tree.search("I went to alpha beta the other day to pick up some spam")

could use a startpos (default=0) argument for efficiently restarting
the search after finding the first match

Jul 18 '05 #20
Daniel Yoo <dy**@hkn.eecs.berkeley.edu> wrote:
: John Machin <sj******@lexicon.net> wrote:
: : tree.search("I went to alpha beta the other day to pick up some spam")

: : could use a startpos (default=0) argument for efficiently restarting
: : the search after finding the first match

: Ok, that's easy to fix. I'll do that tonight.

Done. 'startpos' and other bug fixes are in Release 0.7:

http://hkn.eecs.berkeley.edu/~dyoo/p...ick-0.7.tar.gz

But I think I'd better hold off adding the ahocorasick package to PyPI
until it stabilizes for longer than a day... *grin*
Jul 18 '05 #21
Scott David Daniels <Sc***********@acm.org> wrote:

: I have a (very high speed) modified Aho-Corasick machine that I sell.
: The calling model that I found works well is:

: def chases(self, sourcestream, ...):
: '''A generator taking a generator of source blocks,
: yielding (matches, position) pairs where position is an
: offset within the "current" block.
: '''

: You might consider taking a look at providing that form.
Hi Scott,

No problem, I'll be happy to do this.

I need some clarification on the calling model though. Would this be
an accurate test case?

######
def testChasesInterface(self):
self.tree.add("python")
self.tree.add("is")
self.tree.make()
sourceStream = iter(("python programming is fun",
"how much is that python in the window"))
self.assertEqual([
(sourceBlocks[0], (0, 6)),
(sourceBlocks[0], (19, 21)),
(sourceBlocks[1], (9, 11)),
(sourceBlocks[1], (17, 23)),
],
list(self.tree.chases(sourceStream))
######

Here, I'm assuming that chases() takes in a 'sourceStream', which is
an iterator of text blocks., and that the return value is itself an
iterator.
Best of wishes!
Jul 18 '05 #22
: Done. 'startpos' and other bug fixes are in Release 0.7:

: http://hkn.eecs.berkeley.edu/~dyoo/p...ick-0.7.tar.gz

Ok, I stopped working on the Aho-Corasick module for a while, so I've
just bumped the version number to 0.8 and posted it up on PyPI.

I did add some preliminary code to use graphviz to emit DOT files, but
it's very untested code. I also added an undocumented api for
inspecting the states and their transitions.

I hope that the original poster finds it useful, even though it's
probably a bit late.
Hope this helps!
Jul 18 '05 #23

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Kenneth McDonald | last post by:
I'm working on the 0.8 release of my 'rex' module, and would appreciate feedback, suggestions, and criticism as I work towards finalizing the API and feature sets. rex is a module intended to make...
6
by: Scott Zuyderduyn | last post by:
Hi all, I've started planning a software project that utilizes XML to store a lot of different input from the user. Regular expressions comprise a portion of this input. I could store a regex...
3
by: gilad | last post by:
I'm working on an application in C# that will perform regular expression matching against a small string. Usually regular expressions are used such that the text being searched is large while the...
7
by: Patient Guy | last post by:
Coding patterns for regular expressions is completely unintuitive, as far as I can see. I have been trying to write script that produces an array of attribute components within an HTML element. ...
3
by: George Yachán | last post by:
What would the regular expression be to return both of the following? Thank you. 'earns $21.25 from" 'earns $21 from"
20
by: Asper Faner | last post by:
I seem to always have hard time understaing how this regular expression works, especially how on earth do people bring it up as part of computer programming language. Natural language processing...
1
by: Allan Ebdrup | last post by:
I have a dynamic list of regular expressions, the expressions don't change very often but they can change. And I have a single string that I want to match the regular expressions against and find...
10
by: Thomas Dybdahl Ahle | last post by:
Hi, I'm writing a program with a large data stream to which modules can connect using regular expressions. Now I'd like to not have to test all expressions every time I get a line, as most of...
47
by: Henning_Thornblad | last post by:
What can be the cause of the large difference between re.search and grep? This script takes about 5 min to run on my computer: #!/usr/bin/env python import re row="" for a in range(156000):...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.