473,320 Members | 1,724 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Is there a maximum length of a regular expression in python?

I have a regular expression that is approximately 100k bytes. (It is
basically a list of all known norwegian postal numbers and the
corresponding place with | in between. I know this is not the intended
use for regular expressions, but it should nonetheless work.

the pattern is
ur'(N-|NO-)?(5259 HJELLESTAD|4026 STAVANGER|4027 STAVANGER........|8305
SVOLVÆR)'

The error message I get is:
RuntimeError: internal error in regular expression engine

Jan 18 '06 #1
14 11311
ol****************@gmail.com wrote:
I have a regular expression that is approximately 100k bytes. (It is
basically a list of all known norwegian postal numbers and the
corresponding place with | in between. I know this is not the intended
use for regular expressions, but it should nonetheless work.

the pattern is
ur'(N-|NO-)?(5259 HJELLESTAD|4026 STAVANGER|4027 STAVANGER........|8305
SVOLVÆR)'

The error message I get is:
RuntimeError: internal error in regular expression engine

And I'm not the least bit surprised. Your code is brittle (i.e. likely
to break) and cannot, for example, cope with multiple spaces between the
number and the word(s). Quite apart from breaking the interpreter :-)

I'd say your test was the clearest possible demonstration that there
*is* a limit.

Wouldn't it be better to have a dict keyed on the number and containing
the word (which you can construct from the same source you constructed
your horrendously long regexp)?

Then if you find something matching the pattern (untested)

ur'(N-|NO-)?((\d\d\d\d)\s*([A-Za-z ]+))'

or something like it that actually works (I invariably get regexps wrong
at least three times before I get them right) you can use the dict to
validate the number and name.

Quite apart from anything else, if the text line you are examining
doesn't have the right syntactic form then you are going to test
hundreds of options, none of which can possibly match. So matching the
syntax and then validating the data identified seems like a much more
sensible option (to me, at least).

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/

Jan 18 '06 #2
ol****************@gmail.com wrote:
I have a regular expression that is approximately 100k bytes. (It is
basically a list of all known norwegian postal numbers and the
corresponding place with | in between. I know this is not the intended
use for regular expressions, but it should nonetheless work.

the pattern is
ur'(N-|NO-)?(5259 HJELLESTAD|4026 STAVANGER|4027 STAVANGER........|8305
SVOLVÆR)'

The error message I get is:
RuntimeError: internal error in regular expression engine


you're most likely exceeding the allowed code size (usually 64k).

however, putting all postal numbers in a single RE is a horrid abuse of the RE
engine. why not just scan for "(N-|NO-)?(\d+)" and use a dictionary to check
if you have a valid match?

postcodes = {
"5269": "HJELLESTAD",
...
"9999": "ØSTRE FJORDVIDDA",
}

for m in re.finditer("(N-|NO-)?(\d+) ", text):
prefix, number = m.groups()
try:
place = postcodes[number]
except KeyError:
continue
if not text.startswith(place, m.end()):
continue
# got a match!
print prefix, number, place

</F>

Jan 18 '06 #3
In article <11**********************@z14g2000cwz.googlegroups .com>,
ol****************@gmail.com wrote:
I have a regular expression that is approximately 100k bytes. (It is
basically a list of all known norwegian postal numbers and the
corresponding place with | in between. I know this is not the intended
use for regular expressions, but it should nonetheless work.

the pattern is
ur'(N-|NO-)?(5259 HJELLESTAD|4026 STAVANGER|4027 STAVANGER........|8305
SVOLVÆR)'

The error message I get is:
RuntimeError: internal error in regular expression engine


I don't know of any stated maximum length, but I'm not at all surprised
this causes the regex compiler to blow up. This is clearly a case of regex
being the wrong tool for the job.

I'm guessing a dictionary, with the numeric codes as keys and the city
names as values (or perhaps the other way around) is what you want.
Jan 18 '06 #4

<ol****************@gmail.com> skrev i en meddelelse
news:11**********************@z14g2000cwz.googlegr oups.com...
I have a regular expression that is approximately 100k bytes. (It is
basically a list of all known norwegian postal numbers and the
corresponding place with | in between. I know this is not the intended
use for regular expressions, but it should nonetheless work.


Err. No.

A while back it was established in this forum that re's per design can have
a maximum of 99 match groups ... I suspect that every "|" silently consumes
one match group.
Jan 19 '06 #5
Frithiof Andreas Jensen wrote:
I have a regular expression that is approximately 100k bytes. (It is
basically a list of all known norwegian postal numbers and the
corresponding place with | in between. I know this is not the intended
use for regular expressions, but it should nonetheless work.


Err. No.

A while back it was established in this forum that re's per design can have
a maximum of 99 match groups ... I suspect that every "|" silently consumes
one match group.


nope. this is a code size limit, not a group count limit.

</F>

Jan 19 '06 #6
Roy Smith wrote:
ol****************@gmail.com wrote:

I have a regular expression that is approximately 100k bytes. (It is
basically a list of all known norwegian postal numbers and the
corresponding place with | in between. I know this is not the intended
use for regular expressions, but it should nonetheless work.

the pattern is
ur'(N-|NO-)?(5259 HJELLESTAD|4026 STAVANGER|4027 STAVANGER........|8305
SVOLVÆR)'

The error message I get is:
RuntimeError: internal error in regular expression engine

I don't know of any stated maximum length, but I'm not at all surprised
this causes the regex compiler to blow up. This is clearly a case of regex
being the wrong tool for the job.


Does no one care about an internal error in the regular expression
engine?
--
--Bryan
Jan 20 '06 #7
In article <iu*****************@newssvr27.news.prodigy.net> ,
Bryan Olson <fa*********@nowhere.org> wrote:
Roy Smith wrote:
ol****************@gmail.com wrote:

I have a regular expression that is approximately 100k bytes. (It is
basically a list of all known norwegian postal numbers and the
corresponding place with | in between. I know this is not the intended
use for regular expressions, but it should nonetheless work.

the pattern is
ur'(N-|NO-)?(5259 HJELLESTAD|4026 STAVANGER|4027 STAVANGER........|8305
SVOLVÆR)'

The error message I get is:
RuntimeError: internal error in regular expression engine

I don't know of any stated maximum length, but I'm not at all surprised
this causes the regex compiler to blow up. This is clearly a case of regex
being the wrong tool for the job.


Does no one care about an internal error in the regular expression
engine?


I think the most that could be said here is that it should probably produce
a better error message.
Jan 20 '06 #8
Bryan Olson wrote:
Roy Smith wrote:
ol****************@gmail.com wrote:
I have a regular expression that is approximately 100k bytes. (It is
basically a list of all known norwegian postal numbers and the
corresponding place with | in between. I know this is not the intended
use for regular expressions, but it should nonetheless work.

the pattern is
ur'(N-|NO-)?(5259 HJELLESTAD|4026 STAVANGER|4027 STAVANGER........|8305
SVOLVÆR)'

The error message I get is:
RuntimeError: internal error in regular expression engine

I don't know of any stated maximum length, but I'm not at all surprised
this causes the regex compiler to blow up. This is clearly a case of regex
being the wrong tool for the job.

Does no one care about an internal error in the regular expression
engine?

Not one that requires parsing a 100 kilobyte re that should be replaced
by something more sensible, no.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/

Jan 21 '06 #9
Steve Holden <st***@holdenweb.com> writes:
Does no one care about an internal error in the regular expression
engine?

Not one that requires parsing a 100 kilobyte re that should be
replaced by something more sensible, no.


If the internal error means the re engine bumped into some internal
limit and gracefully raised an exception, then fine. If "internal
error" means the re engine unexpectedly got into some inconsistent
internal state, then threw up its hands and barfed after discovering
the error sometime later, that's bad. Does nobody care which it is?
Jan 21 '06 #10
In article <7x************@ruckus.brouhaha.com>,
Paul Rubin <http://ph****@NOSPAM.invalid> wrote:
Steve Holden <st***@holdenweb.com> writes:
Does no one care about an internal error in the regular expression
engine?

Not one that requires parsing a 100 kilobyte re that should be
replaced by something more sensible, no.


If the internal error means the re engine bumped into some internal
limit and gracefully raised an exception, then fine. If "internal
error" means the re engine unexpectedly got into some inconsistent
internal state, then threw up its hands and barfed after discovering
the error sometime later, that's bad. Does nobody care which it is?


The nice thing about an open source project is that if nobody else gets
excited about some particular issue which is bothering you, you can take a
look yourself.

(from Python-2.3.4/Modules/_sre.c):

static void
pattern_error(int status)
{
switch (status) {
case SRE_ERROR_RECURSION_LIMIT:
PyErr_SetString(
PyExc_RuntimeError,
"maximum recursion limit exceeded"
);
break;
case SRE_ERROR_MEMORY:
PyErr_NoMemory();
break;
default:
/* other error codes indicate compiler/engine bugs */
PyErr_SetString(
PyExc_RuntimeError,
"internal error in regular expression engine"
);
}
}

I suppose one man's graceful exit is another man's barf.
Jan 21 '06 #11
[Bryan Olson]
Does no one care about an internal error in the regular expression
engine?

[Steve Holden] Not one that requires parsing a 100 kilobyte re that should be replaced
by something more sensible, no.


I care: this is a case of not detecting information loss due to
unchecked downcasting in C, and it was pure luck that it resulted in
an internal re error rather than, say, a wrong result. God only knows
what other pathologies the re engine could tricked into exhibiting
this way. Python 2.5 will raise an exception instead, during regexp
compilation (I just checked in code for this on the trunk; with some
luck, someone will backport that to 2.4 too).
Jan 21 '06 #12
Tim Peters wrote:
[Bryan Olson]
Does no one care about an internal error in the regular expression
engine?

[Steve Holden]
Not one that requires parsing a 100 kilobyte re that should be replaced
by something more sensible, no.

I care: this is a case of not detecting information loss due to
unchecked downcasting in C, and it was pure luck that it resulted in
an internal re error rather than, say, a wrong result. God only knows
what other pathologies the re engine could tricked into exhibiting
this way. Python 2.5 will raise an exception instead, during regexp
compilation (I just checked in code for this on the trunk; with some
luck, someone will backport that to 2.4 too).


Just goes to show you, ignorance is bliss.
What would we do without you, Tim?

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/

Jan 21 '06 #13

"Bryan Olson" <fa*********@nowhere.org> skrev i en meddelelse
news:iu*****************@newssvr27.news.prodigy.ne t...
Roy Smith wrote: Does no one care about an internal error in the regular expression
engine?


Yes, but - given the example - In about the same way that I care about an
internal error in my car engine after dropping a spanner into it ;-)
Jan 24 '06 #14
this should really be posted to http://www.thedailywtf.com/, I wonder
if they have a german version of TheDailyWTF.com?

Jan 25 '06 #15

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Kenneth McDonald | last post by:
I'm working on the 0.8 release of my 'rex' module, and would appreciate feedback, suggestions, and criticism as I work towards finalizing the API and feature sets. rex is a module intended to make...
3
by: Vibha Tripathi | last post by:
Hi Folks, I put a Regular Expression question on this list a couple days ago. I would like to rephrase my question as below: In the Python re.sub(regex, replacement, subject)...
2
by: Kums | last post by:
What is the maximum permissible size of a database? Is there any limitation. What is the maximum # of tablespace's allowed in a database? Thanks for your response.
43
by: Roger L. Cauvin | last post by:
Say I have some string that begins with an arbitrary sequence of characters and then alternates repeating the letters 'a' and 'b' any number of times, e.g. "xyz123aaabbaabbbbababbbbaaabb" I'm...
5
by: Avi Kak | last post by:
Folks, Does regular expression processing in Python allow for executable code to be embedded inside a regular expression? For example, in Perl the following two statements $regex =...
0
by: peternet | last post by:
Folks, I need to be able to determine the maximum length of data allowed by a regular expression of a node of a given XML document using VB .Net. Any ideas would be appreciated. Thanks.
5
by: Noah Hoffman | last post by:
I have been trying to write a regular expression that identifies a block of text enclosed by (potentially nested) parentheses. I've found solutions using other regular expression engines (for...
0
by: gdetre | last post by:
Dear all, I'm trying to get a large, machine-generated regular expression (many thousands of characters) to work in Python on a Mac (running Leopard), and I keep banging my head against this...
8
by: Uwe Schmitt | last post by:
Hi, Is anobody aware of this post: http://swtch.com/~rsc/regexp/regexp1.html ? Are there any plans to speed up Pythons regular expression module ? Or is the example in this artricle too...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.