By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
432,027 Members | 1,111 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 432,027 IT Pros & Developers. It's quick & easy.

more than 100 capturing groups in a regex

P: n/a
Hello,

Python regular expressions must not have more than 100 capturing
groups. The source code responsible for this reads as follows:
# XXX: <fl> get rid of this limitation!
if p.pattern.groups > 100:
raise AssertionError(
"sorry, but this version only supports 100 named groups"
)

I have been waiting a long time now for Python to get rid of this
limitation.
I could make a program of mine a lot faster with an easy hack if Python
did not have it.

My question is: Does anyone know if the problem is going to be fixed in
the next few months or so? Or is there a way to circumvent it?
Jörg Schuster

Oct 24 '05 #1
Share this Question
Share on Google+
33 Replies


P: n/a
Joerg> Or is there a way to circumvent [capturing groups limitation]?

Sure, submit a patch to SourceForge that removes the restriction.

I've never come anywhere close to creating regular expressions that need to
capture 100 groups even though I generate regular expressions from a
higher-level representation. I suspect few will have hit that limit.
Perhaps explain what motivates you to want to capture that many groups.
Other people may be able to suggest alternatives. And remember:

Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems. --Jamie Zawinski

Skip
Oct 24 '05 #2

P: n/a
> Some people, when confronted with a problem, think "I know,
I'll use regular expressions." Now they have two problems.
--Jamie Zawinski


Thanks for the citation.

If my goal had been to redesign my program, I would not ask questions
about regular expressions. I do not have the time to redesign my
program. And knowing that my situation would be better, if I had
written other code in the past, does not help me at all.

I just want to use more than 100 capturing groups. If someone told me
that it is very unlikely for Python to get rid of the said limitation,
I would recode part of my program in C++ using pcre. But I would prefer
to be able to do everything in Python. That is why I asked.

Jörg

Oct 25 '05 #3

P: n/a
Joerg Schuster wrote:
I just want to use more than 100 capturing groups. If someone told me
that it is very unlikely for Python to get rid of the said limitation,
I would recode part of my program in C++ using pcre.


It is very unlikely for Python to get rid of the said limitation.

-Peter
Oct 25 '05 #4

P: n/a

"Joerg Schuster" <jo************@gmail.com> wrote in message
news:11**********************@z14g2000cwz.googlegr oups.com...
Hello,

Python regular expressions must not have more than 100 capturing
groups.

Really ??

I have been waiting a long time now for Python to get rid of this
limitation.

Ahh - The "dark side" of Open Source:

If nobody cares, then you will have to do it yourself (and often people do
not care because nobody had the need to go there - for good reasons).

My question is: Does anyone know if the problem is going to be fixed in
the next few months or so? Or is there a way to circumvent it?

After a quick glean across the source code for the sre module, it appears
that the only place the code mentions a limit of 100 groups is in fact the
place that you quote.

I suspect it is there for some historic reason - the people to ask is of
course "pythonware.com" who wrote it; there may well be a good reason for
the limitation.

What happens if you up the limit to whatever you need?
Oct 25 '05 #5

P: n/a

"Joerg Schuster" <jo************@gmail.com> wrote in message
news:11**********************@z14g2000cwz.googlegr oups.com...
Some people, when confronted with a problem, think "I know,
I'll use regular expressions." Now they have two problems.
--Jamie Zawinski


Thanks for the citation.

If my goal had been to redesign my program, I would not ask questions
about regular expressions. I do not have the time to redesign my
program. And knowing that my situation would be better, if I had
written other code in the past, does not help me at all.

Experience shows that, in any project, there is always time redo what was
not made properly the first time a deadline was dreamed up.

I just want to use more than 100 capturing groups. If someone told me
that it is very unlikely for Python to get rid of the said limitation,
I would recode part of my program in C++ using pcre.

See!? There *is* Time.

Oct 25 '05 #6

P: n/a
> What happens if you up the limit to whatever you need?

Good idea. I just tried this. Nothing evil seems to happen. This seems
to be a solution. Thanks.

Jörg

Oct 25 '05 #7

P: n/a
Joerg Schuster wrote:
I just want to use more than 100 capturing groups.


define "more" (101, 200, 1000, 100000, ... ?)

</F>

Oct 25 '05 #8

P: n/a
No limitation at all would be best. If a limitation is necessary, then
the more capturing groups, the better. At the time being, I would be
really happy about having the possibility to use 10000 capturing
groups.

Jörg

Oct 25 '05 #9

P: n/a
"Joerg Schuster" <jo************@gmail.com> writes:
No limitation at all would be best. If a limitation is necessary, then
the more capturing groups, the better. At the time being, I would be
really happy about having the possibility to use 10000 capturing
groups.


I'm sorry, I missed the beginning of this thread and it has already expired on
my news server, but what is the reason for so much capturing groups? I
imagine that coding this and keeping code maintenable is a huge effort. Oh,
and I came from Perl, where I used to think in regexps... In Python I almost
never use them.

--
Jorge Godoy <go***@ieee.org>
Oct 25 '05 #10

P: n/a
Joerg Schuster wrote:
What happens if you up the limit to whatever you need?

Good idea. I just tried this. Nothing evil seems to happen. This seems
to be a solution. Thanks.

Jörg

The joys of open source. Just remember you have now made your program
non-portable. Hope this isn't an issue.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/

Oct 25 '05 #11

P: n/a
> but what is the reason for so much capturing groups? I
imagine that coding this and keeping code maintenable is a huge effort.


User input is compiled to regular expressions. The user does not have
to worry about those groups.

Oct 25 '05 #12

P: n/a
On Tue, 25 Oct 2005 03:55:17 -0700, Joerg Schuster wrote:
No limitation at all would be best. If a limitation is necessary, then
the more capturing groups, the better. At the time being, I would be
really happy about having the possibility to use 10000 capturing
groups.


Do you really think that the regular expression needed to do that would be
maintainable?

I'm also curious, what sort of usage case would need ten thousand
capturing groups? I'd love to see the performance, especially if all ten
thousand of them do backtracking.

--
Steven.

Oct 25 '05 #13

P: n/a
> The joys of open source. Just remember you have now
made your program
non-portable. Hope this isn't an issue.


Of course portability is an issue -- on the long run. But on the short
run I am really glad to be able to do a 1 second demo run on my
notebook instead of a 20 seconds demo run. And I am especially glad to
get this 1 second demo by doing a 30-minute hack. (Hopefully ...)

Oct 25 '05 #14

P: n/a

Fredrik Lundh wrote:
Joerg Schuster wrote:
I just want to use more than 100 capturing groups.


define "more" (101, 200, 1000, 100000, ... ?)

</F>


The Zero-One-Infinity Rule:

http://www.catb.org/~esr/jargon/html...nity-Rule.html

Iain

Oct 25 '05 #15

P: n/a
"Joerg Schuster" <jo************@gmail.com> writes:
but what is the reason for so much capturing groups? I
imagine that coding this and keeping code maintenable is a huge effort.


User input is compiled to regular expressions. The user does not have
to worry about those groups.


And what is the problem with something like getopt, optparse, etc.?

And 10K groups on a single input line?

--
Jorge Godoy <go***@ieee.org>
Oct 25 '05 #16

P: n/a
On Tue, 25 Oct 2005 05:17:52 -0700, Iain King wrote:

Fredrik Lundh wrote:
Joerg Schuster wrote:
> I just want to use more than 100 capturing groups.


define "more" (101, 200, 1000, 100000, ... ?)

</F>


The Zero-One-Infinity Rule:

http://www.catb.org/~esr/jargon/html...nity-Rule.html

Nice in principle, not always practical. Sometimes the choice is, "do you
want it today with arbitrary limits, or six months from now with bugs
but no limits?"

If assigning arbitrary limits prevents worse problems, well, then go for
the limit. For instance, anyone who has fought browser pops ups ("close
one window, and ten more open") may have wished that the browser
implemented an arbitrary limit of, say, ten pop ups. Or even zero :-)

Of course, having said that, in those cases, the limit isn't really
arbitrary. In cases of genuinely arbitrary limits, I agree they are
pointless and annoying (as opposed to having a point but still being
annoying).
--
Steven.

Oct 25 '05 #17

P: n/a
You did not quite understand me. I will give you some details:

My program is a compiler for a certain type of linguistic grammars.
I.e. the user gives *grammar files* to my program. When the grammar
files have been compiled, they can be applied to strings (of a certain
language, e.g. English).

In the grammar files, the user does not have to deal with "capturing
groups". He even does not have to deal with regular expressions. He
just writes a grammar of a certain type. My program then compiles the
grammar into a cascade of transducers. Each of the transducers is
internally represented as a pair (REGEX, ACTION), where REGEX is a
Python regular expression and ACTION a Python function. I.e.: The
meaning of the grammar is: For each line of the input string: if REGEX
matches the line, then apply ACTION to it.

On various levels, the user may produce *disjunctions*. At the time
being, these disjunctions are internally represented by something like:

if regex1: action1()
elif regex2: action2()
elif ...
eliif regexn: actionn()

It would be nicer (and faster) to have just one regex and run that
action A such that the *capturing group* with name A ("?P<A>...")
matched.

Now, I could of course internally create my very own transducers. But
the re module is a module that generates fsa and fsa do part of the
work that a transducer does. So why reinvent the wheel?
Jörg

Oct 25 '05 #18

P: n/a

Steven D'Aprano wrote:
On Tue, 25 Oct 2005 05:17:52 -0700, Iain King wrote:

Fredrik Lundh wrote:
Joerg Schuster wrote:

> I just want to use more than 100 capturing groups.

define "more" (101, 200, 1000, 100000, ... ?)

</F>


The Zero-One-Infinity Rule:

http://www.catb.org/~esr/jargon/html...nity-Rule.html

Nice in principle, not always practical. Sometimes the choice is, "do you
want it today with arbitrary limits, or six months from now with bugs
but no limits?"

If assigning arbitrary limits prevents worse problems, well, then go for
the limit. For instance, anyone who has fought browser pops ups ("close
one window, and ten more open") may have wished that the browser
implemented an arbitrary limit of, say, ten pop ups. Or even zero :-)


Well, exactly. Why limit to ten? The user is either going to want to
see pop-ups, or not. So either limit to 0, or to infinity (and indeed,
this is what most browsers do).
Note the jargon entry defines infinity in this case to me the largest
possible amount given whatever ram/disk space/processing power you have
available.

Also: These arbitrary limits tend to stem from languages which
predominantly use fixed size arrays - DIM in basic, or malloc in C.
The native python component is the list, which doesn't have a max size,
so these problems should be encountered less in python code; by it's
nature, python steers you away from this mistake.

Iain

Oct 25 '05 #19

P: n/a
On Tue, 25 Oct 2005 06:30:35 -0700, Iain King wrote:

Steven D'Aprano wrote:
On Tue, 25 Oct 2005 05:17:52 -0700, Iain King wrote:
>
> Fredrik Lundh wrote:
>> Joerg Schuster wrote:
>>
>> > I just want to use more than 100 capturing groups.
>>
>> define "more" (101, 200, 1000, 100000, ... ?)
>>
>> </F>
>
> The Zero-One-Infinity Rule:
>
> http://www.catb.org/~esr/jargon/html...nity-Rule.html

Nice in principle, not always practical. Sometimes the choice is, "do you
want it today with arbitrary limits, or six months from now with bugs
but no limits?"

If assigning arbitrary limits prevents worse problems, well, then go for
the limit. For instance, anyone who has fought browser pops ups ("close
one window, and ten more open") may have wished that the browser
implemented an arbitrary limit of, say, ten pop ups. Or even zero :-)


Well, exactly. Why limit to ten? The user is either going to want to
see pop-ups, or not. So either limit to 0, or to infinity (and indeed,
this is what most browsers do).


I haven't been troubled by exponentially increasing numbers of pop up
windows for a long, long time. But consider your question "why limit to
ten?" in a wider context.

Elevators always have a weight limit: the lift will operate up to N
kilograms, and stop at N+1. This limit is, in a sense, quite arbitrary,
since that value of N is well below the breaking point of the elevator
cables. Lift engineers, I'm told, use a safety factor of 10 (if the
cable will carry X kg without breaking, set N = X/10). This safety
factor is obviously arbitrary: a more cautious engineer might use a
factor of 100, or even 1000, while another might choose a factor of 5 or
2 or even 1.1. If engineers followed your advice, they would build lifts
that either carried nothing at all, or accepted as much weight until the
cable stretched and snapped.

Perhaps computer programmers would have fewer buffer overflow security
exploits if they took a leaf out of engineers' book and built in a few
more arbitrary safety factors into their data-handling routines. We can
argue whether 256 bytes is long enough for a URL or not, but I think we
can all agree that 3 MB for a URL is more than any person needs.

When you are creating an iterative solution to a problem, the ending
condition is not always well-specified. Trig functions such as sine and
cosine are an easy case: although they theoretically require an infinite
number of terms to generate an exact answer, the terms will eventually
underflow to zero allowing us to stop the calculation.

But unfortunately that isn't the case for all mathematical calculations.
Often, the terms of our sequence do not converge to zero, due to round-off
error. Our answer cycles backwards and forwards between two or more
floating point approximations, e.g. 1.276805 <-> 1.276804. The developer
must make an arbitrary choice to stop after N iterations, if the answer
has not converged. Zero iterations is clearly pointless. One is useless.
And infinite iterations will simply never return an answer. So an
arbitrary choice for N is the only sensible way out.

In a database, we might like to associate (say) multiple phone numbers
with a single account. Most good databases will allow you to do that, but
there is still the question of how to collect that information: you have
to provide some sort of user interface. Now, perhaps you are willing to
build some sort of web-based front-end that allows the user to add new
fields, put their phone number in the new field, with no limit. But
perhaps you are also collecting data using paper forms. So you make an
arbitrary choice: leave two (or three, or ten) boxes for phone numbers.

There are many other reasons why you might decide rationally to impose an
arbitrary limit on some process -- arbitrary does not necessarily mean
"for no good reason". Just make sure that the reason is a good one.
--
Steven.

Oct 25 '05 #20

P: n/a

Steven D'Aprano wrote:
On Tue, 25 Oct 2005 06:30:35 -0700, Iain King wrote:

Steven D'Aprano wrote:
On Tue, 25 Oct 2005 05:17:52 -0700, Iain King wrote:

>
> Fredrik Lundh wrote:
>> Joerg Schuster wrote:
>>
>> > I just want to use more than 100 capturing groups.
>>
>> define "more" (101, 200, 1000, 100000, ... ?)
>>
>> </F>
>
> The Zero-One-Infinity Rule:
>
> http://www.catb.org/~esr/jargon/html...nity-Rule.html
Nice in principle, not always practical. Sometimes the choice is, "do you
want it today with arbitrary limits, or six months from now with bugs
but no limits?"

If assigning arbitrary limits prevents worse problems, well, then go for
the limit. For instance, anyone who has fought browser pops ups ("close
one window, and ten more open") may have wished that the browser
implemented an arbitrary limit of, say, ten pop ups. Or even zero :-)


Well, exactly. Why limit to ten? The user is either going to want to
see pop-ups, or not. So either limit to 0, or to infinity (and indeed,
this is what most browsers do).


I haven't been troubled by exponentially increasing numbers of pop up
windows for a long, long time. But consider your question "why limit to
ten?" in a wider context.

Elevators always have a weight limit: the lift will operate up to N
kilograms, and stop at N+1. This limit is, in a sense, quite arbitrary,
since that value of N is well below the breaking point of the elevator
cables. Lift engineers, I'm told, use a safety factor of 10 (if the
cable will carry X kg without breaking, set N = X/10). This safety
factor is obviously arbitrary: a more cautious engineer might use a
factor of 100, or even 1000, while another might choose a factor of 5 or
2 or even 1.1. If engineers followed your advice, they would build lifts
that either carried nothing at all, or accepted as much weight until the
cable stretched and snapped.

Perhaps computer programmers would have fewer buffer overflow security
exploits if they took a leaf out of engineers' book and built in a few
more arbitrary safety factors into their data-handling routines. We can
argue whether 256 bytes is long enough for a URL or not, but I think we
can all agree that 3 MB for a URL is more than any person needs.

When you are creating an iterative solution to a problem, the ending
condition is not always well-specified. Trig functions such as sine and
cosine are an easy case: although they theoretically require an infinite
number of terms to generate an exact answer, the terms will eventually
underflow to zero allowing us to stop the calculation.

But unfortunately that isn't the case for all mathematical calculations.
Often, the terms of our sequence do not converge to zero, due to round-off
error. Our answer cycles backwards and forwards between two or more
floating point approximations, e.g. 1.276805 <-> 1.276804. The developer
must make an arbitrary choice to stop after N iterations, if the answer
has not converged. Zero iterations is clearly pointless. One is useless.
And infinite iterations will simply never return an answer. So an
arbitrary choice for N is the only sensible way out.

In a database, we might like to associate (say) multiple phone numbers
with a single account. Most good databases will allow you to do that, but
there is still the question of how to collect that information: you have
to provide some sort of user interface. Now, perhaps you are willing to
build some sort of web-based front-end that allows the user to add new
fields, put their phone number in the new field, with no limit. But
perhaps you are also collecting data using paper forms. So you make an
arbitrary choice: leave two (or three, or ten) boxes for phone numbers.

There are many other reasons why you might decide rationally to impose an
arbitrary limit on some process -- arbitrary does not necessarily mean
"for no good reason". Just make sure that the reason is a good one.
--
Steven.

I think we are arguing at cross-purposes, mainly because the term'
arbitrary' has snuck in. The actual rule:

"Allow none of foo, one of foo, or any number of foo." A rule of
thumb for software design, which instructs one to not place random
limits on the number of instances of a given entity.

Firstly, 'for software design'. Not for field engineers servicing
elevators :)

Second, it's [random], not [arbitrary]. I took your use of arbitrary
to mean much the same thing - a number picked without any real
judgement involved, simply because it was deemed larger than some
assumed maximum size. The rule does not apply to a number selected for
good reason.

I don't think I get your phone record example: Surely you'd have the
client record in a one-to-many relationship with the phone number
records, so there would be (theoretically) no limit?

Your web interface rang a bell though - in GMails contacts info page,
each contact has info stored in sections. Each of these sections
stores a heading, an address, and some fields. It defaults to two
fields, with an add field button Hitting it a lot I found this maxed
out at 20 fields per section. You can also add more sections though -
I got bored hitting the add section button once I got to 51 sections
with the button still active. I assume there is some limit to the
number of sections, but I don't know what it is :) GMail is awesome.

Anyway, back to the OP: in this specific case, the cap of 100 groups in
a RE seems random to me, so I think the rule applies.

Also, see "C Programmer's Disease":
http://www.catb.org/~esr/jargon/html...s-Disease.html

Iain

Oct 25 '05 #21

P: n/a
Iain King wrote:
Anyway, back to the OP: in this specific case, the cap of 100 groups in
a RE seems random to me, so I think the rule applies.


perhaps in the "indistinguishable from magic" sense.

if you want to know why 100 is a reasonable and non-random choice, I
suggest checking the RE documentation for "99 groups" and the special
meaning of group 0.

</F>

Oct 26 '05 #22

P: n/a
> if you want to know why 100 is a reasonable and non-random choice, I
suggest checking the RE documentation for "99 groups" and the special
meaning of group 0.


I have read everything I found about Python regular expressions. But I
am not able to understand what you mean. What is so special about 99?

Oct 26 '05 #23

P: n/a
Joerg Schuster wrote:
if you want to know why 100 is a reasonable and non-random choice, I
suggest checking the RE documentation for "99 groups" and the special
meaning of group 0.


I have read everything I found about Python regular expressions. But I
am not able to understand what you mean. What is so special about 99?


it's the largest number than can be written with two decimal digits.

</F>

Oct 26 '05 #24

P: n/a
So what?

Oct 26 '05 #25

P: n/a

Fredrik Lundh wrote:
Iain King wrote:
Anyway, back to the OP: in this specific case, the cap of 100 groups in
a RE seems random to me, so I think the rule applies.


perhaps in the "indistinguishable from magic" sense.

if you want to know why 100 is a reasonable and non-random choice, I
suggest checking the RE documentation for "99 groups" and the special
meaning of group 0.

</F>


Ah, doh! Of course. Oh well then... still, doesn't python's RE
engine support named groups? That would be cumbersome, but would allow
you to go above 100...

Oct 26 '05 #26

P: n/a
* "Iain King" <ia******@gmail.com> wrote:
Ah, doh! Of course. Oh well then... still, doesn't python's RE
engine support named groups? That would be cumbersome, but would allow
you to go above 100...


The named groups are built on top of numbered captures. They are mapped by the
parser and the match instance's group method. The regex matcher itself never
sees these names.

nd
Oct 26 '05 #27

P: n/a
My first test program was far too naive. Evil things do happen. Simply
removing the code that restricts the number of capturing groups to 100
is not a solitution.

Oct 26 '05 #28

P: n/a
.... solution

Oct 26 '05 #29

P: n/a
Joerg Schuster wrote:
So what?


Search in http://docs.python.org/lib/re-syntax.html for "99" and read
the following sentence carefully.

-Peter
Oct 26 '05 #30

P: n/a
D H
Fredrik Lundh wrote:
Joerg Schuster wrote:

if you want to know why 100 is a reasonable and non-random choice, I
suggest checking the RE documentation for "99 groups" and the special
meaning of group 0.


I have read everything I found about Python regular expressions. But I
am not able to understand what you mean. What is so special about 99?

it's the largest number than can be written with two decimal digits.

It's a conflict between python's syntax for regex back references and
octal number literals. Probably wasn't noticed until way too late, and
now it will never change.
Oct 27 '05 #31

P: n/a
> It's a conflict between python's syntax for regex back
references and
octal number literals. Probably wasn't noticed until way
too late, and
now it will never change.


So "reasonable choice" is not a really good description of the
phenomenon.

Oct 27 '05 #32

P: n/a
DH> It's a conflict between python's syntax for regex back references
DH> and octal number literals. Probably wasn't noticed until way too
DH> late, and now it will never change.

I suspect it comes from Perl, since Python's regular expression engine tries
pretty hard to be compatible with Perl's, at least for the basics.

Skip
Oct 27 '05 #33

P: n/a
[DH]
It's a conflict between python's syntax for regex back references
and octal number literals. Probably wasn't noticed until way too
ate, and now it will never change.

[sk**@pobox.com] I suspect it comes from Perl, since Python's regular expression engine tries
pretty hard to be compatible with Perl's, at least for the basics.


"No" to all the above <wink>. The limitation to 99 in backreference
notation was thoroughly discussed on the Python String-SIG at the
time, and it was deliberately not bug-compatible with the Perl of that
time.

In the Perl of that time (no idea what's true now), e.g., \123 in a
regexp was an octal escape if it appeared before or within the 123rd
capturing group, but was a backreference to the 123rd capturing group
if it appeared after the 123rd capturing group. So, yes, two
different instances of "\123" in a single regexp could have different
meanings (meaning chr(83) in one place, and a backreference to group
123 in another, and there's no way to tell the difference without
counting the number of preceding capturing groups).

That's so horridly un-Pythonic that we drew the line there. Nobody
had a sane use case for more than 99 backreferences, so "who cares?"
won.

Note that this isn't a reason for limiting the number of capturing
groups. It only accounts for why we didn't care that you couldn't
write a _backreference_ to a capturing group higher than number 99
using "\nnn" notation.
Oct 27 '05 #34

This discussion thread is closed

Replies have been disabled for this discussion.