By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
464,569 Members | 1,017 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 464,569 IT Pros & Developers. It's quick & easy.

Freeze problem with Regular Expression

P: n/a
Hi All,
the following regular expression matching seems to enter in a infinite
loop:

################
import re
text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
una '
re.findall('[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]
*[A-Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$', text)
#################

No problem with perl with the same expression:

#################
$s = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA) una
';
$s =~ /[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]*[A-
Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$/;
print $1;
#################

I've python 2.5.2 on Ubuntu 8.04.
any idea?
Thanks!

--
Kirk
Jun 27 '08 #1
Share this Question
Share on Google+
9 Replies

P: n/a
On 25 Juni, 17:20, Kirk <nore...@yahoo.comwrote:
Hi All,
the following regular expression matching seems to enter in a infinite
loop:

################
import re
text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
una '
re.findall('[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]
*[A-Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$', text)
#################

No problem with perl with the same expression:

#################
$s = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA) una
';
$s =~ /[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]*[A-
Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$/;
print $1;
#################

I've python 2.5.2 on Ubuntu 8.04.
any idea?
Thanks!

--
Kirk

what are you trying to do?
Jun 27 '08 #2

P: n/a
-----Original Message-----
From: py********************************@python.org [mailto:python-
li*************************@python.org] On Behalf Of Kirk
Sent: Wednesday, June 25, 2008 11:20 AM
To: py*********@python.org
Subject: Freeze problem with Regular Expression

Hi All,
the following regular expression matching seems to enter in a infinite
loop:

################
import re
text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
una '
re.findall('[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-
z]*\s*(?:[0-9]
*[A-Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$', text)
#################

No problem with perl with the same expression:

#################
$s = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
una
';
$s =~ /[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-
9]*[A-
Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$/;
print $1;
#################

I've python 2.5.2 on Ubuntu 8.04.
any idea?
Thanks!

It locks up on 2.5.2 on windows also. Probably too much recursion going
on.
What's with the |'s in [0-9|a-z|\-]? The '|' is a character not an 'or'
operator. I think you meant to say either '[0-9a-z\-]' or '[0-9a-z\-|]'

*****

The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential, proprietary, and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from all computers. GA621
Jun 27 '08 #3

P: n/a
Le Wednesday 25 June 2008 18:40:08 cirfu, vous avez écrit*:
On 25 Juni, 17:20, Kirk <nore...@yahoo.comwrote:
Hi All,
the following regular expression matching seems to enter in a infinite
loop:

################
import re
text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
una '
re.findall('[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9
] *[A-Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$', text)
#################

No problem with perl with the same expression:

#################
$s = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA) una
';
$s =~ /[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]*[A-
Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$/;
print $1;
#################

I've python 2.5.2 on Ubuntu 8.04.
any idea?
Thanks!

--
Kirk

what are you trying to do?
This is indeed the good question.

Whatever the implementation/language is, something like that can work with
happiness, but I doubt you'll find one to tell you if it *should* work or if
it shouldn't, my brain-embedded parser is doing some infinite loop too...

That said, "[0-9|a-z|\-]" is by itself strange, pipe (|) between square
brackets is the character '|', so there is no reason for it to appears twice.

Very complicated regexps are always evil, and a two or three stage filtering
is likely to do the job with good, or at least better, readability.

But once more, what are you trying to do ? This is not even clear that regexp
matching is the best tool for it.

--
_____________

Maric Michaud
Jun 27 '08 #4

P: n/a
On Jun 26, 1:20 am, Kirk <nore...@yahoo.comwrote:
Hi All,
the following regular expression matching seems to enter in a infinite
loop:

################
import re
text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
una '
re.findall('[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]
*[A-Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$', text)
#################
[expletives deleted]
>
I've python 2.5.2 on Ubuntu 8.04.
any idea?
Several problems:
(1) lose the vertical bars (as advised by others)
(2) ALWAYS use a raw string for regexes; your \s* will match on lower-
case 's', not on spaces
(3) why are you using findall on a pattern that ends in "$"?
(4) using non-verbose regexes of that length means you haven't got a
petrol drum's hope in hell of understanding what's going on
(5) too many variable-length patterns, will take a finite (but very
long) time to evaluate
(6) as remarked by others, you haven't said what you are trying to do;
what it actually is doing doesn't look sensible (see below).

Following code is after fixing problems 1,2,3,4:

C:\junk>type infinitere.py
import re
text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
una '
regex0 = r"""
[^A-Z0-9]* # match leading space
(
(?:
[0-9]* # match nothing
[A-Z]+ # match "MSX"
[0-9a-z\-]* # match nothing
)+ # match "MSX"
\s* # match " "
[a-z]* # match nothing
\s* # match nothing
(?:
[0-9]*
[A-Z]+
[0-9a-z\-]*
\s*
)* # match "INTERNATIONAL HOLDINGS ITALIA "
)
([^A-Z]*) # match "srl (di sequito "
"""
regex1 = regex0 + "$"
for rxno, rx in enumerate([regex0, regex1]):
mobj = re.compile(rx, re.VERBOSE).match(text)
if mobj:
print rxno, mobj.groups()
else:
print rxno, "failed"

C:\junk>infinitere.py
0 ('MSX INTERNATIONAL HOLDINGS ITALIA ', 'srl (di seguito ')
### taking a long time, interrupted

HTH,
John
Jun 27 '08 #5

P: n/a
On Jun 26, 8:29*am, John Machin <sjmac...@lexicon.netwrote:
(2) ALWAYS use a raw string for regexes; your \s* will match on lower-
case 's', not on spaces
and should have written:
(2) ALWAYS use a raw string for regexes. <<<=== Big fat full stop
aka period.
but he was at the time only half-way through the first cup of coffee
for the day :-)
Jun 27 '08 #6

P: n/a
On 25 Jun 2008 15:20:04 GMT, Kirk <no*****@yahoo.comwrote:
Hi All,
the following regular expression matching seems to enter in a infinite
loop:

################
import re
text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
una '
re.findall('[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]
*[A-Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$', text)
#################

No problem with perl with the same expression:

#################
$s = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA) una
';
$s =~ /[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]*[A-
Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$/;
print $1;
#################

I've python 2.5.2 on Ubuntu 8.04.
any idea?
If it will help some smarter person identify the problem, it can
be simplified to this:

re.findall('[^X]*((?:0*X+0*)+\s*a*\s*(?:0*X+0*\s*)*)([^X]*)$',
"XXXXXXXXXXXXXXXXX (X" )

This doesn't actually hang, it just takes a long time. The
time taken increases quickly as the chain of X's gets longer.

HTH

--
To email me, substitute nowhere->spamcop, invalid->net.
Jun 27 '08 #7

P: n/a
On Wed, 25 Jun 2008 15:29:38 -0700, John Machin wrote:
Several problems:
Ciao John (and All partecipating in this thread),
first of all I'm sorry for the delay but I was out for business.
(1) lose the vertical bars (as advised by others) (2) ALWAYS use a raw
string for regexes; your \s* will match on lower- case 's', not on
spaces
right! thanks!
(3) why are you using findall on a pattern that ends in "$"?
Yes, you are right, I started with a different need and then it changed
over time...
(6) as remarked by others, you haven't said what you are trying to do;
I reply here to all of you about such point: that's not important,
although I appreciate very much your suggestions!
My point was 'something that works in Perl, has problems in Python'.
In respect to this, I thank Peter for his analysis.
Probably Perl has a different pattern matching algorithm.

Thanks again to all of you!

Bye!

--
Kirk
Jun 30 '08 #8

P: n/a
On Jul 1, 12:45 am, Kirk <nore...@yahoo.comwrote:
On Wed, 25 Jun 2008 15:29:38 -0700, John Machin wrote:
Several problems:

Ciao John (and All partecipating in this thread),
first of all I'm sorry for the delay but I was out for business.
(1) lose the vertical bars (as advised by others) (2) ALWAYS use a raw
string for regexes; your \s* will match on lower- case 's', not on
spaces

right! thanks!
(3) why are you using findall on a pattern that ends in "$"?

Yes, you are right, I started with a different need and then it changed
over time...
(6) as remarked by others, you haven't said what you are trying to do;

I reply here to all of you about such point: that's not important,
although I appreciate very much your suggestions!
My point was 'something that works in Perl, has problems in Python'.
It *is* important; our point was 'you didn't define "works", and it
was near-impossible (without transcribing your regex into verbose
mode) to guess at what you suppose it might do sometimes'.
Jun 30 '08 #9

P: n/a
On Mon, 30 Jun 2008 13:43:22 -0700, John Machin wrote:
>I reply here to all of you about such point: that's not important,
although I appreciate very much your suggestions! My point was
'something that works in Perl, has problems in Python'.

It *is* important; our point was 'you didn't define "works", and it was
ok...
near-impossible (without transcribing your regex into verbose mode) to
guess at what you suppose it might do sometimes'.
fine: it's supposed to terminate! :-)

Do you think that hanging is an *admissible* behavior? Couldn't we learn
something from Perl implementation?

This is my point.

Bye

--
Kirk
Jul 1 '08 #10

This discussion thread is closed

Replies have been disabled for this discussion.