473,748 Members | 2,231 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Python regular expressions just ain't PCRE

I'm kind of disappointed with the re regular expressions module. In
particular, the lack of support for recursion ( (?R) or (?n) ) is a
major drawback to me. There are so many great things that can be
accomplished with regular expressions this way, such as validating a
mathematical expression or parsing a language with nested parens,
quoting or expressions.

Another feature I'm missing is once-only subpatterns and possessive
quantifiers ( (?>...) and ?+ *+ ++ {...}+ ) which are great to avoid
deep recursion and inefficiency in some complex patterns with nested
quantifiers. Even java.util.regex supports them.

Are there any plans to support these features in re? These would be
great features for Python 2.6, they wouldn't clutter anything, and
they'd mean one less reason left to use Perl instead of Python.

Note: I know there are LALR parser generators/parsers for Python, but
the very reason why re exists is to provide a much simpler, more
productive way to parse or validate simple languages and process text.
(The pyparse/yappy/yapps/<insert your favourite Python parser
generator hereargument could have been used to skip regular
expression support in the language, or to deprecate re. Would you want
that? And following the same rule, why would we have Python when
there's C?)

May 5 '07 #1
13 7488

"Wiseman" <Wi*********@gm ail.comwrote in message
news:11******** *************@e 65g2000hsc.goog legroups.com...
| I'm kind of disappointed with the re regular expressions module.

I believe the current Python re module was written to replace the Python
wrapping of pcre in order to support unicode.

| In particular, the lack of support for recursion ( (?R) or (?n) ) is a
| major drawback to me.

I don't remember those being in the pcre Python once had. Perhaps they are
new.

|Are there any plans to support these features in re?

I have not seen any. You would have to ask the author. But I suspect that
this would be a non-trivial project outside his needs.

tjr

May 5 '07 #2
In <11************ *********@e65g2 000hsc.googlegr oups.com>, Wiseman wrote:
Note: I know there are LALR parser generators/parsers for Python, but
the very reason why re exists is to provide a much simpler, more
productive way to parse or validate simple languages and process text.
(The pyparse/yappy/yapps/<insert your favourite Python parser
generator hereargument could have been used to skip regular
expression support in the language, or to deprecate re. Would you want
that? And following the same rule, why would we have Python when
there's C?)
I don't follow your reasoning here. `re` is useful for matching tokens
for a higher level parser and C is useful for writing parts that need
hardware access or "raw speed" where pure Python is too slow.

Regular expressions can become very unreadable compared to Python source
code or EBNF grammars but modeling the tokens in EBNF or Python objects
isn't as compact and readable as simple regular expressions. So both `re`
and higher level parsers are useful together and don't supersede each
other.

The same holds for C and Python. IMHO.

Ciao,
Marc 'BlackJack' Rintsch
May 5 '07 #3
On May 5, 5:12 am, "Terry Reedy" <tjre...@udel.e duwrote:
I believe the current Python re module was written to replace the Python
wrapping of pcre in order to support unicode.
I don't know how PCRE was back then, but right now it supports UTF-8
Unicode patterns and strings, and Unicode character properties. Maybe
it could be reintroduced into Python?
I don't remember those being in the pcre Python once had. Perhaps they are
new.
At least today, PCRE supports recursion and recursion check,
possessive quantifiers and once-only subpatterns (disables
backtracking in a subpattern), callouts (user functions to call at
given points), and other interesting, powerful features.

May 5 '07 #4
On May 5, 7:19 am, Marc 'BlackJack' Rintsch <bj_...@gmx.net wrote:
In <1178323901.381 993.47...@e65g2 000hsc.googlegr oups.com>, Wiseman wrote:
Note: I know there are LALR parser generators/parsers for Python, but
the very reason why re exists is to provide a much simpler, more
productive way to parse or validate simple languages and process text.
(The pyparse/yappy/yapps/<insert your favourite Python parser
generator hereargument could have been used to skip regular
expression support in the language, or to deprecate re. Would you want
that? And following the same rule, why would we have Python when
there's C?)

I don't follow your reasoning here. `re` is useful for matching tokens
for a higher level parser and C is useful for writing parts that need
hardware access or "raw speed" where pure Python is too slow.

Regular expressions can become very unreadable compared to Python source
code or EBNF grammars but modeling the tokens in EBNF or Python objects
isn't as compact and readable as simple regular expressions. So both `re`
and higher level parsers are useful together and don't supersede each
other.

The same holds for C and Python. IMHO.

Ciao,
Marc 'BlackJack' Rintsch
Sure, they don't supersede each other and they don't need to. My point
was that the more things you can do with regexes (not really regular
expressions anymore), the better -as long as they are powerful enough
for what you need to accomplish and they don't become a giant Perl-
style hack, of course-, because regular expressions are a built-in,
standard feature of Python, and they are much faster to use and write
than Python code or some LALR parser definition, and they are more
generally known and understood. You aren't going to parse a
programming language with a regex, but you can save a lot of time if
you can parse simple, but not so simple languages with them. Regular
expressions offer a productive alternative to full-fledged parsers for
the cases where you don't need them. So saying if you want feature X
or feature Y in regular expressions you should use a Bison-like parser
sounds a bit like an excuse, because the very reason why regular
expressions like these exist is to avoid using big, complex parsers
for simple cases. As an analogy, I mentioned Python vs. C: you want to
develop high-level languages because they are simpler and more
productive than working with C, even if you can do anything with the
later.

May 5 '07 #5
On Sat, May 05, 2007 at 08:52:15AM -0700, Wiseman wrote:
I believe the current Python re module was written to replace the Python
wrapping of pcre in order to support unicode.

I don't know how PCRE was back then, but right now it supports UTF-8
Unicode patterns and strings, and Unicode character properties. Maybe
it could be reintroduced into Python?
I would say this is a case for "rough consensus and working code". With
something as big and ugly[1] as a regexp library, I think the "working
code" part will be the hard part.

So, if you have a patch, there's a decent chance such a thing would be
adopted.

I'm not sure what your skill level is, but I would suggest studying the
code, starting in on a patch for one or more of these features, and then
corresponding with the module's maintainers to improve your patch to the
point where it can be accepted.

Dustin
May 5 '07 #6
Are there any plans to support these features in re?

This question is impossible to answer. I don't have such
plans, and I don't know of any, but how could I speak for
the hundreds of contributors to Python world-wide, including
those future contributors which haven't contributed *yet*.

Do you have plans for such features in re?

Regards,
Martin
May 5 '07 #7
Wiseman wrote:
I'm kind of disappointed with the re regular expressions module. In
particular, the lack of support for recursion ( (?R) or (?n) ) is a
major drawback to me. There are so many great things that can be
accomplished with regular expressions this way, such as validating a
mathematical expression or parsing a language with nested parens,
quoting or expressions.
-1 on this from me. In the past 10 years as a professional
programmer, I've used the wierd extended "regex" features maybe 5
times total, whether it be in Perl or Python. In contrast, I've had
to work around the slowness of PCRE-style engines by forking off a
grep() or something similar practically every other month. I think
it'd be far more valuable for most programmers if Python moved toward
dropping the extended semantics so that something one of the efficient
regex libraries (linked in a recent thread here on comp.lang.pytho n)
could work with, and then added a parsing library to the standard
library for more complex jobs. Alternatively, if the additional
memory used isn't huge we could consider having more intelligence in
the re compiler and having it choose between a smarter PCRE engine or
a faster regex engine based on the input. The latter is something I'm
playing with a patch for that I hope to get into a useful state for
discussion soon.

But regexes are one area where speed very often makes the difference
between whether they're usable or not, and that's far more often been
a limitation for me--and I'd think for most programmers--than any lack
in their current Python semantics. So I'd rather see that attacked
first.

May 5 '07 #8
On May 6, 1:52 am, Wiseman <Wiseman1...@gm ail.comwrote:
On May 5, 5:12 am, "Terry Reedy" <tjre...@udel.e duwrote:
I believe the current Python re module was written to replace the Python
wrapping of pcre in order to support unicode.

I don't know how PCRE was back then, but right now it supports UTF-8
Unicode patterns and strings, and Unicode character properties. Maybe
it could be reintroduced into Python?
"UTF-8 Unicode" is meaningless. Python has internal unicode string
objects, with comprehensive support for converting to/from str (8-bit)
string objects. The re module supports unicode patterns and strings.
PCRE "supports" patterns and strings which are encoded in UTF-8. This
is quite different, a kludge, incomparable. Operations which inspect/
modify UTF-8-encoded data are of interest only to folk who are
constrained to use a language which has nothing resembling a proper
unicode datatype.
>
At least today, PCRE supports recursion and recursion check,
possessive quantifiers and once-only subpatterns (disables
backtracking in a subpattern), callouts (user functions to call at
given points), and other interesting, powerful features.
The more features are put into a regular expression module, the more
difficult it is to maintain and the more the patterns look like line
noise.

There's also the YAGNI factor; most folk would restrict using regular
expressions to simple grep-like functionality and data validation --
e.g. re.match("[A-Z][A-Z]?[0-9]{6}[0-9A]$", idno). The few who want to
recognise yet another little language tend to reach for parsers, using
regular expressions only in the lexing phase.

If you really want to have PCRE functionality in Python, you have a
few options:
(1) create a wrapper for PCRE using e.g. SWIG or pyrex or hand-
crafting
(2) write a PEP, get it agreed, and add the functionality to the re
module
(3) wait until someone does (1) or (2) for free
(4) fund someone to do (1) or (2)

HTH,
John

May 5 '07 #9
On May 5, 6:28 pm, dus...@v.igoro. us wrote:
I'm not sure what your skill level is, but I would suggest studying the
code, starting in on a patch for one or more of these features, and then
corresponding with the module's maintainers to improve your patch to the
point where it can be accepted.
I'll consider creating a new PCRE module for Python that uses the
latest version PCRE library. It'll depend on my time availability, but
I can write Python extensions, and I haven't used PCRE in a long time,
and I recall it was a bit of a hassle, but I could get it done.

May 6 '07 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
2702
by: LuKrOz | last post by:
Someone could tell me how can I get the same result substituting ereg with preg_match and ereg_replace with preg_replace. $result = ereg("<\>(.+)<\>",$this->buffer,$token); $this->buffer = ereg_replace("<\>.+<\>","<>",$this->buffer) ; Thanks.
8
6984
by: Eric Linders | last post by:
Hi, I'm trying to figure out the most efficient method for taking the first character in a string (which will be a number), and use it as a variable to check to see if the other numbers in the string match that first number. I'm using this code for form validation of a telephone number. Previous records from the past few months show that when someone is just messing around on one of our forms (to waste our time), they type
68
5878
by: Lad | last post by:
Is anyone capable of providing Python advantages over PHP if there are any? Cheers, L.
1
1446
by: Kamikazy | last post by:
Hi! Can someone tell me from where could I download regular expresions for programming language C? I'm working on a simple compiler and using Lex for creating Lexical Analyser so i need R.E. Thanks!!!
3
3025
by: a | last post by:
I'm a newbie needing to use some Regular Expressions in PHP. Can I safely use the results of my tests using 'The Regex Coach' (http://www.weitz.de/regex-coach/index.html) Are the Regular Expressions used in Perl identical to the Regular Expressions in PHP?
4
13960
by: Clodoaldo Pinto | last post by:
preg_match(): Compilation failed: regular expression too large at offset 0 The regular expression is 34,745 bytes long. <?php $regExp = 'huge regexp with 34,745 bytes'; echo '<pre>strlen($regExp) = ', strlen($regExp), "\n"; echo preg_match($regExp, 'sudokusweb.com'); echo '</pre>';
1
2829
by: Wehrdamned | last post by:
Hi, As I understand it, python uses a pcre engine to work with regular expression. My question is, then, why expressions like : Traceback (most recent call last): File "<stdin>", line 1, in ? File "/usr/lib/python2.4/sre.py", line 180, in compile return _compile(pattern, flags)
12
1957
by: cmk128 | last post by:
Hi PHP's regular expression look like doesn't support .*? syntax. So i cannot match the shortest match. For exmaple: $str="a1b a3b"; $str1=ereg_replace("a.*b", "peter", $str1); will produce "peter", but i want "peter peter", so how to? thanks from Peter (cmk128@hotmail.com)
0
8979
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9355
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9307
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9225
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8234
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6790
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4860
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3296
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2773
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.