Python regular expressions just ain't PCRE

Wiseman

I'm kind of disappointed with the re regular expressions module. In
particular, the lack of support for recursion ( (?R) or (?n) ) is a
major drawback to me. There are so many great things that can be
accomplished with regular expressions this way, such as validating a
mathematical expression or parsing a language with nested parens,
quoting or expressions.

Another feature I'm missing is once-only subpatterns and possessive
quantifiers ( (?>...) and ?+ *+ ++ {...}+ ) which are great to avoid
deep recursion and inefficiency in some complex patterns with nested
quantifiers. Even java.util.regex supports them.

Are there any plans to support these features in re? These would be
great features for Python 2.6, they wouldn't clutter anything, and
they'd mean one less reason left to use Perl instead of Python.

Note: I know there are LALR parser generators/parsers for Python, but
the very reason why re exists is to provide a much simpler, more
productive way to parse or validate simple languages and process text.
(The pyparse/yappy/yapps/<insert your favourite Python parser
generator hereargument could have been used to skip regular
expression support in the language, or to deprecate re. Would you want
that? And following the same rule, why would we have Python when
there's C?)

May 5 '07 #1

Subscribe Post Reply

7451

Terry Reedy

"Wiseman" <Wi*********@gmail.comwrote in message
news:11*********************@e65g2000hsc.googlegro ups.com...
| I'm kind of disappointed with the re regular expressions module.

I believe the current Python re module was written to replace the Python
wrapping of pcre in order to support unicode.

| In particular, the lack of support for recursion ( (?R) or (?n) ) is a
| major drawback to me.

I don't remember those being in the pcre Python once had. Perhaps they are
new.

|Are there any plans to support these features in re?

I have not seen any. You would have to ask the author. But I suspect that
this would be a non-trivial project outside his needs.

tjr

May 5 '07 #2

Marc 'BlackJack' Rintsch

In <11*********************@e65g2000hsc.googlegroups. com>, Wiseman wrote:

Note: I know there are LALR parser generators/parsers for Python, but
the very reason why re exists is to provide a much simpler, more
productive way to parse or validate simple languages and process text.
(The pyparse/yappy/yapps/<insert your favourite Python parser
generator hereargument could have been used to skip regular
expression support in the language, or to deprecate re. Would you want
that? And following the same rule, why would we have Python when
there's C?)

I don't follow your reasoning here. `re` is useful for matching tokens
for a higher level parser and C is useful for writing parts that need
hardware access or "raw speed" where pure Python is too slow.

Regular expressions can become very unreadable compared to Python source
code or EBNF grammars but modeling the tokens in EBNF or Python objects
isn't as compact and readable as simple regular expressions. So both `re`
and higher level parsers are useful together and don't supersede each
other.

The same holds for C and Python. IMHO.

Ciao,
Marc 'BlackJack' Rintsch

May 5 '07 #3

Wiseman

On May 5, 5:12 am, "Terry Reedy" <tjre...@udel.eduwrote:

I believe the current Python re module was written to replace the Python
wrapping of pcre in order to support unicode.

I don't know how PCRE was back then, but right now it supports UTF-8
Unicode patterns and strings, and Unicode character properties. Maybe
it could be reintroduced into Python?

I don't remember those being in the pcre Python once had. Perhaps they are
new.

At least today, PCRE supports recursion and recursion check,
possessive quantifiers and once-only subpatterns (disables
backtracking in a subpattern), callouts (user functions to call at
given points), and other interesting, powerful features.

May 5 '07 #4

Wiseman

On May 5, 7:19 am, Marc 'BlackJack' Rintsch <bj_...@gmx.netwrote:

In <1178323901.381993.47...@e65g2000hsc.googlegroups. com>, Wiseman wrote:
Note: I know there are LALR parser generators/parsers for Python, but
the very reason why re exists is to provide a much simpler, more
productive way to parse or validate simple languages and process text.
(The pyparse/yappy/yapps/<insert your favourite Python parser
generator hereargument could have been used to skip regular
expression support in the language, or to deprecate re. Would you want
that? And following the same rule, why would we have Python when
there's C?)

I don't follow your reasoning here. `re` is useful for matching tokens
for a higher level parser and C is useful for writing parts that need
hardware access or "raw speed" where pure Python is too slow.

Regular expressions can become very unreadable compared to Python source
code or EBNF grammars but modeling the tokens in EBNF or Python objects
isn't as compact and readable as simple regular expressions. So both `re`
and higher level parsers are useful together and don't supersede each
other.

The same holds for C and Python. IMHO.

Ciao,
Marc 'BlackJack' Rintsch

Sure, they don't supersede each other and they don't need to. My point
was that the more things you can do with regexes (not really regular
expressions anymore), the better -as long as they are powerful enough
for what you need to accomplish and they don't become a giant Perl-
style hack, of course-, because regular expressions are a built-in,
standard feature of Python, and they are much faster to use and write
than Python code or some LALR parser definition, and they are more
generally known and understood. You aren't going to parse a
programming language with a regex, but you can save a lot of time if
you can parse simple, but not so simple languages with them. Regular
expressions offer a productive alternative to full-fledged parsers for
the cases where you don't need them. So saying if you want feature X
or feature Y in regular expressions you should use a Bison-like parser
sounds a bit like an excuse, because the very reason why regular
expressions like these exist is to avoid using big, complex parsers
for simple cases. As an analogy, I mentioned Python vs. C: you want to
develop high-level languages because they are simpler and more
productive than working with C, even if you can do anything with the
later.

May 5 '07 #5

dustin

On Sat, May 05, 2007 at 08:52:15AM -0700, Wiseman wrote:

I believe the current Python re module was written to replace the Python
wrapping of pcre in order to support unicode.

I don't know how PCRE was back then, but right now it supports UTF-8
Unicode patterns and strings, and Unicode character properties. Maybe
it could be reintroduced into Python?

I would say this is a case for "rough consensus and working code". With
something as big and ugly[1] as a regexp library, I think the "working
code" part will be the hard part.

So, if you have a patch, there's a decent chance such a thing would be
adopted.

I'm not sure what your skill level is, but I would suggest studying the
code, starting in on a patch for one or more of these features, and then
corresponding with the module's maintainers to improve your patch to the
point where it can be accepted.

Dustin

May 5 '07 #6

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Are there any plans to support these features in re?

This question is impossible to answer. I don't have such
plans, and I don't know of any, but how could I speak for
the hundreds of contributors to Python world-wide, including
those future contributors which haven't contributed *yet*.

Do you have plans for such features in re?

Regards,
Martin

May 5 '07 #7

sjdevnull

Wiseman wrote:

I'm kind of disappointed with the re regular expressions module. In
particular, the lack of support for recursion ( (?R) or (?n) ) is a
major drawback to me. There are so many great things that can be
accomplished with regular expressions this way, such as validating a
mathematical expression or parsing a language with nested parens,
quoting or expressions.

-1 on this from me. In the past 10 years as a professional
programmer, I've used the wierd extended "regex" features maybe 5
times total, whether it be in Perl or Python. In contrast, I've had
to work around the slowness of PCRE-style engines by forking off a
grep() or something similar practically every other month. I think
it'd be far more valuable for most programmers if Python moved toward
dropping the extended semantics so that something one of the efficient
regex libraries (linked in a recent thread here on comp.lang.python)
could work with, and then added a parsing library to the standard
library for more complex jobs. Alternatively, if the additional
memory used isn't huge we could consider having more intelligence in
the re compiler and having it choose between a smarter PCRE engine or
a faster regex engine based on the input. The latter is something I'm
playing with a patch for that I hope to get into a useful state for
discussion soon.

But regexes are one area where speed very often makes the difference
between whether they're usable or not, and that's far more often been
a limitation for me--and I'd think for most programmers--than any lack
in their current Python semantics. So I'd rather see that attacked
first.

May 5 '07 #8

John Machin

On May 6, 1:52 am, Wiseman <Wiseman1...@gmail.comwrote:

On May 5, 5:12 am, "Terry Reedy" <tjre...@udel.eduwrote:

I believe the current Python re module was written to replace the Python
wrapping of pcre in order to support unicode.

I don't know how PCRE was back then, but right now it supports UTF-8
Unicode patterns and strings, and Unicode character properties. Maybe
it could be reintroduced into Python?

"UTF-8 Unicode" is meaningless. Python has internal unicode string
objects, with comprehensive support for converting to/from str (8-bit)
string objects. The re module supports unicode patterns and strings.
PCRE "supports" patterns and strings which are encoded in UTF-8. This
is quite different, a kludge, incomparable. Operations which inspect/
modify UTF-8-encoded data are of interest only to folk who are
constrained to use a language which has nothing resembling a proper
unicode datatype.

>
At least today, PCRE supports recursion and recursion check,
possessive quantifiers and once-only subpatterns (disables
backtracking in a subpattern), callouts (user functions to call at
given points), and other interesting, powerful features.

The more features are put into a regular expression module, the more
difficult it is to maintain and the more the patterns look like line
noise.

There's also the YAGNI factor; most folk would restrict using regular
expressions to simple grep-like functionality and data validation --
e.g. re.match("[A-Z][A-Z]?[0-9]{6}[0-9A]$", idno). The few who want to
recognise yet another little language tend to reach for parsers, using
regular expressions only in the lexing phase.

If you really want to have PCRE functionality in Python, you have a
few options:
(1) create a wrapper for PCRE using e.g. SWIG or pyrex or hand-
crafting
(2) write a PEP, get it agreed, and add the functionality to the re
module
(3) wait until someone does (1) or (2) for free
(4) fund someone to do (1) or (2)

HTH,
John

May 5 '07 #9

Wiseman

On May 5, 6:28 pm, dus...@v.igoro.us wrote:

I'm not sure what your skill level is, but I would suggest studying the
code, starting in on a patch for one or more of these features, and then
corresponding with the module's maintainers to improve your patch to the
point where it can be accepted.

I'll consider creating a new PCRE module for Python that uses the
latest version PCRE library. It'll depend on my time availability, but
I can write Python extensions, and I haven't used PCRE in a long time,
and I recall it was a bit of a hassle, but I could get it done.

May 6 '07 #10

Wiseman

On May 5, 10:06 pm, "sjdevn...@yahoo.com" <sjdevn...@yahoo.comwrote:

-1 on this from me. In the past 10 years as a professional
programmer, I've used the wierd extended "regex" features maybe 5
times total, whether it be in Perl or Python. In contrast, I've had
to work around the slowness of PCRE-style engines by forking off a
grep() or something similar practically every other month.

I use these complex features every month on my job, and performance is
rarely an issue, at least for our particular application of PCRE.

By the way, if you're concerned about performance, you should be
interested on once-only subpatterns.

May 6 '07 #11

Wiseman

On May 5, 10:44 pm, John Machin <sjmac...@lexicon.netwrote:

"UTF-8 Unicode" is meaningless. Python has internal unicode string
objects, with comprehensive support for converting to/from str (8-bit)
string objects. The re module supports unicode patterns and strings.
PCRE "supports" patterns and strings which are encoded in UTF-8. This
is quite different, a kludge, incomparable. Operations which inspect/
modify UTF-8-encoded data are of interest only to folk who are
constrained to use a language which has nothing resembling a proper
unicode datatype.

Sure, I know it's a mediocre support for Unicode for an application,
but we're not talking an application here. If I get the PCRE module
done, I'll just PyArg_ParseTuple(args, "et#", "utf-8", &str, &len),
which will be fine for Python's Unicode support and what PCRE does,
and I won't have to deal with this string at all so I couldn't care
less how it's encoded and if I have proper Unicode support in C or
not. (I'm unsure of how Pyrex or SWIG would treat this so I'll just
hand-craft it. It's not like it would be complex; most of the magic
will be pure C, dealing with PCRE's API.)

There's also the YAGNI factor; most folk would restrict using regular
expressions to simple grep-like functionality and data validation --
e.g. re.match("[A-Z][A-Z]?[0-9]{6}[0-9A]$", idno). The few who want to
recognise yet another little language tend to reach for parsers, using
regular expressions only in the lexing phase.

Well, I find these features very useful. I've used a complex, LALR
parser to parse complex grammars, but I've solved many problems with
just the PCRE lib. Either way seeing nobody's interested on these
features, I'll see if I can expose PCRE to Python myself; it sounds
like the fairest solution because it doesn't even deal with the re
module - you can do whatever you want with it (though I'd rather have
it stay as it is or enhance it), and I'll still have PCRE. That's if I
find the time to do it though, even having no life.

May 6 '07 #12

Klaas

On May 5, 6:57 pm, Wiseman <Wiseman1...@gmail.comwrote:

There's also the YAGNI factor; most folk would restrict using regular
expressions to simple grep-like functionality and data validation --
e.g. re.match("[A-Z][A-Z]?[0-9]{6}[0-9A]$", idno). The few who want to
recognise yet another little language tend to reach for parsers, using
regular expressions only in the lexing phase.

Well, I find these features very useful. I've used a complex, LALR
parser to parse complex grammars, but I've solved many problems with
just the PCRE lib. Either way seeing nobody's interested on these
features, I'll see if I can expose PCRE to Python myself; it sounds
like the fairest solution because it doesn't even deal with the re
module - you can do whatever you want with it (though I'd rather have
it stay as it is or enhance it), and I'll still have PCRE. That's if I
find the time to do it though, even having no life.

A polished wrapper for PCRE would be a great contribution to the
python community. If it becomes popular, then the argument for
replacing the existing re engine becomes much stronger.

-Mike

May 8 '07 #13

John Machin

On May 9, 7:34 am, Klaas <mike.kl...@gmail.comwrote:

On May 5, 6:57 pm, Wiseman <Wiseman1...@gmail.comwrote:

There's also the YAGNI factor; most folk would restrict using regular
expressions to simple grep-like functionality and data validation --
e.g. re.match("[A-Z][A-Z]?[0-9]{6}[0-9A]$", idno). The few who want to
recognise yet another little language tend to reach for parsers, using
regular expressions only in the lexing phase.

Well, I find these features very useful. I've used a complex, LALR
parser to parse complex grammars, but I've solved many problems with
just the PCRE lib. Either way seeing nobody's interested on these
features, I'll see if I can expose PCRE to Python myself; it sounds
like the fairest solution because it doesn't even deal with the re
module - you can do whatever you want with it (though I'd rather have
it stay as it is or enhance it), and I'll still have PCRE. That's if I
find the time to do it though, even having no life.

A polished wrapper for PCRE would be a great contribution to the
python community. If it becomes popular, then the argument for
replacing the existing re engine becomes much stronger.

-Mike

You seem to be overlooking my point that PCRE's unicode support isn't,
just like the Holy Roman Empire wasn't.

May 8 '07 #14

Similar topics

Regular Expression

by: LuKrOz | last post by:

Someone could tell me how can I get the same result substituting ereg with preg_match and ereg_replace with preg_replace. $result = ereg("<\>(.+)<\>",$this->buffer,$token); $this->buffer =...

PHP

Form Validation - Finding Duplicates: Regular Expressions or String Functions?

by: Eric Linders | last post by:

Hi, I'm trying to figure out the most efficient method for taking the first character in a string (which will be a number), and use it as a variable to check to see if the other numbers in the...

PHP

Python or PHP?

by: Lad | last post by:

Is anyone capable of providing Python advantages over PHP if there are any? Cheers, L.

Python

Regular expresions

by: Kamikazy | last post by:

Hi! Can someone tell me from where could I download regular expresions for programming language C? I'm working on a simple compiler and using Lex for creating Lexical Analyser so i need R.E....

C / C++

Regular Expressions and The Regex Coach

by: a | last post by:

I'm a newbie needing to use some Regular Expressions in PHP. Can I safely use the results of my tests using 'The Regex Coach' (http://www.weitz.de/regex-coach/index.html) Are the Regular...

PHP

preg_match(): Compilation failed: regular expression too large

by: Clodoaldo Pinto | last post by:

preg_match(): Compilation failed: regular expression too large at offset 0 The regular expression is 34,745 bytes long. <?php $regExp = 'huge regexp with 34,745 bytes'; echo...

PHP

Python regular expression

by: Wehrdamned | last post by:

Hi, As I understand it, python uses a pcre engine to work with regular expression. My question is, then, why expressions like : Traceback (most recent call last): File "<stdin>", line 1, in...

Python

php regular expression doesn't match

by: cmk128 | last post by:

Hi PHP's regular expression look like doesn't support .*? syntax. So i cannot match the shortest match. For exmaple: $str="a1b a3b"; $str1=ereg_replace("a.*b", "peter", $str1); will produce...

PHP

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server