Regular Expressions

Geoff Hill

What's the way to go about learning Python's regular expressions? I feel
like such an idiot - being so strong in a programming language but knowing
nothing about RE.

Feb 10 '07 #1

Subscribe Post Reply

3358

Paul Rubin

"Geoff Hill" <th*************@gmail.comwrites:

What's the way to go about learning Python's regular expressions? I feel
like such an idiot - being so strong in a programming language but knowing
nothing about RE.

Read the documentation?

Feb 10 '07 #2

John Machin

On Feb 11, 10:26 am, "Geoff Hill" <thegeoffmeis...@gmail.comwrote:

What's the way to go about learning Python's regular expressions? I feel
like such an idiot - being so strong in a programming language but knowing
nothing about RE.

I suggest that you work through the re HOWTO
http://www.amk.ca/python/howto/regex/
and by work through, I don't mean "read". I mean as each new concept
is introduced:
1. try the given example(s) yourself at the interactive prompt
2. try variations on the examples
3. read the relevant part of the Library Reference Manual

Also I'd suggest reading threads in this newsgroup where people are
asking for help with re.

HTH,
John

Feb 11 '07 #3

Paul Rubin

"John Machin" <sj******@lexicon.netwrites:

What's the way to go about learning Python's regular expressions? I feel
like such an idiot - being so strong in a programming language but knowing
nothing about RE.

I suggest that you work through the re HOWTO
http://www.amk.ca/python/howto/regex/

Also remember Zawinski's law:
http://fishbowl.pastiche.org/2003/08...ar_expressions

Feb 11 '07 #4

gregarican

On Feb 10, 6:26 pm, "Geoff Hill" <thegeoffmeis...@gmail.comwrote:

What's the way to go about learning Python's regular expressions? I feel
like such an idiot - being so strong in a programming language but knowing
nothing about RE.

I highly recommend reading the book "Mastering Regular Expressions,"
which I believe is published by O'Reilly. It's a great reference and
helps peel the onion in terms of working through RE. They are a
language unto themselves. A fun brain exercise.

Feb 11 '07 #5

Shawn Milo

On 10 Feb 2007 18:58:51 -0800, gregarican <gr*********@gmail.comwrote:

On Feb 10, 6:26 pm, "Geoff Hill" <thegeoffmeis...@gmail.comwrote:
What's the way to go about learning Python's regular expressions? I feel
like such an idiot - being so strong in a programming language but knowing
nothing about RE.

I highly recommend reading the book "Mastering Regular Expressions,"
which I believe is published by O'Reilly. It's a great reference and
helps peel the onion in terms of working through RE. They are a
language unto themselves. A fun brain exercise.

--
http://mail.python.org/mailman/listinfo/python-list

Absolutely: Get "Mastering Regular Expressions" by Jeffrey Friedl. Not
only is it easy to read, but you'll get a lot of mileage out of
regexes in general. Grep, Perl one-liners, Python, and other tools use
regexes, and you'll find that they are really clever little creatures
once you befriend a few of them.

Shawn

Feb 11 '07 #6

Geoff Hill

Thanks. O'Reilly is the way I learned Python, and I'm suprised that I didn't
think of a book by them earlier.

Feb 11 '07 #7

Steve Holden

Geoff Hill wrote:

What's the way to go about learning Python's regular expressions? I feel
like such an idiot - being so strong in a programming language but knowing
nothing about RE.

In fact that's a pretty smart stance. A quote attributed variously to
Tim Peters and Jamie Zawinski says "Some people, when confronted with a
problem, think 'I know, I'll use regular expressions.' Now they have two
problems."

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Blog of Note: http://holdenweb.blogspot.com
See you at PyCon? http://us.pycon.org/TX2007

Feb 11 '07 #8

Steven D'Aprano

On Sun, 11 Feb 2007 07:05:30 +0000, Steve Holden wrote:

Geoff Hill wrote:
>What's the way to go about learning Python's regular expressions? I feel
like such an idiot - being so strong in a programming language but knowing
nothing about RE.

In fact that's a pretty smart stance.

That's a little harsh -- regexes have their place, together with pointer
arithmetic, bit manipulations, reverse polish notation and goto. The
problem is when people use them inappropriately e.g. using a regex when a
simple string.find will do.

A quote attributed variously to
Tim Peters and Jamie Zawinski says "Some people, when confronted with a
problem, think 'I know, I'll use regular expressions.' Now they have two
problems."

I believe that is correctly attributed to Jamie Zawinski.
--
Steven

Feb 11 '07 #9

John Machin

On Feb 11, 9:25 pm, Steven D'Aprano
<s...@REMOVE.THIS.cybersource.com.auwrote:

On Sun, 11 Feb 2007 07:05:30 +0000, Steve Holden wrote:
Geoff Hill wrote:
What's the way to go about learning Python's regular expressions? I feel
like such an idiot - being so strong in a programming language but knowing
nothing about RE.

In fact that's a pretty smart stance.

That's a little harsh -- regexes have their place, together with pointer
arithmetic, bit manipulations, reverse polish notation and goto. The
problem is when people use them inappropriately e.g. using a regex when a
simple string.find will do.

Thanks for the tip-off, Steve and Steven. Looks like I'll have to
start hiding my 12C (datecode 2214) with its "GTO" button under the
loose floor-board whenever I hear a knock at the door ;-) Looks like
Agner Fog's gone a million, and there'll be a special place in hell
for people who combine regexes with bit manipulation, like Navarro &
Raffinot. And we won't even mention Heikki Hy,*7g^54d3j+__=

Feb 11 '07 #10

James Stroud

gregarican wrote:

On Feb 10, 6:26 pm, "Geoff Hill" <thegeoffmeis...@gmail.comwrote:
>What's the way to go about learning Python's regular expressions? I feel
like such an idiot - being so strong in a programming language but knowing
nothing about RE.

I highly recommend reading the book "Mastering Regular Expressions,"
which I believe is published by O'Reilly. It's a great reference and
helps peel the onion in terms of working through RE. They are a
language unto themselves. A fun brain exercise.

There is no real mention of python in this book, but the first edition
is probably the best programming book I've ever read (excepting, perhaps
Text Processing in Python by Mertz.) Well, come to think of it, check
the latter book out. It has a great chapter on Python Regex. And its
free to download.

James

Feb 11 '07 #11

deviantbunnylord

That's a little harsh -- regexes have their place, together with pointer
arithmetic, bit manipulations, reverse polish notation and goto. The
problem is when people use them inappropriately e.g. using a regex when a
simple string.find will do.

A quote attributed variously to
Tim Peters and Jamie Zawinski says "Some people, when confronted with a
problem, think 'I know, I'll use regular expressions.' Now they have two
problems."

I believe that is correctly attributed to Jamie Zawinski.

--
Steven

So as a newbie, I have to ask. I've played with the re module now for
a while, I think regular expressions are super fun and useful. As far
as them being a problem I found they can be tricky and sometimes the
regex's I've devised do unexpected things...(which I can think of two
instances where that unexpected thing was something that I had hoped
to get into further down the line, yay for me!). So I guess I don't
really understand why they are a "bad idea" to use. I don't know of
any other way yet to parse specific data out of a text, html, or xml
file without resorting to regular expressions.
What other ways are there?

Feb 11 '07 #12

skip

jwzSome people, when confronted with a problem, think 'I know, I'll
jwzuse regular expressions.' Now they have two problems.

dblSo as a newbie, I have to ask.... So I guess I don't really
dblunderstand why they are a "bad idea" to use.

Regular expressions are fine in their place, however, you can get carried
away. For example:

http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html

Skip

Feb 11 '07 #13

Gabriel Genellina

En Sun, 11 Feb 2007 13:35:26 -0300, de**************@gmail.com
<de**************@gmail.comescribió:

>(Steven?)
That's a little harsh -- regexes have their place, together with pointer
arithmetic, bit manipulations, reverse polish notation and goto. The
problem is when people use them inappropriately e.g. using a regex when
a
simple string.find will do.

So as a newbie, I have to ask. I've played with the re module now for
a while, I think regular expressions are super fun and useful. As far
as them being a problem I found they can be tricky and sometimes the
regex's I've devised do unexpected things...(which I can think of two
instances where that unexpected thing was something that I had hoped
to get into further down the line, yay for me!). So I guess I don't
really understand why they are a "bad idea" to use. I don't know of
any other way yet to parse specific data out of a text, html, or xml
file without resorting to regular expressions.
What other ways are there?

For very simple things, it's easier/faster to use string methods like find
or split. By example, splitting "2007-02-11" into y,m,d parts:
y,m,d = date.split("-")
is a lot faster than matching "(\d+)-(\d+)-(\d+)"
On the other hand, complex tasks like parsing an HTML/XML document,
*can't* be done with a regexp alone; but people insist anyway, and then
complain when it doesn't work as expected, and ask how to "fix" the
regexp...
Good usage of regexps maybe goes in the middle.

--
Gabriel Genellina

Feb 11 '07 #14

John Machin

On Feb 12, 3:35 am, "deviantbunnyl...@gmail.com"
<deviantbunnyl...@gmail.comwrote:

That's a little harsh -- regexes have their place, together with pointer
arithmetic, bit manipulations, reverse polish notation and goto. The
problem is when people use them inappropriately e.g. using a regex when a
simple string.find will do.

A quote attributed variously to
Tim Peters and Jamie Zawinski says "Some people, when confronted with a
problem, think 'I know, I'll use regular expressions.' Now they have two
problems."

I believe that is correctly attributed to Jamie Zawinski.

--
Steven

So as a newbie, I have to ask. I've played with the re module now for
a while, I think regular expressions are super fun and useful. As far
as them being a problem I found they can be tricky and sometimes the
regex's I've devised do unexpected things...(which I can think of two
instances where that unexpected thing was something that I had hoped
to get into further down the line, yay for me!). So I guess I don't
really understand why they are a "bad idea" to use.

Regexes are not "bad". However people tend to overuse them, whether
they are overkill (like Gabriel's date-splitting example) or underkill
-- see your next sentence :-)

I don't know of
any other way yet to parse specific data out of a text, html, or xml
file without resorting to regular expressions.
What other ways are there?

Text: Paul Maguire's pyparsing module (Google is your friend); read
David Mertz's book on text processing with Python (free download, I
believe); modules for specific data formats e.g. csv

HTML: htmllib and HTMLParser (both in the Python library),
BeautifulSoup (again GIYF)

XML: xml.* in the Python library. ElementTree (recommended) is
included in Python 2.5; use xml.etree.cElementTree.

HTH,
John

Feb 11 '07 #15

Steve Holden

de**************@gmail.com wrote:

>That's a little harsh -- regexes have their place, together with pointer
arithmetic, bit manipulations, reverse polish notation and goto. The
problem is when people use them inappropriately e.g. using a regex when a
simple string.find will do.

>>A quote attributed variously to
Tim Peters and Jamie Zawinski says "Some people, when confronted with a
problem, think 'I know, I'll use regular expressions.' Now they have two
problems."
I believe that is correctly attributed to Jamie Zawinski.

--
Steven

So as a newbie, I have to ask. I've played with the re module now for
a while, I think regular expressions are super fun and useful. As far
as them being a problem I found they can be tricky and sometimes the
regex's I've devised do unexpected things...(which I can think of two
instances where that unexpected thing was something that I had hoped
to get into further down the line, yay for me!). So I guess I don't
really understand why they are a "bad idea" to use. I don't know of
any other way yet to parse specific data out of a text, html, or xml
file without resorting to regular expressions.
What other ways are there?

Re's aren't inherently bad. Just avoid using them as a hammer to the
extent that all your problems look like nails.

They wouldn't exist if there weren't problems it was appropriate to use
them on. Just try to use simpler techniques first.

For example, don't use re's to find out if a string starts with a
specific substring when you could instead use the .startswith() string
method.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Blog of Note: http://holdenweb.blogspot.com
See you at PyCon? http://us.pycon.org/TX2007

Feb 11 '07 #16

deviantbunnylord

HTML: htmllib and HTMLParser (both in the Python library),
BeautifulSoup (again GIYF)

XML: xml.* in the Python library. ElementTree (recommended) is
included in Python 2.5; use xml.etree.cElementTree.
The source of HTMLParser and xmllib use regular expressions for
parsing out the data. htmllib calls sgmllib at the begining of it's
code--sgmllib starts off with a bunch of regular expressions used to
parse data. So the only real difference there I see is that someone
saved me the work of writing them ;0). I haven't looked at the source
for Beautiful Soup, though I have the sneaking suspicion that most
processing of html/xml is all based on regex's.

Feb 12 '07 #17

John Machin

On Feb 12, 9:20 pm, "deviantbunnyl...@gmail.com"
<deviantbunnyl...@gmail.comwrote:

HTML: htmllib and HTMLParser (both in the Python library),
BeautifulSoup (again GIYF)

XML: xml.* in the Python library. ElementTree (recommended) is
included in Python 2.5; use xml.etree.cElementTree.

The source of HTMLParser and xmllib use regular expressions for
parsing out the data. htmllib calls sgmllib at the begining of it's
code--sgmllib starts off with a bunch of regular expressions used to
parse data. So the only real difference there I see is that someone
saved me the work of writing them ;0). I haven't looked at the source
for Beautiful Soup, though I have the sneaking suspicion that most
processing of html/xml is all based on regex's.

That's right. Those modules use regexes. You don't. You call functions
& classes in the modules.

Someone has written those modules and tested them and documented them
and they've had a fair old thrashing by quite a few people over the
years -- it may be the only difference in your way of thinking but
it's quite a large difference from you opening up the re docs and
getting stuck in single-handedly :-)

Feb 12 '07 #18

Neil Cerutti

On 2007-02-10, Geoff Hill <th*************@gmail.comwrote:

What's the way to go about learning Python's regular
expressions? I feel like such an idiot - being so strong in a
programming language but knowing nothing about RE.

A great way to learn regular expressions is to implement them.

--
Neil Cerutti

Feb 12 '07 #19

skip

dblThe source of HTMLParser and xmllib use regular expressions for
dblparsing out the data. htmllib calls sgmllib at the begining of it's
dblcode--sgmllib starts off with a bunch of regular expressions used
dblto parse data.

I am almost certain those modules use regular expressions for lexical
analysis (splitting the input byte stream into "words"), not for parsing
(extracting the structure of the "sentences").

If I have a simple expression:

(7 + 3.14) * CONST

that's just a stream of bytes, "(", "&", " ", "+", ... Lexical analysis
chunks that stream of bytes into the "words" of the language:

LPAREN (NUMBER, 7) PLUS (NUMBER, 3.14) RPAREN TIMES (IDENT, "CONST")

Parsing then constructs a higher level representation of that stream of
"words" (more commonly called tokens or lexemes). That representation is
application-dependent.

Regular expressions are ideal for lexical analysis. They are not-so-hot for
parsing unless the grammar of the language being parsed is *extremely*
simple.

Here are a couple much better expositions on the topics:

http://en.wikipedia.org/wiki/Lexical_analysis
http://en.wikipedia.org/wiki/Parsing

Skip

Feb 12 '07 #20

Gabriel Genellina

En Mon, 12 Feb 2007 07:20:11 -0300, de**************@gmail.com
<de**************@gmail.comescribió:

The source of HTMLParser and xmllib use regular expressions for
parsing out the data. htmllib calls sgmllib at the begining of it's
code--sgmllib starts off with a bunch of regular expressions used to
parse data. So the only real difference there I see is that someone
saved me the work of writing them ;0). I haven't looked at the source
for Beautiful Soup, though I have the sneaking suspicion that most
processing of html/xml is all based on regex's.

You can build a parser for SGML/HTML/XML documents using regexps AND
python code. You can't do that with regexps only.
By example, suppose you work hard to build a correct regexp for matching
an opening <atag. You extract this from the document: "<a href='xxx'>".
Is it actually an <atag? Maybe. But the text could be inside a comment.
Or in a CDATA section. Or inside javascript code. Or...
A regexp is good for recognizing tokens, and this can be used to build a
parser. But regular expressions alone can't parse these kind of documents,
just because their grammar is not regular.
(Python re engine is stronger that "mathematical" regular expressions, in
the sense that it can handle things like backreferences (?P=...) and
lookahead (?=...) but anyway it can't handle HTML)

--
Gabriel Genellina

Feb 12 '07 #21

Similar topics

Regular Expression

by: Michael McGarry | last post by:

Hi, I am horrible with Regular Expressions, can anyone recommend a book on it? Also I am trying to parse the following string to extract the number after load average. ".... load average:...

Python

Request for Feedback; a module making it easier to use regular expressions.

by: Kenneth McDonald | last post by:

I'm working on the 0.8 release of my 'rex' module, and would appreciate feedback, suggestions, and criticism as I work towards finalizing the API and feature sets. rex is a module intended to make...

Python

Regular Expressions

by: Sehboo | last post by:

Hi, I have several regular expressions that I need to run against documents. Is it possible to combine several expressions in one expression in Regex object. So that it is faster, or will I...

Visual Basic .NET

Using regular expressions in LIKE

by: Együd Csaba | last post by:

Hi All, I'd like to "compress" the following two filter expressions into one - assuming that it makes sense regarding query execution performance. .... where (adate LIKE "2004.01.10 __:30" or...

PostgreSQL Database

Regular expression optimization

by: Billa | last post by:

Hi, I am replaceing a big string using different regular expressions (see some example at the end of the message). The problem is whenever I apply a "replace" it makes a new copy of string and I...

.NET Framework

Regular Expressions and The Regex Coach

by: a | last post by:

I'm a newbie needing to use some Regular Expressions in PHP. Can I safely use the results of my tests using 'The Regex Coach' (http://www.weitz.de/regex-coach/index.html) Are the Regular...

PHP

Get regular expression

by: Mike | last post by:

I have a regular expression (^(.+)(?=\s*).*\1 ) that results in matches. I would like to get what the actual regular expression is. In other words, when I apply ^(.+)(?=\s*).*\1 to " HEART...

C# / C Sharp

Dynamic list of regular expressions, find the one that matches.

by: Allan Ebdrup | last post by:

I have a dynamic list of regular expressions, the expressions don't change very often but they can change. And I have a single string that I want to match the regular expressions against and find...

C# / C Sharp

Python regular expressions just ain't PCRE

by: Wiseman | last post by:

I'm kind of disappointed with the re regular expressions module. In particular, the lack of support for recursion ( (?R) or (?n) ) is a major drawback to me. There are so many great things that can...

Python

Regular Expression Resources

by: FAQEditor | last post by:

Anybody have any URL's to tutorials and/or references for Regular Expressions? The four I have so far are: http://docs.sun.com/source/816-6408-10/regexp.htm...

Javascript

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing