By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
459,341 Members | 1,693 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 459,341 IT Pros & Developers. It's quick & easy.

regular expression: perl ==> python

P: n/a
Hi,
i am so use to perl's regular expression that i find it hard
to memorize the functions in python; so i would appreciate if
people can tell me some equivalents.

1) In perl:
$line = "The food is under the bar in the barn.";
if ( $line =~ /foo(.*)bar/ ) { print "got <$1>\n"; }

in python, I don't know how I can do this?
How does one capture the $1? (I know it is \1 but it is still not clear
how I can simply print it.
thanks

Jul 18 '05 #1
Share this Question
Share on Google+
17 Replies


P: n/a
le*******@yahoo.com wrote:

1) In perl:
$line = "The food is under the bar in the barn.";
if ( $line =~ /foo(.*)bar/ ) { print "got <$1>\n"; }

in python, I don't know how I can do this?


I don't know Perl very well, but I believe this is more or less the
equivalent:
import re
line = "The food is under the bar in the barn."
matcher = re.compile(r'foo(.*)bar')
match = matcher.search(line)
print 'got <%s>' % match.group(1) got <d is under the bar in the >

Of course, you can do this in fewer lines if you like:
print 'got <%s>' % re.search(r'foo(.*bar)', line).group(1)

got <d is under the bar in the bar>

Steve
Jul 18 '05 #2

P: n/a
<le*******@yahoo.com> wrote:
i am so use to perl's regular expression that i find it hard
to memorize the functions in python; so i would appreciate if
people can tell me some equivalents.

1) In perl:
$line = "The food is under the bar in the barn.";
if ( $line =~ /foo(.*)bar/ ) { print "got <$1>\n"; }

in python, I don't know how I can do this?
How does one capture the $1? (I know it is \1 but it is still not clear
how I can simply print it.


in Python, the RE machinery returns match objects, which has methods
that let you dig out more information about the match. "captured groups"
are available via the "group" method:

m = re.search(..., line)
if m:
print "got", m.group(1)

see the regex howto (or the RE chapter in the library reference) for more
information:

http://www.amk.ca/python/howto/regex/

</F>

Jul 18 '05 #3

P: n/a
JZ
Dnia 21 Dec 2004 21:12:09 -0800, le*******@yahoo.com napisał(a):
1) In perl:
$line = "The food is under the bar in the barn.";
if ( $line =~ /foo(.*)bar/ ) { print "got <$1>\n"; }

in python, I don't know how I can do this?
How does one capture the $1? (I know it is \1 but it is still not clear
how I can simply print it.
thanks


import re
line = "The food is under the bar in the barn."
if re.search(r'foo(.*)bar',line):
print 'got %s\n' % _.group(1)

--
JZ ICQ:6712522
http://zabiello.om
Jul 18 '05 #4

P: n/a
"JZ" <wn******@mnovryyb.pbz> wrote:
import re
line = "The food is under the bar in the barn."
if re.search(r'foo(.*)bar',line):
print 'got %s\n' % _.group(1)


Traceback (most recent call last):
File "jz.py", line 4, in ?
print 'got %s\n' % _.group(1)
NameError: name '_' is not defined

</F>

Jul 18 '05 #5

P: n/a
Fredrik Lundh wrote:
"JZ" <wn******@mnovryyb.pbz> wrote:

import re
line = "The food is under the bar in the barn."
if re.search(r'foo(.*)bar',line):
print 'got %s\n' % _.group(1)

Traceback (most recent call last):
File "jz.py", line 4, in ?
print 'got %s\n' % _.group(1)
NameError: name '_' is not defined


He was using the python interactive prompt, which I suspect you already
knew.
Jul 18 '05 #6

P: n/a
JZ
Dnia Wed, 22 Dec 2004 10:27:39 +0100, Fredrik Lundh napisał(a):
import re
line = "The food is under the bar in the barn."
if re.search(r'foo(.*)bar',line):
print 'got %s\n' % _.group(1)


Traceback (most recent call last):
File "jz.py", line 4, in ?
print 'got %s\n' % _.group(1)
NameError: name '_' is not defined


I forgot to add: I am using Python 2.3.4/Win32 (from ActiveState.com). The
code works in my interpreter.

--
JZ
Jul 18 '05 #7

P: n/a
"JZ" wrote:
import re
line = "The food is under the bar in the barn."
if re.search(r'foo(.*)bar',line):
print 'got %s\n' % _.group(1)


Traceback (most recent call last):
File "jz.py", line 4, in ?
print 'got %s\n' % _.group(1)
NameError: name '_' is not defined


I forgot to add: I am using Python 2.3.4/Win32 (from ActiveState.com). The
code works in my interpreter.


only if you type it into the interactive prompt. see:

http://www.python.org/doc/2.4/tut/no...00000000000000

"In interactive mode, the last printed expression is assigned to the variable _.
This means that when you are using Python as a desk calculator, it is some-
what easier to continue calculations /.../"

the "_" symbol has no special meaning when you run a Python program, so the
"if re.search" construct won't work.

</F>

Jul 18 '05 #8

P: n/a
JZ
Dnia Wed, 22 Dec 2004 16:55:55 +0100, Fredrik Lundh napisał(a):
the "_" symbol has no special meaning when you run a Python program,


That's right. So the final code will be:

import re
line = "The food is under the bar in the barn."
found = re.search('foo(.*)bar',line)
if found: print 'got %s\n' % found.group(1)

--
JZ ICQ:6712522
http://zabiello.com
Jul 18 '05 #9

P: n/a
> 1) In perl:
$line = "The food is under the bar in the barn.";
if ( $line =~ /foo(.*)bar/ ) { print "got <$1>\n"; }

in python, I don't know how I can do this?
How does one capture the $1? (I know it is \1 but it is still not clear
how I can simply print it.
thanks

Fredrik Lundh <fr*****@pythonware.com> wrote: "JZ" <wn******@mnovryyb.pbz> wrote:
import re
line = "The food is under the bar in the barn."
if re.search(r'foo(.*)bar',line):
print 'got %s\n' % _.group(1)


Traceback (most recent call last):
File "jz.py", line 4, in ?
print 'got %s\n' % _.group(1)
NameError: name '_' is not defined


I've found that a slight irritation in python compared to perl - the
fact that you need to create a match object (rather than relying on
the silver thread of $_ (etc) running through your program ;-)

import re
line = "The food is under the bar in the barn."
m = re.search(r'foo(.*)bar',line)
if m:
print 'got %s\n' % m.group(1)

This becomes particularly irritating when using if, elif etc, to
match a series of regexps, eg

line = "123123"
m = re.search(r'^(\d+)$', line)
if m:
print "int",int(m.group(1))
else:
m = re.search(r'^(\d*\.\d*)$', line)
if m:
print "float",float(m.group(1))
else:
print "unknown thing", line

The indentation keeps growing which looks rather untidy compared to
the perl

$line = "123123";
if ($line =~ /^(\d+)$/) {
print "int $1\n";
}
elsif ($line =~ /^(\d*\.\d*)$/) {
print "float $1\n";
}
else {
print "unknown thing $line\n";
}

Is there an easy way round this? AFAIK you can't assign a variable in
a compound statement, so you can't use elif at all here and hence the
problem?

I suppose you could use a monstrosity like this, which relies on the
fact that list.append() returns None...

line = "123123"
m = []
if m.append(re.search(r'^(\d+)$', line)) or m[-1]:
print "int",int(m[-1].group(1))
elif m.append(re.search(r'^(\d*\.\d*)$', line)) or m[-1]:
print "float",float(m[-1].group(1))
else:
print "unknown thing", line

--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick
Jul 18 '05 #10

P: n/a
Nick Craig-Wood wrote:
I've found that a slight irritation in python compared to perl - the
fact that you need to create a match object (rather than relying on
the silver thread of $_ (etc) running through your program ;-)
the old "regex" engine associated the match with the pattern, but that
approach isn't thread safe...
line = "123123"
m = re.search(r'^(\d+)$', line)
if m:
print "int",int(m.group(1))
else:
m = re.search(r'^(\d*\.\d*)$', line)
if m:
print "float",float(m.group(1))
else:
print "unknown thing", line


that's not a very efficient way to match multiple patterns, though. a
much better way is to combine the patterns into a single one, and use
the "lastindex" attribute to figure out which one that matched. see

http://effbot.org/zone/xml-scanner.htm

for more on this topic.

</F>

Jul 18 '05 #11

P: n/a

Fredrik Lundh wrote:
"JZ" wrote:
> import re
> line = "The food is under the bar in the barn."
> if re.search(r'foo(.*)bar',line):
> print 'got %s\n' % _.group(1)

Traceback (most recent call last):
File "jz.py", line 4, in ?
print 'got %s\n' % _.group(1)
NameError: name '_' is not defined
I forgot to add: I am using Python 2.3.4/Win32 (from ActiveState.com). The
code works in my interpreter.


only if you type it into the interactive prompt. see:


No, it doesn't work at all, anywhere. Did you actually try this?

http://www.python.org/doc/2.4/tut/no...00000000000000
"In interactive mode, the last printed expression is assigned to the variable _. This means that when you are using Python as a desk calculator, it is some- what easier to continue calculations /.../"


In the 3 lines that are executed before the exception, there are *no*
printed expressions.

Python 2.4 (#60, Nov 30 2004, 11:49:19) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
import re
line = "The food is under the bar in the barn."
if re.search(r'foo(.*)bar',line): .... print 'got %s\n' % _.group(1)
....
Traceback (most recent call last):
File "<stdin>", line 2, in ?
NameError: name '_' is not defined


Jul 18 '05 #12

P: n/a
John Machin wrote:
> I forgot to add: I am using Python 2.3.4/Win32 (from ActiveState.com). The
> code works in my interpreter.


only if you type it into the interactive prompt. see:


No, it doesn't work at all, anywhere. Did you actually try this?


the OP claims that it works in his ActiveState install (PythonWin?). maybe he
played with re.search before typing in the commands he quoted; maybe Python-
Win contains some extra hacks?

as I've illustrated earlier, it definitely doesn't work in a script executed by a standard
Python...

</F>

Jul 18 '05 #13

P: n/a

Fredrik Lundh wrote:
John Machin wrote:
>
> I forgot to add: I am using Python 2.3.4/Win32 (from ActiveState.com). The > code works in my interpreter.

only if you type it into the interactive prompt. see:
No, it doesn't work at all, anywhere. Did you actually try this?


the OP claims that it works in his ActiveState install (PythonWin?).

maybe he played with re.search before typing in the commands he quoted; maybe Python- Win contains some extra hacks?

as I've illustrated earlier, it definitely doesn't work in a script executed by a standard Python...

</F>


It is quite possible that the OP played with re.search before before
typing in the commands he quoted; however *you* claimed that it [his
quoted commands] worked "only if you type it into the interactive
prompt". It doesn't work, in the unqualified sense that I understood.

Anyway, enough of punch-ups about how many dunces can angle on the hat
of a pun -- I did appreciate your other posting about multiple patterns
and "lastindex"; thanks.

Jul 18 '05 #14

P: n/a
On 22 Dec 2004 17:30:04 GMT, Nick Craig-Wood <ni**@craig-wood.com> wrote:
Is there an easy way round this? AFAIK you can't assign a variable in
a compound statement, so you can't use elif at all here and hence the
problem?

I suppose you could use a monstrosity like this, which relies on the
fact that list.append() returns None...

line = "123123"
m = []
if m.append(re.search(r'^(\d+)$', line)) or m[-1]:
print "int",int(m[-1].group(1))
elif m.append(re.search(r'^(\d*\.\d*)$', line)) or m[-1]:
print "float",float(m[-1].group(1))
else:
print "unknown thing", line


I wrote a scanner for a recursive decent parser a while back. This is
the pattern i used for using mulitple regexps, instead of using an
if/elif/else chain.

import re
patterns = [
(re.compile('^(\d+)$'),int),
(re.compile('^(\d+\.\d*)$'),float),
]

def convert(s):
for regexp, action in patterns:
m = regexp.match(s)
if not m:
continue
return action(m.group(1))
raise ValueError, "Invalid input %r, was not a numeric string" % (s,)

if __name__ == '__main__':
tests = [ ("123123",123123), ("123.123",123.123), ("123.",123.) ]
for input, expected in tests:
assert convert(input) == expected

try:
convert('')
convert('abc')
except:
pass
else:
assert None,"Should Raise on invalid input"
Of course, I wrote the tests first. I used your regexp's but I was
confused as to why you were always using .group(1), but decided to
leave it. I would probably actually send the entire match object to
the action. Using something like:
(re.compile('^(\d+)$'),lambda m:int(m.group(1)),
and
return action(m)

but lambdas are going out fashion. :(

Stephen Thorne
Jul 18 '05 #15

P: n/a
Fredrik Lundh <fr*****@pythonware.com> wrote:
that's not a very efficient way to match multiple patterns, though. a
much better way is to combine the patterns into a single one, and use
the "lastindex" attribute to figure out which one that matched.
lastindex is useful, yes.
see

http://effbot.org/zone/xml-scanner.htm

for more on this topic.


I take your point. However I don't find the below very readable -
making 5 small regexps into 1 big one, plus a game of count the
brackets doesn't strike me as a huge win...

xml = re.compile(r"""
<([/?!]?\w+) # 1. tags
|&(\#?\w+); # 2. entities
|([^<>&'\"=\s]+) # 3. text strings (no special characters)
|(\s+) # 4. whitespace
|(.) # 5. special characters
""", re.VERBOSE)

Its probably faster though, so I give in gracelessly ;-)

--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick
Jul 18 '05 #16

P: n/a
Nick Craig-Wood wrote:
I take your point. However I don't find the below very readable -
making 5 small regexps into 1 big one, plus a game of count the
brackets doesn't strike me as a huge win...


if you're doing that a lot, you might wish to create a helper function.

the undocumented sre.Scanner provides a ready-made mechanism for this
kind of RE matching; see

http://aspn.activestate.com/ASPN/Mai...on-dev/1614344

for some discussion.

here's (a slight variation of) the code example they're talking about:

def s_ident(scanner, token): return token
def s_operator(scanner, token): return "op%s" % token
def s_float(scanner, token): return float(token)
def s_int(scanner, token): return int(token)

scanner = sre.Scanner([
(r"[a-zA-Z_]\w*", s_ident),
(r"\d+\.\d*", s_float),
(r"\d+", s_int),
(r"=|\+|-|\*|/", s_operator),
(r"\s+", None),
])
print scanner.scan("sum = 3*foo + 312.50 + bar")

(['sum', 'op=', 3, 'op*', 'foo', 'op+', 312.5, 'op+', 'bar'], '')

</F>

Jul 18 '05 #17

P: n/a
Fredrik Lundh <fr*****@pythonware.com> wrote:
the undocumented sre.Scanner provides a ready-made mechanism for this
kind of RE matching; see

http://aspn.activestate.com/ASPN/Mai...on-dev/1614344

for some discussion.

here's (a slight variation of) the code example they're talking about:

def s_ident(scanner, token): return token
def s_operator(scanner, token): return "op%s" % token
def s_float(scanner, token): return float(token)
def s_int(scanner, token): return int(token)

scanner = sre.Scanner([
(r"[a-zA-Z_]\w*", s_ident),
(r"\d+\.\d*", s_float),
(r"\d+", s_int),
(r"=|\+|-|\*|/", s_operator),
(r"\s+", None),
])
>>> print scanner.scan("sum = 3*foo + 312.50 + bar")

(['sum', 'op=', 3, 'op*', 'foo', 'op+', 312.5, 'op+', 'bar'],
'')


That is very cool - exactly the kind of problem I come across quite
often!

I've found the online documentation (using pydoc) for re / sre in
general to be a bit lacking.

For instance nowhere in

pydoc sre

Does it tell you what methods a match object has (or even what type it
is). To find this out you have to look at the HTML documentation.
This is probably what Windows people look at by default but Unix
hackers like me expect everything (or at least a hint) to be in the
man/pydoc pages.

Just noticed in pydoc2.4 a new section

MODULE DOCS
http://www.python.org/doc/current/lib/module-sre.html

Which is at least a hint that you are looking in the wrong place!
....however that page doesn't exist ;-)

--
Nick Craig-Wood <ni**@craig-wood.com> -- http://www.craig-wood.com/nick
Jul 18 '05 #18

This discussion thread is closed

Replies have been disabled for this discussion.