471,602 Members | 1,303 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,602 software developers and data experts.

Help needed: cryptic perl regular expression in python syntax

Hi there,

I have perl script that uses dynamically
constructed regular in this way:

------perl code starts ----
$result "";
$key = AAA\?01;
$key = quotemeta $key;
$line = " s^\?AAA\?01^BBB^g; #Comment "
if ($line =~ /(^\s*)(s|tr)(.)(\\?\??$key\??)\3(.*?)\3(.*)/) {
$result = $5;

# $result should be "BBB"
# \3 gets the same value as returned by (.)
# which is in this example ^. So we are searching
# parameter limited by first two ^-signs
# and returning the one limited byt the second
# and third ^-sign. Note that using \3 in regular
# expression enables other constants used than ^ -sign.

------perl code stops ----

How can I construct equivalent python regural expression ?

I have tested with constant regular expression like this:
line = ' s^\\?AAA\\?01^BBB^g; #Comment '
r1 = "(^\s*)(s|tr)(.)(\\\\\?\\\??AAA\\\\\?01)"
re.compile(r1).findall(line)

[(' ', 's', '^', '\\?AAA\\?01')]

Which is fine, but is there a way to join 3 raw strings
together into another raw strings? like:

r1 = r'''(^\s*)(s|tr)(.)(\\?\??'''
r2 = r'''\\?\??)\3(.*?)\3(.*)'''
p1 = r1 + key + r2 # p1 should remain raw string too

-pekka-
Jul 18 '05 #1
4 2094
>>>>> "pekka" == pekka niiranen <pe************@wlanmail.com> writes:

pekka> Which is fine, but is there a way to join 3 raw strings
pekka> together into another raw strings? like:

pekka> r1 = r'''(^\s*)(s|tr)(.)(\\?\??'''
pekka> r2 = r'''\\?\??)\3(.*?)\3(.*)'''
pekka> p1 = r1 + key + r2 # p1 should remain raw string too

The term "raw string" only has significance with string literals -
every string object is a "raw string". Backslashes are only
interpreted when converting string literals to in-memory string
objects.

--
Ville Vainio http://tinyurl.com/2prnb
Jul 18 '05 #2
Op 2004-10-19, pekka niiranen schreef <pe************@wlanmail.com>:
Hi there,

I have perl script that uses dynamically
constructed regular in this way:

------perl code starts ----
$result "";
$key = AAA\?01;
$key = quotemeta $key;
$line = " s^\?AAA\?01^BBB^g; #Comment "
if ($line =~ /(^\s*)(s|tr)(.)(\\?\??$key\??)\3(.*?)\3(.*)/) {
$result = $5;

# $result should be "BBB"
# \3 gets the same value as returned by (.)
# which is in this example ^. So we are searching
# parameter limited by first two ^-signs
# and returning the one limited byt the second
# and third ^-sign. Note that using \3 in regular
# expression enables other constants used than ^ -sign.

------perl code stops ----

How can I construct equivalent python regural expression ?

I have tested with constant regular expression like this:
line = ' s^\\?AAA\\?01^BBB^g; #Comment '
r1 = "(^\s*)(s|tr)(.)(\\\\\?\\\??AAA\\\\\?01)"
re.compile(r1).findall(line) [(' ', 's', '^', '\\?AAA\\?01')]

Which is fine, but is there a way to join 3 raw strings
together into another raw strings? like:

r1 = r'''(^\s*)(s|tr)(.)(\\?\??'''
r2 = r'''\\?\??)\3(.*?)\3(.*)'''
p1 = r1 + key + r2 # p1 should remain raw string too


If I understand correctly there are no raw strings, just raw string
literals. The re.compile uses just a normal string.

raw string literal just make it easier to form a strings that are
typically used for regular expressions but the strings themselves
are just ordinary strings.
s1="\\b"
s2=r"\b"
s1==s2 1 s1 '\\b' s2 '\\b' print s1 \b print s2 \b


--
Antoon Pardon
Jul 18 '05 #3
Thanks,

I managed to solve my problem with code like this:
line = ' s^\\?AAA\\?01^BBB^g; #Comment '
r1 = '(^\\s*)(s|tr)(.)(\\\\\\?\\\\??'
key = "AAA\?01"
r2 = '\\\\??)\\3(.*?)\\3(.*)'
r = r1 + re.escape(key) + r2
re.compile(r).findall(line) [(' ', 's', '^', '\\?AAA\\?01', 'BBB', 'g; #Comment ')]

but what an ugly piece of code...

I was hoping to do without excess backslashes with re.escape(),
but no avail since group item '\3' gets misquoted (among other things):
r2 = "\??)\3(.*?)\3(.*)/)"
re.escape(r2)
'\\\\\\?\\?\\)\\\x03\\(\\.\\*\\?\\)\\\x03\\(\\.\\* \\)\\/\\)'
-pekka-

Antoon Pardon wrote:
Op 2004-10-19, pekka niiranen schreef <pe************@wlanmail.com>:
Hi there,

I have perl script that uses dynamically
constructed regular in this way:

------perl code starts ----
$result "";
$key = AAA\?01;
$key = quotemeta $key;
$line = " s^\?AAA\?01^BBB^g; #Comment "
if ($line =~ /(^\s*)(s|tr)(.)(\\?\??$key\??)\3(.*?)\3(.*)/) {
$result = $5;

# $result should be "BBB"
# \3 gets the same value as returned by (.)
# which is in this example ^. So we are searching
# parameter limited by first two ^-signs
# and returning the one limited byt the second
# and third ^-sign. Note that using \3 in regular
# expression enables other constants used than ^ -sign.

------perl code stops ----

How can I construct equivalent python regural expression ?

I have tested with constant regular expression like this:

>line = ' s^\\?AAA\\?01^BBB^g; #Comment '
>r1 = "(^\s*)(s|tr)(.)(\\\\\?\\\??AAA\\\\\?01)"
>re.compile(r1).findall(line)


[(' ', 's', '^', '\\?AAA\\?01')]

Which is fine, but is there a way to join 3 raw strings
together into another raw strings? like:

r1 = r'''(^\s*)(s|tr)(.)(\\?\??'''
r2 = r'''\\?\??)\3(.*?)\3(.*)'''
p1 = r1 + key + r2 # p1 should remain raw string too

If I understand correctly there are no raw strings, just raw string
literals. The re.compile uses just a normal string.

raw string literal just make it easier to form a strings that are
typically used for regular expressions but the strings themselves
are just ordinary strings.

s1="\\b"
s2=r"\b"
s1==s2
1
s1
'\\b'
s2
'\\b'
print s1
\b
print s2


\b

Jul 18 '05 #4
"Steven Bethard" <st************@gmail.com> wrote in message
news:ma**************************************@pyth on.org...
Could you do something like:
line = ' s^\\?AAA\\?01^BBB^g; #Comment '
expr = r'(^\s*)(s|tr)(.)(\\\?%s)\3(.*?)\3(.*)'
matcher = re.compile(expr % re.escape("AAA\?01"))
matcher.findall(line)
[(' ', 's', '^', '\\?AAA\\?01', 'BBB', 'g; #Comment ')]

Basically, I still use the r'' string so that I don't have to write so

many backslashes, but then I use a %s to insert the "AAA\?01" into the middle of the expression. Looks at least a little cleaner to me.

Steve


Here's a more verbose version of Steve Bethard's suggestion. By building
up the regexp from individual parts, it is possible to give each part some
semi-meaningful name, or to attach comments to individual pieces. It also
makes it easier to maintain later. What if you had to support an additional
command besides s and tr, like 'rep'? Just change replaceCmd to read
replaceCmd = r'(s|tr|rep)'. What if you needed to support leading tabs
in addition to leading spaces? Change leadingWhite as needed. For
that matter, just giving the finished regexp the name 'replaceCmdExpr'
gives the reader more of a clue as to what the regexp's purpose is,
as the original code did with extra comments.

I find nearly *all* regexp's to be cryptic, and when I need them, I
usually assemble them in some fashion such as this. David Mertz
proposes a similar style in his very good book, "Text Processing
in Python."

(Some quibble with the practice of aligning '=' signs, but I find it to be a
helpful guide to the eye when declaring a set of related strings such as
these, assuming of course that one edits using a fixed space font.)

So why does the key get prepended with the backslashes and
question marks?

-- Paul
(I'll bet you thought I'd post a pyparsing version. :) Well, in a
certain way, I did.)
import re

line = ' s^\\?AAA\\?01^BBB^g; #Comment '

r1 = r'(^\s*)(s|tr)(.)(\\\?\\??'
key = "AAA\?01"
r2 = r'\\??)\3(.*?)\3(.*)'
r = r1 + re.escape(key) + r2
print re.compile(r).findall(line)

# desired regexp, from Steve Bethard's post
# r'(^\s*)(s|tr)(.)(\\\?%s)\3(.*?)\3(.*)'

# build up regexp by parts
key = r'AAA\?01'
leadingWhite = r'(^\s*)'
replaceCmd = r'(s|tr)'
sepChar = r'(.)'
# prepend \'s and ?'s, only the OP knows why...
findString = r'(\\\?\\??%s)' % re.escape(key)
# sepCharRef references the char read by sepChar,
# to support separators other than '^'
sepCharRef = r'\3'
replString = r'(.*?)'
restOfLine = r'(.*)'
replaceCmdExpr = leadingWhite + replaceCmd + \
sepChar + findString + sepCharRef + \
replString + sepCharRef + restOfLine

matcher = re.compile( replaceCmdExpr )
print matcher.findall(line)

Jul 18 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

17 posts views Thread by Michael McGarry | last post: by
9 posts views Thread by Xah Lee | last post: by
31 posts views Thread by surfunbear | last post: by
9 posts views Thread by Dieter Vanderelst | last post: by
1 post views Thread by Rahul | last post: by
3 posts views Thread by William Gill | last post: by
1 post views Thread by XIAOLAOHU | last post: by
reply views Thread by MichaelMortimer | last post: by
reply views Thread by CCCYYYY | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.