469,646 Members | 1,558 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,646 developers. It's quick & easy.

'\\' in regex affects the following parenthesis?

Could someone tell me why:
>>import re
p = re.compile('\\.*\\(.*)')
Fails with message:

Traceback (most recent call last):
File "<pyshell#12>", line 1, in <module>
re.compile('\\dir\\(file)')
File "C:\Python25\lib\re.py", line 180, in compile
return _compile(pattern, flags)
File "C:\Python25\lib\re.py", line 233, in _compile
raise error, v # invalid expression
error: unbalanced parenthesis

I thought '\\' should just be interpreted as a single '\' and not
affect anything afterwards...

The script 'redemo.py' shipped with Python by default is just fine
about this regex however.

Apr 22 '07 #1
2 2261
On Apr 21, 6:56 pm, vox...@gmail.com wrote:
Could someone tell me why:
>import re
p = re.compile('\\.*\\(.*)')

Fails with message:

Traceback (most recent call last):
File "<pyshell#12>", line 1, in <module>
re.compile('\\dir\\(file)')
File "C:\Python25\lib\re.py", line 180, in compile
return _compile(pattern, flags)
File "C:\Python25\lib\re.py", line 233, in _compile
raise error, v # invalid expression
error: unbalanced parenthesis

I thought '\\' should just be interpreted as a single '\' and not
affect anything afterwards...

The script 'redemo.py' shipped with Python by default is just fine
about this regex however.
You are getting overlap between the Python string literal \\ escaping
and re's \\ escaping. In a Python string literal '\\' gets collapsed
down to '\', so to get your desired result, you would need to double-
double every '\', as in:

p = re.compile('\\\\.*\\\\(.*)')

Ugly, no? Fortunately, Python has a special form for string literals,
called "raw" which suppresses Python's processing of \'s for escaping
- I think this was done expressly to help simplify entering re
strings. To use raw format for a string literal, just precede the
opening quotation mark with an r. Here is your original string, using
a raw literal:

p = re.compile(r'\\.*\\(.*)')

This will compile ok.

(Sometimes these literals are referred to as "raw strings" - I think
this is confusing because new users think this is a special type of
string type, different from str. This creates the EXACT SAME type of
str; the r just tells the compiler/interpreter to handle the quoted
literal a little differently. So I prefer to call them "raw
literals".)

-- Paul

Apr 22 '07 #2
On Apr 22, 9:56 am, vox...@gmail.com wrote:
Could someone tell me why:
>import re
p = re.compile('\\.*\\(.*)')
Short answer: *ALWAYS* use raw strings for regexes in Python source
files.

Long answer:

'\\.*\\(.*)' is equivalent to
r'\.*\(.*)'

So what re.compile is seeing is:

\. -- a literal dot or period or full stop (not a metacharacter)
* -- meaning 0 or more occurrences of the dot
\( -- a literal left parenthesis
.. -- dot metacharacter meaning any character bar a newline
* -- meaning 0 or more occurences of almost anything
) -- a right parenthesis grouping metacharacter; a bit lonely hence
the exception.

What you probably want is:

\\ -- literal backslash
..* -- any stuff
\\ -- literal backslash
(.*) -- grouped (any stuff)

>
Fails with message:

Traceback (most recent call last):
File "<pyshell#12>", line 1, in <module>
re.compile('\\dir\\(file)')
File "C:\Python25\lib\re.py", line 180, in compile
return _compile(pattern, flags)
File "C:\Python25\lib\re.py", line 233, in _compile
raise error, v # invalid expression
error: unbalanced parenthesis

I thought '\\' should just be interpreted as a single '\' and not
affect anything afterwards...
The second and third paragraphs of the re docs (http://docs.python.org/
lib/module-re.html) cover this:
"""
Regular expressions use the backslash character ("\") to indicate
special forms or to allow special characters to be used without
invoking their special meaning. This collides with Python's usage of
the same character for the same purpose in string literals; for
example, to match a literal backslash, one might have to write '\\\\'
as the pattern string, because the regular expression must be "\\",
and each backslash must be expressed as "\\" inside a regular Python
string literal.

The solution is to use Python's raw string notation for regular
expression patterns; backslashes are not handled in any special way in
a string literal prefixed with "r". So r"\n" is a two-character string
containing "\" and "n", while "\n" is a one-character string
containing a newline. Usually patterns will be expressed in Python
code using this raw string notation.
"""

Recommended reading: http://www.amk.ca/python/howto/regex...00000000000000
>
The script 'redemo.py' shipped with Python by default is just fine
about this regex however.
That's because you are typing the regex into a Tkinter app. Likewise
if you were reading the regex from (say) a config file or were typing
it to a raw_input call. The common factor is that you are not passing
it through an extra level of backslash processing.

HTH,
John

Apr 22 '07 #3

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

14 posts views Thread by jumpstart | last post: by
2 posts views Thread by Tim Conner | last post: by
5 posts views Thread by Bragadiru | last post: by
4 posts views Thread by Flomo Togba Kwele | last post: by
3 posts views Thread by =?Utf-8?B?UmF5IE1pdGNoZWxs?= | last post: by
reply views Thread by gheharukoh7 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.