By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,537 Members | 1,416 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,537 IT Pros & Developers. It's quick & easy.

using re module to find " but not " alone ... is this a BUG in re?

P: n/a
Hi,

I want to replace all occourences of " by \" in a string.

But I want to leave all occourences of \" as they are.

The following should happen:

this I want " while I dont want this \"

should be transformed to:

this I want \" while I dont want this \"

and NOT:

this I want \" while I dont want this \\"

I tried even the (?<=...) construction but here I get an unbalanced paranthesis
error.

It seems tha re is not able to do the job due to parsing/compiling problems
for this sort of strings.
Have you any idea??

Anton
Example: --------------------

import re

re.findall("[^\\]\"","this I want \" while I dont want this \\\" ")

Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "C:\Python25\lib\re.py", line 175, in findall
return _compile(pattern, flags).findall(string)
File "C:\Python25\lib\re.py", line 241, in _compile
raise error, v # invalid expression
error: unexpected end of regular expression

Jun 27 '08 #1
Share this Question
Share on Google+
6 Replies


P: n/a
On Jun 12, 7:11 pm, anton <anto...@gmx.dewrote:
Hi,

I want to replace all occourences of " by \" in a string.

But I want to leave all occourences of \" as they are.

The following should happen:

this I want " while I dont want this \"

should be transformed to:

this I want \" while I dont want this \"

and NOT:

this I want \" while I dont want this \\"

I tried even the (?<=...) construction but here I get an unbalanced paranthesis
error.
Sounds like a deficit of backslashes causing re to regard \) as plain
text and not the magic closing parenthesis in (?<=...) -- and don't
you want (?<!...) ?
>
It seems tha re is not able to do the job due to parsing/compiling problems
for this sort of strings.
Nothing is ever as it seems.
>
Have you any idea??
For a start, *ALWAYS* use a raw string for an re pattern -- halves the
backslash pollution!

>

re.findall("[^\\]\"","this I want \" while I dont want this \\\" ")
and if you have " in the pattern, use '...' to enclose the pattern so
that you don't have to use \"
>
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "C:\Python25\lib\re.py", line 175, in findall
return _compile(pattern, flags).findall(string)
File "C:\Python25\lib\re.py", line 241, in _compile
raise error, v # invalid expression
error: unexpected end of regular expression
As expected.

What you want is:
>import re
text = r'frob this " avoid this \", OK?'
>>text
'frob this " avoid this \\", OK?'
>re.sub(r'(?<!\\)"', r'\"', text)
frob this \\" avoid this \\", OK?'
>>
HTH,
John
Jun 27 '08 #2

P: n/a
John Machin <sj******@lexicon.netwrote:
What you want is:
>>import re
text = r'frob this " avoid this \", OK?'
text
'frob this " avoid this \\", OK?'
>>re.sub(r'(?<!\\)"', r'\"', text)
frob this \\" avoid this \\", OK?'
>>>
Or you can do it without using regular expressions at all. Just replace
them all and then fix up the result:
>>text = r'frob this " avoid this \", OK?'
text.replace('"', r'\"').replace(r'\\"', r'\"')
'frob this \\" avoid this \\", OK?'
--
Duncan Booth http://kupuguy.blogspot.com
Jun 27 '08 #3

P: n/a
anton wrote:
I want to replace all occourences of " by \" in a string.

But I want to leave all occourences of \" as they are.

The following should happen:

this I want " while I dont want this \"

should be transformed to:

this I want \" while I dont want this \"

and NOT:

this I want \" while I dont want this \\"

I tried even the (?<=...) construction but here I get an unbalanced
paranthesis error.

It seems tha re is not able to do the job due to parsing/compiling
problems for this sort of strings.
Have you any idea??
The problem is underspecified. Should r'\\"' become r'\\\"' or remain
unchanged? If the backslash is supposed to escape the following letter
including another backslash -- that can't be done with regular expressions
alone:

# John's proposal:
>>print re.sub(r'(?<!\\)"', r'\"', 'no " one \\", two \\\\"')
no \" one \", two \\"
One possible fix:
>>parts = re.compile("(\\\\.)").split('no " one \\", two \\\\"')
parts[::2] = [p.replace('"', '\\"') for p in parts[::2]]
print "".join(parts)
no \" one \", two \\\"

Peter

Jun 27 '08 #4

P: n/a
John Machin <sjmachin <atlexicon.netwrites:
>
On Jun 12, 7:11 pm, anton <anto...@gmx.dewrote:
Hi,

I want to replace all occourences of " by \" in a string.

But I want to leave all occourences of \" as they are.

The following should happen:

this I want " while I dont want this \"
.... cut text off
What you want is:
import re
text = r'frob this " avoid this \", OK?'
text
'frob this " avoid this \\", OK?'
re.sub(r'(?<!\\)"', r'\"', text)
frob this \\" avoid this \\", OK?'
>

HTH,
John
--
http://mail.python.org/mailman/listinfo/python-list


First.. thanks John.

The whole problem is discussed in

http://docs.python.org/dev/howto/reg...ckslash-plague

in the section "The Backslash Plague"

Unfortunately this is *NOT* mentioned in the standard
python documentation of the re module.

Another thing which will always remain strange to me, is that
even if in the python doc of raw string:

http://docs.python.org/ref/strings.html

its written:
"Specifically, a raw string cannot end in a single backslash"

s=r"\\" # works fine
s=r"\" # works not (as stated)

But both ENDS IN A SINGLE BACKSLASH !

The main thing which is hard to understand is:

If a raw string is a string which ignores backslashes,
then it should ignore them in all circumstances,

or where could be the problem here (python parser somewhere??).

Bye

Anton
Jun 27 '08 #5

P: n/a
On Jun 13, 6:23 pm, anton <anto...@gmx.dewrote:
John Machin <sjmachin <atlexicon.netwrites:
On Jun 12, 7:11 pm, anton <anto...@gmx.dewrote:
Hi,
I want to replace all occourences of " by \" in a string.
But I want to leave all occourences of \" as they are.
The following should happen:
this I want " while I dont want this \"

... cut text off
What you want is:
>import re
>text = r'frob this " avoid this \", OK?'
>>text
'frob this " avoid this \\", OK?'
>re.sub(r'(?<!\\)"', r'\"', text)
frob this \\" avoid this \\", OK?'
HTH,
John
--
http://mail.python.org/mailman/listinfo/python-list

First.. thanks John.

The whole problem is discussed in

http://docs.python.org/dev/howto/reg...ckslash-plague

in the section "The Backslash Plague"

Unfortunately this is *NOT* mentioned in the standard
python documentation of the re module.
Yes, and there's more to driving a car in heavy traffic than you will
find in the manufacturer's manual.
>
Another thing which will always remain strange to me, is that
even if in the python doc of raw string:

http://docs.python.org/ref/strings.html

its written:
"Specifically, a raw string cannot end in a single backslash"

s=r"\\" # works fine
s=r"\" # works not (as stated)

But both ENDS IN A SINGLE BACKSLASH !
Apply the interpretation that the first case ends in a double
backslash, and move on.
>
The main thing which is hard to understand is:

If a raw string is a string which ignores backslashes,
then it should ignore them in all circumstances,
Nobody defines a raw string to be a "string that ignores backslashes",
so your premise is invalid.
or where could be the problem here (python parser somewhere??).
Why r"\" is not a valid string token has been done to death IIRC at
least twice in this newsgroup ...

Cheers,
John
Jun 27 '08 #6

P: n/a
On Jun 12, 4:11*am, anton <anto...@gmx.dewrote:
Hi,

I want to replace all occourences of " by \" in a string.

But I want to leave all occourences of \" as they are.

The following should happen:

* this I want " while I dont want this \"

should be transformed to:

* this I want \" while I dont want this \"

and NOT:

* this I want \" while I dont want this \\"
A pyparsing version is not as terse as an re, and certainly not as
fast, but it is easy enough to read. Here is my first brute-force
approach to your problem:

from pyparsing import Literal, replaceWith

escQuote = Literal(r'\"')
unescQuote = Literal(r'"')
unescQuote.setParseAction(replaceWith(r'\"'))

test1 = r'this I want " while I dont want this \"'
test2 = r'frob this " avoid this \", OK?'

for test in (test1, test2):
print (escQuote | unescQuote).transformString(test)

And it prints out the desired:

this I want \" while I dont want this \"
frob this \" avoid this \", OK?

This works by defining both of the patterns escQuote and unescQuote,
and only defines a transforming parse action for the unescQuote. By
listing escQuote first in the list of patterns to match, properly
escaped quotes are skipped over.

Then I looked at your problem slightly differently - why not find both
'\"' and '"', and replace either one with '\"'. In some cases, I'm
"replacing" '\"' with '\"', but so what? Here is the simplfied
transformer:

from pyparsing import Optional, replaceWith

quotes = Optional(r'\\') + '"'
quotes.setParseAction(replaceWith(r'\"'))
for test in (test1, test2):
print quotes.transformString(test)
Again, this prints out the desired output.

Now let's retrofit this altered logic back onto John Machin's
solution:

import re
for test in (test1, test2):
print re.sub(r'\\?"', r'\"', test)
Pretty short and sweet, and pretty readable for an re.

To address Peter Otten's question about what to do with an escaped
backslash, I can't compose this with an re, but I can by adjusting the
first pyparsing version to include an escaped backslash as a "match
but don't do anything with it" expression, just like we did with
escQuote:

from pyparsing import Optional, Literal, replaceWith

escQuote = Literal(r'\"')
unescQuote = Literal(r'"')
unescQuote.setParseAction(replaceWith(r'\"'))
backslash = chr(92)
escBackslash = Literal(backslash+backslash)

test3 = r'no " one \", two \\"'
for test in (test1, test2, test3):
print (escBackslash | escQuote |
unescQuote).transformString(test)

Prints:
this I want \" while I dont want this \"
frob this \" avoid this \", OK?
no \" one \", two \\\"

At first I thought the last transform was an error, but on closer
inspection, I see that the input line ends with an escaped backslash,
followed by a lone '"', which must be replaced with '\"'. So in the
transformed version we see '\\\"', the original escaped backslash,
followed by the replacement '\"' string.

Cheers,
-- Paul
Jun 27 '08 #7

This discussion thread is closed

Replies have been disabled for this discussion.