By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
439,957 Members | 1,960 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 439,957 IT Pros & Developers. It's quick & easy.

Re: re

P: n/a
Actually using regular expressions for the first
time. Is there something that allows you to take the
union of two character sets, or append a character to
a character set?

Say I want to replace 'disc' with 'disk', but only
when 'disc' is a complete word (don't want to change
'discuss' to 'diskuss'.) The following seems almost
right:

[^a-zA-Z])disc[^a-zA-Z]

The problem is that that doesn't match if 'disc' is at
the start or end of the string. Of course I could just
combine a few re's with |, but it seems like there should
(or might?) be a way to simply append a \A to the first
[^a-zA-Z] and a \Z to the second.

--
David C. Ullrich
Jun 27 '08 #1
Share this Question
Share on Google+
6 Replies


P: n/a
David C. Ullrich schrieb:
Actually using regular expressions for the first
time. Is there something that allows you to take the
union of two character sets, or append a character to
a character set?

Say I want to replace 'disc' with 'disk', but only
when 'disc' is a complete word (don't want to change
'discuss' to 'diskuss'.) The following seems almost
right:

[^a-zA-Z])disc[^a-zA-Z]

The problem is that that doesn't match if 'disc' is at
the start or end of the string. Of course I could just
combine a few re's with |, but it seems like there should
(or might?) be a way to simply append a \A to the first
[^a-zA-Z] and a \Z to the second.
Why not

($|[\w])disc(^|[^\w])

I hope \w is really the literal for whitespace - might be something
different, see the docs.

Diez
Jun 27 '08 #2

P: n/a
"Diez B. Roggisch" <de***@nospam.web.dewrote in message
news:6a*************@mid.uni-berlin.de...
David C. Ullrich schrieb:
>Say I want to replace 'disc' with 'disk', but only
when 'disc' is a complete word (don't want to change
'discuss' to 'diskuss'.) The following seems almost
right:

[^a-zA-Z])disc[^a-zA-Z]

The problem is that that doesn't match if 'disc' is at
the start or end of the string. Of course I could just
combine a few re's with |, but it seems like there should
(or might?) be a way to simply append a \A to the first
[^a-zA-Z] and a \Z to the second.

Why not

($|[\w])disc(^|[^\w])

I hope \w is really the literal for whitespace - might be something
different, see the docs.
No, \s is the literal for whitespace.
http://www.python.org/doc/current/lib/re-syntax.html

But how about:

text = re.sub(r"\bdisc\b", "disk", text_to_be_changed)

\b is the "word break" character, it matches at the beginning or end of any
"word" (where a word is any sequence of \w characters, and \w is any
alphanumeric
character or _).

Note that this solution still doesn't catch "Disc" if it is capitalized.

Russ

Jun 27 '08 #3

P: n/a
In article <6a*************@mid.uni-berlin.de>,
"Diez B. Roggisch" <de***@nospam.web.dewrote:
David C. Ullrich schrieb:
Actually using regular expressions for the first
time. Is there something that allows you to take the
union of two character sets, or append a character to
a character set?

Say I want to replace 'disc' with 'disk', but only
when 'disc' is a complete word (don't want to change
'discuss' to 'diskuss'.) The following seems almost
right:

[^a-zA-Z])disc[^a-zA-Z]

The problem is that that doesn't match if 'disc' is at
the start or end of the string. Of course I could just
combine a few re's with |, but it seems like there should
(or might?) be a way to simply append a \A to the first
[^a-zA-Z] and a \Z to the second.

Why not

($|[\w])disc(^|[^\w])

I hope \w is really the literal for whitespace - might be something
different, see the docs.
Thanks, but I don't follow that at all.

Whitespace is actually \s. But [\s]disc[whatever]
doesn't do the job - then it won't match "(disc)",
which counts as "disc appearing as a full word.

Also I think you have ^ and $ backwards, and there's
a ^ I don't understand. I _think_ that a correct version
of what you're suggesting would be

(^|[^a-zA-Z])disc($|[^a-zA-Z])

But as far as I can see that simply doesn't work.
I haven't been able to use | that way, combining
_parts_ of a re. That was the first thing I tried.
The original works right except for not matching
at the start or end of a string, the thing with
the | doesn't work at all:
>>test = compile(r'(^|[^a-zA-Z])disc($|[^a-zA-Z])')
test.findall('')
[]
>>test.findall('disc')
[('', '')]
>>test.findall(' disc ')
[(' ', ' ')]
>>disc = compile(r'[^a-zA-Z]disc[^a-zA-Z]')
disc.findall(' disc disc disc')
[' disc ']
>>disc.findall(' disc disc disc')
[' disc ', ' disc ']
>>test.findall(' disc disc disc')
[(' ', ' '), (' ', ' ')]
>>disc.findall(' disc disc disc')
[' disc ', ' disc ']
>>disc.findall(' disc disc disc ')
[' disc ', ' disc ', ' disc ']

Diez
--
David C. Ullrich
Jun 27 '08 #4

P: n/a
Whitespace is actually \s. But [\s]disc[whatever]
doesn't do the job - then it won't match "(disc)",
which counts as "disc appearing as a full word.
Ok, then this works:

import re

test = """
disc
(disc)
foo disc bar
discuss
""".split("\n")

for t in test:
if re.search(r"(^|[^\w])(disc)($|[^\w])", t):
print "success:", t

Also I think you have ^ and $ backwards, and there's
a ^ I don't understand. I _think_ that a correct version
Yep, sorry for the confusion.

Diez
Jun 27 '08 #5

P: n/a
In article <ma************************************@python.org >,
"Russell Blau" <ru******@hotmail.comwrote:
"Diez B. Roggisch" <de***@nospam.web.dewrote in message
news:6a*************@mid.uni-berlin.de...
David C. Ullrich schrieb:
Say I want to replace 'disc' with 'disk', but only
when 'disc' is a complete word (don't want to change
'discuss' to 'diskuss'.) The following seems almost
right:

[^a-zA-Z])disc[^a-zA-Z]

The problem is that that doesn't match if 'disc' is at
the start or end of the string. Of course I could just
combine a few re's with |, but it seems like there should
(or might?) be a way to simply append a \A to the first
[^a-zA-Z] and a \Z to the second.
Why not

($|[\w])disc(^|[^\w])

I hope \w is really the literal for whitespace - might be something
different, see the docs.

No, \s is the literal for whitespace.
http://www.python.org/doc/current/lib/re-syntax.html

But how about:

text = re.sub(r"\bdisc\b", "disk", text_to_be_changed)

\b is the "word break" character,
Lovely - that's exactly right, thanks. I swear I looked at the
docs... I'm just blind or stupid. No wait, I'm blind _and_
stupid. No, blind and stupid and slow...

Doesn't precisely fit the _spec_ because of digits and underscores,
but it's close enough to solve the problem exactly. Thanks.
>it matches at the beginning or end of any
"word" (where a word is any sequence of \w characters, and \w is any
alphanumeric
character or _).

Note that this solution still doesn't catch "Disc" if it is capitalized.
Thanks. I didn't mention I wanted to catch both cases because I
already knew how to take care of that:

r"\b[dD]isc\b"
Russ
--
David C. Ullrich
Jun 27 '08 #6

P: n/a
On Wed, 04 Jun 2008 20:07:41 +0200, "Diez B. Roggisch"
<de***@nospam.web.dewrote:
>Whitespace is actually \s. But [\s]disc[whatever]
doesn't do the job - then it won't match "(disc)",
which counts as "disc appearing as a full word.

Ok, then this works:
Yes it does.

My real question was why doesn't a construction like

(A|B)C

work as expected. The code below shows that it does.
That puzzled me because I couldn't see any real
difference between your solution here and things
I'd tried that didn't work. But those things also
work in the code below - when I saw this just
now I was even more confused...

Oh. Turns out the actual reason for the confusion wasn't
regex syntax, it was the fact that findall doesn't
return what I thought it did - looking at the result
of findall() it seemed as thought the re was matching
empty strings and whitespace... Looking more
carefully at what findall is supposed to do everything
makes sense.

Sorry to be dense. Remind me to read more than the
first sentence next time:

"findall (pattern, string)
Return a list of all non-overlapping matches of pattern in string.
If one or more groups are present in the pattern, return a list of
groups;..."
>import re

test = """
disc
(disc)
foo disc bar
discuss
""".split("\n")

for t in test:
if re.search(r"(^|[^\w])(disc)($|[^\w])", t):
print "success:", t

>Also I think you have ^ and $ backwards, and there's
a ^ I don't understand. I _think_ that a correct version

Yep, sorry for the confusion.

Diez
David C. Ullrich
Jun 27 '08 #7

This discussion thread is closed

Replies have been disabled for this discussion.