By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,428 Members | 1,340 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,428 IT Pros & Developers. It's quick & easy.

re.sub does not replace all occurences

P: n/a
Hello everybody,

I wanted to use re.sub to strip all HTML tags out of a given string. I
learned that there are better ways to do this without the re module,
but I would like to know why my code is not working. I use the
following:

def stripHtml(source):
source = re.sub("[\n\r\f]", " ", source)
source = re.sub("<.*?>", "", source, re.S | re.I | re.M)
source = re.sub("&(#[0-9]{1,3}|[a-z]{3,6});", "", source, re.I)
return source

But the result still has some tags in it. When I call the second line
multiple times, all tags disappear, but since HTML tags cannot be
overlapping, I do not understand this behavior. There is even a
difference when I omit the re.I (IGNORECASE) option. Without this
option, some tags containing only capital letters (like </FONT>) were
kept in the string when doing one processing run but removed when
doing multiple runs.

Perhaps anyone can tell me why this regex is behaving like this.

Thanks and regards,
Christoph

Aug 7 '07 #1
Share this Question
Share on Google+
3 Replies


P: n/a
On Tue, 07 Aug 2007 10:28:24 -0700, Christoph Krammer wrote:
Hello everybody,

I wanted to use re.sub to strip all HTML tags out of a given string. I
learned that there are better ways to do this without the re module,
but I would like to know why my code is not working. I use the
following:

def stripHtml(source):
source = re.sub("[\n\r\f]", " ", source)
source = re.sub("<.*?>", "", source, re.S | re.I | re.M)
source = re.sub("&(#[0-9]{1,3}|[a-z]{3,6});", "", source, re.I)
return source

But the result still has some tags in it. When I call the second line
multiple times, all tags disappear, but since HTML tags cannot be
overlapping, I do not understand this behavior. There is even a
difference when I omit the re.I (IGNORECASE) option. Without this
option, some tags containing only capital letters (like </FONT>) were
kept in the string when doing one processing run but removed when
doing multiple runs.
Can you give some example HTML where it fails?

Ciao,
Marc 'BlackJack' Rintsch
Aug 7 '07 #2

P: n/a
On 2007-08-07, Christoph Krammer <re********@googlemail.comwrote:
Hello everybody,

I wanted to use re.sub to strip all HTML tags out of a given string. I
learned that there are better ways to do this without the re module,
but I would like to know why my code is not working. I use the
following:

def stripHtml(source):
source = re.sub("[\n\r\f]", " ", source)
source = re.sub("<.*?>", "", source, re.S | re.I | re.M)
source = re.sub("&(#[0-9]{1,3}|[a-z]{3,6});", "", source, re.I)
return source

But the result still has some tags in it. When I call the
second line multiple times, all tags disappear, but since HTML
tags cannot be overlapping, I do not understand this behavior.
There is even a difference when I omit the re.I (IGNORECASE)
option. Without this option, some tags containing only capital
letters (like </FONT>) were kept in the string when doing one
processing run but removed when doing multiple runs.

Perhaps anyone can tell me why this regex is behaving like
this.
>>import re
help(re.sub)
Help on function sub in module re:

sub(pattern, repl, string, count=0)
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl. repl can be either a string or a callable;
if a callable, it's passed the match object and must return
a replacement string to be used.

And from the Python Library Reference for re.sub:

The pattern may be a string or an RE object; if you need to
specify regular expression flags, you must use a RE object,
or use embedded modifiers in a pattern; for example,
"sub("(?i)b+", "x", "bbbb BBBB")" returns 'x x'.

The optional argument count is the maximum number of pattern
occurrences to be replaced; count must be a non-negative
integer. If omitted or zero, all occurrences will be
replaced. Empty matches for the pattern are replaced only
when not adjacent to a previous match, so "sub('x*', '-',
'abc')" returns '-a-b-c-'.

In other words, the fourth argument to sub is count, not a set of
re flags.

--
Neil Cerutti
Aug 7 '07 #3

P: n/a
Neil Cerutti schrieb:
In other words, the fourth argument to sub is count, not a set of
re flags.
I knew it had to be something very stupid.

Thanks a lot.

Aug 7 '07 #4

This discussion thread is closed

Replies have been disabled for this discussion.