-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
"Michael Vilain <vi****@spamcop.net>" wrote in news:vilain-
8A*******************@comcast.ash.giganews.com:
Originally, I was using
$value =~ s/<.*>//g;
to strip HTML tags from a variable. It actually stripped everything
from the first "<" to the last ">" after the ending tag. I found this
regex in this group:
$value =~ s/\<[^\<]+\>//g;
and I'm trying to parse it out and figure out why it works. First off,
some questions:
- why escape the "<"? It's not one of the meta characters that has
special meaning in a regex.
- what's the difference between using ".*" to match any string and "+"
to match a repeat of the character class "[^\<]".
Just trying to deepen my understanding of regex. It's like whitewash
-- it gets more opaque with multiple coats.
Nah, it's not that hard. There's a learning curve, sure, but you'll get
to the top of it in time.
First, you are correct about the "<" -- no need to escape it; whoever did
it wasn't thinking.
Second, it helps to translate the regex sub-expressions into English
(assuming English is your native tongue):
<.*> means: Match a less-than character, followed by as many
characters as possible, followed by a greather-than character.
<[^>]+> means: Match a less-than character, followed by as many non-
greater-than characters as possible, followed by a greater-than
character.
See the difference? . matches ANY character; [^>] matches only non-">"
characters.
Note that it is not possible in general to process HTML via regular
expressions (at least, not simple regexes). Consider the following
snippet of valid HTML:
<img src="foo.jpg" alt='<<<"cool!">>>' />
- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print
-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>
iQA/AwUBP59EVWPeouIeTNHoEQJRGQCguzB4DdBzsa/9dmTMRm4ExzMmxBUAoIIq
bHd4Hbx8MdXgkJm3sWoUu0K1
=ADWR
-----END PGP SIGNATURE-----