473,396 Members | 1,734 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

regex for stripping HTML

Originally, I was using

$value =~ s/<.*>//g;

to strip HTML tags from a variable. It actually stripped everything
from the first "<" to the last ">" after the ending tag. I found this
regex in this group:

$value =~ s/\<[^\<]+\>//g;

and I'm trying to parse it out and figure out why it works. First off,
some questions:

- why escape the "<"? It's not one of the meta characters that has
special meaning in a regex.

- what's the difference between using ".*" to match any string and "+"
to match a repeat of the character class "[^\<]".

Just trying to deepen my understanding of regex. It's like whitewash --
it gets more opaque with multiple coats.

TIA,

/MeV/

--
DeeDee, don't press that button! DeeDee! NO! Dee...

Jul 19 '05 #1
4 18574
In article <vi**************************@comcast.ash.giganews .com>,
Michael Vilain <vi****@spamcop.net> wrote:
Originally, I was using

$value =~ s/<.*>//g;

to strip HTML tags from a variable. It actually stripped everything
from the first "<" to the last ">" after the ending tag. I found this
regex in this group:

$value =~ s/\<[^\<]+\>//g;

and I'm trying to parse it out and figure out why it works. First off,
some questions:

- why escape the "<"? It's not one of the meta characters that has
special meaning in a regex.

- what's the difference between using ".*" to match any string and "+"
to match a repeat of the character class "[^\<]".

Just trying to deepen my understanding of regex. It's like whitewash --
it gets more opaque with multiple coats.

TIA,

/MeV/


Hello. This is from the Terminal Query:

$ perldoc -q html

Here's one "simple-minded" approach, that works for most files:

#!/usr/bin/perl -p0777
s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

If you want a more complete solution, see the 3-stage
striphtml
program in http://www.cpan.org/authors/Tom_Chris-
tiansen/scripts/striphtml.gz .
--
Koncept <<
"Contrary to popular belief, the most dangerous animal is not the lion or
tiger or even the elephant. The most dangerous animal is a shark riding
on an elephant, just trampling and eating everything they see." - Jack Handey
Jul 19 '05 #2
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

"Michael Vilain <vi****@spamcop.net>" wrote in news:vilain-
8A*******************@comcast.ash.giganews.com:
Originally, I was using

$value =~ s/<.*>//g;

to strip HTML tags from a variable. It actually stripped everything
from the first "<" to the last ">" after the ending tag. I found this
regex in this group:

$value =~ s/\<[^\<]+\>//g;

and I'm trying to parse it out and figure out why it works. First off,
some questions:

- why escape the "<"? It's not one of the meta characters that has
special meaning in a regex.

- what's the difference between using ".*" to match any string and "+"
to match a repeat of the character class "[^\<]".

Just trying to deepen my understanding of regex. It's like whitewash -- it gets more opaque with multiple coats.


Nah, it's not that hard. There's a learning curve, sure, but you'll get
to the top of it in time.

First, you are correct about the "<" -- no need to escape it; whoever did
it wasn't thinking.

Second, it helps to translate the regex sub-expressions into English
(assuming English is your native tongue):

<.*> means: Match a less-than character, followed by as many
characters as possible, followed by a greather-than character.

<[^>]+> means: Match a less-than character, followed by as many non-
greater-than characters as possible, followed by a greater-than
character.

See the difference? . matches ANY character; [^>] matches only non-">"
characters.
Note that it is not possible in general to process HTML via regular
expressions (at least, not simple regexes). Consider the following
snippet of valid HTML:

<img src="foo.jpg" alt='<<<"cool!">>>' />

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP59EVWPeouIeTNHoEQJRGQCguzB4DdBzsa/9dmTMRm4ExzMmxBUAoIIq
bHd4Hbx8MdXgkJm3sWoUu0K1
=ADWR
-----END PGP SIGNATURE-----
Jul 19 '05 #3
you have to escape < because it can be used as a search delimiter

"Michael Vilain " wrote:
Originally, I was using

$value =~ s/<.*>//g;

to strip HTML tags from a variable. It actually stripped everything
from the first "<" to the last ">" after the ending tag. I found this
regex in this group:

$value =~ s/\<[^\<]+\>//g;

and I'm trying to parse it out and figure out why it works. First off,
some questions:

- why escape the "<"? It's not one of the meta characters that has
special meaning in a regex.

- what's the difference between using ".*" to match any string and "+"
to match a repeat of the character class "[^\<]".

Just trying to deepen my understanding of regex. It's like whitewash --
it gets more opaque with multiple coats.

TIA,

/MeV/


--
Regards,
Dov Levenglick

Jul 19 '05 #4
DOV LEVENGLICK <Do************@motorola.com> wrote in comp.lang.perl.misc:
"Michael Vilain " wrote:


[DOV's top-posting re-arranged]
$value =~ s/\<[^\<]+\>//g;

and I'm trying to parse it out and figure out why it works. First off,
some questions:

- why escape the "<"? It's not one of the meta characters that has
special meaning in a regex.


you have to escape < because it can be used as a search delimiter


This is nonsense. What are you talking about? And don't top-post.

Anno
Jul 19 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Bill Cohagan | last post by:
I'm looking for help with a regular expression question, so my first question is which newsgroup is the best one to post to? Just in case *this* is the best choice, here's the problem: I'm...
1
by: George Durzi | last post by:
Consider this excerpt from some HTML. (This is a copy from View->Source, except for the comment) <TABLE WIDTH=100% CELLPADDING=0 CELLSPACING=0 border=0> <?xml version="1.0" encoding="UTF-16"?>...
24
by: cassetti | last post by:
Here's the issue: I have roughly 20 MS excel spreadsheets, each row contains a record. These records were hand entered by people in call centers. The problem is, there can and are duplicate...
5
by: cj | last post by:
I just ran across this in the VB help. Sounds perfect. Only they don't tell me what namespace must be imported to use regex. I guess that's the problem cause I pasted this into a test program...
4
by: Chris | last post by:
Hi Everyone, I am using a regex to check for a string. When all the file contains is my test string the regex returns a match, but when I embed the test string in the middle of a text file a...
7
by: MrNobody | last post by:
I'm trying to do some regex in C# but for some reason linebreaks are causing my regex to not work. the test string goes like this: string ss = "<tagname...
4
by: MooMaster | last post by:
I'm trying to develop a little script that does some string manipulation. I have some few hundred strings that currently look like this: cond(a,b,c) and I want them to look like this: ...
13
by: Paul Lautman | last post by:
I was using the following code: element.value = element.value.replace(/ /g,''); to remove all the spaces in a string. However in IE6 it complained with and "Expected ')'" error. How can I...
3
by: William Gill | last post by:
I am not to sharp on my regular expressions because I haven't used them in quite a while. So I am relearning regex and the PHP regex functions at the same time. Which means when I screw up, I'm...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.