regex for stripping HTML

Michael Vilain

Originally, I was using

$value =~ s/<.*>//g;

to strip HTML tags from a variable. It actually stripped everything
from the first "<" to the last ">" after the ending tag. I found this
regex in this group:

$value =~ s/\<[^\<]+\>//g;

and I'm trying to parse it out and figure out why it works. First off,
some questions:

- why escape the "<"? It's not one of the meta characters that has
special meaning in a regex.

- what's the difference between using ".*" to match any string and "+"
to match a repeat of the character class "[^\<]".

Just trying to deepen my understanding of regex. It's like whitewash --
it gets more opaque with multiple coats.

TIA,

/MeV/

--
DeeDee, don't press that button! DeeDee! NO! Dee...

Jul 19 '05 #1

Subscribe Post Reply

18574

Koncept

In article <vi**************************@comcast.ash.giganews .com>,
Michael Vilain <vi****@spamcop.net> wrote:

Originally, I was using

$value =~ s/<.*>//g;

to strip HTML tags from a variable. It actually stripped everything
from the first "<" to the last ">" after the ending tag. I found this
regex in this group:

$value =~ s/\<[^\<]+\>//g;

and I'm trying to parse it out and figure out why it works. First off,
some questions:

- why escape the "<"? It's not one of the meta characters that has
special meaning in a regex.

- what's the difference between using ".*" to match any string and "+"
to match a repeat of the character class "[^\<]".

Just trying to deepen my understanding of regex. It's like whitewash --
it gets more opaque with multiple coats.

TIA,

/MeV/

Hello. This is from the Terminal Query:

$ perldoc -q html

Here's one "simple-minded" approach, that works for most files:

#!/usr/bin/perl -p0777
s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

If you want a more complete solution, see the 3-stage
striphtml
program in http://www.cpan.org/authors/Tom_Chris-
tiansen/scripts/striphtml.gz .

--
Koncept <<
"Contrary to popular belief, the most dangerous animal is not the lion or
tiger or even the elephant. The most dangerous animal is a shark riding
on an elephant, just trampling and eating everything they see." - Jack Handey

Jul 19 '05 #2

Eric J. Roode

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

"Michael Vilain <vi****@spamcop.net>" wrote in news:vilain-
8A*******************@comcast.ash.giganews.com:

Originally, I was using

$value =~ s/<.*>//g;

to strip HTML tags from a variable. It actually stripped everything
from the first "<" to the last ">" after the ending tag. I found this
regex in this group:

$value =~ s/\<[^\<]+\>//g;

and I'm trying to parse it out and figure out why it works. First off,
some questions:

- why escape the "<"? It's not one of the meta characters that has
special meaning in a regex.

- what's the difference between using ".*" to match any string and "+"
to match a repeat of the character class "[^\<]".

Just trying to deepen my understanding of regex. It's like whitewash -- it gets more opaque with multiple coats.

Nah, it's not that hard. There's a learning curve, sure, but you'll get
to the top of it in time.

First, you are correct about the "<" -- no need to escape it; whoever did
it wasn't thinking.

Second, it helps to translate the regex sub-expressions into English
(assuming English is your native tongue):

<.*> means: Match a less-than character, followed by as many
characters as possible, followed by a greather-than character.

<[^>]+> means: Match a less-than character, followed by as many non-
greater-than characters as possible, followed by a greater-than
character.

See the difference? . matches ANY character; [^>] matches only non-">"
characters.
Note that it is not possible in general to process HTML via regular
expressions (at least, not simple regexes). Consider the following
snippet of valid HTML:

<img src="foo.jpg" alt='<<<"cool!">>>' />

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP59EVWPeouIeTNHoEQJRGQCguzB4DdBzsa/9dmTMRm4ExzMmxBUAoIIq
bHd4Hbx8MdXgkJm3sWoUu0K1
=ADWR
-----END PGP SIGNATURE-----

Jul 19 '05 #3

DOV LEVENGLICK

you have to escape < because it can be used as a search delimiter

"Michael Vilain " wrote:

Originally, I was using

$value =~ s/<.*>//g;

to strip HTML tags from a variable. It actually stripped everything
from the first "<" to the last ">" after the ending tag. I found this
regex in this group:

$value =~ s/\<[^\<]+\>//g;

and I'm trying to parse it out and figure out why it works. First off,
some questions:

- why escape the "<"? It's not one of the meta characters that has
special meaning in a regex.

- what's the difference between using ".*" to match any string and "+"
to match a repeat of the character class "[^\<]".

Just trying to deepen my understanding of regex. It's like whitewash --
it gets more opaque with multiple coats.

TIA,

/MeV/

--
Regards,
Dov Levenglick

Jul 19 '05 #4

Anno Siegel

DOV LEVENGLICK <Do************@motorola.com> wrote in comp.lang.perl.misc:

"Michael Vilain " wrote:

[DOV's top-posting re-arranged]

$value =~ s/\<[^\<]+\>//g;

and I'm trying to parse it out and figure out why it works. First off,
some questions:

- why escape the "<"? It's not one of the meta characters that has
special meaning in a regex.

you have to escape < because it can be used as a search delimiter

This is nonsense. What are you talking about? And don't top-post.

Anno

Jul 19 '05 #5

Similar topics

Regex help?

by: Bill Cohagan | last post by:

I'm looking for help with a regular expression question, so my first question is which newsgroup is the best one to post to? Just in case *this* is the best choice, here's the problem: I'm...

.NET Framework

Regex Matching Question

by: George Durzi | last post by:

Consider this excerpt from some HTML. (This is a copy from View->Source, except for the comment) <TABLE WIDTH=100% CELLPADDING=0 CELLSPACING=0 border=0> <?xml version="1.0" encoding="UTF-16"?>...

ASP.NET

Can Access use Fuzzy Logic

by: cassetti | last post by:

Here's the issue: I have roughly 20 MS excel spreadsheets, each row contains a record. These records were hand entered by people in call centers. The problem is, there can and are duplicate...

Microsoft Access / VBA

Regex.Replace

by: cj | last post by:

I just ran across this in the VB help. Sounds perfect. Only they don't tell me what namespace must be imported to use regex. I guess that's the problem cause I pasted this into a test program...

Visual Basic .NET

Regex doesn't match when test string is in middle of file

by: Chris | last post by:

Hi Everyone, I am using a regex to check for a string. When all the file contains is my test string the regex returns a match, but when I embed the test string in the middle of a text file a...

Visual Basic .NET

how do I handle linebreaks in Regex?

by: MrNobody | last post by:

I'm trying to do some regex in C# but for some reason linebreaks are causing my regex to not work. the test string goes like this: string ss = "<tagname...

C# / C Sharp

Regex help...pretty please?

by: MooMaster | last post by:

I'm trying to develop a little script that does some string manipulation. I have some few hundred strings that currently look like this: cond(a,b,c) and I want them to look like this: ...

Python

Space in REGEX in IE6

by: Paul Lautman | last post by:

I was using the following code: element.value = element.value.replace(/ /g,''); to remove all the spaces in a string. However in IE6 it complained with and "Expected ')'" error. How can I...

Javascript

regex help

by: William Gill | last post by:

I am not to sharp on my regular expressions because I haven't used them in quite a while. So I am relearning regex and the PHP regex functions at the same time. Which means when I screw up, I'm...

PHP

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice