Stripping scripts from HTML with regular expressions

Michel Bouwmans

Hey everyone,

I'm trying to strip all script-blocks from a HTML-file using regex.

I tried the following in Python:

testfile = open('testfile')
testhtml = testfile.read()
regex = re.compile('<script\b[^>]*>(.*?)</script>', re.DOTALL)
result = regex.sub('', blaat)
print result

This strips far more away then just the script-blocks. Am I missing
something from the regex-implementation from Python or am I doing something
else wrong?

greetz
MFB

Apr 9 '08 #1

Subscribe Post Reply

1584

Reedick, Andrew

-----Original Message-----
From: py********************************@python.org [mailto:python-
li*************************@python.org] On Behalf Of Michel Bouwmans
Sent: Wednesday, April 09, 2008 3:38 PM
To: py*********@python.org
Subject: Stripping scripts from HTML with regular expressions

Hey everyone,

I'm trying to strip all script-blocks from a HTML-file using regex.

I tried the following in Python:

testfile = open('testfile')
testhtml = testfile.read()
regex = re.compile('<script\b[^>]*>(.*?)</script>', re.DOTALL)
result = regex.sub('', blaat)
print result

This strips far more away then just the script-blocks. Am I missing
something from the regex-implementation from Python or am I doing
something
else wrong?

[Insert obligatory comment about using a html specific parser
(HTMLParser) instead of regexes.]

Actually your regex didn't appear to strip anything. You probably saw
stuff disappear because blaat != testhtml:
testhtml = testfile.read()
result = regex.sub('', blaat)
Try this:

import re

testfile = open('a.html')
testhtml = testfile.read()
regex = re.compile('<script\s+.*?>(.*?)</script>', re.DOTALL)
result = regex.sub('',testhtml)

print result

Apr 9 '08 #2

Stefan Behnel

Michel Bouwmans wrote:

I'm trying to strip all script-blocks from a HTML-file using regex.

You might want to take a look at lxml.html instead, which comes with an HTML
cleaner module:

http://codespeak.net/lxml/lxmlhtml.h...eaning-up-html

Stefan

Apr 9 '08 #3

Nikita the Spider

In article <ma**************************************@python.o rg>,
"Reedick, Andrew" <jr****@ATT.COMwrote:

-----Original Message-----
From: py********************************@python.org [mailto:python-
li*************************@python.org] On Behalf Of Michel Bouwmans
Sent: Wednesday, April 09, 2008 3:38 PM
To: py*********@python.org
Subject: Stripping scripts from HTML with regular expressions

Hey everyone,

I'm trying to strip all script-blocks from a HTML-file using regex.

[Insert obligatory comment about using a html specific parser
(HTMLParser) instead of regexes.]

Yah, seconded. To the OP - use BeautifulSoup or HtmlData unless you like
to reinvent wheels.

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more

Apr 10 '08 #4

Similar topics

Replacing characters + stripping HTML

by: Martin | last post by:

I have a HTML parser that reads product pages from various retailers - and I want to optimize it somewhat: I download all HTML before I start the parsing - and to do that I want to: - Get rid...

PHP

Help with a regular expression

by: YoBro | last post by:

Hi I have used some of this code from the PHP manual, but I am bloody hopeless with regular expressions. Was hoping somebody could offer a hand. The output of this will put the name of a form...

PHP

in-line detection of html escape codes

by: yawnmoth | last post by:

say i have a for loop that would iterate through every character and put a space between every 80th one, in effect forcing word wrap to occur. this can be implemented easily using a regular...

PHP

Request for Feedback; a module making it easier to use regular expressions.

by: Kenneth McDonald | last post by:

I'm working on the 0.8 release of my 'rex' module, and would appreciate feedback, suggestions, and criticism as I work towards finalizing the API and feature sets. rex is a module intended to make...

Python

regex for stripping HTML

by: Michael Vilain | last post by:

Originally, I was using $value =~ s/<.*>//g; to strip HTML tags from a variable. It actually stripped everything from the first "<" to the last ">" after the ending tag. I found this regex...

Perl

stripping words from querystring

by: David | last post by:

Hi, I'm trying to pass a querystring with certain common words removed (and, the, if, of etc). The code below replaces the keywords with "" or whatever I choose, but what I'd like to do is...

ASP / Active Server Pages

Shud not search HTML tags

by: anand | last post by:

Hello Group, i am stuck up to a problem, i made a search program on my web site and highlighting the searched phrase on HTML pages . well the problem is when user searches word "table" the Page...

ASP.NET

Stripping html tags from text

by: Spondishy | last post by:

Hi, I'm looking for help with a regular expression and c#. I want to remove all tags from a piece of html except the following. <a> <b> <h1> <h2>

ASP.NET

Stripping HTML from RSS feed

by: Jason | last post by:

First things first, let me say that I couldn't decide whether to post this to the PHP ng, or to an XML ng. I know from experience that you guys know what you're talking about, though, and all of...

PHP

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing