By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
439,957 Members | 2,017 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 439,957 IT Pros & Developers. It's quick & easy.

Stripping scripts from HTML with regular expressions

P: n/a
Hey everyone,

I'm trying to strip all script-blocks from a HTML-file using regex.

I tried the following in Python:

testfile = open('testfile')
testhtml = testfile.read()
regex = re.compile('<script\b[^>]*>(.*?)</script>', re.DOTALL)
result = regex.sub('', blaat)
print result

This strips far more away then just the script-blocks. Am I missing
something from the regex-implementation from Python or am I doing something
else wrong?

greetz
MFB
Apr 9 '08 #1
Share this Question
Share on Google+
3 Replies


P: n/a

-----Original Message-----
From: py********************************@python.org [mailto:python-
li*************************@python.org] On Behalf Of Michel Bouwmans
Sent: Wednesday, April 09, 2008 3:38 PM
To: py*********@python.org
Subject: Stripping scripts from HTML with regular expressions

Hey everyone,

I'm trying to strip all script-blocks from a HTML-file using regex.

I tried the following in Python:

testfile = open('testfile')
testhtml = testfile.read()
regex = re.compile('<script\b[^>]*>(.*?)</script>', re.DOTALL)
result = regex.sub('', blaat)
print result

This strips far more away then just the script-blocks. Am I missing
something from the regex-implementation from Python or am I doing
something
else wrong?
[Insert obligatory comment about using a html specific parser
(HTMLParser) instead of regexes.]

Actually your regex didn't appear to strip anything. You probably saw
stuff disappear because blaat != testhtml:
testhtml = testfile.read()
result = regex.sub('', blaat)
Try this:

import re

testfile = open('a.html')
testhtml = testfile.read()
regex = re.compile('<script\s+.*?>(.*?)</script>', re.DOTALL)
result = regex.sub('',testhtml)

print result


Apr 9 '08 #2

P: n/a
Michel Bouwmans wrote:
I'm trying to strip all script-blocks from a HTML-file using regex.
You might want to take a look at lxml.html instead, which comes with an HTML
cleaner module:

http://codespeak.net/lxml/lxmlhtml.h...eaning-up-html

Stefan
Apr 9 '08 #3

P: n/a
In article <ma**************************************@python.o rg>,
"Reedick, Andrew" <jr****@ATT.COMwrote:
-----Original Message-----
From: py********************************@python.org [mailto:python-
li*************************@python.org] On Behalf Of Michel Bouwmans
Sent: Wednesday, April 09, 2008 3:38 PM
To: py*********@python.org
Subject: Stripping scripts from HTML with regular expressions

Hey everyone,

I'm trying to strip all script-blocks from a HTML-file using regex.

[Insert obligatory comment about using a html specific parser
(HTMLParser) instead of regexes.]
Yah, seconded. To the OP - use BeautifulSoup or HtmlData unless you like
to reinvent wheels.

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
Apr 10 '08 #4

This discussion thread is closed

Replies have been disabled for this discussion.