-----Original Message-----
From: py********************************@python.org [mailto:python-
li*************************@python.org] On Behalf Of Michel Bouwmans
Sent: Wednesday, April 09, 2008 3:38 PM
To: py*********@python.org
Subject: Stripping scripts from HTML with regular expressions
Hey everyone,
I'm trying to strip all script-blocks from a HTML-file using regex.
I tried the following in Python:
testfile = open('testfile')
testhtml = testfile.read()
regex = re.compile('<script\b[^>]*>(.*?)</script>', re.DOTALL)
result = regex.sub('', blaat)
print result
This strips far more away then just the script-blocks. Am I missing
something from the regex-implementation from Python or am I doing
something
else wrong?
[Insert obligatory comment about using a html specific parser
(HTMLParser) instead of regexes.]
Actually your regex didn't appear to strip anything. You probably saw
stuff disappear because blaat != testhtml:
testhtml = testfile.read()
result = regex.sub('', blaat)
Try this:
import re
testfile = open('a.html')
testhtml = testfile.read()
regex = re.compile('<script\s+.*?>(.*?)</script>', re.DOTALL)
result = regex.sub('',testhtml)
print result