[Alan Kennedy]
The optimal solution, IMHO, is to tidy the HTML into XML, and then use
SAX to filter out the stuff you don't want. Here is some code that
does the latter. This should be nice and fast, and use a lot less
memory than object-model based approaches.
Unfortunately, in my haste to post a demonstration of a technique
earlier on, I posted running code that is both buggy and *INSECURE*.
The following are problems with it
1. A bug in making up the attribute string results in loss of
permitted attributes.
2. The failure to escape character data (i.e. map '<' to '<' and
'>' to '>') as it is written out gives rise to the possibility of a
code injection attack. It's easy to circumvent the check for malicious
code: I'll leave to y'all to figure out how.
3. I have a feeling that the failure to escape the attribute values
also opens the possibility of a code injection attack. I'm not
certain: it depends on the browser environment in which the final HTML
is rendered.
Anyway, here's some updated code that closes the SECURITY HOLES in the
earlier-posted version :-(
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import xml.sax
from xml.sax.saxutils import escape, quoteattr
import cStringIO as StringIO
permittedElements = ['html', 'body', 'b', 'i', 'p']
permittedAttrs = ['class', 'id', ]
class cleaner(xml.sax.handler.ContentHandler):
def __init__(self):
xml.sax.handler.ContentHandler.__init__(self)
self.outbuf = StringIO.StringIO()
def startElement(self, elemname, attrs):
if elemname in permittedElements:
attrstr = ""
for a in attrs.keys():
if a in permittedAttrs:
attrstr = "%s%s" % (attrstr, " %s=%s" % (a,
quoteattr(attrs[a])))
self.outbuf.write("<%s%s>" % (elemname, attrstr))
def endElement(self, elemname):
if elemname in permittedElements:
self.outbuf.write("</%s>" % (elemname,))
def characters(self, s):
self.outbuf.write("%s" % (escape(s),))
testdoc = """
<html>
<body>
<p class="1" id="2">This paragraph contains <b>only</b> permitted
elements.</p>
<p>This paragraph contains <i
onclick="javascript
:pop('porno.htm')">disallowed
attributes</i>.</p>
<img src="http://www.blackhat.com/session_hijack.gif"/>
<p>This paragraph contains
<script src="blackhat.js"/>a potential script
attack</p>
</body>
</html>
"""
if __name__ == "__main__":
parser = xml.sax.make_parser()
mycleaner = cleaner()
parser.setContentHandler(mycleaner)
parser.setFeature(xml.sax.handler.feature_namespac es, 0)
parser.feed(testdoc)
print mycleaner.outbuf.getvalue()
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
regards,
--
alan kennedy
------------------------------------------------------
check http headers here:
http://xhaus.com/headers
email alan:
http://xhaus.com/contact/alan