472,958 Members | 2,587 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,958 software developers and data experts.

Simple allowing of HTML elements/attributes?

I'm writing a site with mod_python which will have, among other things,
forums. I want to allow users to use some HTML (<em>, <strong>, <p>,
etc.) on the forums, but I don't want to allow bad elements and
attributes (onclick, <script>, etc.). I would also like to do basic
validation (no overlapping elements like <strong><em>foo</em></strong>,
no missing end tags). I'm not asking anyone to write a script for me,
but does anyone have general ideas about how to do this quickly on an
active forum?
Jul 18 '05 #1
4 2409
At some point, Leif K-Brooks <eu*****@ecritters.biz> wrote:
I'm writing a site with mod_python which will have, among other
things, forums. I want to allow users to use some HTML (<em>,
<strong>, <p>, etc.) on the forums, but I don't want to allow bad
elements and attributes (onclick, <script>, etc.). I would also like
to do basic validation (no overlapping elements like
<strong><em>foo</em></strong>, no missing end tags). I'm not asking
anyone to write a script for me, but does anyone have general ideas
about how to do this quickly on an active forum?


You could require valid XML, and use a validating XML parser to
check conformance. You'd have to make sure the output is correctly
quoted (for instance, check that HTML tags in a CDATA block get quoted).

--
|>|\/|<
/--------------------------------------------------------------------------\
|David M. Cooke
|cookedm(at)physics(dot)mcmaster(dot)ca
Jul 18 '05 #2
co**********@physics.mcmaster.ca (David M. Cooke) wrote in message news:<qn*************@arbutus.physics.mcmaster.ca> ...
At some point, Leif K-Brooks <eu*****@ecritters.biz> wrote:
I'm writing a site with mod_python which will have, among other
things, forums. I want to allow users to use some HTML (<em>,
<strong>, <p>, etc.) on the forums, but I don't want to allow bad
elements and attributes (onclick, <script>, etc.). I would also like
to do basic validation (no overlapping elements like
<strong><em>foo</em></strong>, no missing end tags). I'm not asking
anyone to write a script for me, but does anyone have general ideas
about how to do this quickly on an active forum?


You could require valid XML, and use a validating XML parser to
check conformance. You'd have to make sure the output is correctly
quoted (for instance, check that HTML tags in a CDATA block get quoted).


You could use Tidy (or tidylib) to convert error-ridden input into
valid HTML or XHTML, and then grab the BODY contents via an XML
parser, as David suggested. I imagine that the library version of tidy
is quick enough to meet your needs.

Or maybe you could use XSLT to cut the "bad stuff" out of your tidied
XHTML. (Not something I'm familiar with, but someone must have done
this before.)

There's a Python wrapper for tidylib at
http://utidylib.sourceforge.net/ .

-- Graham
Jul 18 '05 #3
[Leif K-Brooks]
I'm writing a site with mod_python which will have, among other
things, forums. I want to allow users to use some HTML (<em>,
<strong>, <p>, etc.) on the forums, but I don't want to allow bad
elements and attributes (onclick, <script>, etc.). I would also like
to do basic validation (no overlapping elements like
<strong><em>foo</em></strong>, no missing end tags). I'm not asking
anyone to write a script for me, but does anyone have general ideas
about how to do this quickly on an active forum?

"Quickly" being an important consideration for you, I'm presuming.

(David M. Cooke) You could require valid XML, and use a validating XML parser to
check conformance. You'd have to make sure the output is correctly
quoted (for instance, check that HTML tags in a CDATA block get quoted).

Hmmm, I'd imagine that the average forum user isn't going to know what
well-formed XML is. Also, validating-XML support is one of the areas
where python is lacking. Lastly, wrapping HTML tags in a CDATA block
won't deliver much benefit. You still have to send that HTML to the
browser, which will probably render the contents of the CDATA block
anyway.

[Graham Fawcett] You could use Tidy (or tidylib) to convert error-ridden input into
valid HTML or XHTML, and then grab the BODY contents via an XML
parser, as David suggested. I imagine that the library version of tidy
is quick enough to meet your needs.
This is a good idea. Tidy is always a good way to get easily
processable XML from badly-formed HTML. There are multiple ways to run
Tidy from python: use MAL's utidy library, use the command line
executable and pipes, or in jython use JTidy.

http://sourceforge.net/projects/jtidy

[Graham Fawcett] Or maybe you could use XSLT to cut the "bad stuff" out of your tidied
XHTML. (Not something I'm familiar with, but someone must have done
this before.)


However, this is not a good idea. XSLT requires an Object Model of the
document, meaning that you're going to use a lot of cpu-time and
memory. In extreme cases, e.g. where some black-hat attempts to upload
a 20 Mbyte HTML file, you're opening yourself up to a
Denial-Of-Service attack, when your server tries to build up a [D]OM
of that document.

The optimal solution, IMHO, is to tidy the HTML into XML, and then use
SAX to filter out the stuff you don't want. Here is some code that
does the latter. This should be nice and fast, and use a lot less
memory than object-model based approaches.

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import xml.sax
import cStringIO as StringIO

permittedElements = ['html', 'body', 'b', 'i', 'p']
permittedAttrs = ['class', 'id', ]

class cleaner(xml.sax.handler.ContentHandler):

def __init__(self):
xml.sax.handler.ContentHandler.__init__(self)
self.outbuf = StringIO.StringIO()

def startElement(self, elemname, attrs):
if elemname in permittedElements:
attrstr = ""
for a in attrs.keys():
if a in permittedAttrs:
attrstr = "%s " % "%s='%s'" % (a, attrs[a])
self.outbuf.write("<%s%s>" % (elemname, attrstr))

def endElement(self, elemname):
if elemname in permittedElements:
self.outbuf.write("</%s>" % (elemname,))

def characters(self, s):
self.outbuf.write("%s" % (s,))

testdoc = """
<html>
<body>
<p>This paragraph contains <b>only</b> permitted elements.</p>
<p>This paragraph contains <i
onclick="javascript:pop('porno.htm')">disallowed
attributes</i>.</p>
<img src="http://www.blackhat.com/session_hijack.gif"/>
<p>This paragraph contains
<a href="http://www.jscript-attack.com/">a potential script
attack</a></p>
</body>
</html>
"""

if __name__ == "__main__":
parser = xml.sax.make_parser()
mycleaner = cleaner()
parser.setContentHandler(mycleaner)
parser.setFeature(xml.sax.handler.feature_namespac es, 0)
parser.feed(testdoc)
print mycleaner.outbuf.getvalue()
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Tidying the HTML to XML is left as an exercise to the reader ;-)

HTH,

--
alan kennedy
------------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan: http://xhaus.com/contact/alan
Jul 18 '05 #4
[Alan Kennedy]
The optimal solution, IMHO, is to tidy the HTML into XML, and then use
SAX to filter out the stuff you don't want. Here is some code that
does the latter. This should be nice and fast, and use a lot less
memory than object-model based approaches.


Unfortunately, in my haste to post a demonstration of a technique
earlier on, I posted running code that is both buggy and *INSECURE*.
The following are problems with it

1. A bug in making up the attribute string results in loss of
permitted attributes.

2. The failure to escape character data (i.e. map '<' to '&lt;' and
'>' to '&gt;') as it is written out gives rise to the possibility of a
code injection attack. It's easy to circumvent the check for malicious
code: I'll leave to y'all to figure out how.

3. I have a feeling that the failure to escape the attribute values
also opens the possibility of a code injection attack. I'm not
certain: it depends on the browser environment in which the final HTML
is rendered.

Anyway, here's some updated code that closes the SECURITY HOLES in the
earlier-posted version :-(

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import xml.sax
from xml.sax.saxutils import escape, quoteattr
import cStringIO as StringIO

permittedElements = ['html', 'body', 'b', 'i', 'p']
permittedAttrs = ['class', 'id', ]

class cleaner(xml.sax.handler.ContentHandler):

def __init__(self):
xml.sax.handler.ContentHandler.__init__(self)
self.outbuf = StringIO.StringIO()

def startElement(self, elemname, attrs):
if elemname in permittedElements:
attrstr = ""
for a in attrs.keys():
if a in permittedAttrs:
attrstr = "%s%s" % (attrstr, " %s=%s" % (a,
quoteattr(attrs[a])))
self.outbuf.write("<%s%s>" % (elemname, attrstr))

def endElement(self, elemname):
if elemname in permittedElements:
self.outbuf.write("</%s>" % (elemname,))

def characters(self, s):
self.outbuf.write("%s" % (escape(s),))

testdoc = """
<html>
<body>
<p class="1" id="2">This paragraph contains <b>only</b> permitted
elements.</p>
<p>This paragraph contains <i
onclick="javascript:pop('porno.htm')">disallowed
attributes</i>.</p>
<img src="http://www.blackhat.com/session_hijack.gif"/>
<p>This paragraph contains
<script src="blackhat.js"/>a potential script
attack</p>
</body>
</html>
"""

if __name__ == "__main__":
parser = xml.sax.make_parser()
mycleaner = cleaner()
parser.setContentHandler(mycleaner)
parser.setFeature(xml.sax.handler.feature_namespac es, 0)
parser.feed(testdoc)
print mycleaner.outbuf.getvalue()
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

regards,

--
alan kennedy
------------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan: http://xhaus.com/contact/alan
Jul 18 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: scorpion | last post by:
I have a simple type like this: <xs:simpleType name="SizeType"> <xs:restriction base="xs:token"> <xs:enumeration value="small"/> <xs:enumeration value="medium"/> <xs:enumeration...
16
by: Terry | last post by:
Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed...
10
by: Rithish | last post by:
I want to emulate paging in an HTML document. something like, ------------------------- | | | <DIV> | | | | <TABLE></TABLE>...
7
by: J. Hall | last post by:
Hi dudes, Got a simple webpage, with three numeric text input boxes, the idea being that the user is asked to insert percentages of their business around the world... UK, Europe, Other ...
14
by: Brandon | last post by:
I am an amateur working on a first site, I have settled on using FP 2002 for now. My current page is up and live, but I have two errors that I cant seem to get rid of ... Line 29, column 6:...
22
by: Luke | last post by:
Elements with name attribute: form, input, textarea, a, frame, iframe, button, select, map, meta, applet, object, param, img (if you know more reply...) Methods of addresing html elements:...
9
by: Patient Guy | last post by:
Taking the BODY element as an example, all of its style attributes ('alink', 'vlink', 'background', 'text', etc.) are deprecated in HTML 4.01, a fact noted in the DOM Level 2 HTML specification. ...
29
by: Knut Olsen-Solberg | last post by:
I try to change the text in a <p> using getElementById(). I wonder what properties exists, and which one to use here. (The following does not work.) Regards Knut ______________________ ...
5
by: sean.stonehart | last post by:
I've got a menu written in Javascript I'm wanting to enable but I need for it to sit in the center of a 3 column table. The menu keeps anchoring itself to the top left of the display, which is not...
2
by: DJRhino | last post by:
Was curious if anyone else was having this same issue or not.... I was just Up/Down graded to windows 11 and now my access combo boxes are not acting right. With win 10 I could start typing...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 4 Oct 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
0
tracyyun
by: tracyyun | last post by:
Hello everyone, I have a question and would like some advice on network connectivity. I have one computer connected to my router via WiFi, but I have two other computers that I want to be able to...
4
NeoPa
by: NeoPa | last post by:
Hello everyone. I find myself stuck trying to find the VBA way to get Access to create a PDF of the currently-selected (and open) object (Form or Report). I know it can be done by selecting :...
3
NeoPa
by: NeoPa | last post by:
Introduction For this article I'll be using a very simple database which has Form (clsForm) & Report (clsReport) classes that simply handle making the calling Form invisible until the Form, or all...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 1 Nov 2023 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM) Please note that the UK and Europe revert to winter time on...
3
by: nia12 | last post by:
Hi there, I am very new to Access so apologies if any of this is obvious/not clear. I am creating a data collection tool for health care employees to complete. It consists of a number of...
0
NeoPa
by: NeoPa | last post by:
Introduction For this article I'll be focusing on the Report (clsReport) class. This simply handles making the calling Form invisible until all of the Reports opened by it have been closed, when it...
2
by: GKJR | last post by:
Does anyone have a recommendation to build a standalone application to replace an Access database? I have my bookkeeping software I developed in Access that I would like to make available to other...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.