473,769 Members | 2,331 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Simple allowing of HTML elements/attributes?

I'm writing a site with mod_python which will have, among other things,
forums. I want to allow users to use some HTML (<em>, <strong>, <p>,
etc.) on the forums, but I don't want to allow bad elements and
attributes (onclick, <script>, etc.). I would also like to do basic
validation (no overlapping elements like <strong><em>foo </em></strong>,
no missing end tags). I'm not asking anyone to write a script for me,
but does anyone have general ideas about how to do this quickly on an
active forum?
Jul 18 '05 #1
4 2452
At some point, Leif K-Brooks <eu*****@ecritt ers.biz> wrote:
I'm writing a site with mod_python which will have, among other
things, forums. I want to allow users to use some HTML (<em>,
<strong>, <p>, etc.) on the forums, but I don't want to allow bad
elements and attributes (onclick, <script>, etc.). I would also like
to do basic validation (no overlapping elements like
<strong><em>foo </em></strong>, no missing end tags). I'm not asking
anyone to write a script for me, but does anyone have general ideas
about how to do this quickly on an active forum?


You could require valid XML, and use a validating XML parser to
check conformance. You'd have to make sure the output is correctly
quoted (for instance, check that HTML tags in a CDATA block get quoted).

--
|>|\/|<
/--------------------------------------------------------------------------\
|David M. Cooke
|cookedm(at)phy sics(dot)mcmast er(dot)ca
Jul 18 '05 #2
co**********@ph ysics.mcmaster. ca (David M. Cooke) wrote in message news:<qn******* ******@arbutus. physics.mcmaste r.ca>...
At some point, Leif K-Brooks <eu*****@ecritt ers.biz> wrote:
I'm writing a site with mod_python which will have, among other
things, forums. I want to allow users to use some HTML (<em>,
<strong>, <p>, etc.) on the forums, but I don't want to allow bad
elements and attributes (onclick, <script>, etc.). I would also like
to do basic validation (no overlapping elements like
<strong><em>foo </em></strong>, no missing end tags). I'm not asking
anyone to write a script for me, but does anyone have general ideas
about how to do this quickly on an active forum?


You could require valid XML, and use a validating XML parser to
check conformance. You'd have to make sure the output is correctly
quoted (for instance, check that HTML tags in a CDATA block get quoted).


You could use Tidy (or tidylib) to convert error-ridden input into
valid HTML or XHTML, and then grab the BODY contents via an XML
parser, as David suggested. I imagine that the library version of tidy
is quick enough to meet your needs.

Or maybe you could use XSLT to cut the "bad stuff" out of your tidied
XHTML. (Not something I'm familiar with, but someone must have done
this before.)

There's a Python wrapper for tidylib at
http://utidylib.sourceforge.net/ .

-- Graham
Jul 18 '05 #3
[Leif K-Brooks]
I'm writing a site with mod_python which will have, among other
things, forums. I want to allow users to use some HTML (<em>,
<strong>, <p>, etc.) on the forums, but I don't want to allow bad
elements and attributes (onclick, <script>, etc.). I would also like
to do basic validation (no overlapping elements like
<strong><em>foo </em></strong>, no missing end tags). I'm not asking
anyone to write a script for me, but does anyone have general ideas
about how to do this quickly on an active forum?

"Quickly" being an important consideration for you, I'm presuming.

(David M. Cooke) You could require valid XML, and use a validating XML parser to
check conformance. You'd have to make sure the output is correctly
quoted (for instance, check that HTML tags in a CDATA block get quoted).

Hmmm, I'd imagine that the average forum user isn't going to know what
well-formed XML is. Also, validating-XML support is one of the areas
where python is lacking. Lastly, wrapping HTML tags in a CDATA block
won't deliver much benefit. You still have to send that HTML to the
browser, which will probably render the contents of the CDATA block
anyway.

[Graham Fawcett] You could use Tidy (or tidylib) to convert error-ridden input into
valid HTML or XHTML, and then grab the BODY contents via an XML
parser, as David suggested. I imagine that the library version of tidy
is quick enough to meet your needs.
This is a good idea. Tidy is always a good way to get easily
processable XML from badly-formed HTML. There are multiple ways to run
Tidy from python: use MAL's utidy library, use the command line
executable and pipes, or in jython use JTidy.

http://sourceforge.net/projects/jtidy

[Graham Fawcett] Or maybe you could use XSLT to cut the "bad stuff" out of your tidied
XHTML. (Not something I'm familiar with, but someone must have done
this before.)


However, this is not a good idea. XSLT requires an Object Model of the
document, meaning that you're going to use a lot of cpu-time and
memory. In extreme cases, e.g. where some black-hat attempts to upload
a 20 Mbyte HTML file, you're opening yourself up to a
Denial-Of-Service attack, when your server tries to build up a [D]OM
of that document.

The optimal solution, IMHO, is to tidy the HTML into XML, and then use
SAX to filter out the stuff you don't want. Here is some code that
does the latter. This should be nice and fast, and use a lot less
memory than object-model based approaches.

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import xml.sax
import cStringIO as StringIO

permittedElemen ts = ['html', 'body', 'b', 'i', 'p']
permittedAttrs = ['class', 'id', ]

class cleaner(xml.sax .handler.Conten tHandler):

def __init__(self):
xml.sax.handler .ContentHandler .__init__(self)
self.outbuf = StringIO.String IO()

def startElement(se lf, elemname, attrs):
if elemname in permittedElemen ts:
attrstr = ""
for a in attrs.keys():
if a in permittedAttrs:
attrstr = "%s " % "%s='%s'" % (a, attrs[a])
self.outbuf.wri te("<%s%s>" % (elemname, attrstr))

def endElement(self , elemname):
if elemname in permittedElemen ts:
self.outbuf.wri te("</%s>" % (elemname,))

def characters(self , s):
self.outbuf.wri te("%s" % (s,))

testdoc = """
<html>
<body>
<p>This paragraph contains <b>only</b> permitted elements.</p>
<p>This paragraph contains <i
onclick="javasc ript:pop('porno .htm')">disallo wed
attributes</i>.</p>
<img src="http://www.blackhat.co m/session_hijack. gif"/>
<p>This paragraph contains
<a href="http://www.jscript-attack.com/">a potential script
attack</a></p>
</body>
</html>
"""

if __name__ == "__main__":
parser = xml.sax.make_pa rser()
mycleaner = cleaner()
parser.setConte ntHandler(mycle aner)
parser.setFeatu re(xml.sax.hand ler.feature_nam espaces, 0)
parser.feed(tes tdoc)
print mycleaner.outbu f.getvalue()
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Tidying the HTML to XML is left as an exercise to the reader ;-)

HTH,

--
alan kennedy
------------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan: http://xhaus.com/contact/alan
Jul 18 '05 #4
[Alan Kennedy]
The optimal solution, IMHO, is to tidy the HTML into XML, and then use
SAX to filter out the stuff you don't want. Here is some code that
does the latter. This should be nice and fast, and use a lot less
memory than object-model based approaches.


Unfortunately, in my haste to post a demonstration of a technique
earlier on, I posted running code that is both buggy and *INSECURE*.
The following are problems with it

1. A bug in making up the attribute string results in loss of
permitted attributes.

2. The failure to escape character data (i.e. map '<' to '&lt;' and
'>' to '&gt;') as it is written out gives rise to the possibility of a
code injection attack. It's easy to circumvent the check for malicious
code: I'll leave to y'all to figure out how.

3. I have a feeling that the failure to escape the attribute values
also opens the possibility of a code injection attack. I'm not
certain: it depends on the browser environment in which the final HTML
is rendered.

Anyway, here's some updated code that closes the SECURITY HOLES in the
earlier-posted version :-(

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import xml.sax
from xml.sax.saxutil s import escape, quoteattr
import cStringIO as StringIO

permittedElemen ts = ['html', 'body', 'b', 'i', 'p']
permittedAttrs = ['class', 'id', ]

class cleaner(xml.sax .handler.Conten tHandler):

def __init__(self):
xml.sax.handler .ContentHandler .__init__(self)
self.outbuf = StringIO.String IO()

def startElement(se lf, elemname, attrs):
if elemname in permittedElemen ts:
attrstr = ""
for a in attrs.keys():
if a in permittedAttrs:
attrstr = "%s%s" % (attrstr, " %s=%s" % (a,
quoteattr(attrs[a])))
self.outbuf.wri te("<%s%s>" % (elemname, attrstr))

def endElement(self , elemname):
if elemname in permittedElemen ts:
self.outbuf.wri te("</%s>" % (elemname,))

def characters(self , s):
self.outbuf.wri te("%s" % (escape(s),))

testdoc = """
<html>
<body>
<p class="1" id="2">This paragraph contains <b>only</b> permitted
elements.</p>
<p>This paragraph contains <i
onclick="javasc ript:pop('porno .htm')">disallo wed
attributes</i>.</p>
<img src="http://www.blackhat.co m/session_hijack. gif"/>
<p>This paragraph contains
<script src="blackhat.j s"/>a potential script
attack</p>
</body>
</html>
"""

if __name__ == "__main__":
parser = xml.sax.make_pa rser()
mycleaner = cleaner()
parser.setConte ntHandler(mycle aner)
parser.setFeatu re(xml.sax.hand ler.feature_nam espaces, 0)
parser.feed(tes tdoc)
print mycleaner.outbu f.getvalue()
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

regards,

--
alan kennedy
------------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan: http://xhaus.com/contact/alan
Jul 18 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
2661
by: scorpion | last post by:
I have a simple type like this: <xs:simpleType name="SizeType"> <xs:restriction base="xs:token"> <xs:enumeration value="small"/> <xs:enumeration value="medium"/> <xs:enumeration value="large"/> <xs:enumeration value="xlarge"/> </xs:restriction> </xs:simpleType>
16
2906
by: Terry | last post by:
Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed loaded into cache, the slideshow doesn't look very nice. I am not sure how/when to call the slideshow() function to make sure it starts after the preload has been completed.
10
13680
by: Rithish | last post by:
I want to emulate paging in an HTML document. something like, ------------------------- | | | <DIV> | | | | <TABLE></TABLE> | | | | <TABLE></TABLE> | | |
7
1774
by: J. Hall | last post by:
Hi dudes, Got a simple webpage, with three numeric text input boxes, the idea being that the user is asked to insert percentages of their business around the world... UK, Europe, Other Obviously this mustn't exceed 100% and so I OnChange I simply want to check that all three boxes have a value, and if so sum them up and alert the user
14
6422
by: Brandon | last post by:
I am an amateur working on a first site, I have settled on using FP 2002 for now. My current page is up and live, but I have two errors that I cant seem to get rid of ... Line 29, column 6: duplicate specification of attribute "STYLE" style="border: 1px solid #00F;width: 750px; height: 690px;" Line 154, column 46: there is no attribute "BORDERCOLOR" <table border="0" cellspacing="0" bordercolor="#0000FF" width="650"
22
3338
by: Luke | last post by:
Elements with name attribute: form, input, textarea, a, frame, iframe, button, select, map, meta, applet, object, param, img (if you know more reply...) Methods of addresing html elements: <form name="myform"> <input name="myinput" /> </form> 1. var input = document.forms.myform.myinput;//from nn3+ 2. var input = document.forms.myinput;//from nn3+
9
2005
by: Patient Guy | last post by:
Taking the BODY element as an example, all of its style attributes ('alink', 'vlink', 'background', 'text', etc.) are deprecated in HTML 4.01, a fact noted in the DOM Level 2 HTML specification. The DOM specification does not explicitly itself deprecate the use of attributes however for the element in the interface definition section I read. Is there text in the DOM specification that states specifically that the DOM specification...
29
1704
by: Knut Olsen-Solberg | last post by:
I try to change the text in a <p> using getElementById(). I wonder what properties exists, and which one to use here. (The following does not work.) Regards Knut ______________________ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML> <HEAD> <TITLE>JavaScript</TITLE>
5
1196
by: sean.stonehart | last post by:
I've got a menu written in Javascript I'm wanting to enable but I need for it to sit in the center of a 3 column table. The menu keeps anchoring itself to the top left of the display, which is not what I had in mind. I'm new to writing Javascript & enabling it in this manner. Any suggestions on how to make it respond to the HTML I'm wrapping around it would be great. Thanks!
0
10222
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10050
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
9866
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8876
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7413
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6675
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5448
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3967
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2815
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.