473,406 Members | 2,619 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,406 software developers and data experts.

HTMLPurifier - Standard Compliant HTML Filtering

HTMLPurifier is a new PHP library that filters HTML so that not only is
XSS thwarted, but the resulting HTML is standards-compliant! It's
licensed under LGPL, and is currently undergoing beta testing (beta
meaning that validation routines for a few shorthand CSS properties and
deprecated HTML properties are missing, but everything else is there).

The main difference from HTMLPurifier is that while older packages like
kses and HTML_Safe attempt to blacklist XSS, HTMLPurifier employs a
whitelist approach, breaking down an HTML document and rigorously
testing everything, whether it be a color declaration or an external
URI.

Try it out first: http://hp.jpsband.org/live/docs/examples/demo.php
Then grab a copy here: http://hp.jpsband.org/

Aug 18 '06 #1
5 1754
Ehh... sorry about the parse error. The demo is running off code in the
trunk (which is undergoing active development right now).

Aug 19 '06 #2
Ambush Commander:
HTMLPurifier is a new PHP library that filters HTML so that not only is
XSS thwarted, but the resulting HTML is standards-compliant!
Do you mean standards compliant, valid or something else? If you mean
standards compliant - assuming that that includes HTML - you would have
to assign meanings to all the ambiguous clauses of the HTML4.01 spec
(strictly speaking, all of them). If you mean valid, you would have to
guess or somehow infer what any invalid markup was intended to mean
before you could sort it.

--
Jock

Aug 19 '06 #3
John Dunlop wrote:
Do you mean standards compliant, valid or something else? If you mean
standards compliant - assuming that that includes HTML - you would have
to assign meanings to all the ambiguous clauses of the HTML4.01 spec
(strictly speaking, all of them). If you mean valid, you would have to
guess or somehow infer what any invalid markup was intended to mean
before you could sort it.
In a way, both. I can't be completely standards compliant, because
technically that would mean I'd let XSS through. What I can do is,
while disallowing XSS, ensure that any output the filter gives won't
break a XHTML 1.0 Transitional page's validation at the W3C validator.
This is no easy task, especially since the spec doesn't get everything
right (for example SGML exclusions). Currently, the only thing that's
bothering the filter are control characters and non-SGML allowed
codepoints: anything else you throw at it will be turned into something
that will validate.

As in valid, people use deprecated elements and attributes like <font>
and <centerall the time. The filter converts these into their proper
representations (<span style=""and <div style="text-align:center;">)
So it can be quite smart about that sort of thing (it also does
automatic <ptag closings, etc). Kind of like Tidy, the only thing is
that Tidy doesn't guarantee validation. We do.

Aug 19 '06 #4
Ambush Commander:
In a way, both. I can't be completely standards compliant, because
technically that would mean I'd let XSS through. What I can do is,
while disallowing XSS, ensure that any output the filter gives won't
break a XHTML 1.0 Transitional page's validation at the W3C validator.
This is no easy task, especially since the spec doesn't get everything
right (for example SGML exclusions). Currently, the only thing that's
bothering the filter are control characters and non-SGML allowed
codepoints: anything else you throw at it will be turned into something
that will validate.
I don't mean to sound rude, but what is this 'something'? How do you
know when you come across an error what was originally meant? Do you
flag the error and ask the user what they meant?
As in valid, people use deprecated elements and attributes like <font>
and <centerall the time. The filter converts these into their proper
representations (<span style=""and <div style="text-align:center;">)
So it can be quite smart about that sort of thing (it also does
automatic <ptag closings, etc). Kind of like Tidy, the only thing is
that Tidy doesn't guarantee validation. We do.
I don't believe there is any program today that can check conformance
to the HTML spec. Machines have no understanding of the prose of the
spec. Your program, from what I gather, checks validity and a
selection of other criteria that you have chosen: a linter with built
in validator.

--
Jock

Aug 20 '06 #5

John Dunlop wrote:
I don't mean to sound rude, but what is this 'something'? How do you
know when you come across an error what was originally meant? Do you
flag the error and ask the user what they meant?
No, we just mangle it and hope the user notices. :-P Error logging and
feedback is a feature I'd like to implement soon.
I don't believe there is any program today that can check conformance
to the HTML spec. Machines have no understanding of the prose of the
spec.
But the /programmer/ can. I manually went through the HTML and CSS
specs and hand-picked the elements, attributes and properties that
would be acceptable from an untrusted user in a rich text environment.
And then hand-coded their definitions.
Your program, from what I gather, checks validity and a
selection of other criteria that you have chosen: a linter with built
in validator.
Hmm... I don't see how that's much different from a filter. It won't
"fix" an excessive use and duplication of inline styles. It can't
figure out that a user is abusing a certain tag for a different
meaning. But it will make a document conform in the eyes of the W3C
validator, and it will block XSS attempts (by virtue of its whitelist
nature).

I feel like a salesperson trying to "sell" a product. Please feel free
to ask more questions.

Aug 20 '06 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

57
by: Piotr Wolski | last post by:
how to make my page that it was correct with every browser standard? for example when i change HTML's table size it has no effect when i see it under mozilla and has effect under Internet...
16
by: Pjer Holton | last post by:
If I were to build a Windows application that is a true standard Windows application in every conceivable way and that adheres to the MS Windows standards as much as possible (installation, GUI,...
35
by: Dr.Tube | last post by:
Hi there, I have this web site (www.DrTube.com) which has the following DTD: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> which switches...
14
by: John Ratliff | last post by:
I'm trying to find out whether g++ has a bug or not. Wait, don't leave, it's a standard C++ question, I promise. This program will compile and link fine under mingw/g++ 3.4.2, but fails to link...
93
by: Matt | last post by:
Hi folks. Can you help with some questions? I gather that some types supported by g++ are nonstandard but have been proposed as standards. Are the long long and unsigned long long types still...
52
by: lovecreatesbeauty | last post by:
Why the C standard committee doesn't provide a standard implementation including the C compiler and library when the language standard document is published? C works on the abstract model of low...
4
by: dustin | last post by:
I've been hacking away on this PEP for a while, and there has been some related discussion on python-dev that went into the PEP: ...
1
by: firewoodtim | last post by:
I am trying to find a good way to filter user input from tinyMCE, and have received advice that HTMLPurifier is a good filter for that purpose. Does anyone here have recommendations/advice?
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.