On Jan 28, 6:22 am, Zhang Weiwu <zhangwe...@realss.com>
wrote:
Hello. I am looking for a commandline tool to take an
html document (or html document segment, a.k.a. without
beginign "<html><head>..</head><body>") and process it by
removing all css style settings and javascripts, and
output a clean html/xhtml.
Optionally, it would be nice if this tool can take an
acceptable tag list and remove all tags not in this list.
I need such a tool to process a lot of static html
document I am working on. Do you happen to know such a
tool? I am still googling around ;) I tried tidy but
there seems not to be an option to remove css.
Unless your source HTML is so tag-soupy no sane HTML parser
can grok it, XSLT is great for this kind of stuff. Of
course, you'll also need an XSLT processor that can
transform HTML documents (libxslt can do that, and probably
many others).
pavel@debian:~/dev/xslt$ cat raw.html
<!DOCTYPE HTML PUBLIC
"-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>Test</title>
<style type="text/css">
body { font-family : monospace ; }
</style>
<script type="text/javascript">
function oink ( ) { alert ( 'oink!' ) ; }
</script>
</head>
<body>
<div style=" color : blue ;">
<span style=" font-style : italic ; "
onclick=" oink ( ) ; ">oink!</span>
</div>
</body>
</html>
pavel@debian:~/dev/xslt$ cat strip_jscss.xsl
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="style|script|@style|@onclick"/>
</xsl:stylesheet>
pavel@debian:~/dev/xslt$ xsltproc -html strip_jscss.xsl
raw.html
<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=UTF-8">
<title>Test</title>
</head>
<body><div>
<span>oink!</span>
</div></body>
</html>
Naturally, you'll want to tinker with xsl:output to get
valid HTML as an output, and you'll need to fine-tune the
exclusion template to handle all the event handler
attributes etc. xsltproc is a command-line utility that
comes with libxslt, but as I said, I'd expect most of XSLT
processors capable of transforming HTML as well.
--
Pavel Lepin