John Resler wrote:
Hi all,
First I want to say I am fully aware of the huge scope of the
problem of parsing and correcting files of any sort. I have been using
the jTidy libraries (Dave Raggett W3C, I believe) to attempt to clean up
Dave Raggett wrote the original tidy, but it's been some years since
he was in charge of it.
the html I use and convert it to xhtml if possible. Not to complain
about Tidy, it is the only application I'm aware of that does what it
does... I am just curious if there are any other applications/libraries
that perform the same function, more completely?
libxml2 parses html, including tagsoup html, and gives you SAX or DOM
APIs on it. You can then serialise that to better HTML or XHTML.
It's a different approach to tidy, and shares the same fundamental
problem of having to guess blindly when presented with heavy-duty
gibberish.
A higher-level application based on libxml2 is AccessValet. Its
real purpose is (X)HTML accessibility analysis and reporting, but it
will also clean up (x)html. It takes a more brutal approach than
tidy: instead of attempting to substitute for crap, it strips it.
So if you take the default - which is strict output - it'll remove
everything that's deprecated in HTML4/XHTML1, and
<p align=center><font color=black>some text here<p>some more text
becomes
<p>some text here</p><p>some more text</p>
I wouldn't recommend it over tidy for that particular purpose, but it's
an option:-)
You can also fix markup on the fly when serving it. The state of the
art there is mod_publisher, at
http://apache.webthing.com/mod_publisher/
and is far better than any of the tidy-in-a-webserver options.
--
Nick Kew