473,413 Members | 2,066 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,413 software developers and data experts.

"Fixing html files"

Hi all,
First I want to say I am fully aware of the huge scope of the problem
of parsing and correcting files of any sort. I have been using the jTidy
libraries (Dave Raggett W3C, I believe) to attempt to clean up the html
I use and convert it to xhtml if possible. Not to complain about Tidy,
it is the only application I'm aware of that does what it does... I am
just curious if there are any other applications/libraries that perform
the same function, more completely?
Jul 20 '05 #1
2 1496
John Resler <Jo*********@sbcglobal.net> writes:
Hi all,
First I want to say I am fully aware of the huge scope of the problem
of parsing and correcting files of any sort. I have been using the jTidy
libraries (Dave Raggett W3C, I believe) to attempt to clean up the html
I use and convert it to xhtml if possible. Not to complain about Tidy,
it is the only application I'm aware of that does what it does... I am
just curious if there are any other applications/libraries that perform
the same function, more completely?

Hard to quantify "more completely". tidy does a better job than most.
Alternative route might be for example John Cowan's tagsoup
http://mercury.ccil.org/~cowan/XML/tagsoup/
which will allow you to parse most html into an xml processing
pipeline. It doesn't do any cleaning up really, but once you have it as
xml you just hit it with enough xslt of your choice and it should all
come out looking lovely, er, in theory....

If you are feeling really brave there's my htmlparse xslt2 stylesheet
but this is decidedly unsupported.
http://www.dcarlisle.demon.co.uk/htmlparse.xsl

David
Jul 20 '05 #2
John Resler wrote:
Hi all,
First I want to say I am fully aware of the huge scope of the
problem of parsing and correcting files of any sort. I have been using
the jTidy libraries (Dave Raggett W3C, I believe) to attempt to clean up
Dave Raggett wrote the original tidy, but it's been some years since
he was in charge of it.
the html I use and convert it to xhtml if possible. Not to complain
about Tidy, it is the only application I'm aware of that does what it
does... I am just curious if there are any other applications/libraries
that perform the same function, more completely?


libxml2 parses html, including tagsoup html, and gives you SAX or DOM
APIs on it. You can then serialise that to better HTML or XHTML.
It's a different approach to tidy, and shares the same fundamental
problem of having to guess blindly when presented with heavy-duty
gibberish.

A higher-level application based on libxml2 is AccessValet. Its
real purpose is (X)HTML accessibility analysis and reporting, but it
will also clean up (x)html. It takes a more brutal approach than
tidy: instead of attempting to substitute for crap, it strips it.
So if you take the default - which is strict output - it'll remove
everything that's deprecated in HTML4/XHTML1, and
<p align=center><font color=black>some text here<p>some more text
becomes
<p>some text here</p><p>some more text</p>

I wouldn't recommend it over tidy for that particular purpose, but it's
an option:-)

You can also fix markup on the fly when serving it. The state of the
art there is mod_publisher, at
http://apache.webthing.com/mod_publisher/
and is far better than any of the tidy-in-a-webserver options.

--
Nick Kew
Jul 20 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: David Schubert | last post by:
Hallo Leute, ich betreibe die Seite www.rechercheportal.de und möchte auf den ..html-Seiten PHP benutzen. Folgendes habe ich ausprobiert: 1. htaccess: habe nicht genügend Rechte, um einen neuen...
28
by: petermichaux | last post by:
Hi, On my computer apache will see php in .php documents but not in .html documents. Can I configure apache to see php in .html documents? Or is this something that cannot be done at all? ...
5
by: MAK | last post by:
I'm stumped. I'm trying to use Python 2.3's urllib2.urlopen() to open an HTML file on the local harddrive of my WinXP box. If I were to use, say, Netscape to open this file, I'd specify it as...
9
by: David D. | last post by:
Does the file extension matter when including a JavaScript file in an HTML page? Normally, one would include a JavaScript file in an HTML page using <script src="foo.JS" type="text/javascript">...
6
by: Andre Ranieri | last post by:
I'm trying to create a login page for customers to log into our corporate website, our presidents naturally wants the user and password fields to populate from a cookie so the customer doesn't have...
1
by: S. van Beek | last post by:
Dear reader, In the form property "Help File" there is the possibility to specify a .chm help file. This help file can be created with "HTML Help Workshop". According the instructions...
1
by: bnlockwood | last post by:
I'm looking to have the same kind of feature gmail does in that it is able to view pdf files as html. I want to be able to have a PDF file, have a view as html button but not have 2 files one html...
5
by: GenCode | last post by:
What is the best way to read a "readable" web directory... I know I can do this Client.DownloadFile("http://www.mydomain.com/readabledir/", c:\ \dir.txt"); But that gives me the html and all...
4
by: Chris Shearer Cooper | last post by:
I have a Memo field in an Access 2000 database that contains information that is not just plain old text - it's information my program needs, and it needs to be in a specific format. The problem...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.