"Fixing html files"

John Resler

Hi all,
First I want to say I am fully aware of the huge scope of the problem
of parsing and correcting files of any sort. I have been using the jTidy
libraries (Dave Raggett W3C, I believe) to attempt to clean up the html
I use and convert it to xhtml if possible. Not to complain about Tidy,
it is the only application I'm aware of that does what it does... I am
just curious if there are any other applications/libraries that perform
the same function, more completely?

Jul 20 '05 #1

Subscribe Post Reply

1496

David Carlisle

John Resler <Jo*********@sbcglobal.net> writes:

Hi all,
First I want to say I am fully aware of the huge scope of the problem
of parsing and correcting files of any sort. I have been using the jTidy
libraries (Dave Raggett W3C, I believe) to attempt to clean up the html
I use and convert it to xhtml if possible. Not to complain about Tidy,
it is the only application I'm aware of that does what it does... I am
just curious if there are any other applications/libraries that perform
the same function, more completely?

Hard to quantify "more completely". tidy does a better job than most.
Alternative route might be for example John Cowan's tagsoup
http://mercury.ccil.org/~cowan/XML/tagsoup/
which will allow you to parse most html into an xml processing
pipeline. It doesn't do any cleaning up really, but once you have it as
xml you just hit it with enough xslt of your choice and it should all
come out looking lovely, er, in theory....

If you are feeling really brave there's my htmlparse xslt2 stylesheet
but this is decidedly unsupported.
http://www.dcarlisle.demon.co.uk/htmlparse.xsl

David

Jul 20 '05 #2

Nick Kew

John Resler wrote:

Hi all,
First I want to say I am fully aware of the huge scope of the
problem of parsing and correcting files of any sort. I have been using
the jTidy libraries (Dave Raggett W3C, I believe) to attempt to clean up
Dave Raggett wrote the original tidy, but it's been some years since
he was in charge of it.
the html I use and convert it to xhtml if possible. Not to complain
about Tidy, it is the only application I'm aware of that does what it
does... I am just curious if there are any other applications/libraries
that perform the same function, more completely?

libxml2 parses html, including tagsoup html, and gives you SAX or DOM
APIs on it. You can then serialise that to better HTML or XHTML.
It's a different approach to tidy, and shares the same fundamental
problem of having to guess blindly when presented with heavy-duty
gibberish.

A higher-level application based on libxml2 is AccessValet. Its
real purpose is (X)HTML accessibility analysis and reporting, but it
will also clean up (x)html. It takes a more brutal approach than
tidy: instead of attempting to substitute for crap, it strips it.
So if you take the default - which is strict output - it'll remove
everything that's deprecated in HTML4/XHTML1, and
<p align=center><font color=black>some text here<p>some more text
becomes
<p>some text here</p><p>some more text</p>

I wouldn't recommend it over tidy for that particular purpose, but it's
an option:-)

You can also fix markup on the fly when serving it. The state of the
art there is mod_publisher, at
http://apache.webthing.com/mod_publisher/
and is far better than any of the tidy-in-a-webserver options.

--
Nick Kew

Jul 20 '05 #3

by: David Schubert | last post by:

Hallo Leute, ich betreibe die Seite www.rechercheportal.de und möchte auf den ..html-Seiten PHP benutzen. Folgendes habe ich ausprobiert: 1. htaccess: habe nicht genügend Rechte, um einen neuen...

PHP

php in ".html" documents

by: petermichaux | last post by:

Hi, On my computer apache will see php in .php documents but not in .html documents. Can I configure apache to see php in .html documents? Or is this something that cannot be done at all? ...

PHP

Q: urlopen() and "file:///c:/mypage.html" ??

by: MAK | last post by:

I'm stumped. I'm trying to use Python 2.3's urllib2.urlopen() to open an HTML file on the local harddrive of my WinXP box. If I were to use, say, Netscape to open this file, I'd specify it as...

Python

by: David D. | last post by:

Does the file extension matter when including a JavaScript file in an HTML page? Normally, one would include a JavaScript file in an HTML page using <script src="foo.JS" type="text/javascript">...

Javascript

Web Control vs. html "run as server" for setting password from coo

by: Andre Ranieri | last post by:

I'm trying to create a login page for customers to log into our corporate website, our presidents naturally wants the user and password fields to populate from a cookie so the customer doesn't have...

ASP.NET

How to open a chm help file prepared with "HTML Help Workshop"

by: S. van Beek | last post by:

Dear reader, In the form property "Help File" there is the possibility to specify a .chm help file. This help file can be created with "HTML Help Workshop". According the instructions...

Microsoft Access / VBA

asp.net c# PDF to "View as html"?

by: bnlockwood | last post by:

I'm looking to have the same kind of feature gmail does in that it is able to view pdf files as html. I want to be able to have a PDF file, have a view as html button but not have 2 files one html...

ASP.NET

What is the best way to read a HTML "readable" web directory...

by: GenCode | last post by:

What is the best way to read a "readable" web directory... I know I can do this Client.DownloadFile("http://www.mydomain.com/readabledir/", c:\ \dir.txt"); But that gives me the html and all...

C# / C Sharp

How convince Access to stop "fixing" my text?

by: Chris Shearer Cooper | last post by:

I have a Memo field in an Access 2000 database that contains information that is not just plain old text - it's information my program needs, and it needs to be in a specific format. The problem...

Microsoft Access / VBA

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Similar topics