473,808 Members | 2,882 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

"Fixing html files"

Hi all,
First I want to say I am fully aware of the huge scope of the problem
of parsing and correcting files of any sort. I have been using the jTidy
libraries (Dave Raggett W3C, I believe) to attempt to clean up the html
I use and convert it to xhtml if possible. Not to complain about Tidy,
it is the only application I'm aware of that does what it does... I am
just curious if there are any other applications/libraries that perform
the same function, more completely?
Jul 20 '05 #1
2 1529
John Resler <Jo*********@sb cglobal.net> writes:
Hi all,
First I want to say I am fully aware of the huge scope of the problem
of parsing and correcting files of any sort. I have been using the jTidy
libraries (Dave Raggett W3C, I believe) to attempt to clean up the html
I use and convert it to xhtml if possible. Not to complain about Tidy,
it is the only application I'm aware of that does what it does... I am
just curious if there are any other applications/libraries that perform
the same function, more completely?

Hard to quantify "more completely". tidy does a better job than most.
Alternative route might be for example John Cowan's tagsoup
http://mercury.ccil.org/~cowan/XML/tagsoup/
which will allow you to parse most html into an xml processing
pipeline. It doesn't do any cleaning up really, but once you have it as
xml you just hit it with enough xslt of your choice and it should all
come out looking lovely, er, in theory....

If you are feeling really brave there's my htmlparse xslt2 stylesheet
but this is decidedly unsupported.
http://www.dcarlisle.demon.co.uk/htmlparse.xsl

David
Jul 20 '05 #2
John Resler wrote:
Hi all,
First I want to say I am fully aware of the huge scope of the
problem of parsing and correcting files of any sort. I have been using
the jTidy libraries (Dave Raggett W3C, I believe) to attempt to clean up
Dave Raggett wrote the original tidy, but it's been some years since
he was in charge of it.
the html I use and convert it to xhtml if possible. Not to complain
about Tidy, it is the only application I'm aware of that does what it
does... I am just curious if there are any other applications/libraries
that perform the same function, more completely?


libxml2 parses html, including tagsoup html, and gives you SAX or DOM
APIs on it. You can then serialise that to better HTML or XHTML.
It's a different approach to tidy, and shares the same fundamental
problem of having to guess blindly when presented with heavy-duty
gibberish.

A higher-level application based on libxml2 is AccessValet. Its
real purpose is (X)HTML accessibility analysis and reporting, but it
will also clean up (x)html. It takes a more brutal approach than
tidy: instead of attempting to substitute for crap, it strips it.
So if you take the default - which is strict output - it'll remove
everything that's deprecated in HTML4/XHTML1, and
<p align=center><f ont color=black>som e text here<p>some more text
becomes
<p>some text here</p><p>some more text</p>

I wouldn't recommend it over tidy for that particular purpose, but it's
an option:-)

You can also fix markup on the fly when serving it. The state of the
art there is mod_publisher, at
http://apache.webthing.com/mod_publisher/
and is far better than any of the tidy-in-a-webserver options.

--
Nick Kew
Jul 20 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
1762
by: David Schubert | last post by:
Hallo Leute, ich betreibe die Seite www.rechercheportal.de und möchte auf den ..html-Seiten PHP benutzen. Folgendes habe ich ausprobiert: 1. htaccess: habe nicht genügend Rechte, um einen neuen Typ für den Parser zu definieren 2. iframe: bin ich unsicher, ob das alle Browser unterstützen (Mozilla, Netscape..) 3. Umbenennen aller Dateien: irrsinniger Aufwand, habe ich schon hinter mir. Erst danach merke ich, dass .php-Dateien...
28
1933
by: petermichaux | last post by:
Hi, On my computer apache will see php in .php documents but not in .html documents. Can I configure apache to see php in .html documents? Or is this something that cannot be done at all? Thanks, Peter
5
15891
by: MAK | last post by:
I'm stumped. I'm trying to use Python 2.3's urllib2.urlopen() to open an HTML file on the local harddrive of my WinXP box. If I were to use, say, Netscape to open this file, I'd specify it as "file:///c:/mypage.html", and it would open it just fine. But urlopen() won't accept it as a valid URL. I get an OSError exception with the error message "No such file or directory: '\\C:\\mypage.html'".
9
13430
by: David D. | last post by:
Does the file extension matter when including a JavaScript file in an HTML page? Normally, one would include a JavaScript file in an HTML page using <script src="foo.JS" type="text/javascript"> However, I have found that I can use an alternate file extension, such as <script src="foo.HTML" type="text/javascript"> It works fine with my IE 6 and Mozilla. Will it work with other browsers?
6
2298
by: Andre Ranieri | last post by:
I'm trying to create a login page for customers to log into our corporate website, our presidents naturally wants the user and password fields to populate from a cookie so the customer doesn't have to type their credentials every time, this seems like a pretty common thing. However, when I try to populate the password HTML textbox from the cookie, the textbox remains blank. However, if I try this from an equivalent web control, the...
1
2286
by: S. van Beek | last post by:
Dear reader, In the form property "Help File" there is the possibility to specify a .chm help file. This help file can be created with "HTML Help Workshop". According the instructions the stored help file in the same folder as the
1
1592
by: bnlockwood | last post by:
I'm looking to have the same kind of feature gmail does in that it is able to view pdf files as html. I want to be able to have a PDF file, have a view as html button but not have 2 files one html and one pdf... I just want one that opens as html rather then opening Adobe Acrobat. I'm coding in ASP.NET c#.... thanks, B~
5
2714
by: GenCode | last post by:
What is the best way to read a "readable" web directory... I know I can do this Client.DownloadFile("http://www.mydomain.com/readabledir/", c:\ \dir.txt"); But that gives me the html and all the other tags...all I want is a directory listing of all the *.gif in this dir and not all the html Now I know I can parse the html to get the gif file names...but I
4
2044
by: Chris Shearer Cooper | last post by:
I have a Memo field in an Access 2000 database that contains information that is not just plain old text - it's information my program needs, and it needs to be in a specific format. The problem is, Access keeps trying to "help" me by capitalizing it whenever I enter the single letter 'i', and has other issues that I think are due to some kind of "text fixup" mode it thinks it should be in. I turned off "Perform Name AutoCorrect" in...
0
9600
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10631
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10374
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
10114
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9196
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7651
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5548
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5686
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
3859
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.