Paul,
It's entirely possible to have HTML so hideous that it's amazing the
browser can even render it. I've used HTMLTidy (the utility) and in my
experience it's far too idealistic (understandable given that it comes
out of people at the W3C) and it upchucks on all sorts of things. It's
a circular problem: you want to tidy up some crap code so you can
figure out WTF it's doing so you can take out the crap. But the crap
keeps it from being tidied up.
What would really be grand is something that riffs off the HTML parser
in IE or FireFox, which are much more, shall we say, "real world"
parsers. There is a Firefox plugin called "view source chart" that does
a pretty good job of figuring out random HTML and displaying it in a
useful format. I don't know if the source is available to that, but it
or something out of Firefox open source might be the ticket for
producing a useful API you could leverage. It could provide the implied
missing closures for certain tags, etc., giving you the virtual HTML
that's actually rendering rather than the broken HTML, should you wish.
My kingdom for some free time!
--Bob
pa*******@gmail .com wrote:
Hi,
I am having the html tidy in c# issue.
I parse a lot of html pages with sports data. For this I need them
converted in xml.
I am using html tidy libraries: zetaHtmlTidy from codeproject,
EFTidyLib (from sourceforge), and TidyATL COM wrapper, Chilkat HTML 2
XML.
They do work on some pages but on some they dont. I get Stack Over
Flow or empty string.
Html pages on wich it doesn't work are pages on www.bluesquare.com and
on www.caribsports.com with sports data.
Maybe someone who converted a lot of htmls could tell me some more
components.
I am also looking for non-free components. ANy componnet/ solution /
sugestion is welcome.
Paul