468,113 Members | 2,059 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,113 developers. It's quick & easy.

Parsing broken HTML via Mozilla

Hello all!

I'm trying to parse broken HTML with several Python tools.
Unfortunately none of them work 100% reliable. Problems are
e.g. nested comments, bare "&" in URLs and "<" in text (e.g.
"if foo < bar") etc.

All of these pages can be displayed properly in a browser
so why not reuse the parser in e.g. Mozilla? Is there any
way to get proper XML out of Mozilla? Calling mozilla on the
command line would be OK, but it would be better if I could
use Mozilla like a SAX parser. Is there any project that
provides this functionality?

Bye,
Walter Dörwald
Jul 18 '05 #1
6 1812
Walter Dörwald <wa****@livinglogic.de> writes:
[...]
way to get proper XML out of Mozilla? Calling mozilla on the
command line would be OK, but it would be better if I could
use Mozilla like a SAX parser. Is there any project that
provides this functionality?

[...]

PyXPCOM. Good luck compiling it.
John
Jul 18 '05 #2

"Walter Dörwald" <wa****@livinglogic.de> wrote in message
news:ma**************************************@pyth on.org...
Hello all!

I'm trying to parse broken HTML with several Python tools.
Unfortunately none of them work 100% reliable. Problems are
e.g. nested comments, bare "&" in URLs and "<" in text (e.g.
"if foo < bar") etc.

All of these pages can be displayed properly in a browser
so why not reuse the parser in e.g. Mozilla? Is there any
way to get proper XML out of Mozilla? Calling mozilla on the
command line would be OK, but it would be better if I could
use Mozilla like a SAX parser. Is there any project that
provides this functionality?

Bye,
Walter Dörwald

Maybe you should preprocess your files with something like,

http://www.zope.org/Members/chrisw/StripOGram
which can help you get rid of the stuff you dont want

Tom
Jul 18 '05 #3
Walter Do:rwald <wa****@livinglogic.de> wrote in message news:<ma**************************************@pyt hon.org>...
Hello all!
Hi!
I'm trying to parse broken HTML with several Python tools. Unfortunately none of them work 100% reliable.


What have you tried?

I've been using Tidy with pretty good results; there's a Python
wrapper called utidylib available at http://utidylib.berlios.de

Make sure to use the "force output" option and it'll do a reasonable
job of parsing fairly broken HTML and outputting either as plain HTML,
XHTML, or several other formats (with lots of tweaky knobs available
to tune the output if you want to).
Jul 18 '05 #4
In article <ma**************************************@python.o rg>, Walter
Dörwald wrote:
I'm trying to parse broken HTML with several Python tools.
Unfortunately none of them work 100% reliable. Problems are e.g.
nested comments, bare "&" in URLs and "<" in text (e.g. "if foo <
bar") etc.


Not a Mozilla solution, but I hear good things about
http://www.crummy.com/software/BeautifulSoup/

--
Paul Wright | http://pobox.com/~pw201 | http://blog.noctua.org.uk/
Reply address is valid but discards mail with attachments: send plain text only
Jul 18 '05 #5
Paul Wright wrote:
In article <ma**************************************@python.o rg>, Walter
Dörwald wrote:
I'm trying to parse broken HTML with several Python tools.
Unfortunately none of them work 100% reliable. Problems are e.g.
nested comments, bare "&" in URLs and "<" in text (e.g. "if foo <
bar") etc.


Not a Mozilla solution, but I hear good things about
http://www.crummy.com/software/BeautifulSoup/


I already tried that, but it completely ignores encoding issues
and it passes broken entity references (e.g. bare & in URLs) along
literally. Furthermore its support for DTD aware HTML parsing
is not complete (e.g. <link> is not handled as an empty tag).

Bye,
Walter Dörwald

Jul 18 '05 #6
Paul Wright wrote:
In article <ma**************************************@python.o rg>, Walter
Dörwald wrote:
I'm trying to parse broken HTML with several Python tools.
Unfortunately none of them work 100% reliable. Problems are e.g.
nested comments, bare "&" in URLs and "<" in text (e.g. "if foo <
bar") etc.


Not a Mozilla solution, but I hear good things about
http://www.crummy.com/software/BeautifulSoup/


I already tried that, but it completely ignores encoding issues
and it passes broken entity references (e.g. bare & in URLs) along
literally. Furthermore its support for DTD aware HTML parsing
is not complete (e.g. <link> is not handled as an empty tag).

Bye,
Walter Dörwald
Jul 18 '05 #7

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

14 posts views Thread by Viktor Rosenfeld | last post: by
1 post views Thread by Chris Hemingway | last post: by
3 posts views Thread by Sanjay Arora | last post: by
2 posts views Thread by hzgt9b | last post: by
2 posts views Thread by pabloski | last post: by
5 posts views Thread by Benoit | last post: by
1 post views Thread by Philip Semanchuk | last post: by
2 posts views Thread by Felipe De Bene | last post: by
3 posts views Thread by didacticone | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.