By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
428,785 Members | 2,074 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 428,785 IT Pros & Developers. It's quick & easy.

Parsing broken HTML via Mozilla

P: n/a
Hello all!

I'm trying to parse broken HTML with several Python tools.
Unfortunately none of them work 100% reliable. Problems are
e.g. nested comments, bare "&" in URLs and "<" in text (e.g.
"if foo < bar") etc.

All of these pages can be displayed properly in a browser
so why not reuse the parser in e.g. Mozilla? Is there any
way to get proper XML out of Mozilla? Calling mozilla on the
command line would be OK, but it would be better if I could
use Mozilla like a SAX parser. Is there any project that
provides this functionality?

Bye,
Walter Dörwald
Jul 18 '05 #1
Share this Question
Share on Google+
6 Replies


P: n/a
Walter Dörwald <wa****@livinglogic.de> writes:
[...]
way to get proper XML out of Mozilla? Calling mozilla on the
command line would be OK, but it would be better if I could
use Mozilla like a SAX parser. Is there any project that
provides this functionality?

[...]

PyXPCOM. Good luck compiling it.
John
Jul 18 '05 #2

P: n/a

"Walter Dörwald" <wa****@livinglogic.de> wrote in message
news:ma**************************************@pyth on.org...
Hello all!

I'm trying to parse broken HTML with several Python tools.
Unfortunately none of them work 100% reliable. Problems are
e.g. nested comments, bare "&" in URLs and "<" in text (e.g.
"if foo < bar") etc.

All of these pages can be displayed properly in a browser
so why not reuse the parser in e.g. Mozilla? Is there any
way to get proper XML out of Mozilla? Calling mozilla on the
command line would be OK, but it would be better if I could
use Mozilla like a SAX parser. Is there any project that
provides this functionality?

Bye,
Walter Dörwald

Maybe you should preprocess your files with something like,

http://www.zope.org/Members/chrisw/StripOGram
which can help you get rid of the stuff you dont want

Tom
Jul 18 '05 #3

P: n/a
Walter Do:rwald <wa****@livinglogic.de> wrote in message news:<ma**************************************@pyt hon.org>...
Hello all!
Hi!
I'm trying to parse broken HTML with several Python tools. Unfortunately none of them work 100% reliable.


What have you tried?

I've been using Tidy with pretty good results; there's a Python
wrapper called utidylib available at http://utidylib.berlios.de

Make sure to use the "force output" option and it'll do a reasonable
job of parsing fairly broken HTML and outputting either as plain HTML,
XHTML, or several other formats (with lots of tweaky knobs available
to tune the output if you want to).
Jul 18 '05 #4

P: n/a
In article <ma**************************************@python.o rg>, Walter
Dörwald wrote:
I'm trying to parse broken HTML with several Python tools.
Unfortunately none of them work 100% reliable. Problems are e.g.
nested comments, bare "&" in URLs and "<" in text (e.g. "if foo <
bar") etc.


Not a Mozilla solution, but I hear good things about
http://www.crummy.com/software/BeautifulSoup/

--
Paul Wright | http://pobox.com/~pw201 | http://blog.noctua.org.uk/
Reply address is valid but discards mail with attachments: send plain text only
Jul 18 '05 #5

P: n/a
Paul Wright wrote:
In article <ma**************************************@python.o rg>, Walter
Dörwald wrote:
I'm trying to parse broken HTML with several Python tools.
Unfortunately none of them work 100% reliable. Problems are e.g.
nested comments, bare "&" in URLs and "<" in text (e.g. "if foo <
bar") etc.


Not a Mozilla solution, but I hear good things about
http://www.crummy.com/software/BeautifulSoup/


I already tried that, but it completely ignores encoding issues
and it passes broken entity references (e.g. bare & in URLs) along
literally. Furthermore its support for DTD aware HTML parsing
is not complete (e.g. <link> is not handled as an empty tag).

Bye,
Walter Dörwald

Jul 18 '05 #6

P: n/a
Paul Wright wrote:
In article <ma**************************************@python.o rg>, Walter
Dörwald wrote:
I'm trying to parse broken HTML with several Python tools.
Unfortunately none of them work 100% reliable. Problems are e.g.
nested comments, bare "&" in URLs and "<" in text (e.g. "if foo <
bar") etc.


Not a Mozilla solution, but I hear good things about
http://www.crummy.com/software/BeautifulSoup/


I already tried that, but it completely ignores encoding issues
and it passes broken entity references (e.g. bare & in URLs) along
literally. Furthermore its support for DTD aware HTML parsing
is not complete (e.g. <link> is not handled as an empty tag).

Bye,
Walter Dörwald
Jul 18 '05 #7

This discussion thread is closed

Replies have been disabled for this discussion.