471,887 Members | 1,450 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,887 software developers and data experts.

How to read information from tables in HTML?

Hi, all

I'm confronted with some trouble when dealing with html files.

The html files contain javascript and some information stored in tables.
And it seems that they're not well-formed, when parsed with minidom, it
will say "mismatched tag".
Then how can i get information from those files? Is there any useful
library for me?

Many thanks ;-)
Aug 3 '07 #1
1 1330
ZelluX wrote:
I'm confronted with some trouble when dealing with html files.

The html files contain javascript and some information stored in tables.
And it seems that they're not well-formed, when parsed with minidom, it
will say "mismatched tag".
minidom deals with XML. You're trying to read something that's (similar to)
HTML. HTML is much less strict.

Then how can i get information from those files? Is there any useful
library for me?
BeautifulSoup or lxml.html (which supports the BeautifulSoup parser, btw).

Both can deal with broken HTML, but lxml.html has better support for cleaning
up HTML (e.g. removing Javascript or embedded content, etc.) or handling forms.

http://codespeak.net/lxml/

The lxml.html package is not currently in an official lxml release, but you
can install it from SVN sources:

http://codespeak.net/svn/lxml/branch/html/

A release is expected soon.

Stefan
Aug 3 '07 #2

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by jaYPee | last post: by
4 posts views Thread by Scot L. Harris | last post: by
5 posts views Thread by Mark A. Sam | last post: by
10 posts views Thread by Phil Stanton | last post: by
8 posts views Thread by send.me.all.email | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.