Den Fri, 02 Mar 2007 15:32:58 -0800 skrev
se******@spawar.navy.mil:
I'm trying to extract some data from an XHTML Transitional web page.
xml.dom.minidom.parseString("text of web page") gives errors about it
not being well formed XML.
Do I just need to add something like <?xml ...?or what?
As many HTML Transitional pages are very bad formed, you can't really
create a dom of them.
I've written multiple grabbers, which grab tv data from html pages, and
parses it into xml.
Basicly there are three ways to get the info:
# Use find(): If you are only searching for a few data pieces, you
might be able to find some html code always appearing before the data you
need.
# Use regular expressions: This can very quickly get all data from a
table or so into a nice list. Only problem is regular expressions having
a little steep learing curve.
# Use a SAX parser: This will iterate through all html items, not
carring if they validate or not. You will define a method to be called
each time it finds a tag, a piece of text etc.
What is best way to do this?
In the beginning I mostly did the SAX way, but it really generates a lot
of code, which is not necessaryly more readable than the regular
expressions.