I've fed some data to the HTML parser constructed by myself. Here is the
beginning of the content of the fed data:
=====
<!doctype html public "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html><head><meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1"><link rel="stylesheet"
href="http://us.i1.yimg.com/us.yimg.com/lib/s/yschx_040927.css" type="text/css"
media="all">
<![if !IE]>
....
=====
however, when "<![if !IE]>" is encountered, I found that handle_data() is called
but not handle_decl(), (since I've let the function handle_decl to print sth on
the screen, but nothing happened) and the following error is displayed:
.......
HTMLParser.HTMLParseError: unknown declaration: 'if !IE', at line 4, column 1
May I ask why such error is raised? Thanks in advance! 3 1683
"Valkyrie" <va******@cuhk.edu.hk> wrote in message news:1100610863.75889@eng-ser4... <![if !IE]>
HTMLParser.HTMLParseError: unknown declaration: 'if !IE', at line 4, column 1
May I ask why such error is raised?
HTMLParser isn't very forgiving of bad HTML; you feed it syntactically invalid HTML,
it tends to give you errors. That includes Microsoft only extensions like <![if !IE.
Unless you know you have known valid sources it may be best to use one of
the forgiving parsers: Beautiful Soup, UTidylib, libxml etc.. (see many past discussions).
Uche's article: http://www.xml.com/pub/a/2004/09/08/pyxml.html may be of interest.
Thank you. That means there is no way to deal with it using simple python
built-in functions?
Richard Brodie wrote: "Valkyrie" <va******@cuhk.edu.hk> wrote in message news:1100610863.75889@eng-ser4...
<![if !IE]>
HTMLParser.HTMLParseError: unknown declaration: 'if !IE', at line 4, column 1
May I ask why such error is raised?
HTMLParser isn't very forgiving of bad HTML; you feed it syntactically invalid HTML, it tends to give you errors. That includes Microsoft only extensions like <![if !IE. Unless you know you have known valid sources it may be best to use one of the forgiving parsers: Beautiful Soup, UTidylib, libxml etc.. (see many past discussions). Uche's article: http://www.xml.com/pub/a/2004/09/08/pyxml.html may be of interest.
On Tue, 16 Nov 2004 22:14:33 +0800, Valkyrie <va******@cuhk.edu.hk> wrote: Thank you. That means there is no way to deal with it using simple python built-in functions?
Well, you can always preprocess your HTML by replacing dubious
constructs. It's ugly but it works. You might even do something smart
and replace thing back after processing.
Good luck,
Dirk.
-------------------------------------
Dirk-Jan C. Binnema (djcb)
mail: djcb [at] djcbsoftware [dot] nl
blog: www.djcbsoftware.nl/ChangeLog
im : dj**@jabber.org
------------------------------------- This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Adonis |
last post by:
When parsing my html files, I use handle_pi to capture some embedded python
code, but I have noticed that in the embedded python code if it contains
html, HTMLParser will parse it as well, and thus...
|
by: Sean Cody |
last post by:
I'm trying to take a webpage that has a nxn table of entries (bus times) and
convert it to a 2D array (list of lists). Initially this was simple but I
need to be able to access whole 'columns' of...
|
by: Kevin T. Ryan |
last post by:
Hi all -
I'm somewhat new to python (about 1 year), and I'm trying to write a program
that opens a file like object w/ urllib.urlopen, and then parse the data by
passing it to a class that...
|
by: florent |
last post by:
I'm trying to parse html documents from the web, using the HTMLParser
class of the HTMLParser module (python 2.3), but some web documents are
not fully valids. When the parser finds an invalid tag,...
|
by: Lawrence D'Oliveiro |
last post by:
I've been using HTMLParser to scrape Web sites. The trouble with this
is, there's a lot of malformed HTML out there. Real browsers have to be
written to cope gracefully with this, but HTMLParser...
|
by: Rene Pijlman |
last post by:
The code below results in an exception (Python 2.4.2):
HTMLParser.HTMLParseError: bad end tag: "</foo' + 'bar>", at line 4,
column 6
Should it? The end tag it chokes on is in comment, isn't...
|
by: Kenneth McDonald |
last post by:
I'm writing a program that will parse HTML and (mostly) convert it to
MediaWiki format. The two Python modules I'm aware of to do this are
HTMLParser and htmllib. However, I'm currently...
|
by: ychaouche |
last post by:
Hi, python experts.
<console trace>
chaouche@CAY:~/TEST$ python nettoyageHTML.py
chaouche@CAY:~/TEST$
</console trace>
This is the nettoyageHTML.py python script
<code>
|
by: jonbutler88 |
last post by:
Just writing a simple website spider in python, keep getting these
errors, not sure what to do. The problem seems to be in the feed()
function of htmlparser.
Traceback (most recent call last):...
|
by: globalrev |
last post by:
tried all kinds of combos to get this to work.
http://docs.python.org/lib/module-HTMLParser.html
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
|
by: ryjfgjl |
last post by:
ExcelToDatabase: batch import excel into database automatically...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM).
In this month's session, we are pleased to welcome back...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM).
In this month's session, we are pleased to welcome back...
|
by: Vimpel783 |
last post by:
Hello!
Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
|
by: jfyes |
last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
|
by: ArrayDB |
last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
|
by: CloudSolutions |
last post by:
Introduction:
For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
|
by: Shællîpôpï 09 |
last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome former...
| |