473,326 Members | 2,013 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,326 software developers and data experts.

HTMLParser problem

I've fed some data to the HTML parser constructed by myself. Here is the
beginning of the content of the fed data:
=====
<!doctype html public "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html><head><meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1"><link rel="stylesheet"
href="http://us.i1.yimg.com/us.yimg.com/lib/s/yschx_040927.css" type="text/css"
media="all">

<![if !IE]>
....
=====
however, when "<![if !IE]>" is encountered, I found that handle_data() is called
but not handle_decl(), (since I've let the function handle_decl to print sth on
the screen, but nothing happened) and the following error is displayed:

.......
HTMLParser.HTMLParseError: unknown declaration: 'if !IE', at line 4, column 1

May I ask why such error is raised? Thanks in advance!
Jul 18 '05 #1
3 1683

"Valkyrie" <va******@cuhk.edu.hk> wrote in message news:1100610863.75889@eng-ser4...
<![if !IE]>

HTMLParser.HTMLParseError: unknown declaration: 'if !IE', at line 4, column 1

May I ask why such error is raised?


HTMLParser isn't very forgiving of bad HTML; you feed it syntactically invalid HTML,
it tends to give you errors. That includes Microsoft only extensions like <![if !IE.
Unless you know you have known valid sources it may be best to use one of
the forgiving parsers: Beautiful Soup, UTidylib, libxml etc.. (see many past discussions).
Uche's article: http://www.xml.com/pub/a/2004/09/08/pyxml.html may be of interest.
Jul 18 '05 #2
Thank you. That means there is no way to deal with it using simple python
built-in functions?
Richard Brodie wrote:
"Valkyrie" <va******@cuhk.edu.hk> wrote in message news:1100610863.75889@eng-ser4...

<![if !IE]>

HTMLParser.HTMLParseError: unknown declaration: 'if !IE', at line 4, column 1

May I ask why such error is raised?

HTMLParser isn't very forgiving of bad HTML; you feed it syntactically invalid HTML,
it tends to give you errors. That includes Microsoft only extensions like <![if !IE.
Unless you know you have known valid sources it may be best to use one of
the forgiving parsers: Beautiful Soup, UTidylib, libxml etc.. (see many past discussions).
Uche's article: http://www.xml.com/pub/a/2004/09/08/pyxml.html may be of interest.

Jul 18 '05 #3
On Tue, 16 Nov 2004 22:14:33 +0800, Valkyrie <va******@cuhk.edu.hk> wrote:
Thank you. That means there is no way to deal with it using simple python
built-in functions?


Well, you can always preprocess your HTML by replacing dubious
constructs. It's ugly but it works. You might even do something smart
and replace thing back after processing.

Good luck,
Dirk.

-------------------------------------
Dirk-Jan C. Binnema (djcb)
mail: djcb [at] djcbsoftware [dot] nl
blog: www.djcbsoftware.nl/ChangeLog
im : dj**@jabber.org
-------------------------------------
Jul 18 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Adonis | last post by:
When parsing my html files, I use handle_pi to capture some embedded python code, but I have noticed that in the embedded python code if it contains html, HTMLParser will parse it as well, and thus...
11
by: Sean Cody | last post by:
I'm trying to take a webpage that has a nxn table of entries (bus times) and convert it to a 2D array (list of lists). Initially this was simple but I need to be able to access whole 'columns' of...
4
by: Kevin T. Ryan | last post by:
Hi all - I'm somewhat new to python (about 1 year), and I'm trying to write a program that opens a file like object w/ urllib.urlopen, and then parse the data by passing it to a class that...
9
by: florent | last post by:
I'm trying to parse html documents from the web, using the HTMLParser class of the HTMLParser module (python 2.3), but some web documents are not fully valids. When the parser finds an invalid tag,...
8
by: Lawrence D'Oliveiro | last post by:
I've been using HTMLParser to scrape Web sites. The trouble with this is, there's a lot of malformed HTML out there. Real browsers have to be written to cope gracefully with this, but HTMLParser...
6
by: Rene Pijlman | last post by:
The code below results in an exception (Python 2.4.2): HTMLParser.HTMLParseError: bad end tag: "</foo' + 'bar>", at line 4, column 6 Should it? The end tag it chokes on is in comment, isn't...
1
by: Kenneth McDonald | last post by:
I'm writing a program that will parse HTML and (mostly) convert it to MediaWiki format. The two Python modules I'm aware of to do this are HTMLParser and htmllib. However, I'm currently...
3
by: ychaouche | last post by:
Hi, python experts. <console trace> chaouche@CAY:~/TEST$ python nettoyageHTML.py chaouche@CAY:~/TEST$ </console trace> This is the nettoyageHTML.py python script <code>
8
by: jonbutler88 | last post by:
Just writing a simple website spider in python, keep getting these errors, not sure what to do. The problem seems to be in the feed() function of htmlparser. Traceback (most recent call last):...
3
by: globalrev | last post by:
tried all kinds of combos to get this to work. http://docs.python.org/lib/module-HTMLParser.html from HTMLParser import HTMLParser class MyHTMLParser(HTMLParser):
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.