html parser , unexpected '<' char in declaration

Sakcee

html =
'<html><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
<head></head> <body bgcolor=#ffffff>\r\n Foo foo , blah blah
</body></html>'

import htmllib
import formatter
parser=htmllib.HTMLParser(formatter.NullFormatter( ))
parser.feed(html)

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/sgmllib.py", line 95, in feed
self.goahead(0)
File "/usr/lib/python2.4/sgmllib.py", line 165, in goahead
k = self.parse_declaration(i)
File "/usr/lib/python2.4/markupbase.py", line 132, in parse_declaration
self.error(
File "/usr/lib/python2.4/htmllib.py", line 40, in error
raise HTMLParseError(message)
htmllib.HTMLParseError: unexpected '<' char in declaration
the error is generated by unclosed DOCTYPE declaration

what is the best way to handle this kind of document. should I use
regex to check and strip, or does HTMLParser offers something? , can i
override default sgmllib behaviour
I have to work with this htmllib because of existing modules .
thanks

Feb 20 '06 #1

Subscribe Post Reply

3066

Jesus Rivero - (Neurogeek)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sakcee wrote:

html =
'<html><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
<head></head> <body bgcolor=#ffffff>\r\n Foo foo , blah blah
</body></html>'

html =
"""
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN">
<html>
<head>
</head>
<body bgcolor="#ffffff">
Foo foo , blah blah
</body>
</html>
"""

Try checking your html code. It looks really messy. ' char is not for
multiple line strings. You can try the code above.

As a suggestion, you should really focus on learning html basics ;)

Regards

Jesus (Neurogeek)

import htmllib
import formatter
parser=htmllib.HTMLParser(formatter.NullFormat ter())
parser.feed(html)

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/sgmllib.py", line 95, in feed
self.goahead(0)
File "/usr/lib/python2.4/sgmllib.py", line 165, in goahead
k = self.parse_declaration(i)
File "/usr/lib/python2.4/markupbase.py", line 132, in parse_declaration
self.error(
File "/usr/lib/python2.4/htmllib.py", line 40, in error
raise HTMLParseError(message)
htmllib.HTMLParseError: unexpected '<' char in declaration
the error is generated by unclosed DOCTYPE declaration

what is the best way to handle this kind of document. should I use
regex to check and strip, or does HTMLParser offers something? , can i
override default sgmllib behaviour
I have to work with this htmllib because of existing modules .
thanks

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD+mZzdIssYB9vBoMRAoWXAJ9KuAnLLXhZVv4t6fDBpu 3RW6oxFgCeM/1S
iNScofTDdJxLfOkaAR9Ejws=
=+LTo
-----END PGP SIGNATURE-----

Feb 21 '06 #2

Sakcee

thanks for the reply

well probabbly I should explain more. this is part of an email . after
the mta delivers the email, it is stored in a local dir.
After that the email is being parsed by the parser inside an web based
imap client at display time.

I dont think I have the choice of rewriting the message!? and I dont
want to reject the message alltogether.

I can either 1-fix the incoming html by tidying it up
or 2- strip only plain text out and dispaly that you have spam, 3 - or
ignore that mal-formatted tag and display the rest

Feb 21 '06 #3

Jesus Rivero - (Neurogeek)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

hmmm, that's kind of different issue then.

I can guess, from the error you pasted earlier, that the problem shown
is due to the fact Python is interpreting a "<" as an expression and not
as a char. review your code or try to figure out the exact input you're
receving within the mta.
Regards,

Jesus (Neurogeek)

Sakcee wrote:

thanks for the reply

well probabbly I should explain more. this is part of an email . after
the mta delivers the email, it is stored in a local dir.
After that the email is being parsed by the parser inside an web based
imap client at display time.

I dont think I have the choice of rewriting the message!? and I dont
want to reject the message alltogether.

I can either 1-fix the incoming html by tidying it up
or 2- strip only plain text out and dispaly that you have spam, 3 - or
ignore that mal-formatted tag and display the rest

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD+n5gdIssYB9vBoMRAvIHAJ9H+IQWtaEMa9FBYFvDAQ XcIO2SRwCfX3yj
BEvNJ6yWht1b+dBc6ohkwYI=
=X1JL
-----END PGP SIGNATURE-----

Feb 21 '06 #4

Tim Roberts

"Jesus Rivero - (Neurogeek)" <jr*****@latinux.org> wrote:

hmmm, that's kind of different issue then.

I can guess, from the error you pasted earlier, that the problem shown
is due to the fact Python is interpreting a "<" as an expression and not
as a char. review your code or try to figure out the exact input you're
receving within the mta.

Well, Jesus, you are 0 for 2. Sakcee pointed out what the exact problem
was in his original message. The HTML he is being given is ill-formed; the
<!DOCTYPE directive is not closed. The SGML parser finds a <html> tag
which it thinks is inside the <!DOCTYPE, and that's illegal.

well probabbly I should explain more. this is part of an email . after
the mta delivers the email, it is stored in a local dir.
After that the email is being parsed by the parser inside an web based
imap client at display time.

I dont think I have the choice of rewriting the message!? and I dont
want to reject the message alltogether.

I can either 1-fix the incoming html by tidying it up
or 2- strip only plain text out and dispaly that you have spam, 3 - or
ignore that mal-formatted tag and display the rest

If this is happening with more than one message, you could check for it
rather easily with a regular expression, or even just ''.find, and then
either insert a closing '>' or delete everything up to the <html> before
parsing it.
--
- Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.

Feb 21 '06 #5

Sakcee

thanks for the suggestions,

this is not happening frequently, actually this is the first time I
have seen this exception in the system, which means that some spam
message was generated with ill-formated html.
i guess the best way would be to check using regular expression and
delete the unclosed tags.

Feb 21 '06 #6

Jesus Rivero (Neurogeek)

Oopss!

You are totally right guys, i did miss the closing '>' thinking about
maybe errors in the use of ' or ".

Jesus

Tim Roberts wrote:

"Jesus Rivero - (Neurogeek)" <jr*****@latinux.org> wrote:

hmmm, that's kind of different issue then.

I can guess, from the error you pasted earlier, that the problem shown
is due to the fact Python is interpreting a "<" as an expression and not
as a char. review your code or try to figure out the exact input you're
receving within the mta.

Well, Jesus, you are 0 for 2. Sakcee pointed out what the exact problem
was in his original message. The HTML he is being given is ill-formed; the
<!DOCTYPE directive is not closed. The SGML parser finds a <html> tag
which it thinks is inside the <!DOCTYPE, and that's illegal.

well probabbly I should explain more. this is part of an email . after
the mta delivers the email, it is stored in a local dir.
After that the email is being parsed by the parser inside an web based
imap client at display time.

I dont think I have the choice of rewriting the message!? and I dont
want to reject the message alltogether.

I can either 1-fix the incoming html by tidying it up
or 2- strip only plain text out and dispaly that you have spam, 3 - or
ignore that mal-formatted tag and display the rest

If this is happening with more than one message, you could check for it
rather easily with a regular expression, or even just ''.find, and then
either insert a closing '>' or delete everything up to the <html> before
parsing it.

Feb 21 '06 #7

Similar topics

Parsing HTML

by: Anders Eriksson | last post by:

Hello! I want to extract some info from a some specific HTML pages, Microsofts International Word list (e.g. http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm). I want to...

Python

minimal xml parser?

by: Mike | last post by:

Does anyone know of a minimal/mini/tiny/small xml parser in c? I'm looking for something small that accepts a stream or string, builds a c structure, and then returns an opaque pointer to that...

.NET Framework

CSS versus HTML tables

by: sinister | last post by:

I wanted to spiff up my overly spartan homepage, and started using some CSS templates I found on a couple of weblogs. It looks fine in my browser (IE 6.0), but it doesn't print right. I tested...

HTML / CSS

friend operator << overload ambiguity error

by: Gianni Mariani | last post by:

Can anyone enligten me why I get the "ambiguous overload" error from the code below: friendop.cpp: In function `int main()': friendop.cpp:36: ambiguous overload for `std::basic_ostream<char,...

C / C++

Operator << not valid on unsigned?

by: Sensei | last post by:

Hi. I have a problem with a C++ code I can't resolve, or better, I can't see what the problem should be! Here's an excerpt of the incriminated code: === bspalgo.cpp // THAT'S THE BAD...

C / C++

Storing HTML in XML

by: bissatch | last post by:

Hi, Is it possible for me to store HTML tags inside XML nodes? I need some way to share news headlines. Because the headlines differ in their presentsation, it would be very difficult to store...

.NET Framework

a struct parser

by: Weiguang Shi | last post by:

Hi, Is there a tool that, given a struct definition, generates a function that parses binary data of this struct and a command that can be used to construct binary data according to...

C / C++

How do I know if there's a certain HTML element in an HTML string (without indexOf)?

by: Gabriella | last post by:

Hi, I have a textarea, where the user can enter any given string. He can also insert HTML tags, if he/she wishes. Once I obtain the textarea's string as HTML through form.body.innerHTML, I...

Javascript

std::operator<<(basic_ostream&, type) vs. std::basic_ostream::operator(type)... ?

by: Martin T. | last post by:

Hello. I tried to overload the operator<< for implicit printing of wchar_t string on a char stream. Normally using it on a ostream will succeed as std::operator<<<std::char_traits<char> will...

C / C++

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA