By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
459,680 Members | 1,744 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 459,680 IT Pros & Developers. It's quick & easy.

html parser , unexpected '<' char in declaration

P: n/a
html =
'<html><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
<head></head> <body bgcolor=#ffffff>\r\n Foo foo , blah blah
</body></html>'
import htmllib
import formatter
parser=htmllib.HTMLParser(formatter.NullFormatter( ))
parser.feed(html)


Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/sgmllib.py", line 95, in feed
self.goahead(0)
File "/usr/lib/python2.4/sgmllib.py", line 165, in goahead
k = self.parse_declaration(i)
File "/usr/lib/python2.4/markupbase.py", line 132, in parse_declaration
self.error(
File "/usr/lib/python2.4/htmllib.py", line 40, in error
raise HTMLParseError(message)
htmllib.HTMLParseError: unexpected '<' char in declaration
the error is generated by unclosed DOCTYPE declaration

what is the best way to handle this kind of document. should I use
regex to check and strip, or does HTMLParser offers something? , can i
override default sgmllib behaviour
I have to work with this htmllib because of existing modules .
thanks

Feb 20 '06 #1
Share this Question
Share on Google+
6 Replies


P: n/a
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sakcee wrote:
html =
'<html><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
<head></head> <body bgcolor=#ffffff>\r\n Foo foo , blah blah
</body></html>'


html =
"""
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN">
<html>
<head>
</head>
<body bgcolor="#ffffff">
Foo foo , blah blah
</body>
</html>
"""

Try checking your html code. It looks really messy. ' char is not for
multiple line strings. You can try the code above.

As a suggestion, you should really focus on learning html basics ;)

Regards

Jesus (Neurogeek)
import htmllib
import formatter
parser=htmllib.HTMLParser(formatter.NullFormat ter())
parser.feed(html)

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/sgmllib.py", line 95, in feed
self.goahead(0)
File "/usr/lib/python2.4/sgmllib.py", line 165, in goahead
k = self.parse_declaration(i)
File "/usr/lib/python2.4/markupbase.py", line 132, in parse_declaration
self.error(
File "/usr/lib/python2.4/htmllib.py", line 40, in error
raise HTMLParseError(message)
htmllib.HTMLParseError: unexpected '<' char in declaration
the error is generated by unclosed DOCTYPE declaration

what is the best way to handle this kind of document. should I use
regex to check and strip, or does HTMLParser offers something? , can i
override default sgmllib behaviour
I have to work with this htmllib because of existing modules .
thanks


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD+mZzdIssYB9vBoMRAoWXAJ9KuAnLLXhZVv4t6fDBpu 3RW6oxFgCeM/1S
iNScofTDdJxLfOkaAR9Ejws=
=+LTo
-----END PGP SIGNATURE-----
Feb 21 '06 #2

P: n/a
thanks for the reply

well probabbly I should explain more. this is part of an email . after
the mta delivers the email, it is stored in a local dir.
After that the email is being parsed by the parser inside an web based
imap client at display time.

I dont think I have the choice of rewriting the message!? and I dont
want to reject the message alltogether.

I can either 1-fix the incoming html by tidying it up
or 2- strip only plain text out and dispaly that you have spam, 3 - or
ignore that mal-formatted tag and display the rest

Feb 21 '06 #3

P: n/a
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

hmmm, that's kind of different issue then.

I can guess, from the error you pasted earlier, that the problem shown
is due to the fact Python is interpreting a "<" as an expression and not
as a char. review your code or try to figure out the exact input you're
receving within the mta.
Regards,

Jesus (Neurogeek)

Sakcee wrote:
thanks for the reply

well probabbly I should explain more. this is part of an email . after
the mta delivers the email, it is stored in a local dir.
After that the email is being parsed by the parser inside an web based
imap client at display time.

I dont think I have the choice of rewriting the message!? and I dont
want to reject the message alltogether.

I can either 1-fix the incoming html by tidying it up
or 2- strip only plain text out and dispaly that you have spam, 3 - or
ignore that mal-formatted tag and display the rest


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD+n5gdIssYB9vBoMRAvIHAJ9H+IQWtaEMa9FBYFvDAQ XcIO2SRwCfX3yj
BEvNJ6yWht1b+dBc6ohkwYI=
=X1JL
-----END PGP SIGNATURE-----
Feb 21 '06 #4

P: n/a
"Jesus Rivero - (Neurogeek)" <jr*****@latinux.org> wrote:

hmmm, that's kind of different issue then.

I can guess, from the error you pasted earlier, that the problem shown
is due to the fact Python is interpreting a "<" as an expression and not
as a char. review your code or try to figure out the exact input you're
receving within the mta.


Well, Jesus, you are 0 for 2. Sakcee pointed out what the exact problem
was in his original message. The HTML he is being given is ill-formed; the
<!DOCTYPE directive is not closed. The SGML parser finds a <html> tag
which it thinks is inside the <!DOCTYPE, and that's illegal.
well probabbly I should explain more. this is part of an email . after
the mta delivers the email, it is stored in a local dir.
After that the email is being parsed by the parser inside an web based
imap client at display time.

I dont think I have the choice of rewriting the message!? and I dont
want to reject the message alltogether.

I can either 1-fix the incoming html by tidying it up
or 2- strip only plain text out and dispaly that you have spam, 3 - or
ignore that mal-formatted tag and display the rest


If this is happening with more than one message, you could check for it
rather easily with a regular expression, or even just ''.find, and then
either insert a closing '>' or delete everything up to the <html> before
parsing it.
--
- Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.
Feb 21 '06 #5

P: n/a
thanks for the suggestions,

this is not happening frequently, actually this is the first time I
have seen this exception in the system, which means that some spam
message was generated with ill-formated html.
i guess the best way would be to check using regular expression and
delete the unclosed tags.

Feb 21 '06 #6

P: n/a
Oopss!

You are totally right guys, i did miss the closing '>' thinking about
maybe errors in the use of ' or ".

Jesus

Tim Roberts wrote:
"Jesus Rivero - (Neurogeek)" <jr*****@latinux.org> wrote:

hmmm, that's kind of different issue then.

I can guess, from the error you pasted earlier, that the problem shown
is due to the fact Python is interpreting a "<" as an expression and not
as a char. review your code or try to figure out the exact input you're
receving within the mta.


Well, Jesus, you are 0 for 2. Sakcee pointed out what the exact problem
was in his original message. The HTML he is being given is ill-formed; the
<!DOCTYPE directive is not closed. The SGML parser finds a <html> tag
which it thinks is inside the <!DOCTYPE, and that's illegal.
well probabbly I should explain more. this is part of an email . after
the mta delivers the email, it is stored in a local dir.
After that the email is being parsed by the parser inside an web based
imap client at display time.

I dont think I have the choice of rewriting the message!? and I dont
want to reject the message alltogether.

I can either 1-fix the incoming html by tidying it up
or 2- strip only plain text out and dispaly that you have spam, 3 - or
ignore that mal-formatted tag and display the rest


If this is happening with more than one message, you could check for it
rather easily with a regular expression, or even just ''.find, and then
either insert a closing '>' or delete everything up to the <html> before
parsing it.


Feb 21 '06 #7

This discussion thread is closed

Replies have been disabled for this discussion.