html =
'<html><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
<head></head> <body bgcolor=#ffffff>\r\n Foo foo , blah blah
</body></html>' import htmllib import formatter parser=htmllib.HTMLParser(formatter.NullFormatter( )) parser.feed(html)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/sgmllib.py", line 95, in feed
self.goahead(0)
File "/usr/lib/python2.4/sgmllib.py", line 165, in goahead
k = self.parse_declaration(i)
File "/usr/lib/python2.4/markupbase.py", line 132, in parse_declaration
self.error(
File "/usr/lib/python2.4/htmllib.py", line 40, in error
raise HTMLParseError(message)
htmllib.HTMLParseError: unexpected '<' char in declaration
the error is generated by unclosed DOCTYPE declaration
what is the best way to handle this kind of document. should I use
regex to check and strip, or does HTMLParser offers something? , can i
override default sgmllib behaviour
I have to work with this htmllib because of existing modules .
thanks 6 3066
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Sakcee wrote: html = '<html><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" <head></head> <body bgcolor=#ffffff>\r\n Foo foo , blah blah </body></html>'
html =
"""
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN">
<html>
<head>
</head>
<body bgcolor="#ffffff">
Foo foo , blah blah
</body>
</html>
"""
Try checking your html code. It looks really messy. ' char is not for
multiple line strings. You can try the code above.
As a suggestion, you should really focus on learning html basics ;)
Regards
Jesus (Neurogeek) import htmllib import formatter parser=htmllib.HTMLParser(formatter.NullFormat ter()) parser.feed(html)
Traceback (most recent call last): File "<stdin>", line 1, in ? File "/usr/lib/python2.4/sgmllib.py", line 95, in feed self.goahead(0) File "/usr/lib/python2.4/sgmllib.py", line 165, in goahead k = self.parse_declaration(i) File "/usr/lib/python2.4/markupbase.py", line 132, in parse_declaration self.error( File "/usr/lib/python2.4/htmllib.py", line 40, in error raise HTMLParseError(message) htmllib.HTMLParseError: unexpected '<' char in declaration
the error is generated by unclosed DOCTYPE declaration
what is the best way to handle this kind of document. should I use regex to check and strip, or does HTMLParser offers something? , can i override default sgmllib behaviour I have to work with this htmllib because of existing modules .
thanks
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFD+mZzdIssYB9vBoMRAoWXAJ9KuAnLLXhZVv4t6fDBpu 3RW6oxFgCeM/1S
iNScofTDdJxLfOkaAR9Ejws=
=+LTo
-----END PGP SIGNATURE-----
thanks for the reply
well probabbly I should explain more. this is part of an email . after
the mta delivers the email, it is stored in a local dir.
After that the email is being parsed by the parser inside an web based
imap client at display time.
I dont think I have the choice of rewriting the message!? and I dont
want to reject the message alltogether.
I can either 1-fix the incoming html by tidying it up
or 2- strip only plain text out and dispaly that you have spam, 3 - or
ignore that mal-formatted tag and display the rest
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
hmmm, that's kind of different issue then.
I can guess, from the error you pasted earlier, that the problem shown
is due to the fact Python is interpreting a "<" as an expression and not
as a char. review your code or try to figure out the exact input you're
receving within the mta.
Regards,
Jesus (Neurogeek)
Sakcee wrote: thanks for the reply
well probabbly I should explain more. this is part of an email . after the mta delivers the email, it is stored in a local dir. After that the email is being parsed by the parser inside an web based imap client at display time.
I dont think I have the choice of rewriting the message!? and I dont want to reject the message alltogether.
I can either 1-fix the incoming html by tidying it up or 2- strip only plain text out and dispaly that you have spam, 3 - or ignore that mal-formatted tag and display the rest
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFD+n5gdIssYB9vBoMRAvIHAJ9H+IQWtaEMa9FBYFvDAQ XcIO2SRwCfX3yj
BEvNJ6yWht1b+dBc6ohkwYI=
=X1JL
-----END PGP SIGNATURE-----
"Jesus Rivero - (Neurogeek)" <jr*****@latinux.org> wrote: hmmm, that's kind of different issue then.
I can guess, from the error you pasted earlier, that the problem shown is due to the fact Python is interpreting a "<" as an expression and not as a char. review your code or try to figure out the exact input you're receving within the mta.
Well, Jesus, you are 0 for 2. Sakcee pointed out what the exact problem
was in his original message. The HTML he is being given is ill-formed; the
<!DOCTYPE directive is not closed. The SGML parser finds a <html> tag
which it thinks is inside the <!DOCTYPE, and that's illegal. well probabbly I should explain more. this is part of an email . after the mta delivers the email, it is stored in a local dir. After that the email is being parsed by the parser inside an web based imap client at display time.
I dont think I have the choice of rewriting the message!? and I dont want to reject the message alltogether.
I can either 1-fix the incoming html by tidying it up or 2- strip only plain text out and dispaly that you have spam, 3 - or ignore that mal-formatted tag and display the rest
If this is happening with more than one message, you could check for it
rather easily with a regular expression, or even just ''.find, and then
either insert a closing '>' or delete everything up to the <html> before
parsing it.
--
- Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.
thanks for the suggestions,
this is not happening frequently, actually this is the first time I
have seen this exception in the system, which means that some spam
message was generated with ill-formated html.
i guess the best way would be to check using regular expression and
delete the unclosed tags.
Oopss!
You are totally right guys, i did miss the closing '>' thinking about
maybe errors in the use of ' or ".
Jesus
Tim Roberts wrote: "Jesus Rivero - (Neurogeek)" <jr*****@latinux.org> wrote:
hmmm, that's kind of different issue then.
I can guess, from the error you pasted earlier, that the problem shown is due to the fact Python is interpreting a "<" as an expression and not as a char. review your code or try to figure out the exact input you're receving within the mta.
Well, Jesus, you are 0 for 2. Sakcee pointed out what the exact problem was in his original message. The HTML he is being given is ill-formed; the <!DOCTYPE directive is not closed. The SGML parser finds a <html> tag which it thinks is inside the <!DOCTYPE, and that's illegal. well probabbly I should explain more. this is part of an email . after the mta delivers the email, it is stored in a local dir. After that the email is being parsed by the parser inside an web based imap client at display time.
I dont think I have the choice of rewriting the message!? and I dont want to reject the message alltogether.
I can either 1-fix the incoming html by tidying it up or 2- strip only plain text out and dispaly that you have spam, 3 - or ignore that mal-formatted tag and display the rest
If this is happening with more than one message, you could check for it rather easily with a regular expression, or even just ''.find, and then either insert a closing '>' or delete everything up to the <html> before parsing it.
This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Anders Eriksson |
last post by:
Hello!
I want to extract some info from a some specific HTML pages, Microsofts
International Word list (e.g.
http://msdn.microsoft.com/library/en-us/dnwue/html/swe_word_list.htm). I
want to...
|
by: Mike |
last post by:
Does anyone know of a minimal/mini/tiny/small xml parser
in c? I'm looking for something small that accepts a stream
or string, builds a c structure, and then returns an opaque
pointer to that...
|
by: sinister |
last post by:
I wanted to spiff up my overly spartan homepage, and started using some CSS
templates I found on a couple of weblogs. It looks fine in my browser (IE
6.0), but it doesn't print right. I tested...
|
by: Gianni Mariani |
last post by:
Can anyone enligten me why I get the "ambiguous overload" error from the
code below:
friendop.cpp: In function `int main()':
friendop.cpp:36: ambiguous overload for `std::basic_ostream<char,...
|
by: Sensei |
last post by:
Hi.
I have a problem with a C++ code I can't resolve, or better, I can't see
what the problem should be!
Here's an excerpt of the incriminated code:
=== bspalgo.cpp
// THAT'S THE BAD...
|
by: bissatch |
last post by:
Hi,
Is it possible for me to store HTML tags inside XML nodes? I need some
way to share news headlines. Because the headlines differ in their
presentsation, it would be very difficult to store...
|
by: Weiguang Shi |
last post by:
Hi,
Is there a tool that, given a struct definition, generates a function
that parses binary data of this struct and a command that can be used
to construct binary data according to...
|
by: Gabriella |
last post by:
Hi,
I have a textarea, where the user can enter any given string.
He can also insert HTML tags, if he/she wishes.
Once I obtain the textarea's string as HTML through
form.body.innerHTML, I...
|
by: Martin T. |
last post by:
Hello.
I tried to overload the operator<< for implicit printing of wchar_t
string on a char stream.
Normally using it on a ostream will succeed as
std::operator<<<std::char_traits<char>
will...
|
by: Charles Arthur |
last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers,...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new...
| | |