By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,427 Members | 1,356 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,427 IT Pros & Developers. It's quick & easy.

Once again a unicode question

P: n/a
Hello,

I'm puzzled by this test I made while trying to transform a page in
html to plain text. Because I cannot send unicode to feed, nor str so
how can I do this ?

..nicoe@smarties:~$ python2.4
..Python 2.4.1c2 (#2, Mar 19 2005, 01:04:19)
..[GCC 3.3.5 (Debian 1:3.3.5-12)] on linux2
..Type "help", "copyright", "credits" or "license" for more information.
..>>> import formatter
..>>> import htmllib
..>>> html2txt = htmllib.HTMLParser(formatter.AbstractFormatter(for matter.DumbWriter()))
..>>> html2txt.feed(u'D\xe9but')
..Traceback (most recent call last):
.. File "<stdin>", line 1, in ?
.. File "/usr/lib/python2.4/sgmllib.py", line 95, in feed
.. self.goahead(0)
.. File "/usr/lib/python2.4/sgmllib.py", line 120, in goahead
.. self.handle_data(rawdata[i:j])
.. File "/usr/lib/python2.4/htmllib.py", line 65, in handle_data
.. self.formatter.add_flowing_data(data)
.. File "/usr/lib/python2.4/formatter.py", line 197, in add_flowing_data
.. self.writer.send_flowing_data(data)
.. File "/usr/lib/python2.4/formatter.py", line 421, in send_flowing_data
.. write(word)
..UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
..>>> html2txt.feed(u'D\xe9but'.encode('latin1'))
..Traceback (most recent call last):
.. File "<stdin>", line 1, in ?
.. File "/usr/lib/python2.4/sgmllib.py", line 94, in feed
.. self.rawdata = self.rawdata + data
..UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 1: ordinal not in range(128)
..>>> html2txt.feed('Début')
..Traceback (most recent call last):
.. File "<stdin>", line 1, in ?
.. File "/usr/lib/python2.4/sgmllib.py", line 94, in feed
.. self.rawdata = self.rawdata + data
..UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
..>>>

--
(°> Nicolas Évrard
/ ) Liège - Belgique
^^
Jul 18 '05 #1
Share this Question
Share on Google+
2 Replies


P: n/a
Nicolas Evrard wrote:
Hello,

I'm puzzled by this test I made while trying to transform a page in
html to plain text. Because I cannot send unicode to feed, nor str so
how can I do this ?


Seems like the parser is in the broken state after the first exception.
Feed only binary strings to it.

Serge.
Jul 18 '05 #2

P: n/a
* Serge Orlov [23:45 26/03/05 CET]:
Nicolas Evrard wrote:
Hello,

I'm puzzled by this test I made while trying to transform a page in
html to plain text. Because I cannot send unicode to feed, nor str so
how can I do this ?


Seems like the parser is in the broken state after the first exception.
Feed only binary strings to it.


That was that thank you very much.

--
(°> Nicolas Évrard
/ ) Liège - Belgique
^^
Jul 18 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.