By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
458,016 Members | 1,024 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 458,016 IT Pros & Developers. It's quick & easy.

Ascii Encoding Error with UTF-8 encoder

P: n/a
Can anyone explain why I'm getting an ascii encoding error when I'm trying
to write out using a UTF-8 encoder?

Thanks

Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
filterMap = {}
for i in range(0,255): .... filterMap[chr(i)] = chr(i)
.... filterMap[chr(9)] = chr(136)
filterMap[chr(10)] = chr(133)
filterMap[chr(136)] = chr(9)
filterMap[chr(133)] = chr(10)
line = '''this has .... tabs and line
.... breaks''' filteredLine = ''.join([ filterMap[a] for a in line])
import codecs
f = codecs.open('foo.txt', 'wU', 'utf-8')
print filteredLine thisÍhasÍŗtabsÍandÍlineŗbreaks f.write(filteredLine)

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "C:\Python24\lib\codecs.py", line 501, in write
return self.writer.write(data)
File "C:\Python24\lib\codecs.py", line 178, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4:
ordinal
not in range(128)
Jun 27 '06 #1
Share this Question
Share on Google+
5 Replies


P: n/a
Mike Currie wrote:
Can anyone explain why I'm getting an ascii encoding error when I'm trying
to write out using a UTF-8 encoder?


Please read the Python Unicode HOWTO.

http://www.amk.ca/python/howto/unicode

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Jun 27 '06 #2

P: n/a
On 28/06/2006 7:46 AM, Mike Currie wrote:
Can anyone explain why I'm getting an ascii encoding error when I'm trying
to write out using a UTF-8 encoder?

f = codecs.open('foo.txt', 'wU', 'utf-8')
print filteredLine thisÍhasÍŗtabsÍandÍlineŗbreaks f.write(filteredLine)

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "C:\Python24\lib\codecs.py", line 501, in write
return self.writer.write(data)
File "C:\Python24\lib\codecs.py", line 178, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4:
ordinal
not in range(128)


Your fundamental problem is that you are trying to decode an 8-bit
string to UTF-8. The codec tries to convert your string to Unicode
first, using the default encoding (ascii), which fails.

Get this into your head:
You encode Unicode as ascii, latin1, cp1252, utf8, gagolitic, whatever
into an 8-bit string.
You decode whatever from an 8-bit string into Unicode.

Here is a run-down on your problem, using just the encode/decode methods
instead of codecs for illustration purposes:

(1) Equivalent to what you did.
|>> '\x88'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0:
ordinal not in range(128)

(2) Same thing, explicitly trying to decode your 8-bit string as ASCII.
|>> '\x88'.decode('ascii').encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0:
ordinal not in range(128)

(3) Encoding Unicode as UTF-8 works, as expected.
|>> u'\x88'.encode('utf-8')
'\xc2\x88'

(4) But you need to know what your 8-bit data is supposed to be encoded
in, before you start.
|>> '\x88'.decode('cp1252').encode('utf-8')
'\xcb\x86'
|>> '\x88'.decode('latin1').encode('utf-8')
'\xc2\x88'

I am rather puzzled as to what you are trying to achieve. You appear to
believe that you possess one or more 8-bit strings, encoded in latin1,
which contain the C0 controls \x09 (HT) and \x0a (LF) AND the
corresponding C1 controls \x88 (HTS) and \x85 (NEL). You want to change
LF to NEL, and NEL to LF and similarly with the other pair. Then you
want to write the result, encoded in UTF-8, to a file. The purpose
behind that baroque/byzantine capering would be .... what?

Jun 27 '06 #3

P: n/a
Thanks for the thorough explanation.

What I am doing is converting data for processing that will be tab (for
columns) and newline (for row) delimited. Some of the data contains tabs
and newlines so, I have to convert them to something else so the file
integrity is good.

Not my idea, I've been left with the implementation however.

"John Machin" <sj******@lexicon.net> wrote in message
news:44********@news.eftel.com...
On 28/06/2006 7:46 AM, Mike Currie wrote:
Can anyone explain why I'm getting an ascii encoding error when I'm
trying to write out using a UTF-8 encoder?

> f = codecs.open('foo.txt', 'wU', 'utf-8')
> print filteredLine

thisÍhasÍŗtabsÍandÍlineŗbreaks
> f.write(filteredLine)

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "C:\Python24\lib\codecs.py", line 501, in write
return self.writer.write(data)
File "C:\Python24\lib\codecs.py", line 178, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4:
ordinal
not in range(128)


Your fundamental problem is that you are trying to decode an 8-bit string
to UTF-8. The codec tries to convert your string to Unicode first, using
the default encoding (ascii), which fails.

Get this into your head:
You encode Unicode as ascii, latin1, cp1252, utf8, gagolitic, whatever
into an 8-bit string.
You decode whatever from an 8-bit string into Unicode.

Here is a run-down on your problem, using just the encode/decode methods
instead of codecs for illustration purposes:

(1) Equivalent to what you did.
|>> '\x88'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0:
ordinal not in range(128)

(2) Same thing, explicitly trying to decode your 8-bit string as ASCII.
|>> '\x88'.decode('ascii').encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0:
ordinal not in range(128)

(3) Encoding Unicode as UTF-8 works, as expected.
|>> u'\x88'.encode('utf-8')
'\xc2\x88'

(4) But you need to know what your 8-bit data is supposed to be encoded
in, before you start.
|>> '\x88'.decode('cp1252').encode('utf-8')
'\xcb\x86'
|>> '\x88'.decode('latin1').encode('utf-8')
'\xc2\x88'

I am rather puzzled as to what you are trying to achieve. You appear to
believe that you possess one or more 8-bit strings, encoded in latin1,
which contain the C0 controls \x09 (HT) and \x0a (LF) AND the
corresponding C1 controls \x88 (HTS) and \x85 (NEL). You want to change LF
to NEL, and NEL to LF and similarly with the other pair. Then you want to
write the result, encoded in UTF-8, to a file. The purpose behind that
baroque/byzantine capering would be .... what?

Jun 27 '06 #4

P: n/a
On 28/06/2006 9:44 AM, Mike Currie wrote:

What I am doing is converting data for processing that will be tab (for
columns) and newline (for row) delimited. Some of the data contains tabs
and newlines so, I have to convert them to something else so the file
integrity is good.

Not my idea, I've been left with the implementation however.


Do you *need* UTF-8? Or is that only there to hide away the \x88 and
\x83? Apart from tab and linefeed, what (if any) other characters are
there in the data that are not printable ASCII characters?

In any case, if you have 8-bit string data, the CSV file format would
appear to meet the requirement: it preserves your data by "quoting"
delimiters and newlines that appear in the actual data. The Python csv
module is included in every Python distribution since 2.3.

Cheers,
John
Jun 28 '06 #5

P: n/a
On 6/27/06, Mike Currie <de*@null.com> wrote:
Thanks for the thorough explanation.

What I am doing is converting data for processing that will be tab (for
columns) and newline (for row) delimited. Some of the data contains tabs
and newlines so, I have to convert them to something else so the file
integrity is good.
Usually it is done by escaping: translate tab -> \t, new line -> \n,
back slash -> \\.
Python strings already have a method to do it in just one line:
s=chr(9)+chr(10)+chr(92)
print s.encode("string_escape")

\t\n\\

when you're ready to convert it back you call decode("string_escape")

Not my idea, I've been left with the implementation however.


The idea is actually not bad as long as you know how to cope with unicode.
Jun 28 '06 #6

This discussion thread is closed

Replies have been disabled for this discussion.