Ascii Encoding Error with UTF-8 encoder

Mike Currie

Can anyone explain why I'm getting an ascii encoding error when I'm trying
to write out using a UTF-8 encoder?

Thanks

Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.

filterMap = {}
for i in range(0,255): .... filterMap[chr(i)] = chr(i)
.... filterMap[chr(9)] = chr(136)
filterMap[chr(10)] = chr(133)
filterMap[chr(136)] = chr(9)
filterMap[chr(133)] = chr(10)
line = '''this has .... tabs and line
.... breaks''' filteredLine = ''.join([ filterMap[a] for a in line])
import codecs
f = codecs.open('foo.txt', 'wU', 'utf-8')
print filteredLine thisêhasêàtabsêandêlineàbreaks f.write(filteredLine)

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "C:\Python24\lib\codecs.py", line 501, in write
return self.writer.write(data)
File "C:\Python24\lib\codecs.py", line 178, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4:
ordinal
not in range(128)

Jun 27 '06 #1

Subscribe Reply

6161

Robert Kern

Mike Currie wrote:

Can anyone explain why I'm getting an ascii encoding error when I'm trying
to write out using a UTF-8 encoder?

Please read the Python Unicode HOWTO.

http://www.amk.ca/python/howto/unicode

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Jun 27 '06 #2

John Machin

On 28/06/2006 7:46 AM, Mike Currie wrote:

Can anyone explain why I'm getting an ascii encoding error when I'm trying
to write out using a UTF-8 encoder?

f = codecs.open('foo.txt', 'wU', 'utf-8')
print filteredLine thisêhasêàtabsêandêlineàbreaks f.write(filteredLine)

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "C:\Python24\lib\codecs.py", line 501, in write
return self.writer.write(data)
File "C:\Python24\lib\codecs.py", line 178, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4:
ordinal
not in range(128)

Your fundamental problem is that you are trying to decode an 8-bit
string to UTF-8. The codec tries to convert your string to Unicode
first, using the default encoding (ascii), which fails.

Get this into your head:
You encode Unicode as ascii, latin1, cp1252, utf8, gagolitic, whatever
into an 8-bit string.
You decode whatever from an 8-bit string into Unicode.

Here is a run-down on your problem, using just the encode/decode methods
instead of codecs for illustration purposes:

(1) Equivalent to what you did.
|>> '\x88'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0:
ordinal not in range(128)

(2) Same thing, explicitly trying to decode your 8-bit string as ASCII.
|>> '\x88'.decode('ascii').encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0:
ordinal not in range(128)

(3) Encoding Unicode as UTF-8 works, as expected.
|>> u'\x88'.encode('utf-8')
'\xc2\x88'

(4) But you need to know what your 8-bit data is supposed to be encoded
in, before you start.
|>> '\x88'.decode('cp1252').encode('utf-8')
'\xcb\x86'
|>> '\x88'.decode('latin1').encode('utf-8')
'\xc2\x88'

I am rather puzzled as to what you are trying to achieve. You appear to
believe that you possess one or more 8-bit strings, encoded in latin1,
which contain the C0 controls \x09 (HT) and \x0a (LF) AND the
corresponding C1 controls \x88 (HTS) and \x85 (NEL). You want to change
LF to NEL, and NEL to LF and similarly with the other pair. Then you
want to write the result, encoded in UTF-8, to a file. The purpose
behind that baroque/byzantine capering would be .... what?

Jun 27 '06 #3

Mike Currie

Thanks for the thorough explanation.

What I am doing is converting data for processing that will be tab (for
columns) and newline (for row) delimited. Some of the data contains tabs
and newlines so, I have to convert them to something else so the file
integrity is good.

Not my idea, I've been left with the implementation however.

"John Machin" <sj******@lexicon.net> wrote in message
news:44********@news.eftel.com...

On 28/06/2006 7:46 AM, Mike Currie wrote:
Can anyone explain why I'm getting an ascii encoding error when I'm
trying to write out using a UTF-8 encoder?

> f = codecs.open('foo.txt', 'wU', 'utf-8')
> print filteredLine

thisêhasêàtabsêandêlineàbreaks
> f.write(filteredLine)

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "C:\Python24\lib\codecs.py", line 501, in write
return self.writer.write(data)
File "C:\Python24\lib\codecs.py", line 178, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4:
ordinal
not in range(128)

Your fundamental problem is that you are trying to decode an 8-bit string
to UTF-8. The codec tries to convert your string to Unicode first, using
the default encoding (ascii), which fails.

Get this into your head:
You encode Unicode as ascii, latin1, cp1252, utf8, gagolitic, whatever
into an 8-bit string.
You decode whatever from an 8-bit string into Unicode.

Here is a run-down on your problem, using just the encode/decode methods
instead of codecs for illustration purposes:

(1) Equivalent to what you did.
|>> '\x88'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0:
ordinal not in range(128)

(2) Same thing, explicitly trying to decode your 8-bit string as ASCII.
|>> '\x88'.decode('ascii').encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0:
ordinal not in range(128)

(3) Encoding Unicode as UTF-8 works, as expected.
|>> u'\x88'.encode('utf-8')
'\xc2\x88'

(4) But you need to know what your 8-bit data is supposed to be encoded
in, before you start.
|>> '\x88'.decode('cp1252').encode('utf-8')
'\xcb\x86'
|>> '\x88'.decode('latin1').encode('utf-8')
'\xc2\x88'

I am rather puzzled as to what you are trying to achieve. You appear to
believe that you possess one or more 8-bit strings, encoded in latin1,
which contain the C0 controls \x09 (HT) and \x0a (LF) AND the
corresponding C1 controls \x88 (HTS) and \x85 (NEL). You want to change LF
to NEL, and NEL to LF and similarly with the other pair. Then you want to
write the result, encoded in UTF-8, to a file. The purpose behind that
baroque/byzantine capering would be .... what?

Jun 27 '06 #4

John Machin

On 28/06/2006 9:44 AM, Mike Currie wrote:

What I am doing is converting data for processing that will be tab (for
columns) and newline (for row) delimited. Some of the data contains tabs
and newlines so, I have to convert them to something else so the file
integrity is good.

Not my idea, I've been left with the implementation however.

Do you *need* UTF-8? Or is that only there to hide away the \x88 and
\x83? Apart from tab and linefeed, what (if any) other characters are
there in the data that are not printable ASCII characters?

In any case, if you have 8-bit string data, the CSV file format would
appear to meet the requirement: it preserves your data by "quoting"
delimiters and newlines that appear in the actual data. The Python csv
module is included in every Python distribution since 2.3.

Cheers,
John

Jun 28 '06 #5

Serge Orlov

On 6/27/06, Mike Currie <de*@null.com> wrote:

Thanks for the thorough explanation.

What I am doing is converting data for processing that will be tab (for
columns) and newline (for row) delimited. Some of the data contains tabs
and newlines so, I have to convert them to something else so the file
integrity is good.
Usually it is done by escaping: translate tab -> \t, new line -> \n,
back slash -> \\.
Python strings already have a method to do it in just one line:
s=chr(9)+chr(10)+chr(92)
print s.encode("string_escape")

\t\n\\

when you're ready to convert it back you call decode("string_escape")

Not my idea, I've been left with the implementation however.

The idea is actually not bad as long as you know how to cope with unicode.

Jun 28 '06 #6

by: webdev | last post by:

lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3...

Python

Simple high-ascii character encoding

by: chandy | last post by:

Hi, I have an Html document that declares that it uses the utf-8 character set. As this document is editable via a web interface I need to make sure than high-ascii characters that may be...

HTML / CSS

utf-8 to ascii

by: mail2atulmehta | last post by:

I have a question. how to generate two files, one in UTF-8, the other in ASCII with the same column length SO that when i do the conversion from utf-8 to ascii, the column length does not change ....

C / C++

UTF8/UTF7/ASCII problem while reading from text file

by: Lenard Gunda | last post by:

hi! I have the following problem. I need to read data from a TXT file our company receives. I would use StreamReader, and process it line by line using ReadLine, however, the following problem...

C# / C Sharp

Unicode to ASCII string conversion

by: Ger | last post by:

I have not been able to find a simple, straight forward Unicode to ASCII string conversion function in VB.Net. Is that because such a function does not exists or do I overlook it? I found...

Visual Basic .NET

how to using codecvt to convert ascii<-->UTF-8 within std::ofstream

by: davihigh | last post by:

My Friends: I am using std::ofstream (as well as ifstream), I hope that when i wrote in some std::string(...) with locale, ofstream can convert to UTF-8 encoding and save file to disk. So does...

C / C++

Long way around UnicodeDecodeError, or 'ascii' codec can't decode byte

by: Oleg Parashchenko | last post by:

Hello, I'm working on an unicode-aware application. I like to use "print" to debug programs, but in this case it was nightmare. The most popular result of "print" was: UnicodeDecodeError:...

Python

Converting ASCII to UTF-8

by: Alci | last post by:

I am getting some Korean characters data from MS SQL server. These data were submitted as UTF-8 into the database, but stored as normal varchars. So, when I getting them out of database by using...

ASP.NET

Displaying Non-ASCII Characters in C++

by: tushar.saxena | last post by:

This post is a follow up to the post at : http://groups.google.com/group/comp.lang.c++/browse_thread/thread/83af6123fa945e8b?hl=ug#9eaa6fab5622424e as my original question was answered there, but I...

C / C++

encoding ascii data for xml

by: harrelson | last post by:

I have a large amount of data in a postgresql database with the encoding of SQL_ASCII. Most recent data is UTF-8 but data from several years ago could be of some unknown other data type. Being...

Python

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

C# / C Sharp

Ascii Encoding Error with UTF-8 encoder

Similar topics