473,836 Members | 1,948 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Ascii Encoding Error with UTF-8 encoder

Can anyone explain why I'm getting an ascii encoding error when I'm trying
to write out using a UTF-8 encoder?

Thanks

Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright" , "credits" or "license" for more information.
filterMap = {}
for i in range(0,255): .... filterMap[chr(i)] = chr(i)
.... filterMap[chr(9)] = chr(136)
filterMap[chr(10)] = chr(133)
filterMap[chr(136)] = chr(9)
filterMap[chr(133)] = chr(10)
line = '''this has .... tabs and line
.... breaks''' filteredLine = ''.join([ filterMap[a] for a in line])
import codecs
f = codecs.open('fo o.txt', 'wU', 'utf-8')
print filteredLine thisÍhasÍŗtabsÍ andÍlineŗbreaks f.write(filtere dLine)

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "C:\Python24\li b\codecs.py", line 501, in write
return self.writer.wri te(data)
File "C:\Python24\li b\codecs.py", line 178, in write
data, consumed = self.encode(obj ect, self.errors)
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0x88 in position 4:
ordinal
not in range(128)
Jun 27 '06 #1
5 6225
Mike Currie wrote:
Can anyone explain why I'm getting an ascii encoding error when I'm trying
to write out using a UTF-8 encoder?


Please read the Python Unicode HOWTO.

http://www.amk.ca/python/howto/unicode

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Jun 27 '06 #2
On 28/06/2006 7:46 AM, Mike Currie wrote:
Can anyone explain why I'm getting an ascii encoding error when I'm trying
to write out using a UTF-8 encoder?

f = codecs.open('fo o.txt', 'wU', 'utf-8')
print filteredLine thisÍhasÍŗtabsÍ andÍlineŗbreaks f.write(filtere dLine)

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "C:\Python24\li b\codecs.py", line 501, in write
return self.writer.wri te(data)
File "C:\Python24\li b\codecs.py", line 178, in write
data, consumed = self.encode(obj ect, self.errors)
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0x88 in position 4:
ordinal
not in range(128)


Your fundamental problem is that you are trying to decode an 8-bit
string to UTF-8. The codec tries to convert your string to Unicode
first, using the default encoding (ascii), which fails.

Get this into your head:
You encode Unicode as ascii, latin1, cp1252, utf8, gagolitic, whatever
into an 8-bit string.
You decode whatever from an 8-bit string into Unicode.

Here is a run-down on your problem, using just the encode/decode methods
instead of codecs for illustration purposes:

(1) Equivalent to what you did.
|>> '\x88'.encode(' utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0x88 in position 0:
ordinal not in range(128)

(2) Same thing, explicitly trying to decode your 8-bit string as ASCII.
|>> '\x88'.decode(' ascii').encode( 'utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0x88 in position 0:
ordinal not in range(128)

(3) Encoding Unicode as UTF-8 works, as expected.
|>> u'\x88'.encode( 'utf-8')
'\xc2\x88'

(4) But you need to know what your 8-bit data is supposed to be encoded
in, before you start.
|>> '\x88'.decode(' cp1252').encode ('utf-8')
'\xcb\x86'
|>> '\x88'.decode(' latin1').encode ('utf-8')
'\xc2\x88'

I am rather puzzled as to what you are trying to achieve. You appear to
believe that you possess one or more 8-bit strings, encoded in latin1,
which contain the C0 controls \x09 (HT) and \x0a (LF) AND the
corresponding C1 controls \x88 (HTS) and \x85 (NEL). You want to change
LF to NEL, and NEL to LF and similarly with the other pair. Then you
want to write the result, encoded in UTF-8, to a file. The purpose
behind that baroque/byzantine capering would be .... what?

Jun 27 '06 #3
Thanks for the thorough explanation.

What I am doing is converting data for processing that will be tab (for
columns) and newline (for row) delimited. Some of the data contains tabs
and newlines so, I have to convert them to something else so the file
integrity is good.

Not my idea, I've been left with the implementation however.

"John Machin" <sj******@lexic on.net> wrote in message
news:44******** @news.eftel.com ...
On 28/06/2006 7:46 AM, Mike Currie wrote:
Can anyone explain why I'm getting an ascii encoding error when I'm
trying to write out using a UTF-8 encoder?

> f = codecs.open('fo o.txt', 'wU', 'utf-8')
> print filteredLine

thisÍhasÍŗtabsÍ andÍlineŗbreaks
> f.write(filtere dLine)

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "C:\Python24\li b\codecs.py", line 501, in write
return self.writer.wri te(data)
File "C:\Python24\li b\codecs.py", line 178, in write
data, consumed = self.encode(obj ect, self.errors)
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0x88 in position 4:
ordinal
not in range(128)


Your fundamental problem is that you are trying to decode an 8-bit string
to UTF-8. The codec tries to convert your string to Unicode first, using
the default encoding (ascii), which fails.

Get this into your head:
You encode Unicode as ascii, latin1, cp1252, utf8, gagolitic, whatever
into an 8-bit string.
You decode whatever from an 8-bit string into Unicode.

Here is a run-down on your problem, using just the encode/decode methods
instead of codecs for illustration purposes:

(1) Equivalent to what you did.
|>> '\x88'.encode(' utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0x88 in position 0:
ordinal not in range(128)

(2) Same thing, explicitly trying to decode your 8-bit string as ASCII.
|>> '\x88'.decode(' ascii').encode( 'utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0x88 in position 0:
ordinal not in range(128)

(3) Encoding Unicode as UTF-8 works, as expected.
|>> u'\x88'.encode( 'utf-8')
'\xc2\x88'

(4) But you need to know what your 8-bit data is supposed to be encoded
in, before you start.
|>> '\x88'.decode(' cp1252').encode ('utf-8')
'\xcb\x86'
|>> '\x88'.decode(' latin1').encode ('utf-8')
'\xc2\x88'

I am rather puzzled as to what you are trying to achieve. You appear to
believe that you possess one or more 8-bit strings, encoded in latin1,
which contain the C0 controls \x09 (HT) and \x0a (LF) AND the
corresponding C1 controls \x88 (HTS) and \x85 (NEL). You want to change LF
to NEL, and NEL to LF and similarly with the other pair. Then you want to
write the result, encoded in UTF-8, to a file. The purpose behind that
baroque/byzantine capering would be .... what?

Jun 27 '06 #4
On 28/06/2006 9:44 AM, Mike Currie wrote:

What I am doing is converting data for processing that will be tab (for
columns) and newline (for row) delimited. Some of the data contains tabs
and newlines so, I have to convert them to something else so the file
integrity is good.

Not my idea, I've been left with the implementation however.


Do you *need* UTF-8? Or is that only there to hide away the \x88 and
\x83? Apart from tab and linefeed, what (if any) other characters are
there in the data that are not printable ASCII characters?

In any case, if you have 8-bit string data, the CSV file format would
appear to meet the requirement: it preserves your data by "quoting"
delimiters and newlines that appear in the actual data. The Python csv
module is included in every Python distribution since 2.3.

Cheers,
John
Jun 28 '06 #5
On 6/27/06, Mike Currie <de*@null.com > wrote:
Thanks for the thorough explanation.

What I am doing is converting data for processing that will be tab (for
columns) and newline (for row) delimited. Some of the data contains tabs
and newlines so, I have to convert them to something else so the file
integrity is good.
Usually it is done by escaping: translate tab -> \t, new line -> \n,
back slash -> \\.
Python strings already have a method to do it in just one line:
s=chr(9)+chr(10 )+chr(92)
print s.encode("strin g_escape")

\t\n\\

when you're ready to convert it back you call decode("string_ escape")

Not my idea, I've been left with the implementation however.


The idea is actually not bad as long as you know how to cope with unicode.
Jun 28 '06 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
6076
by: webdev | last post by:
lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3 script that grabs some web pages from the web, regex parse the data and stores it localy to xml file for further use.. at first i had no problem using python minidom and everything concerning
37
10183
by: chandy | last post by:
Hi, I have an Html document that declares that it uses the utf-8 character set. As this document is editable via a web interface I need to make sure than high-ascii characters that may be accidentally entered are properly represented when the document is served. My programming language allows me to get the ascii value for any individual character so what I am doing when a change is saved is to look at each character in the content and...
5
3580
by: mail2atulmehta | last post by:
I have a question. how to generate two files, one in UTF-8, the other in ASCII with the same column length SO that when i do the conversion from utf-8 to ascii, the column length does not change . any help is appreciated thanks
5
15050
by: Lenard Gunda | last post by:
hi! I have the following problem. I need to read data from a TXT file our company receives. I would use StreamReader, and process it line by line using ReadLine, however, the following problem occurs. The file contains characters with ASCII codes above 128. But the file is still text (nothing like UTF7/8 or the like). It also might contain + signs. As a result:
18
34154
by: Ger | last post by:
I have not been able to find a simple, straight forward Unicode to ASCII string conversion function in VB.Net. Is that because such a function does not exists or do I overlook it? I found Encoding.Convert, but that needs byte arrays. Thanks, /Ger
8
14347
by: davihigh | last post by:
My Friends: I am using std::ofstream (as well as ifstream), I hope that when i wrote in some std::string(...) with locale, ofstream can convert to UTF-8 encoding and save file to disk. So does ifstream. Something I found shows that, I need to have a proper codecvt to set it. I need more information, maybe a small piece of code sample. Thank you!
4
5384
by: Oleg Parashchenko | last post by:
Hello, I'm working on an unicode-aware application. I like to use "print" to debug programs, but in this case it was nightmare. The most popular result of "print" was: UnicodeDecodeError: 'ascii' codec can't decode byte 0xXX in position 0: ordinal not in range(128) I spent two hours fixing it, and I hope it's done. The solution is one
0
5457
by: Alci | last post by:
I am getting some Korean characters data from MS SQL server. These data were submitted as UTF-8 into the database, but stored as normal varchars. So, when I getting them out of database by using Gridview +SqlDataSource, they are actually ASCII format, but I couldn't just convert the encoding of the page to get the proper UTF-8 format Korean texts. So, i modified a function from MS site, try to convert ASCII to UTF-8, as below protected...
5
5473
by: tushar.saxena | last post by:
This post is a follow up to the post at : http://groups.google.com/group/comp.lang.c++/browse_thread/thread/83af6123fa945e8b?hl=ug#9eaa6fab5622424e as my original question was answered there, but I have some additional problems now. Basically what I want to do is : Given an input UTF-8 encoded file containing HTML sequences such as "&amp;", I want to be able to replace these sequences with their UTF-8 representations (i.e. "&") What I...
4
2042
by: harrelson | last post by:
I have a large amount of data in a postgresql database with the encoding of SQL_ASCII. Most recent data is UTF-8 but data from several years ago could be of some unknown other data type. Being honest with myself, I am not even sure that the most recent data is always UTF-8-- data is entered on web forms and I wouldn't be surprised if data of other encodings is slipping in. Up to the point I have just ignored the problem-- on the web...
0
9812
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, weíll explore What is ONU, What Is Router, ONU & Routerís main usage, and What is the difference between ONU and Router. Letís take a closer look ! Part I. Meaning of...
0
10824
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10533
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10579
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
1
7775
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Duprť who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6975
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5644
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5813
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4443
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.