473,837 Members | 1,846 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Long way around UnicodeDecodeEr ror, or 'ascii' codec can't decode byte

Hello,

I'm working on an unicode-aware application. I like to use "print" to
debug programs, but in this case it was nightmare. The most popular
result of "print" was:

UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xXX in position
0: ordinal not in range(128)

I spent two hours fixing it, and I hope it's done. The solution is one
of the ugliest hack I ever written, but it solves the pain. The full
story and the code is in my blog:

http://uucode.com/blog/2007/03/23/sh...-7-bit-python/

--
Oleg Parashchenko olpa@ http://uucode.com/
http://uucode.com/blog/ Generative Programming, XML, TeX, Scheme
http://tohtml.com/ Online syntax highlighting

Mar 29 '07 #1
4 5384
On 29 Mar, 06:26, "Oleg Parashchenko" <ole...@gmail.c omwrote:
Hello,

I'm working on an unicode-aware application. I like to use "print" to
debug programs, but in this case it was nightmare. The most popular
result of "print" was:

UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xXX in position
0: ordinal not in range(128)
What does sys.stdout.enco ding say?
I spent two hours fixing it, and I hope it's done. The solution is one
of the ugliest hack I ever written, but it solves the pain. The full
story and the code is in my blog:

http://uucode.com/blog/2007/03/23/sh...-7-bit-python/
Calling sys.setdefaulte ncoding might not even help in this case, and
the consensus is that it may be harmful to your code's portability
[1]. Writing output to a terminal may be influenced by your locale,
but I'm not convinced that going through all the locale settings and
setting the character set is the best approach (or even the right
one).

What do you get if you do this...?

import locale
locale.setlocal e(locale.LC_ALL , "")
print locale.getlocal e()

What is your terminal encoding?

Usually, if I'm wanting to print Unicode objects, I explicitly encode
them into something I know the terminal will support. The codecs
module can help with writing Unicode to streams in different
encodings, too.

Paul

[1] http://groups.google.com/group/comp....1017a4cb4bb8ea

Mar 29 '07 #2
Hello,

On Mar 29, 4:53 pm, "Paul Boddie" <p...@boddie.or g.ukwrote:
On 29 Mar, 06:26, "Oleg Parashchenko" <ole...@gmail.c omwrote:
Hello,
I'm working on an unicode-aware application. I like to use "print" to
debug programs, but in this case it was nightmare. The most popular
result of "print" was:
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xXX in position
0: ordinal not in range(128)

What does sys.stdout.enco ding say?
'KOI8-R'
>
I spent two hours fixing it, and I hope it's done. The solution is one
of the ugliest hack I ever written, but it solves the pain. The full
story and the code is in my blog:
http://uucode.com/blog/2007/03/23/sh...-7-bit-python/

Calling sys.setdefaulte ncoding might not even help in this case, and
the consensus is that it may be harmful to your code's portability
[1].
Yes, but I think UTF-8 is now everywhere.
Writing output to a terminal may be influenced by your locale,
but I'm not convinced that going through all the locale settings and
setting the character set is the best approach (or even the right
one).

What do you get if you do this...?

import locale
locale.setlocal e(locale.LC_ALL , "")
print locale.getlocal e()
('ru_RU', 'koi8-r')
>
What is your terminal encoding?
koi8-r
>
Usually, if I'm wanting to print Unicode objects, I explicitly encode
them into something I know the terminal will support. The codecs
module can help with writing Unicode to streams in different
encodings, too.
As long as input/output is the only place for such need, it's ok to
encode expliciyely. But I also had problems, for example, with md5
module, and I don't know the whole list of potential problematic
places. Therefore, I'd better go with my brutal utf8ization.
>
Paul

[1]http://groups.google.c om/group/comp.lang.pytho n/msg/431017a4cb4bb8e a
--
Oleg Parashchenko olpa@ http://uucode.com/
http://uucode.com/blog/ Generative Programming, XML, TeX, Scheme
http://tohtml.com/ Online syntax highlighting

Mar 31 '07 #3
Oleg Parashchenko napisa(a):
>>I spent two hours fixing it, and I hope it's done. The solution is one
of the ugliest hack I ever written, but it solves the pain. The full
story and the code is in my blog:
http://uucode.com/blog/2007/03/23/sh...-7-bit-python/
Calling sys.setdefaulte ncoding might not even help in this case, and
the consensus is that it may be harmful to your code's portability
[1].

Yes, but I think UTF-8 is now everywhere.
No, it is not. Your own system is "not ready for UTF-8", as you stated
somewhere in this blog entry. How can you expect everybody else's system
being utf-8, while "you are not ready for transition"?

It would be better if you write your programs in encoding-agnostic way,
using byte streams only for input and output (yes, printing a debug
statement on terminal *is* a kind of producing the output). An, oh, you
cann't encode/decode text not knowing the encoding...

--
Jarek Zgoda
http://jpa.berlios.de/
Mar 31 '07 #4
Oleg Parashchenko wrote:
On Mar 29, 4:53 pm, "Paul Boddie" <p...@boddie.or g.ukwrote:
On 29 Mar, 06:26, "Oleg Parashchenko" <ole...@gmail.c omwrote:
>
I'm working on an unicode-aware application. I like to use "print" to
debug programs, but in this case it was nightmare. The most popular
result of "print" was:
>
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xXX in position
0: ordinal not in range(128)
I think I've found the actual source of this, and it isn't the print
statement. UnicodeDecodeEr ror relates to the construction of Unicode
objects, not the encoding of such objects as byte strings. The
terminology is explained using this simple diagram (which hopefully
won't be ruined in transmission):

byte string in XYZ encoding
|
(decode from XYZ) --possible UnicodeDecodeEr ror
|
V
Unicode object
|
(encode to ABC) --possible UnicodeEncodeEr ror
|
V
byte string in ABC encoding
What does sys.stdout.enco ding say?

'KOI8-R'
[...]
What do you get if you do this...?

import locale
locale.setlocal e(locale.LC_ALL , "")
print locale.getlocal e()

('ru_RU', 'koi8-r')

What is your terminal encoding?

koi8-r
Here's a transcript on my system answering the same questions:

Python 2.4.1 (#2, Oct 4 2006, 16:53:35)
[GCC 3.3.5 (Debian 1:3.3.5-8ubuntu2.1)] on linux2
Type "help", "copyright" , "credits" or "license" for more
information.
>>import locale
locale.getloc ale()
(None, None)
>>locale.setloc ale(locale.LC_A LL, "")
'en_US.ISO-8859-15'
>>locale.getloc ale()
('en_US', 'iso-8859-15')

So Python knows about the locale. Note that neither of us use UTF-8 as
a system encoding.
>>import sys
sys.stdout.en coding
'ISO-8859-15'
>>sys.stdin.enc oding
'ISO-8859-15'

This tells us that Python could know things about writing Unicode
objects out in the appropriate encoding. I wasn't sure whether Python
was so smart about this, so let's see what happens...
>>print unicode("")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xe6 in position
0: ordinal not in range(128)

Now this isn't anything to do with the print operation: what's
happening here is that I'm explicitly making a Unicode object but
haven't said what the encoding of my byte string is. The default
encoding is 'ascii' as stated in the error message. None of the
characters provided belong to the ASCII character set.

We can check this by not printing anything out:
>>s = unicode("")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xe6 in position
0: ordinal not in range(128)

So, let's try again and provide an encoding...
>>print unicode("", sys.stdin.encod ing)


Here, we've mentioned the encoding and even though the print statement
is acting on a Unicode object, it seems to be happy to work out the
resulting encoding.
>>print u""


Here, we've skipped the explicit Unicode object construction by using
a Unicode literal, which works in this simple case.

Of course, if your system encoding (along with the terminal) isn't
capable of displaying every Unicode character, you'll experience
problems doing the above. Frequently, it's interesting to encode
things as UTF-8 and look at them in applications that are capable of
displaying the text. Thus, you'd do something like this:

import unicodedata

(This gets an interesting function to help us look up characters in
the Unicode database.)

somefile = open("somefile. txt", "wb")
print >>somefile, unicodedata.loo kup("MONGOLIAN VOWEL
SEPARATOR").enc ode("utf-8")

Or even this:

import codecs
somefile = codecs.open("so mefile.txt", "wb", encoding="utf-8")
print >>somefile, unicodedata.loo kup("MONGOLIAN VOWEL SEPARATOR")

Here, we only specified the encoding once when opening the file. The
file object accepts Unicode objects thereafter.
Usually, if I'm wanting to print Unicode objects, I explicitly encode
them into something I know the terminal will support. The codecs
module can help with writing Unicode to streams in different
encodings, too.

As long as input/output is the only place for such need, it's ok to
encode expliciyely. But I also had problems, for example, with md5
module, and I don't know the whole list of potential problematic
places. Therefore, I'd better go with my brutal utf8ization.
It's best to decode (ie. construct Unicode objects) upon receiving
data as input, and to encode (ie. convert Unicode objects to byte
strings) upon producing output. What may be the problem with the md5
module, and you'd have to post example code for us to help you out, is
that it assumes byte strings and doesn't work properly with Unicode
objects, but I can't say for sure because I'm usually presenting byte
strings to md5 module functions on the rare occasions I do anything
with them. Note that one would usually calculate MD5 checksums on raw
data, although I can imagine a hypothetical (although perhaps
unrealistic) need to do so on Unicode text, so it doesn't necessarily
make much sense to present those functions with Unicode data.

Paul

Mar 31 '07 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
4918
by: Ruslan | last post by:
Hi, everybody. In this excerpt of code enc = 'some_type_of_encoding' def _encode(v): if isinstance(v, UnicodeType): v = v.encode(v) return v
4
9396
by: Robin Siebler | last post by:
I have no idea what is causing this error, or how to fix it. The full error is: Traceback (most recent call last): File "D:\ScriptRuntime\PS\Automation\Handlers\SCMTestToolResourceToolsBAT.py", line 60, in Run PS.Automation.Utility.System.AppendSystemPath(args, context) File "D:\ScriptRuntime\PS\Automation\Utility\System.py", line 55, in AppendSys temPath AppendPathVariable("PATH", appendtext, context) File...
5
5625
by: ash | last post by:
hi, one of the modules in my programs stopped wroking after i upgraded from python 2.3 to 2.4. I also changed my wxPython to unicode supported one during the process. what the module essentially does is search for a stirng pattern form a list of strings. this is the function: def srchqu(self): for i in range(1,len(qu)):#qu is the list of strings
4
3520
by: Robin Haswell | last post by:
Okay I'm getting really frustrated with Python's Unicode handling, I'm trying everything I can think of an I can't escape Unicode(En|De)codeError no matter what I try. Could someone explain to me what I'm doing wrong here, so I can hope to throw light on the myriad of similar problems I'm having? Thanks :-) Python 2.4.1 (#2, May 6 2005, 11:22:24) on linux2 Type "help", "copyright", "credits" or "license" for more information.
7
3033
by: kath | last post by:
I have a number of excel files. In each file DATE is represented by different name. I want to read the date from those different file. Also the date is in different column in different file. To identify the date field in different files I have created a file called _globals where I keep all aliases for DATE in a array called 'alias_DATE'. Array alias_DATE looks like,
2
4856
by: Gilles Ganault | last post by:
Hello It seems like I have Unicode data in a CSV file but Python is using a different code page, so isn't happy when I'm trying to read and put this data into an SQLite database with APSW: ======== sql = "INSERT INTO mytable (col1,col2) VALUES (?,?)" cursor.executemany(sql, records("test.tsv")) """
3
6643
by: Jorgen Bodde | last post by:
Hi All, I am relatively new to python unicode pains and I would like to have some advice. I have this snippet of code: def playFile(cmd, args): argstr = list() for arg in appcfg.options.split(): thefile = args filemask = u"%file%"
0
1739
by: Jorgen Bodde | last post by:
Hi Edwin, Filemask is obvious as it is assigned in the python code itself. It is "%file%". The idea is that the file clicked is substituted for the "%file%" by the replace action. The file that needs to be substituted is a simple file on disk. Here is a dump of the file and it's characters. I do understand that it is not in the range of ASCII but how can I make it so that it will work?
3
14644
by: Gilles Ganault | last post by:
Hello I'm getting this error while downloading and parsing web pages: ===== title = m.group(1) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 48: ordinal not in range(128) =====
0
9846
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, well explore What is ONU, What Is Router, ONU & Routers main usage, and What is the difference between ONU and Router. Lets take a closer look ! Part I. Meaning of...
0
10890
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10581
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10634
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10279
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7007
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5855
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
4053
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
3127
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.