471,319 Members | 1,750 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,319 software developers and data experts.

How to get Python to default to UTF8

I'm developing a cgi-bin application that must be unicode sensitive. I'm
striving for a UTF8 implementation. I'm running python 2.3 on a development
machine (windows xp) and a server (windows xp server). Both environments are
running Apache 2.2 with the same configuration file.

The problem is this. On my development machine I get the following unicode
error:

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-6: invalid
data
args = ('utf8', 'adem\xe3\xa1s', 4, 7, 'invalid data')
encoding = 'utf8'
end = 7
object = 'adem\xe3\xa1s'
reason = 'invalid data'
start = 4
On my server, running exactly the same python code, I see the following
unicode error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 4:
ordinal not in range(128)
args = ('ascii', 'adem\xe3\xa1s', 4, 5, 'ordinal not in range(128)')
encoding = 'ascii'
end = 5
object = 'adem\xe3\xa1s'
reason = 'ordinal not in range(128)'
start = 4

Note the differences in the encoding -- on the development machine it's utf8
but on the server it's ascii.

I was under the impression that Python assumed ascii encoding by default.
I'm wondering how did my development machine get to be utf8? And since my
python code is the same on both machines, what is it about my configuration
that could be causing a difference in default encoding? I checked site.py on
both machines and both files default to ASCII, so I assume it's something
else.

Thanks in advance.
Dec 22 '07 #1
4 4349
weheh wrote:
I'm developing a cgi-bin application that must be unicode sensitive. I'm
striving for a UTF8 implementation. I'm running python 2.3 on a development
machine (windows xp) and a server (windows xp server). Both environments are
running Apache 2.2 with the same configuration file.

The problem is this. On my development machine I get the following unicode
error:

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-6: invalid
data
args = ('utf8', 'adem\xe3\xa1s', 4, 7, 'invalid data')
encoding = 'utf8'
end = 7
object = 'adem\xe3\xa1s'
reason = 'invalid data'
start = 4
Could be that sys.stdin.encoding differs between the setups.

*Where* do you get this exception? In the database layer? When the
script is trying to read things from a file? When it's trying to output
things? Somewhere else?

</F>

Dec 22 '07 #2
Hi Fredrik,

Thanks again for your feedback. I am much obliged.

Indeed, I am forced to be exteremely rigorous about decoding on the way in
and encoding on the way out everywhere in my program, just as you say. Your
advice is excellent and concurs with other sources of unicode expertise.
Following this approach is the only thing that has made it possible for me
to get my program to work.

However, the situation is still unacceptable to me because I often make
mistakes and it is easy for me to miss places where encoding is necessary. I
rely on testing to find my faults. On my development environment, I get no
error message and it seems that everything works perfectly. However, once
ported to the server, I see a crash. But this is too late a stage to catch
the error since the app is already live.

I assume that the default encoding that you mention shouldn't ever be
changed is stored in the site.py file. I've checked this file and it's set
to ascii in both machines (development and server). I haven't touched
site.py. However, a week or so ago, following the advice of someone I read
on the web, I did create a file in my cgi-bin directory called something
like site-config.py, wherein encoding was set to utf8. I ran my program a
few times, but then reading elsewhere that the site-config.py approach was
outmoded, I decided to remove this file. I'm wondering whether it made a
permanent change somewhere in the bowels of python while I wasn't looking?

Can you elaborate on where to look to see what stdin/stdout encodings are
set to? All inputs are coming at my app either via html forms or input
files. All output goes either to the browser via html or to an output file.

>
to fix this, figure out from where you got the encoded (8-bit) string, and
make sure you decode it properly on the way in. only use Unicode strings
on the "inside".

(Python does have two encoding defaults; there's a default encoding that
*shouldn't* ever be changed from the "ascii" default, and there's also a
stdin/stdout encoding that's correctly set if you run the code in an
ordinary terminal window. if you get your data from anywhere else, you
cannot trust any of these, so you should do your own decoding on the way
in, and encoding things on the way out).

</F>

Dec 23 '07 #3
However, the situation is still unacceptable to me because I often make
mistakes and it is easy for me to miss places where encoding is necessary. I
rely on testing to find my faults. On my development environment, I get no
error message and it seems that everything works perfectly. However, once
ported to the server, I see a crash. But this is too late a stage to catch
the error since the app is already live.
If you want to check whether there is indeed no place where you forgot
to properly .encode, you can set the default encoding on your
development machine to "undefined" (see site.py). This will give you an
exception whenever the default encoding is invoked, even if the encoding
would have succeeded under the default default encoding (ie. "ascii")

Such a setting should not be applied a production environment.
Can you elaborate on where to look to see what stdin/stdout encodings are
set to?
Just print out sys.stdin.encoding and sys.stdout.encoding. Or were you
asking for the precise source in the interpreter that sets them?
All inputs are coming at my app either via html forms or input
files. All output goes either to the browser via html or to an output file.
Then sys.stdout.encoding will not be set to anything.

Regards,
Martin
Dec 23 '07 #4
weheh wrote:
Hi Fredrik,

Thanks again for your feedback. I am much obliged.
Bear in mind that in Python, ASCII currently means ASCII, values
0..127. Type "str" will accept values 127. However, the default
conversion from "str" to "unicode" requires true ASCII values, in
0..127. So if you take in data from some source which might have
a byte value 127, the default conversion to Unicode won't work.

There are conversion functions for specifying the meaning of
values 128..255, (the input might be "latin1" encoding, for
example), or ignoring unexpected characters, or converting them
to "?".

John Nagle
Dec 24 '07 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

9 posts views Thread by thijs.braem | last post: by
reply views Thread by damonwischik | last post: by
3 posts views Thread by dmitrey | last post: by
3 posts views Thread by kettle | last post: by
reply views Thread by rosydwin | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.