473,327 Members | 2,069 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,327 software developers and data experts.

How to get Python to default to UTF8

I'm developing a cgi-bin application that must be unicode sensitive. I'm
striving for a UTF8 implementation. I'm running python 2.3 on a development
machine (windows xp) and a server (windows xp server). Both environments are
running Apache 2.2 with the same configuration file.

The problem is this. On my development machine I get the following unicode
error:

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-6: invalid
data
args = ('utf8', 'adem\xe3\xa1s', 4, 7, 'invalid data')
encoding = 'utf8'
end = 7
object = 'adem\xe3\xa1s'
reason = 'invalid data'
start = 4
On my server, running exactly the same python code, I see the following
unicode error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 4:
ordinal not in range(128)
args = ('ascii', 'adem\xe3\xa1s', 4, 5, 'ordinal not in range(128)')
encoding = 'ascii'
end = 5
object = 'adem\xe3\xa1s'
reason = 'ordinal not in range(128)'
start = 4

Note the differences in the encoding -- on the development machine it's utf8
but on the server it's ascii.

I was under the impression that Python assumed ascii encoding by default.
I'm wondering how did my development machine get to be utf8? And since my
python code is the same on both machines, what is it about my configuration
that could be causing a difference in default encoding? I checked site.py on
both machines and both files default to ASCII, so I assume it's something
else.

Thanks in advance.
Dec 22 '07 #1
4 4455
weheh wrote:
I'm developing a cgi-bin application that must be unicode sensitive. I'm
striving for a UTF8 implementation. I'm running python 2.3 on a development
machine (windows xp) and a server (windows xp server). Both environments are
running Apache 2.2 with the same configuration file.

The problem is this. On my development machine I get the following unicode
error:

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-6: invalid
data
args = ('utf8', 'adem\xe3\xa1s', 4, 7, 'invalid data')
encoding = 'utf8'
end = 7
object = 'adem\xe3\xa1s'
reason = 'invalid data'
start = 4
Could be that sys.stdin.encoding differs between the setups.

*Where* do you get this exception? In the database layer? When the
script is trying to read things from a file? When it's trying to output
things? Somewhere else?

</F>

Dec 22 '07 #2
Hi Fredrik,

Thanks again for your feedback. I am much obliged.

Indeed, I am forced to be exteremely rigorous about decoding on the way in
and encoding on the way out everywhere in my program, just as you say. Your
advice is excellent and concurs with other sources of unicode expertise.
Following this approach is the only thing that has made it possible for me
to get my program to work.

However, the situation is still unacceptable to me because I often make
mistakes and it is easy for me to miss places where encoding is necessary. I
rely on testing to find my faults. On my development environment, I get no
error message and it seems that everything works perfectly. However, once
ported to the server, I see a crash. But this is too late a stage to catch
the error since the app is already live.

I assume that the default encoding that you mention shouldn't ever be
changed is stored in the site.py file. I've checked this file and it's set
to ascii in both machines (development and server). I haven't touched
site.py. However, a week or so ago, following the advice of someone I read
on the web, I did create a file in my cgi-bin directory called something
like site-config.py, wherein encoding was set to utf8. I ran my program a
few times, but then reading elsewhere that the site-config.py approach was
outmoded, I decided to remove this file. I'm wondering whether it made a
permanent change somewhere in the bowels of python while I wasn't looking?

Can you elaborate on where to look to see what stdin/stdout encodings are
set to? All inputs are coming at my app either via html forms or input
files. All output goes either to the browser via html or to an output file.

>
to fix this, figure out from where you got the encoded (8-bit) string, and
make sure you decode it properly on the way in. only use Unicode strings
on the "inside".

(Python does have two encoding defaults; there's a default encoding that
*shouldn't* ever be changed from the "ascii" default, and there's also a
stdin/stdout encoding that's correctly set if you run the code in an
ordinary terminal window. if you get your data from anywhere else, you
cannot trust any of these, so you should do your own decoding on the way
in, and encoding things on the way out).

</F>

Dec 23 '07 #3
However, the situation is still unacceptable to me because I often make
mistakes and it is easy for me to miss places where encoding is necessary. I
rely on testing to find my faults. On my development environment, I get no
error message and it seems that everything works perfectly. However, once
ported to the server, I see a crash. But this is too late a stage to catch
the error since the app is already live.
If you want to check whether there is indeed no place where you forgot
to properly .encode, you can set the default encoding on your
development machine to "undefined" (see site.py). This will give you an
exception whenever the default encoding is invoked, even if the encoding
would have succeeded under the default default encoding (ie. "ascii")

Such a setting should not be applied a production environment.
Can you elaborate on where to look to see what stdin/stdout encodings are
set to?
Just print out sys.stdin.encoding and sys.stdout.encoding. Or were you
asking for the precise source in the interpreter that sets them?
All inputs are coming at my app either via html forms or input
files. All output goes either to the browser via html or to an output file.
Then sys.stdout.encoding will not be set to anything.

Regards,
Martin
Dec 23 '07 #4
weheh wrote:
Hi Fredrik,

Thanks again for your feedback. I am much obliged.
Bear in mind that in Python, ASCII currently means ASCII, values
0..127. Type "str" will accept values 127. However, the default
conversion from "str" to "unicode" requires true ASCII values, in
0..127. So if you take in data from some source which might have
a byte value 127, the default conversion to Unicode won't work.

There are conversion functions for specifying the meaning of
values 128..255, (the input might be "latin1" encoding, for
example), or ignoring unexpected characters, or converting them
to "?".

John Nagle
Dec 24 '07 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: gabor | last post by:
hi, today i made some tests... i tested some unicode symbols, that are above the 16bit limit (gothic:http://www.unicode.org/charts/PDF/U10330.pdf) .. i played around with iconv and so on,...
0
by: Nobody | last post by:
I have an application that processes MIME messages. It reads a message from a file, looks for a text/html and text/plain parts in it, performs some processing on these parts, and outputs the new...
9
by: thijs.braem | last post by:
Hi everyone, I'm having quite some troubles trying to convert Unicode to String (for use in psycopg, which apparently doesn't know how to cope with unicode strings). The error I keep having...
20
by: weheh | last post by:
Dear web gods: After much, much, much struggle with unicode, many an hour reading all the examples online, coding them, testing them, ripping them apart and putting them back together, I am...
0
by: damonwischik | last post by:
I use emacs 22 and python-mode. Emacs can display utf8 characters (e.g. when I open a utf8-encoded file with Chinese, those characters show up fine), and I'd like to see utf8-encoded output from my...
3
by: dmitrey | last post by:
hi all, what's the best way to write Python dictionary to a file? (and then read) There could be unicode field names and values encountered. Thank you in advance, D.
3
by: kettle | last post by:
Hi, I was wondering how I ought to be handling character range translations in python. What I want to do is translate fullwidth numbers and roman alphabet characters into their halfwidth ascii...
6
by: ogtheterror | last post by:
Hi I have a very limited understanding of Python and have given this the best shot i have but still have not been able to get it working. Is there anyone that knows how to get this into a .net...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.