By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,907 Members | 1,963 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,907 IT Pros & Developers. It's quick & easy.

How to display unicode with the CGI module?

P: n/a
Hi!

I am using the built-in Python web server (CGIHTTPServer) to serve
pages via CGI.
The problem I am having is that I get an error while trying to display
Unicode UTF-8 characters via a Python CGI script.

The error goes like this: "UnicodeEncodeError: 'ascii' codec can't
encode character u'\u026a' in position 12: ordinal not in range(128)".

My question is: (1 ) how and (2) where do I set the encoding for the
page?

I have tried adding <meta http-equiv="content-type" content="text/
html; charset=utf-8" /but this does not seem to help, as this is an
instruction for the browser, not for the webserver and/or CGI script.

Do I have to set the encoding in the server script? On in the Python
CGI script?

The data that I want to display comes from a sqlite3 database and is
already in Unicode format.

The webserver script looks like this:

Expand|Select|Wrap|Line Numbers
  1. #
  2. import CGIHTTPServer, BaseHTTPServer
  3. httpd=BaseHTTPServer.HTTPServer(('',8080),
  4. CGIHTTPServer.CGIHTTPRequestHandler)
  5. httpd.serve_forever()
  6. #
  7.  
A simplified version of my Python CGI script would be:
Expand|Select|Wrap|Line Numbers
  1. import cgi
  2.  
  3. print "text/html"
  4. print
  5.  
  6. print "<html>"
  7. print " <body>"
  8. print   "my UTF8 string: Français 日本語 Español Português Română"
  9. print " </body>"
  10. print "</html>"
  11.  
  12.  
Where and what do I need to add to these scripts to get proper display
of UTF8 content?
Nov 25 '07 #1
Share this Question
Share on Google+
6 Replies


P: n/a
On Sat, 24 Nov 2007 15:58:56 -0800, coldpizza wrote:
The problem I am having is that I get an error while trying to display
Unicode UTF-8 characters via a Python CGI script.

The error goes like this: "UnicodeEncodeError: 'ascii' codec can't
encode character u'\u026a' in position 12: ordinal not in range(128)".
Unicode != UTF-8. You are not trying to send an UTF-8 encoded byte string
but an *unicode string*. That's not possible. If unicode strings should
"leave" your program they must be encoded into byte strings. If you don't
do this explicitly Python tries to encode as ASCII and fails if there's
anything non-ASCII in the string. The `encode()` method is your friend.

Ciao,
Marc 'BlackJack' Rintsch
Nov 25 '07 #2

P: n/a
Marc 'BlackJack' Rintsch schrieb:
On Sat, 24 Nov 2007 15:58:56 -0800, coldpizza wrote:
>The problem I am having is that I get an error while trying to display
Unicode UTF-8 characters via a Python CGI script.

The error goes like this: "UnicodeEncodeError: 'ascii' codec can't
encode character u'\u026a' in position 12: ordinal not in range(128)".

Unicode != UTF-8. You are not trying to send an UTF-8 encoded byte string
but an *unicode string*.
Just to expand on this... It helps thinking of "unicode objects" and
"strings" as seperate types (which they are). So there is no such thing
like "unicode string" and you always need to think about when to
encode() your unicode objects. However, this will change in py3k...,
what's the new rule of thumb?

cheers
Paul

Nov 25 '07 #3

P: n/a
Unicode != UTF-8.
....
>`encode()` method is your friend.
Thanks a lot for help!

I am always confused as to which one to use: encode() or decode(); I
have initially tried decode() and it did not work.

It is funny that encode() and decode() omit the name of the other
encoding (Unicode ucs2?), which makes it far less readable than a
s.recode('ucs2','utf8').

Another wierd thing is that by default Python converts internal
Unicode to ascii. Will it be the same in Py3k? string*.
Just to expand on this... It helps thinking of "unicode objects" and
"strings" as seperate types (which they are). So there is no such thing
like "unicode string" and you always need to think about when to
encode() your unicode objects. However, this will change in py3k...,
what's the new rule of thumb?

cheers
Paul
Nov 25 '07 #4

P: n/a
Op Sun, 25 Nov 2007 13:02:26 -0800, schreef coldpizza:
It is funny that encode() and decode() omit the name of the other
encoding (Unicode ucs2?), which makes it far less readable than a
s.recode('ucs2','utf8').
The internal encoding/representation of a "string" of Unicode characters
is considered an implementation detail and is in fact not always the same
(e.g. a cpython build parameter selects UCS2 or UCS4, and it might be
something else in other implementations).

See the 'Py_UNICODE' paragraph in:
<http://docs.python.org/api/unicodeObjects.html>
--
JanC
Nov 26 '07 #5

P: n/a
paul wrote:
However, this will change in py3k...,
what's the new rule of thumb?
In py3k, the str type will be what unicode is now, and there
will be a new type called bytes for holding binary data --
including text in some external encoding. These two types
will not be compatible.

At the lowest level, reading a file will return bytes, which
then have to be decoded to produce a (unicode) str, and a str
will have to be encoded into bytes before being written to a
file.

There will be wrappers for text files that perform the
decoding and encoding automatically, but they will need to
be set up to use a specified encoding if you're dealing
with anything other than ascii. (It may be possible to
set up a system-wide default, I'm not sure.)

So you won't be able to get away with ignoring encoding
issues in py3k. On the plus side, it should all be handled
in a much more consistent and less error-prone way. If
you mistakenly try to use encoded data as though it were
decoded data or vice versa, you'll get a type error.

--
Greg
Nov 26 '07 #6

P: n/a
greg schrieb:
paul wrote:
>However, this will change in py3k...,
what's the new rule of thumb?
[snipp]
So you won't be able to get away with ignoring encoding
issues in py3k. On the plus side, it should all be handled
in a much more consistent and less error-prone way. If
you mistakenly try to use encoded data as though it were
decoded data or vice versa, you'll get a type error.
Thanks for your detailed answer. In fact, having encode() only for <str>
and decode() for <bytewill simplify things a lot. I guess implicit
encode() of <strwhen using print() will stay but having utf-8 as the
new default encoding will reduce the number of UnicodeError. You'll get
weird characters instead ;)

cheers
Paul

Nov 26 '07 #7

This discussion thread is closed

Replies have been disabled for this discussion.