473,714 Members | 1,968 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

a question about Chinese characters in aPython Program

Hope you all had a nice weekend.

I have a question that I hope someone can help me out. I want to run a Python program that uses Tkinter for the user interface (GUI). The program allows me to type Chinese characters, but neverthelss is unable to show them up on screen. The follow is some of the error message I received after I logged off the program:

"Could not write output: <type "exceptions : UnicodeEncodeEr ror'>, 'ascii' codec can't encode characters in position 0-1: ordinal not in range (128)"

Any suggestion will be appreciated.

Sincerely,

Liang
Liang Chen,Ph.D.
Assistant Professor
University of Georgia
Communication Sciences and Special Education
542 Aderhold Hall
Athens, GA 30602

Phone: 706-542-4566
Oct 20 '08 #1
13 3918
est
On Oct 20, 10:48*am, Liang Chen <c...@uga.eduwr ote:
Hope you all had a nice weekend.

I have a question that I hope someone can help me out. I want to run a Python program that uses Tkinter for the user interface (GUI). The program allows me to type Chinese characters, but neverthelss is unable to show them up on screen. The follow is some of the error message I received after I logged off the program:

"Could not write output: <type "exceptions : UnicodeEncodeEr ror'>, 'ascii'codec can't encode characters in position 0-1: ordinal not in range (128)"

Any suggestion will be appreciated.

Sincerely,

Liang

Liang Chen,Ph.D.
Assistant Professor
University of Georgia
Communication Sciences and Special Education
542 Aderhold Hall
Athens, GA 30602

Phone: 706-542-4566
Personally I call it a serious bug in python, but sadly most of python
community members do not agree
.. It may be a internal str() that caused this issue.

https://groups.google.com/group/comp...6ade6b6f5f3052
http://bugs.python.org/issue3648

Oct 20 '08 #2
On 20 Okt, 07:32, est <electronix...@ gmail.comwrote:
>
Personally I call it a serious bug in python
Normally I'd entertain the possibility of bugs in Python, but your
reasoning is a bit thin (in http://bugs.python.org/issue3648): "Why
cann't Python just define ascii to range(256)"

I do accept that it can be awkward to output text to the console, for
example, but you have to consider that the console might not be
configured to display any character you can throw at it. My console is
configured for ISO-8859-15 (something like your magical "ascii to
range(256)" only where someone has to decide what those 256 characters
actually are), but that isn't going to help me display CJK characters.
A solution might be to generate UTF-8 and then get the user to display
the output in an appropriately configured application, but even then
someone has to say that it's UTF-8 and not some other encoding that's
being used. As discussed in another recent thread, Python 2.x does
make some reasonable guesses about such matters to the extent that
it's possible automatically (without magical knowledge).

There is also the problem about use of the "str" built-in function or
any operation where some Unicode object may be converted to a plain
string. It is now recommended that you only convert to plain strings
when you need to produce a sequence of bytes (for output, for
example), and that you indicate how the Unicode values are encoded as
bytes (by specifying an encoding). Python 3.x doesn't really change
this: it just makes the Unicode/text vs. bytes distinction more
obvious.

Paul
Oct 20 '08 #3
est
On Oct 20, 6:47*pm, Paul Boddie <p...@boddie.or g.ukwrote:
On 20 Okt, 07:32, est <electronix...@ gmail.comwrote:
Personally I call it a serious bug in python

Normally I'd entertain the possibility of bugs in Python, but your
reasoning is a bit thin (inhttp://bugs.python.org/issue3648):"Why
cann't Python just define ascii to range(256)"

I do accept that it can be awkward to output text to the console, for
example, but you have to consider that the console might not be
configured to display any character you can throw at it. My console is
configured for ISO-8859-15 (something like your magical "ascii to
range(256)" only where someone has to decide what those 256 characters
actually are), but that isn't going to help me display CJK characters.
A solution might be to generate UTF-8 and then get the user to display
the output in an appropriately configured application, but even then
someone has to say that it's UTF-8 and not some other encoding that's
being used. As discussed in another recent thread, Python 2.x does
make some reasonable guesses about such matters to the extent that
it's possible automatically (without magical knowledge).

There is also the problem about use of the "str" built-in function or
any operation where some Unicode object may be converted to a plain
string. It is now recommended that you only convert to plain strings
when you need to produce a sequence of bytes (for output, for
example), and that you indicate how the Unicode values are encoded as
bytes (by specifying an encoding). Python 3.x doesn't really change
this: it just makes the Unicode/text vs. bytes distinction more
obvious.

Paul
Thanks for the long comment Paul, but it didn't help massive errors in
Python encoding.

IMHO it's even better to output wrong encodings rather than halt the
WHOLE damn program by an exception

When debugging encoding problems, the solution is simple. If
characters display wrong, switch to another encoding, one of them must
be right.

But it's tiring in python to deal with encodings, you have to wrap
EVERY SINGLE character expression with try ... except ... just imagine
what pain it is.

Just like the example I gave in Google Groups, u'\ue863' can NEVER be
encoded into '\xfe\x9f'. Not a chance, because python REFUSE to handle
a byte that is greater than range(128).

Strangely the 'mbcs' encoding system can. Does 'mbcs' have magic or
something? But it's Windows-specific

Dealing with character encodings is really simple. AFAIK early
encoding before Unicode, although they have many names, are all based
on hacks. Take Chinese characters as an example. They are called
GB2312 encoding, in fact it is totally compatible with range(256)
ANSI. (There are minor issues like display half of a wide-character in
a question mark ? but at least it's readable) If you just output
serials of byte array, it IS GB2312. The same is true with BIG5, JIS,
etc.
Like I said, str() should NOT throw an exception BY DESIGN, it's a
basic language standard. str() is not only a convert to string
function, but also a serialization in most cases.(e.g. socket) My
simple suggestion is: If it's a unicode character, output as UTF-8;
other wise just ouput byte array, please do not encode it with really
stupid range(128) ASCII. It's not guessing, it's totally wrong.
Oct 20 '08 #4
On 20 Okt, 15:30, est <electronix...@ gmail.comwrote:
>
Thanks for the long comment Paul, but it didn't help massive errors in
Python encoding.

IMHO it's even better to output wrong encodings rather than halt the
WHOLE damn program by an exception
I disagree. Maybe I'll now get round to uploading an amusing pictorial
example of this strategy just to illustrate where it can lead. CJK
characters may be more demanding to deal with than various European
characters, but I've seen public advertisements (admittedly aimed at
IT course applicants) which made jokes about stuff like "å" and "ø"
appearing in documents instead of the intended European characters, so
it's fairly safe to say that people do care what gets written out from
computer programs.
When debugging encoding problems, the solution is simple. If
characters display wrong, switch to another encoding, one of them must
be right.

But it's tiring in python to deal with encodings, you have to wrap
EVERY SINGLE character expression with try ... except ... just imagine
what pain it is.
If everything is in Unicode then you don't have to think about
encodings. I recommend using things like codecs.open to ensure that
input and output even produce and consume Unicode objects when dealing
with files.
Just like the example I gave in Google Groups, u'\ue863' can NEVER be
encoded into '\xfe\x9f'. Not a chance, because python REFUSE to handle
a byte that is greater than range(128).
Aside from the matter of which encoding you'd need to use to convert
u'\ue863' into '\xfe\x9f', it has nothing to do with any implicit byte
value range. To get from a Unicode object to a sequence of bytes
(since that is the external representation of the text for other
programs), Python has to perform a conversion. As a safe (but
obviously conservative) default, Python only attempts to convert each
Unicode character to a byte value using the ASCII character value
table which is only defined for characters 0 to 127 - there's no such
thing as "8-bit ASCII".

Python doesn't attempt to automatically convert using other character
tables (encodings, in other words), since there is quite a large
possibility that the result, if not produced for the correct encoding,
will not produce the desired visual effect. If I start with, say,
character "" and encode it using UTF-8, I get a sequence of bytes
which, if interpreted by a program expecting ISO-8859-15 will appear
as "ø". If I encode the character using ISO-8859-15 and then feed the
resulting byte sequence to a program expecting UTF-8, it will probably
either complain or produce an incorrect visual effect. The reason why
ASCII is safer (although not entirely safe) is because many encodings
support ASCII as a subset of themselves.
Strangely the 'mbcs' encoding system can. Does 'mbcs' have magic or
something? But it's Windows-specific
I thought Microsoft used some UTF-16 variant. That would explain how
it can handle more or less everything.
Dealing with character encodings is really simple. AFAIK early
encoding before Unicode, although they have many names, are all based
on hacks. Take Chinese characters as an example. They are called
GB2312 encoding, in fact it is totally compatible with range(256)
ANSI. (There are minor issues like display half of a wide-character in
a question mark ? but at least it's readable) If you just output
serials of byte array, it IS GB2312. The same is true with BIG5, JIS,
etc.
From the Wikipedia page, it appears that you need to convert GB2312
values to EUC-CN by a relatively straightforward process, and can then
output the resulting byte sequence in an ASCII compatible way,
provided that you filter out all the byte values greater than 127:
these filtered bytes would produce nonsense for anyone using a program
not expecting EUC-CN. UTF-8 has some similar properties, but as I
noted above, you wouldn't want to read most of the output if your
program wasn't expecting UTF-8.
Like I said, str() should NOT throw an exception BY DESIGN, it's a
basic language standard. str() is not only a convert to string
function, but also a serialization in most cases.(e.g. socket) My
simple suggestion is: If it's a unicode character, output as UTF-8;
other wise just ouput byte array, please do not encode it with really
stupid range(128) ASCII. It's not guessing, it's totally wrong.
I think it's unfortunate that "str" is now potentially unreliable for
certain uses, but to just output an arbitrary byte sequence (unless by
byte array you mean a representation of the numeric values) is the
wrong thing to do unless you don't care about the output; in which
case, you could just as well use "repr" instead. I think the output of
"str" vs. "unicode" especially with regard to Unicode objects was
discussed extensively on the python-dev mailing list at one point.

I don't disagree that people sometimes miss a way of having Python or
some library "do the right thing" when writing stuff out. I could
imagine a wrapper for Python accepting UTF-8 whose purpose is to
"blank out" characters which the console cannot handle, and people
might use this wrapper explicitly because that is the "right thing"
for them. Indeed, such a program may already exist for a more general
audience since I imagine that it could be fairly useful.

Paul
Oct 20 '08 #5
On Mon, 20 Oct 2008 06:30:09 -0700, est wrote:
Like I said, str() should NOT throw an exception BY DESIGN, it's a basic
language standard.
int() is also a basic language standard, but it is perfectly acceptable
for int() to raise an exception if you ask it to convert something into
an integer that can't be converted:

int("cat")

What else would you expect int() to do but raise an exception?

If you ask str() to convert something into a string which can't be
converted, then what else should it do other than raise an exception?
Whatever answer you give, somebody else will argue it should do another
thing. Maybe I want failed characters replaced with '?'. Maybe Fred wants
failed characters deleted altogether. Susan wants UTF-16. George wants
Latin-1.

The simple fact is that there is no 1:1 mapping from all 65,000+ Unicode
characters to the 256 bytes used by byte strings, so there *must* be an
encoding, otherwise you don't know which characters map to which bytes.

ASCII has the advantage of being the lowest common denominator. Perhaps
it doesn't make too many people very happy, but it makes everyone equally
unhappy.
str() is not only a convert to string function, but
also a serialization in most cases.(e.g. socket) My simple suggestion
is: If it's a unicode character, output as UTF-8;
Why UTF-8? That will never do. I want it output as UCS-4.

other wise just ouput
byte array, please do not encode it with really stupid range(128) ASCII.
It's not guessing, it's totally wrong.
If you start with a byte string, you can always get a byte string:
>>s = '\x96 \xa0 \xaa' # not ASCII characters
s
'\x96 \xa0 \xaa'
>>str(s)
'\x96 \xa0 \xaa'

--
Steven

Oct 20 '08 #6
est
On Oct 20, 11:46*pm, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com .auwrote:
On Mon, 20 Oct 2008 06:30:09 -0700, est wrote:
Like I said, str() should NOT throw an exception BY DESIGN, it's a basic
language standard.

int() is also a basic language standard, but it is perfectly acceptable
for int() to raise an exception if you ask it to convert something into
an integer that can't be converted:

int("cat")

What else would you expect int() to do but raise an exception?

If you ask str() to convert something into a string which can't be
converted, then what else should it do other than raise an exception?
Whatever answer you give, somebody else will argue it should do another
thing. Maybe I want failed characters replaced with '?'. Maybe Fred wants
failed characters deleted altogether. Susan wants UTF-16. George wants
Latin-1.

The simple fact is that there is no 1:1 mapping from all 65,000+ Unicode
characters to the 256 bytes used by byte strings, so there *must* be an
encoding, otherwise you don't know which characters map to which bytes.

ASCII has the advantage of being the lowest common denominator. Perhaps
it doesn't make too many people very happy, but it makes everyone equally
unhappy.
str() is not only a convert to string function, but
also a serialization in most cases.(e.g. socket) My simple suggestion
is: If it's a unicode character, output as UTF-8;

Why UTF-8? That will never do. I want it output as UCS-4.
other wise just ouput
byte array, please do not encode it with really stupid range(128) ASCII..
It's not guessing, it's totally wrong.

If you start with a byte string, you can always get a byte string:
>s = '\x96 \xa0 \xaa' *# not ASCII characters
s
'\x96 \xa0 \xaa'
>str(s)

'\x96 \xa0 \xaa'

--
Steven
In fact Python handles characters well than most other open-source
programming languages. But still:

1. You can explain str() in 1000 ways, there are 1001 more confusing
error on all kinds of python apps. (Not only some of the scripts I've
written, but also famous enough apps like Boa Constructor
http://i36.tinypic.com/1gqekh.jpg. This sucks hard, right?)
2. Anyone please kindly tell me how can I define a customized encoding
(namely 'ansi') which handles range(256) so I can
sys.setdefaulte ncoding('ansi') once and for all?
Oct 20 '08 #7
On Sun, 19 Oct 2008 22:32:20 -0700, est wrote:
On Oct 20, 10:48*am, Liang Chen <c...@uga.eduwr ote:
>Hope you all had a nice weekend.

I have a question that I hope someone can help me out. I want to run a
Python program that uses Tkinter for the user interface (GUI). The
program allows me to type Chinese characters, but neverthelss is unable
to show them up on screen. The follow is some of the error message I
received after I logged off the program:

"Could not write output: <type "exceptions : UnicodeEncodeEr ror'>,
'ascii' codec can't encode characters in position 0-1: ordinal not in
range (128)"

Any suggestion will be appreciated.

Sincerely,

Liang

Liang Chen,Ph.D.
Assistant Professor
University of Georgia
Communicatio n Sciences and Special Education 542 Aderhold Hall
Athens, GA 30602

Phone: 706-542-4566

Personally I call it a serious bug in python, but sadly most of python
community members do not agree
. It may be a internal str() that caused this issue.
No, it's not a bug, it's a correct behavior that is the most correct
behavior, although some people might not be able to immediately grab the
reasons why it is correct and why defining ascii as range(256) is plain
wrong.

Anyway, if you haven't noticed, str() is capable of emitting all
characters in range(256), e.g. str('\xff'). ascii though, doesn't allow
that, as ascii is a 7-bit encoding, latin-1, ansi, and other ascii
extensions are 8-bit encodings, but not ascii itself.

Oct 20 '08 #8
est <el***********@ gmail.comwrites :
IMHO it's even better to output wrong encodings rather than halt the
WHOLE damn program by an exception
I can't agree with this. The correct thing to do in the face of
ambiguity is for Python to refuse to guess.
When debugging encoding problems, the solution is simple. If
characters display wrong, switch to another encoding, one of them
must be right.
That's debugging problems not in the program but in the *data*, which
Python is helping with by making the problems apparent as soon as
feasible to do so.
But it's tiring in python to deal with encodings, you have to wrap
EVERY SINGLE character expression with try ... except ... just imagine
what pain it is.
That sounds like a rather poor program design. Much better to sanitise
the inputs to the program at a few well-defined points, and know from
that point that the program is dealing internally with Unicode.
Dealing with character encodings is really simple.
Given that your solutions are baroque and complicated, I don't think
even you yourself can believe that statement.
Like I said, str() should NOT throw an exception BY DESIGN, it's a
basic language standard.
Any code should throw an exception if the input is both ambiguous and
invalid by the documented specification.
str() is not only a convert to string function, but also a
serialization in most cases.(e.g. socket) My simple suggestion is:
If it's a unicode character, output as UTF-8; other wise just ouput
byte array, please do not encode it with really stupid range(128)
ASCII. It's not guessing, it's totally wrong.
Your assumption would require that UTF-8 be a lowest *common*
denominator for most output devices Python will be connected to.
That's simply not the case; the lowest common denominator is still
ASCII.

I yearn for a future where all output devices can be assumed, in the
absence of other information, to understand a common Unicode encoding
(e.g. UTF-8), but we're not there yet and it would be a grave mistake
for Python to falsely behave as though we were.

--
\ “I went to a fancy French restaurant called ‘Déj* Vu’. The head |
`\ waiter said, ‘Don't I know you?’” —Steven Wright |
_o__) |
Ben Finney
Oct 20 '08 #9
On Oct 21, 1:45*am, Paul Boddie <p...@boddie.or g.ukwrote:
From the Wikipedia page, it appears that you need to convert GB2312
values to EUC-CN by a relatively straightforward process, and can then
output the resulting byte sequence in an ASCII compatible way,
provided that you filter out all the byte values greater than 127:
these filtered bytes would produce nonsense for anyone using a program
not expecting EUC-CN. UTF-8 has some similar properties, but as I
noted above, you wouldn't want to read most of the output if your
program wasn't expecting UTF-8.
What the Wikipedia page doesn't say is that the number of people who
grok the concept of a GB2312 codepoint is vanishingly small, and the
number of people who would actually have GB2312 codepoints in a file
is smaller still. When people say their data is GB2312, they mean
"GB<somethingen coded as EUC-CN". So the relatively straightforward
process is not required in practice.

I don't understand the point or value of filtering out all byte values
greater than 127:

If the data is really GB2312, this would throw out all the Chinese
characters.

If the GB<somethingis, as is likely, really GBK aka cp936 (a
superset of GB2312), then the second byte of a Chinese character may
be in the ASCII range, and the result of the filter would comprise the
true ASCII characters plus some garbage ASCII characters.

Oct 21 '08 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
22265
by: Zhang Weiwu | last post by:
Hello. I am working with a php software project, in it (www.egroupware.org) Chinese simplified locate is "zh" while Traditional Chinese "tw". I wish to send correct language attribute in http header, I found "zh" is not standard. I found this line in apache2's default httpd.conf # Simplified Chinese (zh-CN) AddLanguage zh-CN .zh-cn
8
3483
by: Agnes | last post by:
In my .net ,i need to generate an xml file , however, user may input a chinese character, Then , the xml will got something unknow characters. the following is my code, Does anyone know how to solve it ?? Private Sub Init() With AMSXML ..Formatting = Formatting.Indented ..Indentation = 4 ..IndentChar = " " ..WriteStartDocument()
4
3024
by: Winnie | last post by:
Hi, I am currently writing a C# Windows Application. On my form, I have several labels with Traditional Chinese text, it is ok on my machine (Windows 2000), but after install on Windows 98 or NT, these labels show questionmarks (????) instead of the chinese characters! Would someone please help me to solve it out? I don't want to use the images to replace the chinese characters. Please help and thanks!
0
724
by: Alex Chan | last post by:
Hi group, I am writing a RFC Server with SAP.NET Connector to connect to SAP. There are chinese characters passing back and forth. I found that all chinese characters sending from SAP are encoded with UTF-7. In order to let SAP display chinese characters correctly, i also need to encode my chinese data by UTF-7. However, i need to convert the UTF-7 encoded SAP data back to chinese
8
11983
by: pabv | last post by:
Hello all, I am having a few issues with encoding to chinese characters and perhaps someone might be able to assist. At the moment I am only able to see chinese characters when displayed as part of a datagrid. When an input textbox is displayed it does not display chinese characters, but rather the unicode characters stored in the mssql 2000 server backend.
0
3704
by: st.frey | last post by:
I've got a problem with importing chinese characters into a mysql-table and have read several mailings but didn't find a solution. i have a utf-8 text file that contains chinese characters. the table where i want to import the data using "load data local infile" has collation utf8_unicode_ci. but after the import is done, the chinese characters are converted into so strange characters. to show the chinese symbols in a php-script i use...
2
5387
by: Taras_96 | last post by:
Hi everyone, Firstly, I would like to know if you can open chinese filenames under win2000 using PHP 5.0? I have a file named 中国.php, and try to open it using fopen(‘中国.php','r');. I save the source file as UTF-8. I get the error: Warning: fopen(中国.php) : failed to open stream: No such file or directory in E:\Translation\Website
2
6291
by: Wassy | last post by:
Hi, i have a website which contains both chinese and english content which is stored in a database. Each record in the dB has an english and Chinese field. If a user enters a search string i have to be able to detect which characters are latin based and which are chinese ideographs. eg) a user may enter "hello world" this is because many Chinese search phrases (especially those involved with technology may include English words...
0
1071
by: Terry Reedy | last post by:
Liang Chen wrote: Start with the Unicode HOWTO in the HOWTOs part of the Manual set. For 2.6 http://docs.python.org/howto/unicode.html For 3.0, which has been updated in spite of the warning http://docs.python.org/dev/3.0/howto/unicode.html
0
8715
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9322
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9189
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9086
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
7963
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development projectplanning, coding, testing, and deploymentwithout human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5964
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4734
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3170
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2116
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.