473,587 Members | 2,494 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

unicode encoding usablilty problem

I have long find the Python default encoding of strict ASCII frustrating.
For one thing I prefer to get garbage character than an exception. But the
biggest issue is Unicode exception often pop up in unexpected places and
only when a non-ASCII or unicode character first found its way into the
system.

Below is an example. The program may runs fine at the beginning. But as
soon as an unicode character u'b' is introduced, the program boom out
unexpectedly.
sys.getdefaulte ncoding() 'ascii' a='\xe5'
# can print, you think you're ok .... print a
å b=u'b'
a==b Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xe5 in position 0:
ordinal not in range(128)

One may suggest the correct way to do it is to use decode, such as

a.decode('latin-1') == b
This brings up another issue. Most references and books focus exclusive on
entering unicode literal and using the encode/decode methods. The fallacy
is that string is such a basic data type use throughout the program, you
really don't want to make a individual decision everytime when you use
string (and take a penalty for any negligence). The Java has a much more
usable model with unicode used internally and encoding/decoding decision
only need twice when dealing with input and output.

I am sure these errors are a nuisance to those who are half conscious to
unicode. Even for those who choose to use unicode, it is almost impossible
to ensure their program work correctly.
Jul 18 '05 #1
30 2735
anonymous coward <au******@gmail .com> wrote:
This brings up another issue. Most references and books focus exclusive on entering unicode
literal and using the encode/decode methods. The fallacy is that string is such a basic data type
use throughout the program, you really don't want to make a individual decision everytime when
you use string (and take a penalty for any negligence). The Java has a much more usable model
with unicode used internally and encoding/decoding decision only need twice when dealing with
input and output.
that's how you should do things in Python too, of course. a unicode string
uses unicode internally. decode on the way in, encode on the way out, and
things just work.

the fact that you can mess things up by mixing unicode strings with binary
strings doesn't mean that you have to mix unicode strings with binary strings
in your program.
Even for those who choose to use unicode, it is almost impossible to ensure their program work
correctly.


well, if you use unicode the way it was intended to, it just works.

</F>

Jul 18 '05 #2
On Fri, 18 Feb 2005 19:24:10 +0100, Fredrik Lundh <fr*****@python ware.com>
wrote:
that's how you should do things in Python too, of course. a unicode
string
uses unicode internally. decode on the way in, encode on the way out, and
things just work.

the fact that you can mess things up by mixing unicode strings with
binary
strings doesn't mean that you have to mix unicode strings with binary
strings
in your program.


I don't want to mix them. But how could I find them? How do I know this
statement can be potential problem

if a==b:

where a and b can be instantiated individually far away from this line of
code that put them together?

In Java they are distinct data type and the compiler would catch all
incorrect usage. In Python, the interpreter seems to 'help' us to promote
binary string to unicode. Things works fine, unit tests pass, all until
the first non-ASCII characters come in and then the program breaks.

Is there a scheme for Python developer to use so that they are safe from
incorrect mixing?
Jul 18 '05 #3
aurora wrote:
[...]
In Java they are distinct data type and the compiler would catch all
incorrect usage. In Python, the interpreter seems to 'help' us to
promote binary string to unicode. Things works fine, unit tests pass,
all until the first non-ASCII characters come in and then the program
breaks.

Is there a scheme for Python developer to use so that they are safe
from incorrect mixing?


Put the following:

import sys
sys.setdefaulte ncoding("undefi ned")

in a file named sitecustomize.p y somewhere in your Python path and
Python will complain whenever there's an implicit conversion between
str and unicode.

HTH,
Walter Dörwald
Jul 18 '05 #4
Fredrik Lundh napisa³(a):
This brings up another issue. Most references and books focus exclusive on entering unicode
literal and using the encode/decode methods. The fallacy is that string is such a basic data type
use throughout the program, you really don't want to make a individual decision everytime when
you use string (and take a penalty for any negligence). The Java has a much more usable model
with unicode used internally and encoding/decoding decision only need twice when dealing with
input and output.


that's how you should do things in Python too, of course. a unicode string
uses unicode internally. decode on the way in, encode on the way out, and
things just work.


There are implementations of Python where it isn't so easy, Python for
iSeries (http://www.iseriespython.com/) is one of them. The code written
for "normal" platform doesn't work on AS/400, even if all strings used
internally are unicode objects, also unicode literals don't work as
expected.

Of course, this is implementation fault but this makes a headache if you
need to write portable code.

--
Jarek Zgoda
http://jpa.berlios.de/ | http://www.zgodowie.org/
Jul 18 '05 #5
=?ISO-8859-15?Q?Walter_D=F 6rwald?= <wa****@livingl ogic.de> writes:
aurora wrote:
> [...]
In Java they are distinct data type and the compiler would catch all
incorrect usage. In Python, the interpreter seems to 'help' us to
promote binary string to unicode. Things works fine, unit tests
pass, all until the first non-ASCII characters come in and then the
program breaks.
Is there a scheme for Python developer to use so that they are safe
from incorrect mixing?


Put the following:

import sys
sys.setdefaulte ncoding("undefi ned")

in a file named sitecustomize.p y somewhere in your Python path and
Python will complain whenever there's an implicit conversion between
str and unicode.


Sounds cool, so I did it.
And started a program I was currently working on.
The first function in it is this:

if sys.platform == "win32":

def _locate_gccxml( ):
import _winreg
for subkey in (r"Software\gcc xml", r"Software\Kitw are\GCC_XML"):
for root in (_winreg.HKEY_C URRENT_USER, _winreg.HKEY_LO CAL_MACHINE):
try:
hkey = _winreg.OpenKey (root, subkey, 0, _winreg.KEY_REA D)
except WindowsError, detail:
if detail.errno != 2:
raise
else:
return _winreg.QueryVa lueEx(hkey, "loc")[0] + r"\bin"

loc = _locate_gccxml( )
if loc:
os.environ["PATH"] = loc

All strings in that snippet are text strings, so the first approach was
to convert them to unicode literals. Doesn't work. Here is the final,
working version (changes are marked):

if sys.platform == "win32":

def _locate_gccxml( ):
import _winreg
for subkey in (r"Software\gcc xml", r"Software\Kitw are\GCC_XML"):
for root in (_winreg.HKEY_C URRENT_USER, _winreg.HKEY_LO CAL_MACHINE):
try:
hkey = _winreg.OpenKey (root, subkey, 0, _winreg.KEY_REA D)
except WindowsError, detail:
if detail.errno != 2:
raise
else:
return _winreg.QueryVa lueEx(hkey, "loc")[0] + ur"\bin"
#-----------------------------------------------------------------^
loc = _locate_gccxml( )
if loc:
os.environ["PATH"] = loc.encode("mbc s")
#--------------------------------^

So, it appears that:

- the _winreg.QueryVa lue function is strange: it takes ascii strings,
but returns a unicode string.
- _winreg.OpenKey takes ascii strings
- the os.environ["PATH"] accepts an ascii string.

And I won't guess what happens when there are registry entries with
unlauts (ok, they could be handled by 'mbcs' encoding), and with chinese
or japanese characters (no way to represent them in ascii strings with a
western locale and mbcs encoding, afaik).
I suggest that 'sys.setdefault encoding("undef ined")' be the standard
setting for the core developers ;-)

Thomas
Jul 18 '05 #6
aurora wrote:
The Java
has a much more usable model with unicode used internally and
encoding/decoding decision only need twice when dealing with input and
output.


In addition to Fredrik's comment (that you should use the same model
in Python) and Walter's comment (that you can enforce it by setting
the default encoding to "undefined" ), I'd like to point out the
historical reason: Python predates Unicode, so the byte string type
has many convenience operations that you would only expect of
a character string.

We have come up with a transition strategy, allowing existing
libraries to widen their support from byte strings to character
strings. This isn't a simple task, so many libraries still expect
and return byte strings, when they should process character strings.
Instead of breaking the libraries right away, we have defined
a transitional mechanism, which allows to add Unicode support
to libraries as the need arises. This transition is still in
progress.

Eventually, the primary string type should be the Unicode
string. If you are curious how far we are still off that goal,
just try running your program with the -U option.

Regards,
Martin
Jul 18 '05 #7
Walter Dörwald napisa³(a):
Is there a scheme for Python developer to use so that they are safe
from incorrect mixing?

Put the following:

import sys
sys.setdefaulte ncoding("undefi ned")

in a file named sitecustomize.p y somewhere in your Python path and
Python will complain whenever there's an implicit conversion between
str and unicode.


This will help in your code, but there is big pile of modules in stdlib
that are not unicode-friendly. From my daily practice come shlex
(tokenizer works only with encoded strings) and logging (you cann't
specify encoding for FileHandler).

--
Jarek Zgoda
http://jpa.berlios.de/ | http://www.zgodowie.org/
Jul 18 '05 #8
=?ISO-8859-15?Q?=22Martin_ v=2E_L=F6wis=22 ?= <ma****@v.loewi s.de> writes:
We have come up with a transition strategy, allowing existing
libraries to widen their support from byte strings to character
strings. This isn't a simple task, so many libraries still expect
and return byte strings, when they should process character strings.
Instead of breaking the libraries right away, we have defined
a transitional mechanism, which allows to add Unicode support
to libraries as the need arises. This transition is still in
progress.

Eventually, the primary string type should be the Unicode
string. If you are curious how far we are still off that goal,
just try running your program with the -U option.


Is it possible to specify a byte string literal when running with the -U option?

Thomas
Jul 18 '05 #9
=?ISO-8859-15?Q?=22Martin_ v=2E_L=F6wis=22 ?= <ma****@v.loewi s.de> writes:
Eventually, the primary string type should be the Unicode
string. If you are curious how far we are still off that goal,
just try running your program with the -U option.


Not very far - can't even call functions ;-)

c:\>py -U
Python 2.5a0 (#60, Dec 29 2004, 11:27:13) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright" , "credits" or "license" for more information.
def f(**kw): .... pass
.... f(**{"a": 0}) Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: f() keywords must be strings


Thomas
Jul 18 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
5259
by: Bill Eldridge | last post by:
I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5, etc.) What I'd like is something as simple as: CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8)); import MySQLdb, re,urllib
14
2834
by: wolfgang haefelinger | last post by:
Hi, I wonder whether someone could explain me a bit what's going on here: import sys # I'm running Mandrake 1o and Windows XP. print sys.version ## 2.3.3 (#2, Feb 17 2004, 11:45:40)
27
5134
by: EU citizen | last post by:
Do web pages have to be created in unicode in order to use UTF-8 encoding? If so, can anyone name a free application which I can use under Windows 98 to create web pages?
3
7745
by: hunterb | last post by:
I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a DB and display it onscreen. No matter which way I open the file, convert it to Unicode/leave it as is or what ever, I see all single bytes ok, but double bytes become 2 seperate single bytes. Surely there is an easy way to convert these mixed...
4
6052
by: webdev | last post by:
lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3 script that grabs some web pages from the web, regex parse the data and stores it localy to xml file for further use.. at first i had no problem using python minidom and everything concerning
7
4196
by: Robert | last post by:
Hello, I'm using Pythonwin and py2.3 (py2.4). I did not come clear with this: I want to use win32-fuctions like win32ui.MessageBox, listctrl.InsertItem ..... to get unicode strings on the screen - best results according to the platform/language settings (mainly XP Home, W2K, ...). Also unicode strings should be displayed as nice as possible at the console with normal print-s to stdout (on varying platforms, different
1
4838
by: jrs_14618 | last post by:
Hello All, This post is essentially a reply a previous post/thread here on this mailing.database.myodbc group titled: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode I was wondering if anybody has experienced the same issues
1
32874
by: ujjwaltrivedi | last post by:
Hey guys, Can anyone tell me how to create a text file with Unicode Encoding. In am using FileStream Finalfile = new FileStream("finalfile.txt", FileMode.Append, FileAccess.Write); ###Question: Now this creates finalfile.txt with ANSI Encoding ...which is a default. Either tell me how to change the default or how to create a
1
3404
by: Victor Lin | last post by:
Hi, I'm writting a application using python standard logging system. I encounter some problem with unicode message passed to logging library. I found that unicode message will be messed up by logging handler. piese of StreamHandler: try: self.stream.write(fs % msg) except UnicodeError:
0
7854
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8219
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8221
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
6629
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5395
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
3845
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
3882
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2364
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1455
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.