unicode encoding usablilty problem

aurora

I have long find the Python default encoding of strict ASCII frustrating.
For one thing I prefer to get garbage character than an exception. But the
biggest issue is Unicode exception often pop up in unexpected places and
only when a non-ASCII or unicode character first found its way into the
system.

Below is an example. The program may runs fine at the beginning. But as
soon as an unicode character u'b' is introduced, the program boom out
unexpectedly.

sys.getdefaulte ncoding() 'ascii' a='\xe5'
# can print, you think you're ok .... print a
å b=u'b'
a==b Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xe5 in position 0:
ordinal not in range(128)

One may suggest the correct way to do it is to use decode, such as

a.decode('latin-1') == b
This brings up another issue. Most references and books focus exclusive on
entering unicode literal and using the encode/decode methods. The fallacy
is that string is such a basic data type use throughout the program, you
really don't want to make a individual decision everytime when you use
string (and take a penalty for any negligence). The Java has a much more
usable model with unicode used internally and encoding/decoding decision
only need twice when dealing with input and output.

I am sure these errors are a nuisance to those who are half conscious to
unicode. Even for those who choose to use unicode, it is almost impossible
to ensure their program work correctly.

Jul 18 '05 #1

Subscribe Reply

2735

1
2
3
>
Last »

Fredrik Lundh

anonymous coward <au******@gmail .com> wrote:

This brings up another issue. Most references and books focus exclusive on entering unicode
literal and using the encode/decode methods. The fallacy is that string is such a basic data type
use throughout the program, you really don't want to make a individual decision everytime when
you use string (and take a penalty for any negligence). The Java has a much more usable model
with unicode used internally and encoding/decoding decision only need twice when dealing with
input and output.
that's how you should do things in Python too, of course. a unicode string
uses unicode internally. decode on the way in, encode on the way out, and
things just work.

the fact that you can mess things up by mixing unicode strings with binary
strings doesn't mean that you have to mix unicode strings with binary strings
in your program.
Even for those who choose to use unicode, it is almost impossible to ensure their program work
correctly.

well, if you use unicode the way it was intended to, it just works.

</F>

Jul 18 '05 #2

aurora

On Fri, 18 Feb 2005 19:24:10 +0100, Fredrik Lundh <fr*****@python ware.com>
wrote:

that's how you should do things in Python too, of course. a unicode
string
uses unicode internally. decode on the way in, encode on the way out, and
things just work.

the fact that you can mess things up by mixing unicode strings with
binary
strings doesn't mean that you have to mix unicode strings with binary
strings
in your program.

I don't want to mix them. But how could I find them? How do I know this
statement can be potential problem

if a==b:

where a and b can be instantiated individually far away from this line of
code that put them together?

In Java they are distinct data type and the compiler would catch all
incorrect usage. In Python, the interpreter seems to 'help' us to promote
binary string to unicode. Things works fine, unit tests pass, all until
the first non-ASCII characters come in and then the program breaks.

Is there a scheme for Python developer to use so that they are safe from
incorrect mixing?

Jul 18 '05 #3

Walter Dörwald

aurora wrote:

[...]
In Java they are distinct data type and the compiler would catch all
incorrect usage. In Python, the interpreter seems to 'help' us to
promote binary string to unicode. Things works fine, unit tests pass,
all until the first non-ASCII characters come in and then the program
breaks.

Is there a scheme for Python developer to use so that they are safe
from incorrect mixing?

Put the following:

import sys
sys.setdefaulte ncoding("undefi ned")

in a file named sitecustomize.p y somewhere in your Python path and
Python will complain whenever there's an implicit conversion between
str and unicode.

HTH,
Walter Dörwald

Jul 18 '05 #4

Jarek Zgoda

Fredrik Lundh napisa³(a):

This brings up another issue. Most references and books focus exclusive on entering unicode
literal and using the encode/decode methods. The fallacy is that string is such a basic data type
use throughout the program, you really don't want to make a individual decision everytime when
you use string (and take a penalty for any negligence). The Java has a much more usable model
with unicode used internally and encoding/decoding decision only need twice when dealing with
input and output.

that's how you should do things in Python too, of course. a unicode string
uses unicode internally. decode on the way in, encode on the way out, and
things just work.

There are implementations of Python where it isn't so easy, Python for
iSeries (http://www.iseriespython.com/) is one of them. The code written
for "normal" platform doesn't work on AS/400, even if all strings used
internally are unicode objects, also unicode literals don't work as
expected.

Of course, this is implementation fault but this makes a headache if you
need to write portable code.

--
Jarek Zgoda
http://jpa.berlios.de/ | http://www.zgodowie.org/

Jul 18 '05 #5

Thomas Heller

=?ISO-8859-15?Q?Walter_D=F 6rwald?= <wa****@livingl ogic.de> writes:

aurora wrote:
> [...]
In Java they are distinct data type and the compiler would catch all
incorrect usage. In Python, the interpreter seems to 'help' us to
promote binary string to unicode. Things works fine, unit tests
pass, all until the first non-ASCII characters come in and then the
program breaks.
Is there a scheme for Python developer to use so that they are safe
from incorrect mixing?

Put the following:

import sys
sys.setdefaulte ncoding("undefi ned")

in a file named sitecustomize.p y somewhere in your Python path and
Python will complain whenever there's an implicit conversion between
str and unicode.

Sounds cool, so I did it.
And started a program I was currently working on.
The first function in it is this:

if sys.platform == "win32":

def _locate_gccxml( ):
import _winreg
for subkey in (r"Software\gcc xml", r"Software\Kitw are\GCC_XML"):
for root in (_winreg.HKEY_C URRENT_USER, _winreg.HKEY_LO CAL_MACHINE):
try:
hkey = _winreg.OpenKey (root, subkey, 0, _winreg.KEY_REA D)
except WindowsError, detail:
if detail.errno != 2:
raise
else:
return _winreg.QueryVa lueEx(hkey, "loc")[0] + r"\bin"

loc = _locate_gccxml( )
if loc:
os.environ["PATH"] = loc

All strings in that snippet are text strings, so the first approach was
to convert them to unicode literals. Doesn't work. Here is the final,
working version (changes are marked):

if sys.platform == "win32":

def _locate_gccxml( ):
import _winreg
for subkey in (r"Software\gcc xml", r"Software\Kitw are\GCC_XML"):
for root in (_winreg.HKEY_C URRENT_USER, _winreg.HKEY_LO CAL_MACHINE):
try:
hkey = _winreg.OpenKey (root, subkey, 0, _winreg.KEY_REA D)
except WindowsError, detail:
if detail.errno != 2:
raise
else:
return _winreg.QueryVa lueEx(hkey, "loc")[0] + ur"\bin"
#-----------------------------------------------------------------^
loc = _locate_gccxml( )
if loc:
os.environ["PATH"] = loc.encode("mbc s")
#--------------------------------^

So, it appears that:

- the _winreg.QueryVa lue function is strange: it takes ascii strings,
but returns a unicode string.
- _winreg.OpenKey takes ascii strings
- the os.environ["PATH"] accepts an ascii string.

And I won't guess what happens when there are registry entries with
unlauts (ok, they could be handled by 'mbcs' encoding), and with chinese
or japanese characters (no way to represent them in ascii strings with a
western locale and mbcs encoding, afaik).
I suggest that 'sys.setdefault encoding("undef ined")' be the standard
setting for the core developers ;-)

Thomas

Jul 18 '05 #6

Martin v. Löwis

aurora wrote:

The Java
has a much more usable model with unicode used internally and
encoding/decoding decision only need twice when dealing with input and
output.

In addition to Fredrik's comment (that you should use the same model
in Python) and Walter's comment (that you can enforce it by setting
the default encoding to "undefined" ), I'd like to point out the
historical reason: Python predates Unicode, so the byte string type
has many convenience operations that you would only expect of
a character string.

We have come up with a transition strategy, allowing existing
libraries to widen their support from byte strings to character
strings. This isn't a simple task, so many libraries still expect
and return byte strings, when they should process character strings.
Instead of breaking the libraries right away, we have defined
a transitional mechanism, which allows to add Unicode support
to libraries as the need arises. This transition is still in
progress.

Eventually, the primary string type should be the Unicode
string. If you are curious how far we are still off that goal,
just try running your program with the -U option.

Regards,
Martin

Jul 18 '05 #7

Jarek Zgoda

Walter Dörwald napisa³(a):

Is there a scheme for Python developer to use so that they are safe
from incorrect mixing?

Put the following:

import sys
sys.setdefaulte ncoding("undefi ned")

in a file named sitecustomize.p y somewhere in your Python path and
Python will complain whenever there's an implicit conversion between
str and unicode.

This will help in your code, but there is big pile of modules in stdlib
that are not unicode-friendly. From my daily practice come shlex
(tokenizer works only with encoded strings) and logging (you cann't
specify encoding for FileHandler).

--
Jarek Zgoda
http://jpa.berlios.de/ | http://www.zgodowie.org/

Jul 18 '05 #8

Thomas Heller

=?ISO-8859-15?Q?=22Martin_ v=2E_L=F6wis=22 ?= <ma****@v.loewi s.de> writes:

We have come up with a transition strategy, allowing existing
libraries to widen their support from byte strings to character
strings. This isn't a simple task, so many libraries still expect
and return byte strings, when they should process character strings.
Instead of breaking the libraries right away, we have defined
a transitional mechanism, which allows to add Unicode support
to libraries as the need arises. This transition is still in
progress.

Eventually, the primary string type should be the Unicode
string. If you are curious how far we are still off that goal,
just try running your program with the -U option.

Is it possible to specify a byte string literal when running with the -U option?

Thomas

Jul 18 '05 #9

Thomas Heller

=?ISO-8859-15?Q?=22Martin_ v=2E_L=F6wis=22 ?= <ma****@v.loewi s.de> writes:

Eventually, the primary string type should be the Unicode
string. If you are curious how far we are still off that goal,
just try running your program with the -U option.

Not very far - can't even call functions ;-)

c:\>py -U
Python 2.5a0 (#60, Dec 29 2004, 11:27:13) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright" , "credits" or "license" for more information.

def f(**kw): .... pass
.... f(**{"a": 0}) Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: f() keywords must be strings

Thomas

Jul 18 '05 #10

Similar topics

5259

Unicode from Web to MySQL

by: Bill Eldridge | last post by:

I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5, etc.) What I'd like is something as simple as: CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8)); import MySQLdb, re,urllib

Python

2834

unicode question

by: wolfgang haefelinger | last post by:

Hi, I wonder whether someone could explain me a bit what's going on here: import sys # I'm running Mandrake 1o and Windows XP. print sys.version ## 2.3.3 (#2, Feb 17 2004, 11:45:40)

Python

5134

UTF-8 & Unicode

by: EU citizen | last post by:

Do web pages have to be created in unicode in order to use UTF-8 encoding? If so, can anyone name a free application which I can use under Windows 98 to create web pages?

.NET Framework

7745

Read UTF8 (mixed byte) file & convert to Unicode

by: hunterb | last post by:

I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a DB and display it onscreen. No matter which way I open the file, convert it to Unicode/leave it as is or what ever, I see all single bytes ok, but double bytes become 2 seperate single bytes. Surely there is an easy way to convert these mixed...

.NET Framework

6052

minidom xml & non ascii / unicode & files

by: webdev | last post by:

lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3 script that grabs some web pages from the web, regex parse the data and stores it localy to xml file for further use.. at first i had no problem using python minidom and everything concerning

Python

4196

Unicode & Pythonwin / win32 / console?

by: Robert | last post by:

Hello, I'm using Pythonwin and py2.3 (py2.4). I did not come clear with this: I want to use win32-fuctions like win32ui.MessageBox, listctrl.InsertItem ..... to get unicode strings on the screen - best results according to the platform/language settings (mainly XP Home, W2K, ...). Also unicode strings should be displayed as nice as possible at the console with normal print-s to stdout (on varying platforms, different

Python

4838

MySQL 5.0, FULL-TEXT Indexing and Search Arabic Data, Unicode

by: jrs_14618 | last post by:

Hello All, This post is essentially a reply a previous post/thread here on this mailing.database.myodbc group titled: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode I was wondering if anybody has experienced the same issues

PHP

32874

How to create a .txt file with unicode encoding

by: ujjwaltrivedi | last post by:

Hey guys, Can anyone tell me how to create a text file with Unicode Encoding. In am using FileStream Finalfile = new FileStream("finalfile.txt", FileMode.Append, FileAccess.Write); ###Question: Now this creates finalfile.txt with ANSI Encoding ...which is a default. Either tell me how to change the default or how to create a

C# / C Sharp

3404

Logging library unicode problem

by: Victor Lin | last post by:

Hi, I'm writting a application using python standard logging system. I encounter some problem with unicode message passed to logging library. I found that unicode message will be messed up by logging handler. piese of StreamHandler: try: self.stream.write(fs % msg) except UnicodeError:

Python

7854

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

8219

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

8221

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

6629

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

5395

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

3845

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

3882

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

2364

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

1455

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP