I have long find the Python default encoding of strict ASCII frustrating.
For one thing I prefer to get garbage character than an exception. But the
biggest issue is Unicode exception often pop up in unexpected places and
only when a non-ASCII or unicode character first found its way into the
system.
Below is an example. The program may runs fine at the beginning. But as
soon as an unicode character u'b' is introduced, the program boom out
unexpectedly. sys.getdefaulte ncoding()
'ascii' a='\xe5' # can print, you think you're ok
.... print a
å b=u'b' a==b
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xe5 in position 0:
ordinal not in range(128)
One may suggest the correct way to do it is to use decode, such as
a.decode('latin-1') == b
This brings up another issue. Most references and books focus exclusive on
entering unicode literal and using the encode/decode methods. The fallacy
is that string is such a basic data type use throughout the program, you
really don't want to make a individual decision everytime when you use
string (and take a penalty for any negligence). The Java has a much more
usable model with unicode used internally and encoding/decoding decision
only need twice when dealing with input and output.
I am sure these errors are a nuisance to those who are half conscious to
unicode. Even for those who choose to use unicode, it is almost impossible
to ensure their program work correctly. 30 2735
anonymous coward <au******@gmail .com> wrote: This brings up another issue. Most references and books focus exclusive on entering unicode literal and using the encode/decode methods. The fallacy is that string is such a basic data type use throughout the program, you really don't want to make a individual decision everytime when you use string (and take a penalty for any negligence). The Java has a much more usable model with unicode used internally and encoding/decoding decision only need twice when dealing with input and output.
that's how you should do things in Python too, of course. a unicode string
uses unicode internally. decode on the way in, encode on the way out, and
things just work.
the fact that you can mess things up by mixing unicode strings with binary
strings doesn't mean that you have to mix unicode strings with binary strings
in your program.
Even for those who choose to use unicode, it is almost impossible to ensure their program work correctly.
well, if you use unicode the way it was intended to, it just works.
</F>
On Fri, 18 Feb 2005 19:24:10 +0100, Fredrik Lundh <fr*****@python ware.com>
wrote: that's how you should do things in Python too, of course. a unicode string uses unicode internally. decode on the way in, encode on the way out, and things just work.
the fact that you can mess things up by mixing unicode strings with binary strings doesn't mean that you have to mix unicode strings with binary strings in your program.
I don't want to mix them. But how could I find them? How do I know this
statement can be potential problem
if a==b:
where a and b can be instantiated individually far away from this line of
code that put them together?
In Java they are distinct data type and the compiler would catch all
incorrect usage. In Python, the interpreter seems to 'help' us to promote
binary string to unicode. Things works fine, unit tests pass, all until
the first non-ASCII characters come in and then the program breaks.
Is there a scheme for Python developer to use so that they are safe from
incorrect mixing?
aurora wrote: [...] In Java they are distinct data type and the compiler would catch all incorrect usage. In Python, the interpreter seems to 'help' us to promote binary string to unicode. Things works fine, unit tests pass, all until the first non-ASCII characters come in and then the program breaks.
Is there a scheme for Python developer to use so that they are safe from incorrect mixing?
Put the following:
import sys
sys.setdefaulte ncoding("undefi ned")
in a file named sitecustomize.p y somewhere in your Python path and
Python will complain whenever there's an implicit conversion between
str and unicode.
HTH,
Walter Dörwald
Fredrik Lundh napisa³(a): This brings up another issue. Most references and books focus exclusive on entering unicode literal and using the encode/decode methods. The fallacy is that string is such a basic data type use throughout the program, you really don't want to make a individual decision everytime when you use string (and take a penalty for any negligence). The Java has a much more usable model with unicode used internally and encoding/decoding decision only need twice when dealing with input and output.
that's how you should do things in Python too, of course. a unicode string uses unicode internally. decode on the way in, encode on the way out, and things just work.
There are implementations of Python where it isn't so easy, Python for
iSeries ( http://www.iseriespython.com/) is one of them. The code written
for "normal" platform doesn't work on AS/400, even if all strings used
internally are unicode objects, also unicode literals don't work as
expected.
Of course, this is implementation fault but this makes a headache if you
need to write portable code.
--
Jarek Zgoda http://jpa.berlios.de/ | http://www.zgodowie.org/
=?ISO-8859-15?Q?Walter_D=F 6rwald?= <wa****@livingl ogic.de> writes: aurora wrote:
> [...] In Java they are distinct data type and the compiler would catch all incorrect usage. In Python, the interpreter seems to 'help' us to promote binary string to unicode. Things works fine, unit tests pass, all until the first non-ASCII characters come in and then the program breaks. Is there a scheme for Python developer to use so that they are safe from incorrect mixing?
Put the following:
import sys sys.setdefaulte ncoding("undefi ned")
in a file named sitecustomize.p y somewhere in your Python path and Python will complain whenever there's an implicit conversion between str and unicode.
Sounds cool, so I did it.
And started a program I was currently working on.
The first function in it is this:
if sys.platform == "win32":
def _locate_gccxml( ):
import _winreg
for subkey in (r"Software\gcc xml", r"Software\Kitw are\GCC_XML"):
for root in (_winreg.HKEY_C URRENT_USER, _winreg.HKEY_LO CAL_MACHINE):
try:
hkey = _winreg.OpenKey (root, subkey, 0, _winreg.KEY_REA D)
except WindowsError, detail:
if detail.errno != 2:
raise
else:
return _winreg.QueryVa lueEx(hkey, "loc")[0] + r"\bin"
loc = _locate_gccxml( )
if loc:
os.environ["PATH"] = loc
All strings in that snippet are text strings, so the first approach was
to convert them to unicode literals. Doesn't work. Here is the final,
working version (changes are marked):
if sys.platform == "win32":
def _locate_gccxml( ):
import _winreg
for subkey in (r"Software\gcc xml", r"Software\Kitw are\GCC_XML"):
for root in (_winreg.HKEY_C URRENT_USER, _winreg.HKEY_LO CAL_MACHINE):
try:
hkey = _winreg.OpenKey (root, subkey, 0, _winreg.KEY_REA D)
except WindowsError, detail:
if detail.errno != 2:
raise
else:
return _winreg.QueryVa lueEx(hkey, "loc")[0] + ur"\bin"
#-----------------------------------------------------------------^
loc = _locate_gccxml( )
if loc:
os.environ["PATH"] = loc.encode("mbc s")
#--------------------------------^
So, it appears that:
- the _winreg.QueryVa lue function is strange: it takes ascii strings,
but returns a unicode string.
- _winreg.OpenKey takes ascii strings
- the os.environ["PATH"] accepts an ascii string.
And I won't guess what happens when there are registry entries with
unlauts (ok, they could be handled by 'mbcs' encoding), and with chinese
or japanese characters (no way to represent them in ascii strings with a
western locale and mbcs encoding, afaik).
I suggest that 'sys.setdefault encoding("undef ined")' be the standard
setting for the core developers ;-)
Thomas
aurora wrote: The Java has a much more usable model with unicode used internally and encoding/decoding decision only need twice when dealing with input and output.
In addition to Fredrik's comment (that you should use the same model
in Python) and Walter's comment (that you can enforce it by setting
the default encoding to "undefined" ), I'd like to point out the
historical reason: Python predates Unicode, so the byte string type
has many convenience operations that you would only expect of
a character string.
We have come up with a transition strategy, allowing existing
libraries to widen their support from byte strings to character
strings. This isn't a simple task, so many libraries still expect
and return byte strings, when they should process character strings.
Instead of breaking the libraries right away, we have defined
a transitional mechanism, which allows to add Unicode support
to libraries as the need arises. This transition is still in
progress.
Eventually, the primary string type should be the Unicode
string. If you are curious how far we are still off that goal,
just try running your program with the -U option.
Regards,
Martin
Walter Dörwald napisa³(a): Is there a scheme for Python developer to use so that they are safe from incorrect mixing?
Put the following:
import sys sys.setdefaulte ncoding("undefi ned")
in a file named sitecustomize.p y somewhere in your Python path and Python will complain whenever there's an implicit conversion between str and unicode.
This will help in your code, but there is big pile of modules in stdlib
that are not unicode-friendly. From my daily practice come shlex
(tokenizer works only with encoded strings) and logging (you cann't
specify encoding for FileHandler).
--
Jarek Zgoda http://jpa.berlios.de/ | http://www.zgodowie.org/
=?ISO-8859-15?Q?=22Martin_ v=2E_L=F6wis=22 ?= <ma****@v.loewi s.de> writes: We have come up with a transition strategy, allowing existing libraries to widen their support from byte strings to character strings. This isn't a simple task, so many libraries still expect and return byte strings, when they should process character strings. Instead of breaking the libraries right away, we have defined a transitional mechanism, which allows to add Unicode support to libraries as the need arises. This transition is still in progress.
Eventually, the primary string type should be the Unicode string. If you are curious how far we are still off that goal, just try running your program with the -U option.
Is it possible to specify a byte string literal when running with the -U option?
Thomas
=?ISO-8859-15?Q?=22Martin_ v=2E_L=F6wis=22 ?= <ma****@v.loewi s.de> writes: Eventually, the primary string type should be the Unicode string. If you are curious how far we are still off that goal, just try running your program with the -U option.
Not very far - can't even call functions ;-)
c:\>py -U
Python 2.5a0 (#60, Dec 29 2004, 11:27:13) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright" , "credits" or "license" for more information. def f(**kw):
.... pass
.... f(**{"a": 0})
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: f() keywords must be strings
Thomas This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Bill Eldridge |
last post by:
I'm trying to grab a document off the Web and toss it
into a MySQL database, but I keep running into the
various encoding problems with Unicode (that aren't
a problem for me with GB2312, BIG 5, etc.)
What I'd like is something as simple as:
CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8));
import MySQLdb, re,urllib
|
by: wolfgang haefelinger |
last post by:
Hi,
I wonder whether someone could explain me a bit what's going on here:
import sys
# I'm running Mandrake 1o and Windows XP.
print sys.version
## 2.3.3 (#2, Feb 17 2004, 11:45:40)
|
by: EU citizen |
last post by:
Do web pages have to be created in unicode in order to use UTF-8 encoding?
If so, can anyone name a free application which I can use under Windows 98
to create web pages?
|
by: hunterb |
last post by:
I have a file which has no BOM and contains mostly single byte chars. There
are numerous double byte chars (Japanese) which appear throughout. I need to
take the resulting Unicode and store it in a DB and display it onscreen. No
matter which way I open the file, convert it to Unicode/leave it as is or
what ever, I see all single bytes ok, but double bytes become 2 seperate
single bytes. Surely there is an easy way to convert these mixed...
|
by: webdev |
last post by:
lo all,
some of the questions i'll ask below have most certainly been discussed
already, i just hope someone's kind enough to answer them again to help
me out..
so i started a python 2.3 script that grabs some web pages from the web,
regex parse the data and stores it localy to xml file for further use..
at first i had no problem using python minidom and everything concerning
| |
by: Robert |
last post by:
Hello,
I'm using Pythonwin and py2.3 (py2.4). I did not come clear with this:
I want to use win32-fuctions like win32ui.MessageBox,
listctrl.InsertItem ..... to get unicode strings on the screen - best
results according to the platform/language settings (mainly XP Home,
W2K, ...).
Also unicode strings should be displayed as nice as possible at the
console with normal print-s to stdout (on varying platforms, different
|
by: jrs_14618 |
last post by:
Hello All,
This post is essentially a reply a previous post/thread
here on this mailing.database.myodbc group titled:
MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
I was wondering if anybody has experienced the same issues
|
by: ujjwaltrivedi |
last post by:
Hey guys,
Can anyone tell me how to create a text file with Unicode Encoding. In
am using
FileStream Finalfile = new FileStream("finalfile.txt",
FileMode.Append, FileAccess.Write);
###Question:
Now this creates finalfile.txt with ANSI Encoding ...which is a
default. Either tell me how to change the default or how to create a
|
by: Victor Lin |
last post by:
Hi,
I'm writting a application using python standard logging system. I
encounter some problem with unicode message passed to logging library.
I found that unicode message will be messed up by logging handler.
piese of StreamHandler:
try:
self.stream.write(fs % msg)
except UnicodeError:
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
| |
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
| |
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |