473,503 Members | 1,697 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

unicode encoding usablilty problem

I have long find the Python default encoding of strict ASCII frustrating.
For one thing I prefer to get garbage character than an exception. But the
biggest issue is Unicode exception often pop up in unexpected places and
only when a non-ASCII or unicode character first found its way into the
system.

Below is an example. The program may runs fine at the beginning. But as
soon as an unicode character u'b' is introduced, the program boom out
unexpectedly.
sys.getdefaultencoding() 'ascii' a='\xe5'
# can print, you think you're ok .... print a
å b=u'b'
a==b Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0:
ordinal not in range(128)

One may suggest the correct way to do it is to use decode, such as

a.decode('latin-1') == b
This brings up another issue. Most references and books focus exclusive on
entering unicode literal and using the encode/decode methods. The fallacy
is that string is such a basic data type use throughout the program, you
really don't want to make a individual decision everytime when you use
string (and take a penalty for any negligence). The Java has a much more
usable model with unicode used internally and encoding/decoding decision
only need twice when dealing with input and output.

I am sure these errors are a nuisance to those who are half conscious to
unicode. Even for those who choose to use unicode, it is almost impossible
to ensure their program work correctly.
Jul 18 '05 #1
30 2712
anonymous coward <au******@gmail.com> wrote:
This brings up another issue. Most references and books focus exclusive on entering unicode
literal and using the encode/decode methods. The fallacy is that string is such a basic data type
use throughout the program, you really don't want to make a individual decision everytime when
you use string (and take a penalty for any negligence). The Java has a much more usable model
with unicode used internally and encoding/decoding decision only need twice when dealing with
input and output.
that's how you should do things in Python too, of course. a unicode string
uses unicode internally. decode on the way in, encode on the way out, and
things just work.

the fact that you can mess things up by mixing unicode strings with binary
strings doesn't mean that you have to mix unicode strings with binary strings
in your program.
Even for those who choose to use unicode, it is almost impossible to ensure their program work
correctly.


well, if you use unicode the way it was intended to, it just works.

</F>

Jul 18 '05 #2
On Fri, 18 Feb 2005 19:24:10 +0100, Fredrik Lundh <fr*****@pythonware.com>
wrote:
that's how you should do things in Python too, of course. a unicode
string
uses unicode internally. decode on the way in, encode on the way out, and
things just work.

the fact that you can mess things up by mixing unicode strings with
binary
strings doesn't mean that you have to mix unicode strings with binary
strings
in your program.


I don't want to mix them. But how could I find them? How do I know this
statement can be potential problem

if a==b:

where a and b can be instantiated individually far away from this line of
code that put them together?

In Java they are distinct data type and the compiler would catch all
incorrect usage. In Python, the interpreter seems to 'help' us to promote
binary string to unicode. Things works fine, unit tests pass, all until
the first non-ASCII characters come in and then the program breaks.

Is there a scheme for Python developer to use so that they are safe from
incorrect mixing?
Jul 18 '05 #3
aurora wrote:
[...]
In Java they are distinct data type and the compiler would catch all
incorrect usage. In Python, the interpreter seems to 'help' us to
promote binary string to unicode. Things works fine, unit tests pass,
all until the first non-ASCII characters come in and then the program
breaks.

Is there a scheme for Python developer to use so that they are safe
from incorrect mixing?


Put the following:

import sys
sys.setdefaultencoding("undefined")

in a file named sitecustomize.py somewhere in your Python path and
Python will complain whenever there's an implicit conversion between
str and unicode.

HTH,
Walter Dörwald
Jul 18 '05 #4
Fredrik Lundh napisa³(a):
This brings up another issue. Most references and books focus exclusive on entering unicode
literal and using the encode/decode methods. The fallacy is that string is such a basic data type
use throughout the program, you really don't want to make a individual decision everytime when
you use string (and take a penalty for any negligence). The Java has a much more usable model
with unicode used internally and encoding/decoding decision only need twice when dealing with
input and output.


that's how you should do things in Python too, of course. a unicode string
uses unicode internally. decode on the way in, encode on the way out, and
things just work.


There are implementations of Python where it isn't so easy, Python for
iSeries (http://www.iseriespython.com/) is one of them. The code written
for "normal" platform doesn't work on AS/400, even if all strings used
internally are unicode objects, also unicode literals don't work as
expected.

Of course, this is implementation fault but this makes a headache if you
need to write portable code.

--
Jarek Zgoda
http://jpa.berlios.de/ | http://www.zgodowie.org/
Jul 18 '05 #5
=?ISO-8859-15?Q?Walter_D=F6rwald?= <wa****@livinglogic.de> writes:
aurora wrote:
> [...]
In Java they are distinct data type and the compiler would catch all
incorrect usage. In Python, the interpreter seems to 'help' us to
promote binary string to unicode. Things works fine, unit tests
pass, all until the first non-ASCII characters come in and then the
program breaks.
Is there a scheme for Python developer to use so that they are safe
from incorrect mixing?


Put the following:

import sys
sys.setdefaultencoding("undefined")

in a file named sitecustomize.py somewhere in your Python path and
Python will complain whenever there's an implicit conversion between
str and unicode.


Sounds cool, so I did it.
And started a program I was currently working on.
The first function in it is this:

if sys.platform == "win32":

def _locate_gccxml():
import _winreg
for subkey in (r"Software\gccxml", r"Software\Kitware\GCC_XML"):
for root in (_winreg.HKEY_CURRENT_USER, _winreg.HKEY_LOCAL_MACHINE):
try:
hkey = _winreg.OpenKey(root, subkey, 0, _winreg.KEY_READ)
except WindowsError, detail:
if detail.errno != 2:
raise
else:
return _winreg.QueryValueEx(hkey, "loc")[0] + r"\bin"

loc = _locate_gccxml()
if loc:
os.environ["PATH"] = loc

All strings in that snippet are text strings, so the first approach was
to convert them to unicode literals. Doesn't work. Here is the final,
working version (changes are marked):

if sys.platform == "win32":

def _locate_gccxml():
import _winreg
for subkey in (r"Software\gccxml", r"Software\Kitware\GCC_XML"):
for root in (_winreg.HKEY_CURRENT_USER, _winreg.HKEY_LOCAL_MACHINE):
try:
hkey = _winreg.OpenKey(root, subkey, 0, _winreg.KEY_READ)
except WindowsError, detail:
if detail.errno != 2:
raise
else:
return _winreg.QueryValueEx(hkey, "loc")[0] + ur"\bin"
#-----------------------------------------------------------------^
loc = _locate_gccxml()
if loc:
os.environ["PATH"] = loc.encode("mbcs")
#--------------------------------^

So, it appears that:

- the _winreg.QueryValue function is strange: it takes ascii strings,
but returns a unicode string.
- _winreg.OpenKey takes ascii strings
- the os.environ["PATH"] accepts an ascii string.

And I won't guess what happens when there are registry entries with
unlauts (ok, they could be handled by 'mbcs' encoding), and with chinese
or japanese characters (no way to represent them in ascii strings with a
western locale and mbcs encoding, afaik).
I suggest that 'sys.setdefaultencoding("undefined")' be the standard
setting for the core developers ;-)

Thomas
Jul 18 '05 #6
aurora wrote:
The Java
has a much more usable model with unicode used internally and
encoding/decoding decision only need twice when dealing with input and
output.


In addition to Fredrik's comment (that you should use the same model
in Python) and Walter's comment (that you can enforce it by setting
the default encoding to "undefined"), I'd like to point out the
historical reason: Python predates Unicode, so the byte string type
has many convenience operations that you would only expect of
a character string.

We have come up with a transition strategy, allowing existing
libraries to widen their support from byte strings to character
strings. This isn't a simple task, so many libraries still expect
and return byte strings, when they should process character strings.
Instead of breaking the libraries right away, we have defined
a transitional mechanism, which allows to add Unicode support
to libraries as the need arises. This transition is still in
progress.

Eventually, the primary string type should be the Unicode
string. If you are curious how far we are still off that goal,
just try running your program with the -U option.

Regards,
Martin
Jul 18 '05 #7
Walter Dörwald napisa³(a):
Is there a scheme for Python developer to use so that they are safe
from incorrect mixing?

Put the following:

import sys
sys.setdefaultencoding("undefined")

in a file named sitecustomize.py somewhere in your Python path and
Python will complain whenever there's an implicit conversion between
str and unicode.


This will help in your code, but there is big pile of modules in stdlib
that are not unicode-friendly. From my daily practice come shlex
(tokenizer works only with encoded strings) and logging (you cann't
specify encoding for FileHandler).

--
Jarek Zgoda
http://jpa.berlios.de/ | http://www.zgodowie.org/
Jul 18 '05 #8
=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?= <ma****@v.loewis.de> writes:
We have come up with a transition strategy, allowing existing
libraries to widen their support from byte strings to character
strings. This isn't a simple task, so many libraries still expect
and return byte strings, when they should process character strings.
Instead of breaking the libraries right away, we have defined
a transitional mechanism, which allows to add Unicode support
to libraries as the need arises. This transition is still in
progress.

Eventually, the primary string type should be the Unicode
string. If you are curious how far we are still off that goal,
just try running your program with the -U option.


Is it possible to specify a byte string literal when running with the -U option?

Thomas
Jul 18 '05 #9
=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?= <ma****@v.loewis.de> writes:
Eventually, the primary string type should be the Unicode
string. If you are curious how far we are still off that goal,
just try running your program with the -U option.


Not very far - can't even call functions ;-)

c:\>py -U
Python 2.5a0 (#60, Dec 29 2004, 11:27:13) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
def f(**kw): .... pass
.... f(**{"a": 0}) Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: f() keywords must be strings


Thomas
Jul 18 '05 #10
Martin v. Löwis:
Eventually, the primary string type should be the Unicode
string. If you are curious how far we are still off that goal,
just try running your program with the -U option.


Tried both -U and sys.setdefaultencoding("undefined") on a couple of my
most used programs and saw a few library problems. One program reads job
advertisements from a mailing list, ranks them according to keywords, and
displays them using unicode to ensure that HTML entities like &bull; are
displayed correctly. That program worked without changes.

The second program reads my spam filled mail box removing messages that
match a set of header criteria. It uses decode_header and make_header from
the email.Header library module to convert each header from a set of encoded
strings into a single unicode string. As email.Header is strongly concerned
with unicode, I expected it would be able to handle the two modifications
well.

With -U, there was one bug in my code assuming that a string would be 8
bit and that was easily fixed. In email.Charset, __init__ expects a
non-unicode argument as it immediately calls unicode(input_charset, 'ascii')
which fails when the argument is unicode. This can be fixed explicitly in
the __init__ but I would argue for a more lenient approach with unicode(u,
enc, err) always ignoring the enc and err arguments when the input is
already in unicode. Next sre breaks when building a mapping array because
array.array can not have a unicode type code. This should probably be fixed
in array rather than sre as mapping = array.array('b'.encode('ascii'),
mapping).tostring() is too ugly. The final issue was in encodings.idna where
there is ace_prefix = "xn--"; uace_prefix = unicode(ace_prefix, "ascii")
which again could avoid breakage if unicode was more lenient.

With sys.setdefaultencoding("undefined"), there were more problems and
they were harder to work around. One addition that could help would be a
function similar to str but with an optional encoding that would be used
when the input failed to convert to string because of a UnicodeError.
Something like

def stri(x, enc='us-ascii'):
try:
return str(x)
except UnicodeError:
return unicode(x).encode(enc)

Neil
Jul 18 '05 #11
On Fri, 18 Feb 2005 20:18:28 +0100, Walter Dörwald <wa****@livinglogic.de>
wrote:
aurora wrote:
> [...]
In Java they are distinct data type and the compiler would catch all
incorrect usage. In Python, the interpreter seems to 'help' us to
promote binary string to unicode. Things works fine, unit tests pass,
all until the first non-ASCII characters come in and then the program
breaks.
Is there a scheme for Python developer to use so that they are safe
from incorrect mixing?


Put the following:

import sys
sys.setdefaultencoding("undefined")

in a file named sitecustomize.py somewhere in your Python path and
Python will complain whenever there's an implicit conversion between
str and unicode.

HTH,
Walter Dörwald


That helps! Running unit test caught quite a few potential problems (as
well as a lot of safe of ASCII string promotion).
Jul 18 '05 #12
On Fri, 18 Feb 2005 21:16:01 +0100, Martin v. Löwis <ma****@v.loewis.de>
wrote:
I'd like to point out the
historical reason: Python predates Unicode, so the byte string type
has many convenience operations that you would only expect of
a character string.

We have come up with a transition strategy, allowing existing
libraries to widen their support from byte strings to character
strings. This isn't a simple task, so many libraries still expect
and return byte strings, when they should process character strings.
Instead of breaking the libraries right away, we have defined
a transitional mechanism, which allows to add Unicode support
to libraries as the need arises. This transition is still in
progress.
I understand. So I wasn't yelling "why can't Python be more like Java". On
the other hand I also want to point out making individual decision for
each string wasn't practical and is very error prone. The fact that
unicode and 8 bit string look alike and work alike in common situation but
only run into problem with non-ASCII is very confusing for most people.

Eventually, the primary string type should be the Unicode
string. If you are curious how far we are still off that goal,
just try running your program with the -U option.
Lots of errors. Amount them are gzip (binary?!) and strftime??

I actually quite appriciate Python's power in processing binary data as
8-bit strings. But perhaps we should transition to use unicode as text
string as treat binary string as exception. Right now we have

'' - 8bit string; u'' unicode string

How about

b'' - 8bit string; '' unicode string

and no automatic conversion. Perhaps this can be activated by something
like the encoding declarations, so that transition can happen module by
module.

Regards,
Martin


Jul 18 '05 #13
On Fri, 18 Feb 2005 21:43:52 +0100, Thomas Heller wrote:
Eventually, the primary string type should be the Unicode
string. If you are curious how far we are still off that goal,
just try running your program with the -U option.


Not very far - can't even call functions ;-)
def f(**kw): ... pass
... f(**{"a": 0}) Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: f() keywords must be strings


That is possible:
f(**{chr(ord("a")): 0})


WFM. SCNR,
Alexander
Jul 18 '05 #14
"aurora" <au******@gmail.com> wrote:
I don't want to mix them. But how could I find them? How do I know this statement can be
potential problem

if a==b:

where a and b can be instantiated individually far away from this line of code that put them
together?
if you don't know what a and b comes from, how can you be sure that
your program works at all? how can you be sure they're both strings?

("a op b" can fail in many ways, depending on what "a", "b", and "op" are)
Things works fine, unit tests pass, all until the first non-ASCII characters
come in and then the program breaks.


if you have unit tests, why don't they include Unicode tests?

</F>

Jul 18 '05 #15
Thomas Heller wrote:
=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?= <ma****@v.loewis.de> writes:

We have come up with a transition strategy, allowing existing
libraries to widen their support from byte strings to character
strings. This isn't a simple task, so many libraries still expect
and return byte strings, when they should process character strings.
Instead of breaking the libraries right away, we have defined
a transitional mechanism, which allows to add Unicode support
to libraries as the need arises. This transition is still in
progress.

Eventually, the primary string type should be the Unicode
string. If you are curious how far we are still off that goal,
just try running your program with the -U option.

Is it possible to specify a byte string literal when running with the -U option?


Not that I know of. If the 'bytes' type happens, then I'd be a fan of b"" to get
a byte string instead of a character string.

Cheers,
Nick.

--
Nick Coghlan | nc******@email.com | Brisbane, Australia
---------------------------------------------------------------
http://boredomandlaziness.skystorm.net
Jul 18 '05 #16
Thomas Heller wrote:
Is it possible to specify a byte string literal when running with the -U option?


Not literally. However, you can specify things like

bytes = [0x47, 0x49, 0x4f, 0x50, 0x01, 0x00]
bytes = ''.join((chr(x) for x in bytes))

Alternatively, you could rely on the 1:1 feature of Latin-1:

bytes = "GIOP\x01\0".encode("l1")

Regards,
Martin
Jul 18 '05 #17
aurora wrote:
Lots of errors. Amount them are gzip (binary?!) and strftime??
For gzip, this is not surprising. It contains things like

self.fileobj.write('\037\213')

which is not intended to denote characters.

How about

b'' - 8bit string; '' unicode string

and no automatic conversion.
This has been proposed before, see PEP 332. The problem is that
people often want byte strings to be mutable as well, so it is
still unclear whether it is better to make the b prefix denote
the current string type (so it would be currently redundant)
or a newly-created mutable string type (similar to array.array).
Perhaps this can be activated by something
like the encoding declarations, so that transition can happen module by
module.


That could work for the literals - a __future__ import would be
most appropriate. For "no automatic conversion", this is very
difficult to implement on a per-module basis. The errors typically
don't occur in the module itself, but in some function called by
the module (e.g. a builtin method of the string type). So the
callee would have to know whether the caller has a future
import...

Regards,
Martin
Jul 18 '05 #18
Martin v. Löwis wrote:
How about

b'' - 8bit string; '' unicode string

and no automatic conversion.

This has been proposed before, see PEP 332. The problem is that
people often want byte strings to be mutable as well, so it is
still unclear whether it is better to make the b prefix denote
the current string type (so it would be currently redundant)
or a newly-created mutable string type (similar to array.array).


Having "", u"", and r"" be immutable, while b"" was mutable would seem rather
inconsistent.

If you want a phased migration to 'assert (str is unicode) == True', then PEP
332 seems to have that covered:

1. Introduce 'bytes' as an alias of str
2. Introduce b"" as an alternate spelling of r""
3. Switch str to be an alias of unicode
4. Switch "" to be an alternate spelling of u""

Trying to intermingle this with making the bytes type mutable seems to be
begging for trouble - consider how many string-keyed dictionaries would break
with that change (the upgrade path is non-existent - you can't stay with str,
because you want byte strings, but you can't go to bytes, because you need
something immutable).

An alternative would be to have "bytestr" be the immutable type corresponding to
the current str (with b"" literals producing bytestr's), while reserving the
"bytes" name for a mutable byte sequence. That is, change PEP 332's upgrade path
to look more like:

* Add a bytestr builtin which is just a synonym for str. (2.5)
* Add a b"..." string literal which is equivalent to raw string literals,
with the exception that values which conflict with the source encoding of the
containing file not generate warnings. (2.5)
* Warn about the use of variables named "bytestr". (2.5 or 2.6)
* Introduce a bytestr builtin which refers to a sequence distinct from the
str type. (2.6)
* Make str a synonym for unicode. (3.0)

And separately:
* Introduce a bytes builtin which is a mutable byte sequence

Alternately, add array.bytes as a subclass of array.array, that provides a nicer
API for dealing specifically with byte strings.

The main point being, the replacement for 'str' needs to be immutable or the
upgrade process is going to be a serious PITA.

Cheers,
Nick.

--
Nick Coghlan | nc******@email.com | Brisbane, Australia
---------------------------------------------------------------
http://boredomandlaziness.skystorm.net
Jul 18 '05 #19
Nick Coghlan wrote:
Having "", u"", and r"" be immutable, while b"" was mutable would seem
rather inconsistent.
Yes. However, this inconsistency might be desirable. It would, of
course, mean that the literal cannot be a singleton. Instead, it has
to be a display (?), similar to list or dict displays: each execution
of the byte string literal creates a new object.
An alternative would be to have "bytestr" be the immutable type
corresponding to the current str (with b"" literals producing
bytestr's), while reserving the "bytes" name for a mutable byte
sequence.
Indeed. This maze of options has caused the process to get stuck.
People also argue that with such an approach, we could as well
tell users to use array.array for the mutable type. But then,
people complain that it doesn't have all the library support that
strings have.
The main point being, the replacement for 'str' needs to be immutable or
the upgrade process is going to be a serious PITA.


Somebody really needs to take this in his hands, completing the PEP,
writing a patch, checking applications to find out what breaks.

Regards,
Martin
Jul 18 '05 #20
Nick Coghlan wrote:
Having "", u"", and r"" be immutable, while b"" was mutable would seem
rather inconsistent.
Yes. However, this inconsistency might be desirable. It would, of
course, mean that the literal cannot be a singleton. Instead, it has
to be a display (?), similar to list or dict displays: each execution
of the byte string literal creates a new object.
An alternative would be to have "bytestr" be the immutable type
corresponding to the current str (with b"" literals producing
bytestr's), while reserving the "bytes" name for a mutable byte
sequence.
Indeed. This maze of options has caused the process to get stuck.
People also argue that with such an approach, we could as well
tell users to use array.array for the mutable type. But then,
people complain that it doesn't have all the library support that
strings have.
The main point being, the replacement for 'str' needs to be immutable or
the upgrade process is going to be a serious PITA.


Somebody really needs to take this in his hands, completing the PEP,
writing a patch, checking applications to find out what breaks.

Regards,
Martin
Jul 18 '05 #21
Martin v. Löwis wrote:
People also argue that with such an approach, we could as well
tell users to use array.array for the mutable type. But then,
people complain that it doesn't have all the library support that
strings have.


Indeed - I've got a data manipulating program that I figured I could make
slightly less memory hungry by using arrays instead of strings.

I discovered very quickly just how inconvenient such a change would be in terms
of the available API for manipulation of the byte array (the loss of 'join'
support was a serious drawback). The program still uses strings for that reason.

However, I wonder if that might not be better solved by providing an
"array.bytearray" that supported relevant portions of the string API (and easy
conversion to a string), rather than blurring the concept of immutable strings.

Hmm - something else the PEP needs to discuss: What happens to __str__ and
__unicode__? Is there a new __bytes__ slot?

I wonder if Skip is still up for championing this one. . .

Cheers,
Nick.
One PEP's enough for me (even though 338 doesn't seem to generate much interest)

--
Nick Coghlan | nc******@email.com | Brisbane, Australia
---------------------------------------------------------------
http://boredomandlaziness.skystorm.net
Jul 18 '05 #22
On Sat, 19 Feb 2005 18:44:27 +0100, Fredrik Lundh <fr*****@pythonware.com>
wrote:
"aurora" <au******@gmail.com> wrote:
I don't want to mix them. But how could I find them? How do I know
this statement can be
potential problem

if a==b:

where a and b can be instantiated individually far away from this line
of code that put them
together?


if you don't know what a and b comes from, how can you be sure that
your program works at all? how can you be sure they're both strings?

("a op b" can fail in many ways, depending on what "a", "b", and "op"
are)


a and b are both string. The issue is 8-bit string or unicode string.

Things works fine, unit tests pass, all until the first non-ASCII
characters
come in and then the program breaks.


if you have unit tests, why don't they include Unicode tests?

</F>


How do I structure the test cases to guarantee coverage? It is not
practical to test every combinations of unicode/8-bit strings. Adding
non-ascii characters to test data probably make problem pop up earlier.
But it is arduous and it is hard to spot if you left out any.

Jul 18 '05 #23
On Sun, 20 Feb 2005 15:01:09 +0100, Martin v. Löwis <ma****@v.loewis.de>
wrote:
Nick Coghlan wrote:
Having "", u"", and r"" be immutable, while b"" was mutable would seem
rather inconsistent.


Yes. However, this inconsistency might be desirable. It would, of
course, mean that the literal cannot be a singleton. Instead, it has
to be a display (?), similar to list or dict displays: each execution
of the byte string literal creates a new object.
An alternative would be to have "bytestr" be the immutable type
corresponding to the current str (with b"" literals producing
bytestr's), while reserving the "bytes" name for a mutable byte
sequence.


Indeed. This maze of options has caused the process to get stuck.
People also argue that with such an approach, we could as well
tell users to use array.array for the mutable type. But then,
people complain that it doesn't have all the library support that
strings have.
The main point being, the replacement for 'str' needs to be immutable
or the upgrade process is going to be a serious PITA.


Somebody really needs to take this in his hands, completing the PEP,
writing a patch, checking applications to find out what breaks.

Regards,
Martin


What is the processing of getting a PEP work out? Does the work and
discussion carry out in the python-dev mailing list? I would be glad to
help out especially on this particular issue.
Jul 18 '05 #24
"aurora" <au******@gmail.com> wrote:
if you don't know what a and b comes from, how can you be sure that
your program works at all? how can you be sure they're both strings?


a and b are both string.


how do you know that?
if you have unit tests, why don't they include Unicode tests?


How do I structure the test cases to guarantee coverage? It is not practical to test every
combinations of unicode/8-bit strings. Adding non-ascii characters to test data probably make
problem pop up earlier. But it is arduous


sounds like you don't want to test for it. sorry, cannot help. I prefer
to design libraries so they can be tested, and design tests so they test all
important aspects of my libraries. if you prefer another approach, there's
not much I can do, other than repeating what I said at the start: if you do
things the right way (decode on the way in, encode on the way out), it
just works.

</F>

Jul 18 '05 #25
"Fredrik Lundh" <fr*****@pythonware.com> writes on Sat, 19 Feb 2005 18:44:27 +0100:
"aurora" <au******@gmail.com> wrote:
I don't want to mix them. But how could I find them? How do I know this statement can be
potential problem

if a==b:

where a and b can be instantiated individually far away from this line of code that put them
together?


I do understand aurora's problems very well.

Me, too, I had suffered from this occasionally:

* some library decides to use unicode (without I had asked it to do so)

* Python decides then to convert other strings to unicode
and bum: "Unicode decode error".

I solve these issues with a "sys.setdefaultencoding(ourDefaultEncoding)"
in "sitecustomize.py".

I know that almost all the characters I have to handle
are encoded in "ourDefaultEncoding" and if something
converts to Unicode without being asked for, then this
is precisely the correct encoding.

I know that Unicode fanatists do not like "setdefaultencoding"
but until we will have completely converted to Unicode (which we probably
will do in the farer future), this is essential to keep sane...
Dieter
Jul 18 '05 #26
aurora wrote:
What is the processing of getting a PEP work out? Does the work and
discussion carry out in the python-dev mailing list? I would be glad to
help out especially on this particular issue.


See PEP 1 for the PEP process. The main point is that discussion is
*not* carried out on any specific forum. But instead, the PEP serves
as a container for all possible considerations people come up with,
formally by writing to the PEP author. Of course, they will use
comp.lang.python and python-dev (and perhaps SIG mailing lists)
instead of writing to the PEP author, so the PEP author may need to
track these as well.

The process is triggered by the author posting revisions of the
PEP at a moderate rate, each time claiming "now I think it is
complete". Then, if nobody comes up with a reasoning that is
not yet covered in the PEP, it becomes ready for BDFL
pronouncement. It better also has an implementation at some point
in time.

For a dormant PEP, the prospective author should contact the
original author, and offer co-authoring. Perhaps the original
author even proposes that you can take over the entire thing
sometime.

Notice that, at some point, a patch implementing the PEP will
be needed. So you should indicate from the beginning whether you
are also willing to work on the implementation. If not, there is
a good chance that the PEP again goes dormant after the
specification is complete.

Regards,
Martin

Jul 18 '05 #27
> This will help in your code, but there is big pile of modules in stdlib
that are not unicode-friendly. From my daily practice come shlex
(tokenizer works only with encoded strings) and logging (you cann't
specify encoding for FileHandler).


You can, of course, pass in a stream opened using codecs.open to StreamHandler.
Not quite as friendly, I'll grant you.

Regards,
Vinay Sajip


Jul 18 '05 #28
Hello !

I've been trying desperately to access http://www.stackless.com but it's
been down, for about a week now !
I desperatly need to download stackless python...
Of course the stackless mailing list is on their server, so it's down,
too.

Does anybody has any info ?
Does anybody have a tarball of a recent version of stackless that I may
use (with the docs ?)

Thanks !

Regards,
P.F. Caillaud
Jul 18 '05 #29
Hi!

Pierre-Frédéric Caillaud wrote:
I've been trying desperately to access http://www.stackless.com but
it's been down, for about a week now !


The stackless webpage is working again.

Regards,

Carl Friedrich Bolz

Jul 18 '05 #30

Great !
Thanks !

On Tue, 12 Apr 2005 16:15:42 +0200, cfbolz <cf****@gmx.de> wrote:
Hi!

Pierre-Frédéric Caillaud wrote:
I've been trying desperately to access http://www.stackless.com but
it's been down, for about a week now !


The stackless webpage is working again.

Regards,

Carl Friedrich Bolz


Jul 18 '05 #31

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
5251
by: Bill Eldridge | last post by:
I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...
14
2823
by: wolfgang haefelinger | last post by:
Hi, I wonder whether someone could explain me a bit what's going on here: import sys # I'm running Mandrake 1o and Windows XP. print sys.version ## 2.3.3 (#2, Feb 17 2004, 11:45:40)
27
5103
by: EU citizen | last post by:
Do web pages have to be created in unicode in order to use UTF-8 encoding? If so, can anyone name a free application which I can use under Windows 98 to create web pages?
3
7722
by: hunterb | last post by:
I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a...
4
6040
by: webdev | last post by:
lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3...
7
4185
by: Robert | last post by:
Hello, I'm using Pythonwin and py2.3 (py2.4). I did not come clear with this: I want to use win32-fuctions like win32ui.MessageBox, listctrl.InsertItem ..... to get unicode strings on the...
1
4834
by: jrs_14618 | last post by:
Hello All, This post is essentially a reply a previous post/thread here on this mailing.database.myodbc group titled: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode I was...
1
32845
by: ujjwaltrivedi | last post by:
Hey guys, Can anyone tell me how to create a text file with Unicode Encoding. In am using FileStream Finalfile = new FileStream("finalfile.txt", FileMode.Append, FileAccess.Write); ...
1
3396
by: Victor Lin | last post by:
Hi, I'm writting a application using python standard logging system. I encounter some problem with unicode message passed to logging library. I found that unicode message will be messed up by...
0
7074
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
1
6982
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
7451
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
5572
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
1
5000
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
4667
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3161
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
1501
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
0
374
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.