467,077 Members | 980 Online
Bytes | Developer Community
Ask Question

Home New Posts Topics Members FAQ

Post your question to a community of 467,077 developers. It's quick & easy.

python encoding bug?


I was playing with python encodings and noticed this:

garabik@lancre:~$ python2.4
Python 2.4 (#2, Dec 3 2004, 17:59:05)
[GCC 3.3.5 (Debian 1:3.3.5-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
unicode('\x9d', 'iso8859_1') u'\x9d'


U+009D is NOT a valid unicode character (it is not even a iso8859_1
valid character)

The same happens if I use 'latin-1' instead of 'iso8859_1'.

This caught me by surprise, since I was doing some heuristics guessing
string encodings, and 'iso8859_1' gave no errors even if the input
encoding was different.

Is this a known behaviour, or I discovered a terrible unknown bug in python encoding
implementation that should be immediately reported and fixed? :-)
happy new year,

--
-----------------------------------------------------------
| Radovan Garab√*k http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
Dec 30 '05 #1
  • viewed: 1286
Share:
2 Replies
<ga******************@kassiopeia.juls.savba.sk> wrote in message
news:dp***********@ns.felk.cvut.cz...
|
| I was playing with python encodings and noticed this:
|
| garabik@lancre:~$ python2.4
| Python 2.4 (#2, Dec 3 2004, 17:59:05)
| [GCC 3.3.5 (Debian 1:3.3.5-2)] on linux2
| Type "help", "copyright", "credits" or "license" for more information.
| >>> unicode('\x9d', 'iso8859_1')
| u'\x9d'
| >>>
|
| U+009D is NOT a valid unicode character (it is not even a iso8859_1
| valid character)

That statement is not entirely true. If you check the current
UnicodeData.txt (on http://www.unicode.org/Public/UNIDATA/) you'll find:

009D;<control>;Cc;0;BN;;;;;N;OPERATING SYSTEM COMMAND;;;;

Regards,

Vincent Wehren

|
| The same happens if I use 'latin-1' instead of 'iso8859_1'.
|
| This caught me by surprise, since I was doing some heuristics guessing
| string encodings, and 'iso8859_1' gave no errors even if the input
| encoding was different.
|
| Is this a known behaviour, or I discovered a terrible unknown bug in
python encoding
| implementation that should be immediately reported and fixed? :-)
|
|
| happy new year,
|
| --
| -----------------------------------------------------------
|| Radovan GarabŪk http://kassiopeia.juls.savba.sk/~garabik/ |
|| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
| -----------------------------------------------------------
| Antivirus alert: file .signature infected by signature virus.
| Hi! I'm a signature virus! Copy me into your signature file to help me
spread!
Dec 31 '05 #2
ga******************@kassiopeia.juls.savba.sk wrote:

I was playing with python encodings and noticed this:

garabik@lancre:~$ python2.4
Python 2.4 (#2, Dec 3 2004, 17:59:05)
[GCC 3.3.5 (Debian 1:3.3.5-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
unicode('\x9d', 'iso8859_1') u'\x9d'

U+009D is NOT a valid unicode character (it is not even a iso8859_1
valid character)


It *IS* a valid unicode and iso8859-1 character, so the behaviour of the
python decoder is correct. The range U+0080 - U+009F is used for various
control characters. There's rarely a valid use for these characters in
documents, so you can be pretty sure that a document using these characters
is windows-1252 - it is valid iso-8859-1, but for a heuristic guess it's
probably saver to assume windows-1252.

If you want an exception to be thrown, you'll need to implement your own
codec, something like 'iso8859_1_nocc' - mmm.. I could try this myself,
because I do such a test in one of my projects, too ;)
The same happens if I use 'latin-1' instead of 'iso8859_1'.

This caught me by surprise, since I was doing some heuristics guessing
string encodings, and 'iso8859_1' gave no errors even if the input
encoding was different.

Is this a known behaviour, or I discovered a terrible unknown bug in
python encoding implementation that should be immediately reported and
fixed? :-)
happy new year,


--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/
Dec 31 '05 #3

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

16 posts views Thread by Paul Prescod | last post: by
reply views Thread by Rafal Zawadzki | last post: by
6 posts views Thread by gita ziabari | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.