473,472 Members | 1,831 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

python encoding bug?


I was playing with python encodings and noticed this:

garabik@lancre:~$ python2.4
Python 2.4 (#2, Dec 3 2004, 17:59:05)
[GCC 3.3.5 (Debian 1:3.3.5-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
unicode('\x9d', 'iso8859_1') u'\x9d'


U+009D is NOT a valid unicode character (it is not even a iso8859_1
valid character)

The same happens if I use 'latin-1' instead of 'iso8859_1'.

This caught me by surprise, since I was doing some heuristics guessing
string encodings, and 'iso8859_1' gave no errors even if the input
encoding was different.

Is this a known behaviour, or I discovered a terrible unknown bug in python encoding
implementation that should be immediately reported and fixed? :-)
happy new year,

--
-----------------------------------------------------------
| Radovan GarabÃ*k http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
Dec 30 '05 #1
2 1447
<ga******************@kassiopeia.juls.savba.sk> wrote in message
news:dp***********@ns.felk.cvut.cz...
|
| I was playing with python encodings and noticed this:
|
| garabik@lancre:~$ python2.4
| Python 2.4 (#2, Dec 3 2004, 17:59:05)
| [GCC 3.3.5 (Debian 1:3.3.5-2)] on linux2
| Type "help", "copyright", "credits" or "license" for more information.
| >>> unicode('\x9d', 'iso8859_1')
| u'\x9d'
| >>>
|
| U+009D is NOT a valid unicode character (it is not even a iso8859_1
| valid character)

That statement is not entirely true. If you check the current
UnicodeData.txt (on http://www.unicode.org/Public/UNIDATA/) you'll find:

009D;<control>;Cc;0;BN;;;;;N;OPERATING SYSTEM COMMAND;;;;

Regards,

Vincent Wehren

|
| The same happens if I use 'latin-1' instead of 'iso8859_1'.
|
| This caught me by surprise, since I was doing some heuristics guessing
| string encodings, and 'iso8859_1' gave no errors even if the input
| encoding was different.
|
| Is this a known behaviour, or I discovered a terrible unknown bug in
python encoding
| implementation that should be immediately reported and fixed? :-)
|
|
| happy new year,
|
| --
| -----------------------------------------------------------
|| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
|| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
| -----------------------------------------------------------
| Antivirus alert: file .signature infected by signature virus.
| Hi! I'm a signature virus! Copy me into your signature file to help me
spread!
Dec 31 '05 #2
ga******************@kassiopeia.juls.savba.sk wrote:

I was playing with python encodings and noticed this:

garabik@lancre:~$ python2.4
Python 2.4 (#2, Dec 3 2004, 17:59:05)
[GCC 3.3.5 (Debian 1:3.3.5-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
unicode('\x9d', 'iso8859_1') u'\x9d'

U+009D is NOT a valid unicode character (it is not even a iso8859_1
valid character)


It *IS* a valid unicode and iso8859-1 character, so the behaviour of the
python decoder is correct. The range U+0080 - U+009F is used for various
control characters. There's rarely a valid use for these characters in
documents, so you can be pretty sure that a document using these characters
is windows-1252 - it is valid iso-8859-1, but for a heuristic guess it's
probably saver to assume windows-1252.

If you want an exception to be thrown, you'll need to implement your own
codec, something like 'iso8859_1_nocc' - mmm.. I could try this myself,
because I do such a test in one of my projects, too ;)
The same happens if I use 'latin-1' instead of 'iso8859_1'.

This caught me by surprise, since I was doing some heuristics guessing
string encodings, and 'iso8859_1' gave no errors even if the input
encoding was different.

Is this a known behaviour, or I discovered a terrible unknown bug in
python encoding implementation that should be immediately reported and
fixed? :-)
happy new year,


--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/
Dec 31 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

16
by: Paul Prescod | last post by:
I skimmed the tutorial and something alarmed me. "Strings are a powerful data type in Prothon. Unlike many languages, they can be of unlimited size (constrained only by memory size) and can hold...
0
by: Rafal Zawadzki | last post by:
Hi. I tried earlier to write python zsi mail list, but nobody answered. I am using ZSI 1.7/2.0rc1 with TTPro Soap SDK. The wsdl file can be found here: http://demo.seapine.com/ttsoapcgi.wsdl ...
0
by: henk-jan ebbers | last post by:
Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.9 Precedence: list List-Id: General discussion...
6
by: gita ziabari | last post by:
Hello All, The following code does not work for unicode characters: keyword = dict() kw = 'ÇÅÎÓËÉÈ' keyword.setdefault(key, ).append (kw) It works fine for inserting ASCII character. Any...
13
by: Liang Chen | last post by:
Hope you all had a nice weekend. I have a question that I hope someone can help me out. I want to run a Python program that uses Tkinter for the user interface (GUI). The program allows me to type...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
1
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
1
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
muto222
php
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.