python encoding bug?

garabik-news-2005-05

I was playing with python encodings and noticed this:

garabik@lancre:~$ python2.4
Python 2.4 (#2, Dec 3 2004, 17:59:05)
[GCC 3.3.5 (Debian 1:3.3.5-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

unicode('\x9d', 'iso8859_1') u'\x9d'

U+009D is NOT a valid unicode character (it is not even a iso8859_1
valid character)

The same happens if I use 'latin-1' instead of 'iso8859_1'.

This caught me by surprise, since I was doing some heuristics guessing
string encodings, and 'iso8859_1' gave no errors even if the input
encoding was different.

Is this a known behaviour, or I discovered a terrible unknown bug in python encoding
implementation that should be immediately reported and fixed? :-)
happy new year,

--
-----------------------------------------------------------
| Radovan GarabÃ*k http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!

Dec 30 '05 #1

Subscribe Reply

1447

Vincent Wehren

<ga******************@kassiopeia.juls.savba.sk> wrote in message
news:dp***********@ns.felk.cvut.cz...
|
| I was playing with python encodings and noticed this:
|
| garabik@lancre:~$ python2.4
| Python 2.4 (#2, Dec 3 2004, 17:59:05)
| [GCC 3.3.5 (Debian 1:3.3.5-2)] on linux2
| Type "help", "copyright", "credits" or "license" for more information.
| >>> unicode('\x9d', 'iso8859_1')
| u'\x9d'
| >>>
|
| U+009D is NOT a valid unicode character (it is not even a iso8859_1
| valid character)

That statement is not entirely true. If you check the current
UnicodeData.txt (on http://www.unicode.org/Public/UNIDATA/) you'll find:

009D;<control>;Cc;0;BN;;;;;N;OPERATING SYSTEM COMMAND;;;;

Regards,

Vincent Wehren

|
| The same happens if I use 'latin-1' instead of 'iso8859_1'.
|
| This caught me by surprise, since I was doing some heuristics guessing
| string encodings, and 'iso8859_1' gave no errors even if the input
| encoding was different.
|
| Is this a known behaviour, or I discovered a terrible unknown bug in
python encoding
| implementation that should be immediately reported and fixed? :-)
|
|
| happy new year,
|
| --
| -----------------------------------------------------------
|| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
|| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
| -----------------------------------------------------------
| Antivirus alert: file .signature infected by signature virus.
| Hi! I'm a signature virus! Copy me into your signature file to help me
spread!

Dec 31 '05 #2

Benjamin Niemann

ga******************@kassiopeia.juls.savba.sk wrote:

I was playing with python encodings and noticed this:

garabik@lancre:~$ python2.4
Python 2.4 (#2, Dec 3 2004, 17:59:05)
[GCC 3.3.5 (Debian 1:3.3.5-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
unicode('\x9d', 'iso8859_1') u'\x9d'

U+009D is NOT a valid unicode character (it is not even a iso8859_1
valid character)

It *IS* a valid unicode and iso8859-1 character, so the behaviour of the
python decoder is correct. The range U+0080 - U+009F is used for various
control characters. There's rarely a valid use for these characters in
documents, so you can be pretty sure that a document using these characters
is windows-1252 - it is valid iso-8859-1, but for a heuristic guess it's
probably saver to assume windows-1252.

If you want an exception to be thrown, you'll need to implement your own
codec, something like 'iso8859_1_nocc' - mmm.. I could try this myself,
because I do such a test in one of my projects, too ;)
The same happens if I use 'latin-1' instead of 'iso8859_1'.

This caught me by surprise, since I was doing some heuristics guessing
string encodings, and 'iso8859_1' gave no errors even if the input
encoding was different.

Is this a known behaviour, or I discovered a terrible unknown bug in
python encoding implementation that should be immediately reported and
fixed? :-)
happy new year,

--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/

Dec 31 '05 #3

by: Paul Prescod | last post by:

I skimmed the tutorial and something alarmed me. "Strings are a powerful data type in Prothon. Unlike many languages, they can be of unlimited size (constrained only by memory size) and can hold...

Python

problem with python zsi

by: Rafal Zawadzki | last post by:

Hi. I tried earlier to write python zsi mail list, but nobody answered. I am using ZSI 1.7/2.0rc1 with TTPro Soap SDK. The wsdl file can be found here: http://demo.seapine.com/ttsoapcgi.wsdl ...

Python

add encoding to standard encodings works different in python 2.5?

by: henk-jan ebbers | last post by:

Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.9 Precedence: list List-Id: General discussion...

Python

inserting Unicode character in dictionary - Python

by: gita ziabari | last post by:

Hello All, The following code does not work for unicode characters: keyword = dict() kw = 'ÇÅÎÓËÉÈ' keyword.setdefault(key, ).append (kw) It works fine for inserting ASCII character. Any...

Python

a question about Chinese characters in aPython Program

by: Liang Chen | last post by:

Hope you all had a nice weekend. I have a question that I hope someone can help me out. I want to run a Python program that uses Tkinter for the user interface (GUI). The program allows me to type...

Python

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

php

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

python encoding bug?

Similar topics