different encodings for unicode() and u''.encode(), bug?

mario

Hello!

i stumbled on this situation, that is if I decode some string, below
just the empty string, using the mcbs encoding, it succeeds, but if I
try to encode it back with the same encoding it surprisingly fails
with a LookupError. This seems like something to be corrected?

$ python
Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

>>s = ''
unicode(s, 'mcbs')

u''

>>unicode(s, 'mcbs').encode('mcbs')

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding: mcbs
Best wishes to everyone for 2008!

mario

Jan 2 '08 #1

Subscribe Reply

3657

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

i stumbled on this situation, that is if I decode some string, below

just the empty string, using the mcbs encoding, it succeeds, but if I
try to encode it back with the same encoding it surprisingly fails
with a LookupError. This seems like something to be corrected?

Indeed - in your code. It's not the same encoding.

>>>unicode(s, 'mcbs')

u''

>>>unicode(s, 'mcbs').encode('mcbs')

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding: mcbs

Use "mbcs" in the second call, not "mcbs".

HTH,
Martin

Jan 2 '08 #2

mario

On Jan 2, 9:30 am, "Martin v. Löwis" <mar...@v.loewis.dewrote:

Use "mbcs" in the second call, not "mcbs".

Ooops, sorry about that, when i switched to test it in the interpreter
I mistyped "mbcs" with "mcbs". But remark I did it consistently ;-)
I.e. it was still teh same encoding, even if maybe non-existant.. ?

If I try again using "mbcs" consistently, I still get the same error:
$ python
Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

>>unicode('', 'mbcs')

u''

>>unicode('', 'mbcs').encode('mbcs')

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding: mbcs

>>>

mario

Jan 2 '08 #3

John Machin

On Jan 2, 7:45 pm, mario <ma...@ruggier.orgwrote:

On Jan 2, 9:30 am, "Martin v. Löwis" <mar...@v.loewis.dewrote:

Use "mbcs" in the second call, not "mcbs".

Ooops, sorry about that, when i switched to test it in the interpreter
I mistyped "mbcs" with "mcbs". But remark I did it consistently ;-)
I.e. it was still teh same encoding, even if maybe non-existant.. ?

If I try again using "mbcs" consistently, I still get the same error:

$ python
Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
Type "help", "copyright", "credits" or "license" for more information.>>unicode('', 'mbcs')
u''

>unicode('', 'mbcs').encode('mbcs')

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding: mbcs

Two things for you to do:

(1) Try these at the Python interactive prompt:

unicode('', 'latin1')
unicode('', 'mbcs')
unicode('', 'raboof')
unicode('abc', 'latin1')
unicode('abc', 'mbcs')
unicode('abc', 'raboof')

(2) Read what the manual (Library Reference -codecs module ->
standard encodings) has to say about mbcs.

Jan 2 '08 #4

John Machin

On Jan 2, 8:44 pm, John Machin <sjmac...@lexicon.netwrote:

(1) Try these at the Python interactive prompt:

unicode('', 'latin1')

Also use those 6 cases to check out the difference in behaviour
between unicode(x, y) and x.decode(y)

Jan 2 '08 #5

mario

On Jan 2, 10:44 am, John Machin <sjmac...@lexicon.netwrote:

>
Two things for you to do:

(1) Try these at the Python interactive prompt:

unicode('', 'latin1')
unicode('', 'mbcs')
unicode('', 'raboof')
unicode('abc', 'latin1')
unicode('abc', 'mbcs')
unicode('abc', 'raboof')

$ python
Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

>>unicode('', 'mbcs')

u''

>>unicode('abc', 'mbcs')

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding: mbcs

>>>

Hmmn, strange. Same behaviour for "raboof".

(2) Read what the manual (Library Reference -codecs module ->
standard encodings) has to say about mbcs.

Page at http://docs.python.org/lib/standard-encodings.html says that
mbcs "purpose":
Windows only: Encode operand according to the ANSI codepage (CP_ACP)

Do not know what the implications of encoding according to "ANSI
codepage (CP_ACP)" are. Windows only seems clear, but why does it only
complain when decoding a non-empty string (or when encoding the empty
unicode string) ?

mario

Jan 2 '08 #6

John Machin

On Jan 2, 9:57 pm, mario <ma...@ruggier.orgwrote:

On Jan 2, 10:44 am, John Machin <sjmac...@lexicon.netwrote:

Two things for you to do:

(1) Try these at the Python interactive prompt:

unicode('', 'latin1')
unicode('', 'mbcs')
unicode('', 'raboof')
unicode('abc', 'latin1')
unicode('abc', 'mbcs')
unicode('abc', 'raboof')

$ python
Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
Type "help", "copyright", "credits" or "license" for more information.>>unicode('', 'mbcs')
u''

>unicode('abc', 'mbcs')

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding: mbcs

Hmmn, strange. Same behaviour for "raboof".

(2) Read what the manual (Library Reference -codecs module ->
standard encodings) has to say about mbcs.

Page athttp://docs.python.org/lib/standard-encodings.htmlsays that
mbcs "purpose":
Windows only: Encode operand according to the ANSI codepage (CP_ACP)

Do not know what the implications of encoding according to "ANSI
codepage (CP_ACP)" are.

Neither do I. YAGNI (especially on darwin) so don't lose any sleep
over it.

Windows only seems clear, but why does it only
complain when decoding a non-empty string (or when encoding the empty
unicode string) ?

My presumption: because it doesn't need a codec to decode '' into u'';
no failed codec look-up, so no complaint. Any realistic app will try
to decode a non-empty string sooner or later.

Jan 2 '08 #7

mario

On Jan 2, 12:28 pm, John Machin <sjmac...@lexicon.netwrote:

On Jan 2, 9:57 pm, mario <ma...@ruggier.orgwrote:

Do not know what the implications of encoding according to "ANSI
codepage (CP_ACP)" are.

Neither do I. YAGNI (especially on darwin) so don't lose any sleep
over it.

Windows only seems clear, but why does it only
complain when decoding a non-empty string (or when encoding the empty
unicode string) ?

My presumption: because it doesn't need a codec to decode '' into u'';
no failed codec look-up, so no complaint. Any realistic app will try
to decode a non-empty string sooner or later.

Yes, I suspect I will never need it ;)

Incidentally, the situation is that in a script that tries to guess a
file's encoding, it bombed on the file ".svn/empty-file" -- but why it
was going so far with an empty string was really due to a bug
elsewhere in the script, trivially fixed. Still, I was curious about
this non-symmetric behaviour for the empty string by some encodings.

Anyhow, thanks a lot to both of you for the great feedback!

mario

Jan 2 '08 #8

Piet van Oostrum

>>>>mario <ma***@ruggier.org(M) wrote:

>M$ python
MPython 2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
M[GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
MType "help", "copyright", "credits" or "license" for more information.

>>>>unicode('', 'mbcs')

Mu''

>>>>unicode('abc', 'mbcs')

MTraceback (most recent call last):
M File "<stdin>", line 1, in <module>
MLookupError: unknown encoding: mbcs

>>>>>

>MHmmn, strange. Same behaviour for "raboof".

Apparently for the empty string the encoding is irrelevant as it will not
be used. I guess there is an early check for this special case in the code.
--
Piet van Oostrum <pi**@cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: pi**@vanoostrum.org

Jan 2 '08 #9

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Do not know what the implications of encoding according to "ANSI

codepage (CP_ACP)" are. Windows only seems clear, but why does it only
complain when decoding a non-empty string (or when encoding the empty
unicode string) ?

It has no implications for this issue here. CP_ACP is a Microsoft
invention of a specific encoding alias - the "ANSI code page"
(as Microsoft calls it) is not a specific encoding where I could
specify a mapping from bytes to characters, but instead a
system-global indirection based on a langage default. For example,
in the Western-European/U.S. version of Windows, the default for
CP_ACP is cp1252 (local installation may change that default,
system-wide).

The issue likely has the cause that Piet also guessed: If the
input is an empty string, no attempt to actually perform an
encoding is done, but the output is assumed to be an empty
string again. This is correct behavior for all codecs that Python
supports in its default installation, at least for the direction
bytes->unicode. For the reverse direction, such an optimization
would be incorrect; consider u"".encode("utf-16").

HTH,
Martin

Jan 2 '08 #10

mario

On Jan 2, 2:25 pm, Piet van Oostrum <p...@cs.uu.nlwrote:

Apparently for the empty string the encoding is irrelevant as it will not
be used. I guess there is an early check for this special case in the code.

In the module I an working on[*] I am remembering a failed encoding
to allow me, if necessary, to later re-process fewer encodings. In the
case of an empty string AND an unknown encoding this strategy
failed...

Anyhow, the question is, should the behaviour be the same for these
operations, and if so what should it be:

u"".encode("non-existent")
unicode("", "non-existent")

mario
[*] a module to decode heuristically, that imho is actually starting
to look quite good, it is at http://gizmojo.org/code/decodeh/ and any
comments very welcome.

Jan 3 '08 #11

John Machin

On Jan 4, 8:03 am, mario <ma...@ruggier.orgwrote:

On Jan 2, 2:25 pm, Piet van Oostrum <p...@cs.uu.nlwrote:

Apparently for the empty string the encoding is irrelevant as it will not
be used. I guess there is an early check for this special case in the code.

In the module I an working on[*] I am remembering a failed encoding
to allow me, if necessary, to later re-process fewer encodings.

If you were in fact doing that, you would not have had a problem. What
you appear to have been doing is (a) remembering a NON-failing
encoding, and assuming that it would continue not to fail (b) not
differentiating between failure reasons (codec doesn't exist, input
not consistent with specified encoding).

A good strategy when dealing with encodings that are unknown (in the
sense that they come from user input, or a list of encodings you got
out of the manual, or are constructed on the fly (e.g. encoding = 'cp'
+ str(code_page_number) # old MS Excel files)) is to try to decode
some vanilla ASCII alphabetic text, so that you can give an immemdiate
in-context error message.

In the
case of an empty string AND an unknown encoding this strategy
failed...

>
Anyhow, the question is, should the behaviour be the same for these
operations, and if so what should it be:

u"".encode("non-existent")
unicode("", "non-existent")

Perhaps you should make TWO comparisons:
(1)
unistrg = strg.decode(encoding)
with
unistrg = unicode(strg, encoding)
[the latter "optimises" the case where strg is ''; the former can't
because its output may be '', not u'', depending on the encoding, so
ut must do the lookup]
(2)
unistrg = strg.decode(encoding)
with
strg = unistrg.encode(encoding)
[both always do the lookup]

In any case, a pointless question (IMHO); the behaviour is extremely
unlikely to change, as the chance of breaking existing code outvotes
any desire to clean up a minor inconsistency that is easily worked
around.

Jan 3 '08 #12

mario

On Jan 4, 12:02 am, John Machin <sjmac...@lexicon.netwrote:

On Jan 4, 8:03 am, mario <ma...@ruggier.orgwrote:
On Jan 2, 2:25 pm, Piet van Oostrum <p...@cs.uu.nlwrote:

Apparently for the empty string the encoding is irrelevant as it will not
be used. I guess there is an early check for this special case in the code.

In the module I an working on[*] I am remembering a failed encoding
to allow me, if necessary, to later re-process fewer encodings.

If you were in fact doing that, you would not have had a problem. What
you appear to have been doing is (a) remembering a NON-failing
encoding, and assuming that it would continue not to fail

Yes, exactly. But there is no difference which ones I remember as the
two subsets will anyway add up to always the same thing. In this
special case (empty string!) the unccode() call does not fail...

(b) not
differentiating between failure reasons (codec doesn't exist, input
not consistent with specified encoding).

There is no failure in the first pass in this case... if I do as you
suggest further down, that is to use s.decode(encoding) instead of
unicode(s, encoding) to force the lookup, then I could remember the
failure reason to be able to make a decision about how to proceed.
However I am aiming at an automatic decision, thus an in-context error
message would need to be replaced with a more rigourous info about how
the guessing should proceed. I am also trying to keep this simple ;)

<snip>

In any case, a pointless question (IMHO); the behaviour is extremely
unlikely to change, as the chance of breaking existing code outvotes
any desire to clean up a minor inconsistency that is easily worked
around.

Yes, I would agree. The work around may not even be worth it though,
as what I really want is a unicode object, so changing from calling
unicode() to s.decode() is not quite right, and will anyway require a
further check. Less clear code, and a little unnecessary performance
hit for the 99.9 majority of cases... Anyhow, I have improved a little
further the "post guess" checking/refining logic of the algorithm[*].

What I'd like to understand better is the "compatibility heirarchy" of
known encodings, in the positive sense that if a string decodes
successfully with encoding A, then it is also possible that it will
encode with encodings B, C; and in the negative sense that is if a
string fails to decode with encoding A, then for sure it will also
fail to decode with encodings B, C. Any ideas if such an analysis of
the relationships between encodings exists?

Thanks! mario
[*] http://gizmojo.org/code/decodeh/

Jan 12 '08 #13

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

What I'd like to understand better is the "compatibility heirarchy" of

known encodings, in the positive sense that if a string decodes
successfully with encoding A, then it is also possible that it will
encode with encodings B, C; and in the negative sense that is if a
string fails to decode with encoding A, then for sure it will also
fail to decode with encodings B, C. Any ideas if such an analysis of
the relationships between encodings exists?

Most certainly. You'll have to learn a lot about many encodings though
to really understand the relationships.

Many encodings X are "ASCII supersets", in the sense that if you have
only characters in the ASCII set, the encoding of the string in ASCII
is the same as the encoding of the string in X. ISO-8859-X, ISO-2022-X,
koi8-x, and UTF-8 fall in this category.

Other encodings are "ASCII supersets" only in the sense that they
include all characters of ASCII, but encode them differently. EBCDIC
and UCS-2/4, UTF-16/32 fall in that category.

Some encodings are 7-bit, so that they decode as ASCII (producing
moji-bake if the input wasn't ASCII). ISO-2022-X is an example.

Some encodings are 8-bit, so that they can decode arbitrary bytes
(again producing moji-bake if the input wasn't that encoding).
ISO-8859-X are examples, as are some of the EBCDIC encodings, and
koi8-x. Also, things will successfully (but meaninglessly) decode
as UTF-16 if the number of bytes in the input is even (likewise
for UTF-32).

HTH,
Martin

Jan 12 '08 #14

Similar topics

2909

Binary strings, unicode and encodings

by: Laurent Therond | last post by:

Maybe you have a minute to clarify the following matter... Consider: --- from cStringIO import StringIO def bencode_rec(x, b): t = type(x)

Python

3566

locale.CODESET / different in python shell and scripts

by: Nuff Said | last post by:

When I type the following code in the interactive python shell, I get 'UTF-8'; but if I put the code into a Python script and run the script - in the same terminal on my Linux box in which I...

Python

2564

PEP 263 status check

by: John Roth | last post by:

PEP 263 is marked finished in the PEP index, however I haven't seen the specified Phase 2 in the list of changes for 2.4 which is when I expected it. Did phase 2 get cancelled, or is it just not...

Python

2914

Umlauts, encodings, sitecustomize.py

by: F. GEIGER | last post by:

I'm on WinXP, Python 2.3. I don't have problems with umlauts (ä, ö, ü and their uppercase instances) in my wxPython-GUIs, when displayed as static texts. But when filling controls with text...

Python

3367

Python and encodings drives me crazy

by: Oliver Andrich | last post by:

Hi everybody, I have to write a little skript, that reads some nasty xml formated files. "Nasty xml formated" means, we have a xml like syntax, no dtd, use html entities without declaration and...

Python

4076

Encode() behaves differently with different charsets?

by: Scott Matthews | last post by:

I've recently come upon an odd Javascript (and/or browser) behavior, and after hunting around the Web I still can't seem to find an answer. Specifically, I have noticed that the Javascript...

Javascript

2445

Questions about working with character encodings

by: Kenneth McDonald | last post by:

I am going to demonstrate my complete lack of understanding as to going back and forth between character encodings, so I hope someone out there can shed some light on this. I have always...

Python

3146

Unicode, encodings, and asian languages: need some help.

by: apprentice | last post by:

Hello, I'm writing an class library that I imagine people from different countries might be interested in using, so I'm considering what needs to be provided to support foreign languages,...

.NET Framework

4656

Help with character encodings

by: A_H | last post by:

Help! I've scraped a PDF file for text and all the minus signs come back as u'\xad'. Is there any easy way I can change them all to plain old ASCII '-' ??? str.replace complained about a...

Python

7339

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

6995

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

7463

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

5581

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

4678

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

3168

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

3157

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

1515

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

C# / C Sharp

389

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

General