unicode(s, enc).encode(enc) == s ?

mario

I have checks in code, to ensure a decode/encode cycle returns the
original string.

Given no UnicodeErrors, are there any cases for the following not to
be True?

unicode(s, enc).encode(enc) == s

mario

Dec 27 '07 #1

Subscribe Reply

2394

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Given no UnicodeErrors, are there any cases for the following not to

be True?

unicode(s, enc).encode(enc) == s

Certainly. ISO-2022 is famous for having ambiguous encodings. Try
these:

unicode("Hallo","iso-2022-jp")
unicode("\x1b(BHallo","iso-2022-jp")
unicode("\x1b(JHallo","iso-2022-jp")
unicode("\x1b(BHal\x1b(Jlo","iso-2022-jp")

or likewise

unicode("\x1b$@BB","iso-2022-jp")
unicode("\x1b$BBB","iso-2022-jp")

In iso-2022-jp-3, there are even more ways to encode the same string.

Regards,
Martin

Dec 27 '07 #2

mario

On Dec 27, 7:37 pm, "Martin v. Löwis" <mar...@v.loewis.dewrote:

Certainly. ISO-2022 is famous for having ambiguous encodings. Try
these:

unicode("Hallo","iso-2022-jp")
unicode("\x1b(BHallo","iso-2022-jp")
unicode("\x1b(JHallo","iso-2022-jp")
unicode("\x1b(BHal\x1b(Jlo","iso-2022-jp")

or likewise

unicode("\x1b$@BB","iso-2022-jp")
unicode("\x1b$BBB","iso-2022-jp")

In iso-2022-jp-3, there are even more ways to encode the same string.

Wow, that's not easy to see why would anyone ever want that? Is there
any logic behind this?

In your samples both of unicode("\x1b(BHallo","iso-2022-jp") and
unicode("\x1b(JHallo","iso-2022-jp") give u"Hallo" -- does this mean
that the ignored/lost bytes in the original strings are not illegal
but *represent nothing* in this encoding?

I.e. in practice (in a context limited to the encoding in question)
should this be considered as a data loss, or should these strings be
considered "equivalent"?

Thanks!

mario

Dec 28 '07 #3

Marc 'BlackJack' Rintsch

On Fri, 28 Dec 2007 03:00:59 -0800, mario wrote:

On Dec 27, 7:37 pm, "Martin v. LÃ¶wis" <mar...@v.loewis.dewrote:
>Certainly. ISO-2022 is famous for having ambiguous encodings. Try
these:

unicode("Hallo","iso-2022-jp")
unicode("\x1b(BHallo","iso-2022-jp")
unicode("\x1b(JHallo","iso-2022-jp")
unicode("\x1b(BHal\x1b(Jlo","iso-2022-jp")

or likewise

unicode("\x1b$@BB","iso-2022-jp")
unicode("\x1b$BBB","iso-2022-jp")

In iso-2022-jp-3, there are even more ways to encode the same string.

Wow, that's not easy to see why would anyone ever want that? Is there
any logic behind this?

In your samples both of unicode("\x1b(BHallo","iso-2022-jp") and
unicode("\x1b(JHallo","iso-2022-jp") give u"Hallo" -- does this mean
that the ignored/lost bytes in the original strings are not illegal
but *represent nothing* in this encoding?

They are not lost or ignored but escape sequences that tell how the
following bytes should be interpreted '\x1b(B' switches to ASCII and
'\x1b(J' to some "roman" encoding which is a superset of ASCII, so it
doesn't matter which one you choose unless the following bytes are all
ASCII. And of course you can use that escape prefix as often as you want
within a string of ASCII byte values.

http://en.wikipedia.org/wiki/ISO-202...Character_Sets

I.e. in practice (in a context limited to the encoding in question)
should this be considered as a data loss, or should these strings be
considered "equivalent"?

Equivalent I would say. As Unicode they contain the same characters.
Just differently encoded as bytes.

Ciao,
Marc 'BlackJack' Rintsch

Dec 28 '07 #4

=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=

Wow, that's not easy to see why would anyone ever want that? Is there

any logic behind this?

It's the pre-Unicode solution to the "we want to have many characters
encoded in a single file" problem.

Suppose you have pre-defined characters sets A, B, C, and you want text
to contain characters from all three sets, one possible encoding is

<switch-to-A>CharactersInA<switch-to-B>CharactersFromB<and-so-on>

Now also suppose that A, B, and C are not completely different, but
have slight overlap - and you get ambiguous encodings.

ISO-2022 works that way. IPSJ maintains a registry of character
sets for ISO, and assigns escape codes to them. There are currently
about 200 character sets registered.

Somebody decoding this would have to know all the character sets
(remember it's a growing registry), hence iso-2022-jp restricts
the character sets that you can use for that particular encoding.
(Likewise, iso-2022-kr also restricts it, but to a different set
of sets).

It's a mess, sure, and one of the primary driving force of Unicode
(which even has the unification - ie. lack of ambiguity - in
its name).

In your samples both of unicode("\x1b(BHallo","iso-2022-jp") and
unicode("\x1b(JHallo","iso-2022-jp") give u"Hallo" -- does this mean
that the ignored/lost bytes in the original strings are not illegal
but *represent nothing* in this encoding?

See above, and Marc's explanation.
ESC ( B switches to "ISO 646, USA Version X3.4 - 1968";
ESC ( J to "ISO 646, Japanese Version for Roman Characters JIS C6220-1969"

These are identical, except for the following differences:
- The USA version has "reverse solidus" at 5/12; the Japanese
version "Yen sign"
- The USA version has "Tilde (overline; general accent)" at
7/14 (depicted as tilde); the Japanese version "Overline"
(depicted as straight overline)
- The Japanese version specifies that you can switch between
roman and katakana mode by sending shift out (SO, '\x0e')
and shift-in (SI, '\x0F') respectively; this switches to
the "JIS KATAKANA character set".
(source:
http://www.itscj.ipsj.or.jp/ISO-IR/006.pdf
http://www.itscj.ipsj.or.jp/ISO-IR/014.pdf
)

I.e. in practice (in a context limited to the encoding in question)
should this be considered as a data loss, or should these strings be
considered "equivalent"?

These particular differences should be considered as irrelevant. There
are some cases where Unicode had introduced particular compatibility
characters to accommodate such encodings (specifically, the "full-width"
latin (*) and "half-width" Japanese characters). Good codecs are
supposed to round-trip the relevant differences to Unicode, and generate
the appropriate compatibility characters.

Bad codecs might not, and in some cases, users might complain that
certain compatibility characters are lacking in Unicode so that correct
round-tripping is not possible. I believe the Unicode consortium has
resolved all these complaints by adding the missing characters; but
I'm not sure.

Regards,
Martin

(*) As an example for full-width characters, consider these
two strings:
ï¼¨ï½…ï½Œï½Œï½
Hello
Should they be equivalent, or not? They are under NFKD, but not
NFD.

Dec 29 '07 #5

mario

Thanks a lot Martin and Marc for the really great explanations! I was
wondering if it would be reasonable to imagine a utility that will
determine whether, for a given encoding, two byte strings would be
equivalent. But I think such a utility will require *extensive*
knowledge about many bizarrities of many encodings -- and has little
chance of being pretty!

In any case, it goes well beyond the situation that triggered my
original question in the first place, that basically was to provide a
reasonable check on whether round-tripping a string is successful --
this is in the context of a small utility to guess an encoding and to
use it to decode a byte string. This utility module was triggered by
one that Skip Montanaro had written some time ago, but I wanted to add
and combine several ideas and techniques (and support for my usage
scenarios) for guessing a string's encoding in one convenient place. I
provide a write-up and the code for it here:

http://gizmojo.org/code/decodeh/

I will be very interested in any remarks any of you may have!

Best regards, mario

Jan 2 '08 #6

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Thanks a lot Martin and Marc for the really great explanations! I was

wondering if it would be reasonable to imagine a utility that will
determine whether, for a given encoding, two byte strings would be
equivalent.

But that is much easier to answer:

s1.decode(enc) == s2.decode(enc)

Assuming Unicode's unification, for a single encoding, this should
produce correct results in all cases I'm aware of.

If the you also have different encodings, you should add

def normal_decode(s, enc):
return unicode.normalize("NFKD", s.decode(enc))

normal_decode(s1, enc) == normal_decode(s2, enc)

This would flatten out compatibility characters, and ambiguities
left in Unicode itself.

But I think such a utility will require *extensive*
knowledge about many bizarrities of many encodings -- and has little
chance of being pretty!

See above.

In any case, it goes well beyond the situation that triggered my
original question in the first place, that basically was to provide a
reasonable check on whether round-tripping a string is successful --
this is in the context of a small utility to guess an encoding and to
use it to decode a byte string. This utility module was triggered by
one that Skip Montanaro had written some time ago, but I wanted to add
and combine several ideas and techniques (and support for my usage
scenarios) for guessing a string's encoding in one convenient place.

Notice that this algorithm is not capable of detecting the ISO-2022
encodings - they look like ASCII to this algorithm. This is by design,
as the encoding was designed to only use 7-bit bytes, so that you can
safely transport them in Email and such (*)

If you want to add support for ISO-2022, you should look for escape
characters, and then check whether the escape sequences are among
the ISO-2022 ones:
- ESC ( - 94-character graphic character set, G0
- ESC ) - 94-character graphic character set, G1
- ESC * - 94-character graphic character set, G2
- ESC + - 94-character graphic character set, G3
- ESC - - 96-character graphic character set, G1
- ESC . - 96-character graphic character set, G2
- ESC / - 96-character graphic character set, G3
- ESC $ - Multibyte
( G0
) G1
* G2
+ G3
- ESC % - Non-ISO-2022 (e.g. UTF-8)

If you see any of these, it should be ISO-2022; see
the Wiki page as to what subset may be in use.

G0..G3 means what register the character set is loaded
into; when you have loaded a character set into a register,
you can switch between registers through ^N (to G1),
^O (to G0), ESC n (to G2), ESC o (to G3) (*)

http://gizmojo.org/code/decodeh/

I will be very interested in any remarks any of you may have!

From a shallow inspection, it looks right. I would have spelled
"losses" as "loses".

Regards,
Martin

(*) For completeness: ISO-2022 also supports 8-bit characters,
and there are more control codes to shift between the various
registers.

Jan 2 '08 #7

mario

Thanks again. I will chunk my responses as your message has too much
in it for me to process all at once...

On Jan 2, 9:34 pm, "Martin v. Löwis" <mar...@v.loewis.dewrote:

Thanks a lot Martin and Marc for the really great explanations! I was
wondering if it would be reasonable to imagine a utility that will
determine whether, for a given encoding, two byte strings would be
equivalent.

But that is much easier to answer:

s1.decode(enc) == s2.decode(enc)

Assuming Unicode's unification, for a single encoding, this should
produce correct results in all cases I'm aware of.

If the you also have different encodings, you should add

def normal_decode(s, enc):
return unicode.normalize("NFKD", s.decode(enc))

normal_decode(s1, enc) == normal_decode(s2, enc)

This would flatten out compatibility characters, and ambiguities
left in Unicode itself.

Hmmn, true, it would be that easy.

I am now not sure why I needed that check, or how to use this version
of it... I am always starting from one string, and decoding it... that
may be lossy when that is re-encoded, and compared to original.
However it is clear that the test above should always pass in this
case, so doing it seems superfluos.

Thanks for the unicodedata.normalize() tip.

mario

Jan 3 '08 #8

mario

On Jan 2, 9:34 pm, "Martin v. Löwis" <mar...@v.loewis.dewrote:

In any case, it goes well beyond the situation that triggered my
original question in the first place, that basically was to provide a
reasonable check on whether round-tripping a string is successful --
this is in the context of a small utility to guess an encoding and to
use it to decode a byte string. This utility module was triggered by
one that Skip Montanaro had written some time ago, but I wanted to add
and combine several ideas and techniques (and support for my usage
scenarios) for guessing a string's encoding in one convenient place.

Notice that this algorithm is not capable of detecting the ISO-2022
encodings - they look like ASCII to this algorithm. This is by design,
as the encoding was designed to only use 7-bit bytes, so that you can
safely transport them in Email and such (*)

Well, one could specify decode_heuristically(s, enc="iso-2022-jp") and
that
encoding will be checked before ascii or any other encoding in the
list.

If you want to add support for ISO-2022, you should look for escape
characters, and then check whether the escape sequences are among
the ISO-2022 ones:
- ESC ( - 94-character graphic character set, G0
- ESC ) - 94-character graphic character set, G1
- ESC * - 94-character graphic character set, G2
- ESC + - 94-character graphic character set, G3
- ESC - - 96-character graphic character set, G1
- ESC . - 96-character graphic character set, G2
- ESC / - 96-character graphic character set, G3
- ESC $ - Multibyte
( G0
) G1
* G2
+ G3
- ESC % - Non-ISO-2022 (e.g. UTF-8)

If you see any of these, it should be ISO-2022; see
the Wiki page as to what subset may be in use.

G0..G3 means what register the character set is loaded
into; when you have loaded a character set into a register,
you can switch between registers through ^N (to G1),
^O (to G0), ESC n (to G2), ESC o (to G3) (*)

OK, suppose we do not know the string is likely to be iso-2022, but we
still want to detect it if it is. I have added a "may_do_better"
mechanism to the algorithm, to add special checks on a *guessed*
algorithm. I am not sure this will not however introduce more or other
problems than the one it is addressing...

I have re-instated checks for iso-8859-1 control chars (likely to be
cp1252), for special symbols in iso-8859-15 when they occur in
iso-8859-1 and cp1252, and for the iso-2022-jp escape sequences. To
flesh out with other checks is mechanical work...

If you could take a look at the updated page:

http://gizmojo.org/code/decodeh/

I still have issues with what happens in situations when for example a
file contains iso-2022 esc sequences but is anyway actally in ascii
or utf-8? e.g. this mail message! I'll let this issue turn for a
little while...

I will be very interested in any remarks any of you may have!

From a shallow inspection, it looks right. I would have spelled
"losses" as "loses".

Yes, corrected.

Thanks, mario

Jan 3 '08 #9

Similar topics

2711

unicode encoding usablilty problem

by: aurora | last post by:

I have long find the Python default encoding of strict ASCII frustrating. For one thing I prefer to get garbage character than an exception. But the biggest issue is Unicode exception often pop up...

Python

2200

unicode(obj, errors='foo') raises TypeError - bug?

by: Mike Brown | last post by:

This works as expected (this is on an ASCII terminal): >>> unicode('asdf\xff', errors='replace') u'asdf\ufffd' This does not work as I expect it to: >>> class C: .... def __str__(self):

Python

1755

ASCII or Unicode

by: Scott Duckworth | last post by:

Can anyone provide a quick code snippit to open a text file and tell if it's ASCII or Unicode? Thanks

Visual Basic .NET

4067

Unicode in MIMEText

by: Damjan | last post by:

Why doesn't this work: from email.MIMEText import MIMEText msg = MIMEText(u'\u043a\u0438\u0440\u0438\u043b\u0438\u0446\u0430') msg.set_charset('utf-8') msg.as_string() Traceback (most recent...

Python

4182

Unicode & Pythonwin / win32 / console?

by: Robert | last post by:

Hello, I'm using Pythonwin and py2.3 (py2.4). I did not come clear with this: I want to use win32-fuctions like win32ui.MessageBox, listctrl.InsertItem ..... to get unicode strings on the...

Python

1840

Writing a Unicode String into file

by: kenny | last post by:

Hello, I am trying to read the contents of 01.bin (unicode) into a String, to modify it and finally to write it back into an other file named 02.bin. If the file really contains "a b c", then...

Visual Basic .NET

1075

Conversion: hex to unicode

by: kenny | last post by:

Hello, I have a problem converting a hexadecimal string to unicode text and save it to a file. I do not know how to make it, so I ask here... And I had much problems replacing unicode strings...

Visual Basic .NET

1574

unicode, bytes redux

by: willie | last post by:

(beating a dead horse) Is it too ridiculous to suggest that it'd be nice if the unicode object were to remember the encoding of the string it was decoded from? So that it's feasible to...

Python

1279

Issue with Base64Encoding for Unicode Data

by: amollokhande1 | last post by:

Hi All, Currently we are facing an issue while decoding the Base64Encoded unicode data. Here is the scenario We have one custom javascript function that encodes the unicode data using Base64...

Visual Basic .NET

7055

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

6920

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

7106

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

6760

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

7022

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

4501

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

1311

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

C# / C Sharp

572

php

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

206

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

General