Connecting Tech Pros Worldwide Forums | Help | Site Map

How to get an encoding a value?

Golawala, Moiz M (GE Infrastructure)
Guest
 
Posts: n/a
#1: Jul 18 '05
Hi all,

I have a some data is encoded into something thing. I want to find out the encoding of that piece of data. For example
s = u"somedata"
I want to do something like
ThisIsTheEncodingOfS = s.getencoding()

is there are method that tell me that it is unicode value if I provide it with a unicode string?


Thanks
Moiz Golawala


Diez B. Roggisch
Guest
 
Posts: n/a
#2: Jul 18 '05

re: How to get an encoding a value?


> I have a some data is encoded into something thing. I want to find out the[color=blue]
> encoding of that piece of data. For example s = u"somedata"
> I want to do something like
> ThisIsTheEncodingOfS =kc s.getencoding()
>
> is there are method that tell me that it is unicode value if I provide it
> with a unicode string?[/color]

You are confusing unicode with strings with a certain encoding.

Unicode is an abstract specification of a huge number of characters,
hopefully covering even the close-to-unknown glyphs of some ancient
himalayan mountain tribe to the commonly used latin alphabet. There are no
actual numeric values associated with that glyphs.

An encoding on the other hand maps certain sets of glyphs to actual numbers
- e.g. the subset of common european language glyphs commonly known as
iso-8859-1, and much more - including utf-8, an encoding thats capable of
encoding all glyphs specified in unicode, at the cost of possibly using
more than one byte per glyph.

Now if you have a unicode object u, you can _encode_ it in a certain
encoding like this:

u.encode("utf-8")

If you oth have a string s of known encoding, you can decode it to a
unicode-object like this:

s.decode("latin1")

Thats the basics. Now to your actual question: your example makes no sense,
as you have a unicodeobject - which lacks any encoding whatsoever. And
unfortunately, if you have a string instead of an unicode object, you can
only guess what encoding it has - if you are lucky, that works. But no one
can guarantee that it works out - neither in python, nor in other
programming languages.

A common approach to guessing the encoding of said string is to try
something like this:

s = <some string with unknown encoding>
encodings ['ascii', 'latin1', 'utf-8', ....] # list of encodings you expect
for e in encodings:
try:
if s == s.decode(e).encode(e):
break
except UnicodeError:
pass


--
Regards,

Diez B. Roggisch
Peter Otten
Guest
 
Posts: n/a
#3: Jul 18 '05

re: How to get an encoding a value?


Diez B. Roggisch wrote:
[color=blue]
> A common approach to guessing the encoding of said string is to try
> something like this:
>
> s = <some string with unknown encoding>
> encodings ['ascii', 'latin1', 'utf-8', ....] # list of encodings you
> expect for e in encodings:
> try:
> ifÂ*sÂ*==Â*s.decode(e).encode(e):
> break
> exceptÂ*UnicodeError:
> pass[/color]

However, you must be very careful with the order in which to test the
encodings. The example code will never detect "utf-8":
[color=blue][color=green][color=darkred]
>>> s = "".join(map(chr, range(256)))
>>> s.decode("latin1").encode("latin1") == s[/color][/color][/color]
True

This equality holds for every encoding where one byte is one character and
uses the full range of 256 bytes/characters. You cannot discriminate
between such encodings using the above method:
[color=blue][color=green][color=darkred]
>>> s.decode("latin1").encode("latin1") == s[/color][/color][/color]
True[color=blue][color=green][color=darkred]
>>> s.decode("latin2").encode("latin2") == s[/color][/color][/color]
True[color=blue][color=green][color=darkred]
>>> s.decode("latin2") == s.decode("latin1")[/color][/color][/color]
False

A statistical approach seems more promising, e. g. some smart variant of
"looking for umlauts" in a text known to be German.

Peter




Alex Martelli
Guest
 
Posts: n/a
#4: Jul 18 '05

re: How to get an encoding a value?


Diez B. Roggisch <deets.nospaaam@web.de> wrote:
[color=blue]
> A common approach to guessing the encoding of said string is to try
> something like this:
>
> s = <some string with unknown encoding>
> encodings ['ascii', 'latin1', 'utf-8', ....] # list of encodings you expect
> for e in encodings:
> try:
> if s == s.decode(e).encode(e):
> break
> except UnicodeError:
> pass[/color]

Yeah, but it doesn't work. iso-8859-x would break for any value of x;
can't tell this way if it was latin-1, or any of the others...


Alex
Diez B. Roggisch
Guest
 
Posts: n/a
#5: Jul 18 '05

re: How to get an encoding a value?


Alex Martelli wrote:
[color=blue]
> Yeah, but it doesn't work. iso-8859-x would break for any value of x;
> can't tell this way if it was latin-1, or any of the others...[/color]

you and peter are right of cours - first try should be utf-8. And of course,
a one-byte-based encoding will always match. I know that there are tools
out there like recode that try to make an educated guess, by taking the
context o non-ascii chars into account and the like.
--
Regards,

Diez B. Roggisch
Piet van Oostrum
Guest
 
Posts: n/a
#6: Jul 18 '05

re: How to get an encoding a value?


>>>>> "Diez B. Roggisch" <deets.nospaaam@web.de> (DBR) wrote:

DBR> You are confusing unicode with strings with a certain encoding.

DBR> Unicode is an abstract specification of a huge number of characters,
DBR> hopefully covering even the close-to-unknown glyphs of some ancient
DBR> himalayan mountain tribe to the commonly used latin alphabet. There are no
DBR> actual numeric values associated with that glyphs.

You mix up characters and glyphs which makes it confusing.
There are no numeric values associated with glyphs in Unicode, but there
are numeric values associated with abstract characters.

(http://www.unicode.org/standard/WhatIsUnicode.html)
Unicode provides a unique number for every character, no matter what the
platform, no matter what the program, no matter what the language.

These numbers are called `code points'. (It says `unique' above, but later
they relax that).

But you are right regarding the encodings. The Unicode code points can be
encoded in different ways e.g. with the UTF-8 encoding.
--
Piet van Oostrum <piet@cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: P.van.Oostrum@hccnet.nl
Diez B. Roggisch
Guest
 
Posts: n/a
#7: Jul 18 '05

re: How to get an encoding a value?


> You mix up characters and glyphs which makes it confusing.[color=blue]
> There are no numeric values associated with glyphs in Unicode, but there
> are numeric values associated with abstract characters.
> (http://www.unicode.org/standard/WhatIsUnicode.html)[/color]

[color=blue]
> Unicode provides a unique number for every character, no matter what the
> platform, no matter what the program, no matter what the language.
>
> These numbers are called `code points'. (It says `unique' above, but later
> they relax that).
>
> But you are right regarding the encodings. The Unicode code points can be
> encoded in different ways e.g. with the UTF-8 encoding.[/color]

Just checked - yup, you're right: a character might in fact be composed of
several glyphs. So they are closely related (especially in your common
western language), but not the same.

Sheesh, that stuff is always a bit more complicated than one actually thinks
- I usually get the applicational part of it right, but the inner details
of unicode are still foggy...

--
Regards,

Diez B. Roggisch
Closed Thread