473,387 Members | 1,691 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

How to get an encoding a value?

Hi all,

I have a some data is encoded into something thing. I want to find out the encoding of that piece of data. For example
s = u"somedata"
I want to do something like
ThisIsTheEncodingOfS = s.getencoding()

is there are method that tell me that it is unicode value if I provide it with a unicode string?
Thanks
Moiz Golawala

Jul 18 '05 #1
6 2003
> I have a some data is encoded into something thing. I want to find out the
encoding of that piece of data. For example s = u"somedata"
I want to do something like
ThisIsTheEncodingOfS =kc s.getencoding()

is there are method that tell me that it is unicode value if I provide it
with a unicode string?


You are confusing unicode with strings with a certain encoding.

Unicode is an abstract specification of a huge number of characters,
hopefully covering even the close-to-unknown glyphs of some ancient
himalayan mountain tribe to the commonly used latin alphabet. There are no
actual numeric values associated with that glyphs.

An encoding on the other hand maps certain sets of glyphs to actual numbers
- e.g. the subset of common european language glyphs commonly known as
iso-8859-1, and much more - including utf-8, an encoding thats capable of
encoding all glyphs specified in unicode, at the cost of possibly using
more than one byte per glyph.

Now if you have a unicode object u, you can _encode_ it in a certain
encoding like this:

u.encode("utf-8")

If you oth have a string s of known encoding, you can decode it to a
unicode-object like this:

s.decode("latin1")

Thats the basics. Now to your actual question: your example makes no sense,
as you have a unicodeobject - which lacks any encoding whatsoever. And
unfortunately, if you have a string instead of an unicode object, you can
only guess what encoding it has - if you are lucky, that works. But no one
can guarantee that it works out - neither in python, nor in other
programming languages.

A common approach to guessing the encoding of said string is to try
something like this:

s = <some string with unknown encoding>
encodings ['ascii', 'latin1', 'utf-8', ....] # list of encodings you expect
for e in encodings:
try:
if s == s.decode(e).encode(e):
break
except UnicodeError:
pass
--
Regards,

Diez B. Roggisch
Jul 18 '05 #2
Diez B. Roggisch wrote:
A common approach to guessing the encoding of said string is to try
something like this:

s = <some string with unknown encoding>
encodings ['ascii', 'latin1', 'utf-8', ....] # list of encodings you
expect for e in encodings:
try:
ifÂ*sÂ*==Â*s.decode(e).encode(e):
break
exceptÂ*UnicodeError:
pass


However, you must be very careful with the order in which to test the
encodings. The example code will never detect "utf-8":
s = "".join(map(chr, range(256)))
s.decode("latin1").encode("latin1") == s True

This equality holds for every encoding where one byte is one character and
uses the full range of 256 bytes/characters. You cannot discriminate
between such encodings using the above method:
s.decode("latin1").encode("latin1") == s True s.decode("latin2").encode("latin2") == s True s.decode("latin2") == s.decode("latin1")

False

A statistical approach seems more promising, e. g. some smart variant of
"looking for umlauts" in a text known to be German.

Peter


Jul 18 '05 #3
Diez B. Roggisch <de************@web.de> wrote:
A common approach to guessing the encoding of said string is to try
something like this:

s = <some string with unknown encoding>
encodings ['ascii', 'latin1', 'utf-8', ....] # list of encodings you expect
for e in encodings:
try:
if s == s.decode(e).encode(e):
break
except UnicodeError:
pass


Yeah, but it doesn't work. iso-8859-x would break for any value of x;
can't tell this way if it was latin-1, or any of the others...
Alex
Jul 18 '05 #4
Alex Martelli wrote:
Yeah, but it doesn't work. iso-8859-x would break for any value of x;
can't tell this way if it was latin-1, or any of the others...


you and peter are right of cours - first try should be utf-8. And of course,
a one-byte-based encoding will always match. I know that there are tools
out there like recode that try to make an educated guess, by taking the
context o non-ascii chars into account and the like.
--
Regards,

Diez B. Roggisch
Jul 18 '05 #5
>>>>> "Diez B. Roggisch" <de************@web.de> (DBR) wrote:

DBR> You are confusing unicode with strings with a certain encoding.

DBR> Unicode is an abstract specification of a huge number of characters,
DBR> hopefully covering even the close-to-unknown glyphs of some ancient
DBR> himalayan mountain tribe to the commonly used latin alphabet. There are no
DBR> actual numeric values associated with that glyphs.

You mix up characters and glyphs which makes it confusing.
There are no numeric values associated with glyphs in Unicode, but there
are numeric values associated with abstract characters.

(http://www.unicode.org/standard/WhatIsUnicode.html)
Unicode provides a unique number for every character, no matter what the
platform, no matter what the program, no matter what the language.

These numbers are called `code points'. (It says `unique' above, but later
they relax that).

But you are right regarding the encodings. The Unicode code points can be
encoded in different ways e.g. with the UTF-8 encoding.
--
Piet van Oostrum <pi**@cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: P.***********@hccnet.nl
Jul 18 '05 #6
> You mix up characters and glyphs which makes it confusing.
There are no numeric values associated with glyphs in Unicode, but there
are numeric values associated with abstract characters.
(http://www.unicode.org/standard/WhatIsUnicode.html)
Unicode provides a unique number for every character, no matter what the
platform, no matter what the program, no matter what the language.

These numbers are called `code points'. (It says `unique' above, but later
they relax that).

But you are right regarding the encodings. The Unicode code points can be
encoded in different ways e.g. with the UTF-8 encoding.


Just checked - yup, you're right: a character might in fact be composed of
several glyphs. So they are closely related (especially in your common
western language), but not the same.

Sheesh, that stuff is always a bit more complicated than one actually thinks
- I usually get the applicational part of it right, but the inner details
of unicode are still foggy...

--
Regards,

Diez B. Roggisch
Jul 18 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

12
by: Christos TZOTZIOY Georgiou | last post by:
This is a subject that comes up fairly often. Last night, I had the following idea, for which I would like feedback from you. This could be implemented as a function in codecs.py (let's call it...
2
by: Vincent Poinot | last post by:
I'd like to implement some sort of search function on my site, so I took Google sample code and tried it, i.e. basically: <form method="GET" action="http://www.google.com/search"> <input...
9
by: Mark | last post by:
I've run a few simple tests looking at how query string encoding/decoding gets handled in asp.net, and it seems like the situation is even messier than it was in asp... Can't say I think much of the...
4
by: Terry Olsen | last post by:
I use the following code to create an XML string: Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click Dim tw As New StringWriter Dim xml...
4
by: Christina | last post by:
Hey Guys, Currently, I am using the below code: Dim oReqDoc as XmlDocument Dim requiredBytes As Byte() requiredBytes = System.Text.UTF8Encoding.UTF8.GetBytes(oReqDoc.InnerXml). Here, I am...
0
by: Janusz Nykiel | last post by:
I've stumbled upon unexpected behavior of the .NET 2.0 System.Xml.XmlWriter class when using it to write data to a binary stream (System.IO.Stream). If the amount of data is less than a certain...
0
by: deloford | last post by:
Hi This is going to be a question for anyone who is an expert in C# Text Encoding. My situation is this: I have a Sybase database which is firing back ISO-8559 encoded strings. I am unable to...
8
by: lisa1987i | last post by:
I am really having trouble with encoding characters. The application I am creating i based on a NNTP component from Smilla smilla.ru My propblem is when I read a string which contain special...
9
by: =?Utf-8?B?RGFu?= | last post by:
I have the following code section that I thought would strip out all the non-ascii characters from a string after decoding it. Unfortunately the non-ascii characters are still in the string....
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.