How to get an encoding a value?

Golawala, Moiz M (GE Infrastructure)

Hi all,

I have a some data is encoded into something thing. I want to find out the encoding of that piece of data. For example
s = u"somedata"
I want to do something like
ThisIsTheEncodingOfS = s.getencoding()

is there are method that tell me that it is unicode value if I provide it with a unicode string?
Thanks
Moiz Golawala

Jul 18 '05 #1

Subscribe Post Reply

2003

Diez B. Roggisch

> I have a some data is encoded into something thing. I want to find out the

encoding of that piece of data. For example s = u"somedata"
I want to do something like
ThisIsTheEncodingOfS =kc s.getencoding()

is there are method that tell me that it is unicode value if I provide it
with a unicode string?

You are confusing unicode with strings with a certain encoding.

Unicode is an abstract specification of a huge number of characters,
hopefully covering even the close-to-unknown glyphs of some ancient
himalayan mountain tribe to the commonly used latin alphabet. There are no
actual numeric values associated with that glyphs.

An encoding on the other hand maps certain sets of glyphs to actual numbers
- e.g. the subset of common european language glyphs commonly known as
iso-8859-1, and much more - including utf-8, an encoding thats capable of
encoding all glyphs specified in unicode, at the cost of possibly using
more than one byte per glyph.

Now if you have a unicode object u, you can _encode_ it in a certain
encoding like this:

u.encode("utf-8")

If you oth have a string s of known encoding, you can decode it to a
unicode-object like this:

s.decode("latin1")

Thats the basics. Now to your actual question: your example makes no sense,
as you have a unicodeobject - which lacks any encoding whatsoever. And
unfortunately, if you have a string instead of an unicode object, you can
only guess what encoding it has - if you are lucky, that works. But no one
can guarantee that it works out - neither in python, nor in other
programming languages.

A common approach to guessing the encoding of said string is to try
something like this:

s = <some string with unknown encoding>
encodings ['ascii', 'latin1', 'utf-8', ....] # list of encodings you expect
for e in encodings:
try:
if s == s.decode(e).encode(e):
break
except UnicodeError:
pass
--
Regards,

Diez B. Roggisch

Jul 18 '05 #2

Peter Otten

Diez B. Roggisch wrote:

A common approach to guessing the encoding of said string is to try
something like this:

s = <some string with unknown encoding>
encodings ['ascii', 'latin1', 'utf-8', ....] # list of encodings you
expect for e in encodings:
try:
ifÂ*sÂ*==Â*s.decode(e).encode(e):
break
exceptÂ*UnicodeError:
pass

However, you must be very careful with the order in which to test the
encodings. The example code will never detect "utf-8":

s = "".join(map(chr, range(256)))
s.decode("latin1").encode("latin1") == s True

This equality holds for every encoding where one byte is one character and
uses the full range of 256 bytes/characters. You cannot discriminate
between such encodings using the above method:
s.decode("latin1").encode("latin1") == s True s.decode("latin2").encode("latin2") == s True s.decode("latin2") == s.decode("latin1")

False

A statistical approach seems more promising, e. g. some smart variant of
"looking for umlauts" in a text known to be German.

Peter

Jul 18 '05 #3

Alex Martelli

Diez B. Roggisch <de************@web.de> wrote:

A common approach to guessing the encoding of said string is to try
something like this:

s = <some string with unknown encoding>
encodings ['ascii', 'latin1', 'utf-8', ....] # list of encodings you expect
for e in encodings:
try:
if s == s.decode(e).encode(e):
break
except UnicodeError:
pass

Yeah, but it doesn't work. iso-8859-x would break for any value of x;
can't tell this way if it was latin-1, or any of the others...
Alex

Jul 18 '05 #4

Diez B. Roggisch

Alex Martelli wrote:

Yeah, but it doesn't work. iso-8859-x would break for any value of x;
can't tell this way if it was latin-1, or any of the others...

you and peter are right of cours - first try should be utf-8. And of course,
a one-byte-based encoding will always match. I know that there are tools
out there like recode that try to make an educated guess, by taking the
context o non-ascii chars into account and the like.
--
Regards,

Diez B. Roggisch

Jul 18 '05 #5

Piet van Oostrum

>>>>> "Diez B. Roggisch" <de************@web.de> (DBR) wrote:

DBR> You are confusing unicode with strings with a certain encoding.

DBR> Unicode is an abstract specification of a huge number of characters,
DBR> hopefully covering even the close-to-unknown glyphs of some ancient
DBR> himalayan mountain tribe to the commonly used latin alphabet. There are no
DBR> actual numeric values associated with that glyphs.

You mix up characters and glyphs which makes it confusing.
There are no numeric values associated with glyphs in Unicode, but there
are numeric values associated with abstract characters.

(http://www.unicode.org/standard/WhatIsUnicode.html)
Unicode provides a unique number for every character, no matter what the
platform, no matter what the program, no matter what the language.

These numbers are called `code points'. (It says `unique' above, but later
they relax that).

But you are right regarding the encodings. The Unicode code points can be
encoded in different ways e.g. with the UTF-8 encoding.
--
Piet van Oostrum <pi**@cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP]
Private email: P.***********@hccnet.nl

Jul 18 '05 #6

Diez B. Roggisch

> You mix up characters and glyphs which makes it confusing.

There are no numeric values associated with glyphs in Unicode, but there
are numeric values associated with abstract characters.
(http://www.unicode.org/standard/WhatIsUnicode.html)
Unicode provides a unique number for every character, no matter what the
platform, no matter what the program, no matter what the language.

These numbers are called `code points'. (It says `unique' above, but later
they relax that).

But you are right regarding the encodings. The Unicode code points can be
encoded in different ways e.g. with the UTF-8 encoding.

Just checked - yup, you're right: a character might in fact be composed of
several glyphs. So they are closely related (especially in your common
western language), but not the same.

Sheesh, that stuff is always a bit more complicated than one actually thinks
- I usually get the applicational part of it right, but the inner details
of unicode are still foggy...

--
Regards,

Diez B. Roggisch

Jul 18 '05 #7

by: Christos TZOTZIOY Georgiou | last post by:

This is a subject that comes up fairly often. Last night, I had the following idea, for which I would like feedback from you. This could be implemented as a function in codecs.py (let's call it...

Python

Forms and encoding

by: Vincent Poinot | last post by:

I'd like to implement some sort of search function on my site, so I took Google sample code and tried it, i.e. basically: <form method="GET" action="http://www.google.com/search"> <input...

HTML / CSS

query string encoding/decoding

by: Mark | last post by:

I've run a few simple tests looking at how query string encoding/decoding gets handled in asp.net, and it seems like the situation is even messier than it was in asp... Can't say I think much of the...

ASP.NET

XML Encoding woes...

by: Terry Olsen | last post by:

I use the following code to create an XML string: Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click Dim tw As New StringWriter Dim xml...

Visual Basic .NET

ISO-8859-1 encoding of an xml string

by: Christina | last post by:

Hey Guys, Currently, I am using the below code: Dim oReqDoc as XmlDocument Dim requiredBytes As Byte() requiredBytes = System.Text.UTF8Encoding.UTF8.GetBytes(oReqDoc.InnerXml). Here, I am...

Visual Basic .NET

EncoderFallbackException when writing characters not available in the specified encoding with XMLWriter to a Stream - feature or bug?

by: Janusz Nykiel | last post by:

I've stumbled upon unexpected behavior of the .NET 2.0 System.Xml.XmlWriter class when using it to write data to a binary stream (System.IO.Stream). If the amount of data is less than a certain...

.NET Framework

Encoding: how to convert ISO-8559 to Unicode

by: deloford | last post by:

Hi This is going to be a question for anyone who is an expert in C# Text Encoding. My situation is this: I have a Sybase database which is firing back ISO-8559 encoded strings. I am unable to...

.NET Framework

Encoding problem from usenet

by: lisa1987i | last post by:

I am really having trouble with encoding characters. The application I am creating i based on a NNTP component from Smilla smilla.ru My propblem is when I read a string which contain special...

C# / C Sharp

encoding.ascii

by: =?Utf-8?B?RGFu?= | last post by:

I have the following code section that I thought would strip out all the non-ascii characters from a string after decoding it. Unfortunately the non-ascii characters are still in the string....

Visual Basic .NET

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

How to get an encoding a value?

Similar topics