473,695 Members | 2,302 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Detect character encoding

Hello,
is there any way how to detect string encoding in Python?

I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it, and
encode it to utf-8 (with string function encode).

Thank you for any answer
Regards
Michal
Dec 4 '05 #1
13 27978
Michal wrote:
Hello,
is there any way how to detect string encoding in Python?

I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it, and
encode it to utf-8 (with string function encode).

Thank you for any answer
Regards
Michal

The two ways to detect a string's encoding are:
(1) know the encoding ahead of time
(2) guess correctly

This is the whole point of Unicode -- an encoding that works for _lots_
of languages.

--Scott David Daniels
sc***********@a cm.org
Dec 4 '05 #2
Michal wrote:
Hello,
is there any way how to detect string encoding in Python?

I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it, and
encode it to utf-8 (with string function encode).


You can only guess, by e.g. looking for words that contain e.g. umlauts.
Recode might be of help here, it has such heuristics built in AFAIK.

But there is _no_ way to be absolutely sure. 8bit are 8bit, so each file
is "legal" in all encodings.
Diez
Dec 4 '05 #3
"Diez B. Roggisch" <de***@nospam.w eb.de> writes:
Michal wrote:
is there any way how to detect string encoding in Python?
I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it,
and encode it to utf-8 (with string function encode).

But there is _no_ way to be absolutely sure. 8bit are 8bit, so each
file is "legal" in all encodings.


Not quite. Some encodings don't use all the valid 8-bit characters, so
if you encounter a character not in an encoding, you can eliminate it
from the list of possible encodings. This doesn't really help much by
itself, though.

<mike
--
Mike Meyer <mw*@mired.or g> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Dec 4 '05 #4
Mentre io pensavo ad una intro simpatica "Michal" scriveva:
Hello,
is there any way how to detect string encoding in Python?
I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it, and
encode it to utf-8 (with string function encode).
Thank you for any answer


Hi,
As you already heard you can't be sure but you can guess.

I use a method like this:

def guess_encoding( text):
for best_enc in guess_list:
try:
unicode(text,be st_enc,"strict" )
except:
pass
else:
break
return best_enc

'guess_list' is an ordered charset name list like this:

['us-ascii','iso-8859-1','iso-8859-2',...,'windows-1250','windows-1252'...]

of course you can remove charsets you are sure you'll never find.
--
Questa potrebbe davvero essere la scintilla che fa traboccare la
goccia.

|\ | |HomePage : http://nem01.altervista.org
| \|emesis |XPN (my nr): http://xpn.altervista.org

Dec 4 '05 #5
You may want to look at some Python Cookbook recipes, such as
http://aspn.activestate.com/ASPN/Coo...n/Recipe/52257
"Auto-detect XML encoding" by Paul Prescod

Dec 4 '05 #6
Mike Meyer wrote:
"Diez B. Roggisch" <de***@nospam.w eb.de> writes:
Michal wrote:
is there any way how to detect string encoding in Python?
I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it,
and encode it to utf-8 (with string function encode).

But there is _no_ way to be absolutely sure. 8bit are 8bit, so each
file is "legal" in all encodings.


Not quite. Some encodings don't use all the valid 8-bit characters, so
if you encounter a character not in an encoding, you can eliminate it
from the list of possible encodings. This doesn't really help much by
itself, though.

<mike


I read or heard (can't remember the origin) that MS IE has a quite good
implementation of guessing the language en character encoding of web
pages when there not or falsely specified.
From what I can remember is that they used an algorithm to create some
statistics of the specific page and compared that with statistic about
all kinds of languages and encodings and just mapped the most likely.

Please be aware that I don't know if the above has even the slightest
amount of truth in it, however it didn't prevent me from posting anyway ;-)

--
mph
Dec 4 '05 #7
Martin> I read or heard (can't remember the origin) that MS IE has a
Martin> quite good implementation of guessing the language en character
Martin> encoding of web pages when there not or falsely specified.

Gee, that's nice. Too bad the source isn't available... <0.5 wink>

Skip
Dec 4 '05 #8
Mike Meyer wrote:
"Diez B. Roggisch" <de***@nospam.w eb.de> writes:
Michal wrote:
is there any way how to detect string encoding in Python?
I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it,
and encode it to utf-8 (with string function encode).


But there is _no_ way to be absolutely sure. 8bit are 8bit, so each
file is "legal" in all encodings.

Not quite. Some encodings don't use all the valid 8-bit characters, so
if you encounter a character not in an encoding, you can eliminate it
from the list of possible encodings. This doesn't really help much by
itself, though.

----- test.py
for enc in ["cp1250", "latin1", "iso-8859-2"]:
print enc
try:
str.decode("".j oin([chr(i) for i in xrange(256)]), enc)
except UnicodeDecodeEr ror, e:
print e
-----

192:~ deets$ python2.4 /tmp/test.py
cp1250
'charmap' codec can't decode byte 0x81 in position 129: character maps
to <undefined>
latin1
iso-8859-2

So cp1250 doesn't have all codepoints defined - but the others have.
Sure, this helps you to eliminate 1 of the three choices the OP wanted
to choose between - but how many texts you have that have a 129 in them?

Regards,

Diez
Dec 4 '05 #9
[Diez B. Roggisch]
Michal wrote:
is there any way how to detect string encoding in Python?

Recode might be of help here, it has such heuristics built in AFAIK.


If we are speaking about the same Recode ☺, there are some built in
tools that could help a human to discover a charset, but this requires
work and time, and is far from fully automated as one might dream.
While some charsets could be guessed almost correctly by automatic
means, most are difficult to recognise. The whole problem is not easy.

--
François Pinard http://pinard.progiciels-bpi.ca
Dec 5 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

10
17622
by: David Komanek | last post by:
Hi all, I have a question if it is possible to manipulate the settings of character encoding in Ms Internet Explorer 5.0, 5.5 and 6.0. The problem is that the default instalation of Ms IE seems to have hard selected default encoding to "Western European (ISO)", which means iso-8859-1. When browsing pages with some Central/Eastern European characters these are converted to iso-8859-1 so displayed wrong. I would suppose the...
7
4965
by: Mark | last post by:
Hi... I've been doing a lot of work both creating and consuming web services, and I notice there seems to be a discontinuity between a number of the different cogs in the wheel centering around windows-1252 and that it is not equivalent to iso-8859-1. Looking in the registry under HKEY_CLASSES_ROOT\MIME\Database\Charset and \Codepage, it seems that all variations on iso-8859-1 (latin1, etc) are mapped to code page 1252, which I'm...
37
10149
by: chandy | last post by:
Hi, I have an Html document that declares that it uses the utf-8 character set. As this document is editable via a web interface I need to make sure than high-ascii characters that may be accidentally entered are properly represented when the document is served. My programming language allows me to get the ascii value for any individual character so what I am doing when a change is saved is to look at each character in the content and...
3
5866
by: Jon Davis | last post by:
I have a software application I've written called PowerBlog (PowerBlog.net) that takes the editing capability of the Internet Explorer WebBrowser control (essentially a DHTMLTextBox), extracts the user-typed HTML, assigns it as an XML node's InnerText property (using C#: System.Xml.XmlDocument obj; obj.InnerText = myHTML). Then I later get the InnerText as a string and write to disk. When this text is displayed in a web browser, special...
2
2349
by: John Dalberg | last post by:
The below html validates correctly on w3.org's html validator when the file has an html extension. When the same file gets an aspx extension, I get the error below from the validator. This tells me that ASP.NET is changing the character encoding in the http header. How can this be corrected so that aspx pages validate correctly? Error Message: "The character encoding specified in the HTTP header (utf-8) is different from the value in...
1
5231
by: Roshan | last post by:
Hi, Is there a way in which I can detect the encoding used while creating a StreamReader? I have a method which receives a StreamReader as a parameter. I want to ensure that when I do a StreamReader.BaseStream.Seek() , I give the appropriate offset so that encoding information/byte order marks are not read in (In case the file uses UTF8 or Unicode encoding). Is there a standard way to do that? Thanks,
37
3371
by: Zhiv Kurilka | last post by:
Hi, I have a text file with following content: "((^)|(.* +))§§§§§§§§" if I read it with: k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII); k.readtotheend()
0
1441
by: Hugh Oxford | last post by:
Hi, I use imap_fetchbody to retrieve a message from the server, and I end up with something which contains lots of characters like this... =0A=0A=EF=BB=BF I have tried quoted_printable_decode(), which eliminates most of the problems, however sometimes a message comes in in which pound sterling signs are converted into characters unreadable by firefox, and represented by a question mark.
4
4840
by: JB | last post by:
Hi All, I have an application that reads text files and does various things with their content. I'd like to know how to detect the character encoding of each text file so my app can handle them correctly (e.g. ANSI, UTF8, etc). Thanks for any tips JB
0
8635
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8574
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9119
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8994
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8852
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
7664
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5839
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4582
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
2276
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.