Detect character encoding

Michal

Hello,
is there any way how to detect string encoding in Python?

I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it, and
encode it to utf-8 (with string function encode).

Thank you for any answer
Regards
Michal

Dec 4 '05 #1

Subscribe Reply

27993

Scott David Daniels

Michal wrote:

Hello,
is there any way how to detect string encoding in Python?

I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it, and
encode it to utf-8 (with string function encode).

Thank you for any answer
Regards
Michal

The two ways to detect a string's encoding are:
(1) know the encoding ahead of time
(2) guess correctly

This is the whole point of Unicode -- an encoding that works for _lots_
of languages.

--Scott David Daniels
sc***********@a cm.org

Dec 4 '05 #2

Diez B. Roggisch

Michal wrote:

Hello,
is there any way how to detect string encoding in Python?

I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it, and
encode it to utf-8 (with string function encode).

You can only guess, by e.g. looking for words that contain e.g. umlauts.
Recode might be of help here, it has such heuristics built in AFAIK.

But there is _no_ way to be absolutely sure. 8bit are 8bit, so each file
is "legal" in all encodings.
Diez

Dec 4 '05 #3

Mike Meyer

"Diez B. Roggisch" <de***@nospam.w eb.de> writes:

Michal wrote:
is there any way how to detect string encoding in Python?
I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it,
and encode it to utf-8 (with string function encode).

But there is _no_ way to be absolutely sure. 8bit are 8bit, so each
file is "legal" in all encodings.

Not quite. Some encodings don't use all the valid 8-bit characters, so
if you encounter a character not in an encoding, you can eliminate it
from the list of possible encodings. This doesn't really help much by
itself, though.

<mike
--
Mike Meyer <mw*@mired.or g> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.

Dec 4 '05 #4

Nemesis

Mentre io pensavo ad una intro simpatica "Michal" scriveva:

Hello,
is there any way how to detect string encoding in Python?
I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it, and
encode it to utf-8 (with string function encode).
Thank you for any answer

Hi,
As you already heard you can't be sure but you can guess.

I use a method like this:

def guess_encoding( text):
for best_enc in guess_list:
try:
unicode(text,be st_enc,"strict" )
except:
pass
else:
break
return best_enc

'guess_list' is an ordered charset name list like this:

['us-ascii','iso-8859-1','iso-8859-2',...,'windows-1250','windows-1252'...]

of course you can remove charsets you are sure you'll never find.
--
Questa potrebbe davvero essere la scintilla che fa traboccare la
goccia.

|\ | |HomePage : http://nem01.altervista.org
| \|emesis |XPN (my nr): http://xpn.altervista.org

Dec 4 '05 #5

B Mahoney

You may want to look at some Python Cookbook recipes, such as
http://aspn.activestate.com/ASPN/Coo...n/Recipe/52257
"Auto-detect XML encoding" by Paul Prescod

Dec 4 '05 #6

Martin P. Hellwig

Mike Meyer wrote:

"Diez B. Roggisch" <de***@nospam.w eb.de> writes:
Michal wrote:
is there any way how to detect string encoding in Python?
I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it,
and encode it to utf-8 (with string function encode).

But there is _no_ way to be absolutely sure. 8bit are 8bit, so each
file is "legal" in all encodings.

Not quite. Some encodings don't use all the valid 8-bit characters, so
if you encounter a character not in an encoding, you can eliminate it
from the list of possible encodings. This doesn't really help much by
itself, though.

<mike

I read or heard (can't remember the origin) that MS IE has a quite good
implementation of guessing the language en character encoding of web
pages when there not or falsely specified.
From what I can remember is that they used an algorithm to create some
statistics of the specific page and compared that with statistic about
all kinds of languages and encodings and just mapped the most likely.

Please be aware that I don't know if the above has even the slightest
amount of truth in it, however it didn't prevent me from posting anyway ;-)

--
mph

Dec 4 '05 #7

skip

Martin> I read or heard (can't remember the origin) that MS IE has a
Martin> quite good implementation of guessing the language en character
Martin> encoding of web pages when there not or falsely specified.

Gee, that's nice. Too bad the source isn't available... <0.5 wink>

Skip

Dec 4 '05 #8

Diez B. Roggisch

Mike Meyer wrote:

"Diez B. Roggisch" <de***@nospam.w eb.de> writes:
Michal wrote:
is there any way how to detect string encoding in Python?
I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it,
and encode it to utf-8 (with string function encode).

But there is _no_ way to be absolutely sure. 8bit are 8bit, so each
file is "legal" in all encodings.

Not quite. Some encodings don't use all the valid 8-bit characters, so
if you encounter a character not in an encoding, you can eliminate it
from the list of possible encodings. This doesn't really help much by
itself, though.

----- test.py
for enc in ["cp1250", "latin1", "iso-8859-2"]:
print enc
try:
str.decode("".j oin([chr(i) for i in xrange(256)]), enc)
except UnicodeDecodeEr ror, e:
print e
-----

192:~ deets$ python2.4 /tmp/test.py
cp1250
'charmap' codec can't decode byte 0x81 in position 129: character maps
to <undefined>
latin1
iso-8859-2

So cp1250 doesn't have all codepoints defined - but the others have.
Sure, this helps you to eliminate 1 of the three choices the OP wanted
to choose between - but how many texts you have that have a 129 in them?

Regards,

Diez

Dec 4 '05 #9

François Pinard

[Diez B. Roggisch]

Michal wrote:
is there any way how to detect string encoding in Python?

Recode might be of help here, it has such heuristics built in AFAIK.

If we are speaking about the same Recode â˜º, there are some built in
tools that could help a human to discover a charset, but this requires
work and time, and is far from fully automated as one might dream.
While some charsets could be guessed almost correctly by automatic
means, most are difficult to recognise. The whole problem is not easy.

--
FranÃ§ois Pinard http://pinard.progiciels-bpi.ca

Dec 5 '05 #10

Similar topics

17626

changing or at least detecting character encoding via javascript ?

by: David Komanek | last post by:

Hi all, I have a question if it is possible to manipulate the settings of character encoding in Ms Internet Explorer 5.0, 5.5 and 6.0. The problem is that the default instalation of Ms IE seems to have hard selected default encoding to "Western European (ISO)", which means iso-8859-1. When browsing pages with some Central/Eastern European characters these are converted to iso-8859-1 so displayed wrong. I would suppose the...

Javascript

4973

xml, character encoding, asp question

by: Mark | last post by:

Hi... I've been doing a lot of work both creating and consuming web services, and I notice there seems to be a discontinuity between a number of the different cogs in the wheel centering around windows-1252 and that it is not equivalent to iso-8859-1. Looking in the registry under HKEY_CLASSES_ROOT\MIME\Database\Charset and \Codepage, it seems that all variations on iso-8859-1 (latin1, etc) are mapped to code page 1252, which I'm...

ASP / Active Server Pages

10166

Simple high-ascii character encoding

by: chandy | last post by:

Hi, I have an Html document that declares that it uses the utf-8 character set. As this document is editable via a web interface I need to make sure than high-ascii characters that may be accidentally entered are properly represented when the document is served. My programming language allows me to get the ascii value for any individual character so what I am doing when a change is saved is to look at each character in the content and...

HTML / CSS

5869

HTML/XML character encoding getting changed

by: Jon Davis | last post by:

I have a software application I've written called PowerBlog (PowerBlog.net) that takes the editing capability of the Internet Explorer WebBrowser control (essentially a DHTMLTextBox), extracts the user-typed HTML, assigns it as an XML node's InnerText property (using C#: System.Xml.XmlDocument obj; obj.InnerText = myHTML). Then I later get the InnerText as a string and write to disk. When this text is displayed in a web browser, special...

C# / C Sharp

2359

Why is ASP.NET changing character encoding of documents?

by: John Dalberg | last post by:

The below html validates correctly on w3.org's html validator when the file has an html extension. When the same file gets an aspx extension, I get the error below from the validator. This tells me that ASP.NET is changing the character encoding in the http header. How can this be corrected so that aspx pages validate correctly? Error Message: "The character encoding specified in the HTTP header (utf-8) is different from the value in...

ASP.NET

5233

Detect the encoding of a stream

by: Roshan | last post by:

Hi, Is there a way in which I can detect the encoding used while creating a StreamReader? I have a method which receives a StreamReader as a parameter. I want to ensure that when I do a StreamReader.BaseStream.Seek() , I give the appropriate offset so that encoding information/byte order marks are not read in (In case the file uses UTF8 or Unicode encoding). Is there a standard way to do that? Thanks,

.NET Framework

3374

Crazy with character encoding

by: Zhiv Kurilka | last post by:

Hi, I have a text file with following content: "((^)|(.* +))§§§§§§§§" if I read it with: k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII); k.readtotheend()

C# / C Sharp

1445

imap_fetchbody and character encoding

by: Hugh Oxford | last post by:

Hi, I use imap_fetchbody to retrieve a message from the server, and I end up with something which contains lots of characters like this... =0A=0A=EF=BB=BF I have tried quoted_printable_decode(), which eliminates most of the problems, however sometimes a message comes in in which pound sterling signs are converted into characters unreadable by firefox, and represented by a question mark.

PHP

4850

How to detect the character encoding of a file ?

by: JB | last post by:

Hi All, I have an application that reads text files and does various things with their content. I'd like to know how to detect the character encoding of each text file so my app can handle them correctly (e.g. ANSI, UTF8, etc). Thanks for any tips JB

Visual Basic .NET

8774

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

9307

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9235

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9181

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

6735

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

4809

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3261

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

2721

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2180

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General