473,732 Members | 2,196 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Detect character encoding

Hello,
is there any way how to detect string encoding in Python?

I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it, and
encode it to utf-8 (with string function encode).

Thank you for any answer
Regards
Michal
Dec 4 '05 #1
13 27993
Michal wrote:
Hello,
is there any way how to detect string encoding in Python?

I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it, and
encode it to utf-8 (with string function encode).

Thank you for any answer
Regards
Michal

The two ways to detect a string's encoding are:
(1) know the encoding ahead of time
(2) guess correctly

This is the whole point of Unicode -- an encoding that works for _lots_
of languages.

--Scott David Daniels
sc***********@a cm.org
Dec 4 '05 #2
Michal wrote:
Hello,
is there any way how to detect string encoding in Python?

I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it, and
encode it to utf-8 (with string function encode).


You can only guess, by e.g. looking for words that contain e.g. umlauts.
Recode might be of help here, it has such heuristics built in AFAIK.

But there is _no_ way to be absolutely sure. 8bit are 8bit, so each file
is "legal" in all encodings.
Diez
Dec 4 '05 #3
"Diez B. Roggisch" <de***@nospam.w eb.de> writes:
Michal wrote:
is there any way how to detect string encoding in Python?
I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it,
and encode it to utf-8 (with string function encode).

But there is _no_ way to be absolutely sure. 8bit are 8bit, so each
file is "legal" in all encodings.


Not quite. Some encodings don't use all the valid 8-bit characters, so
if you encounter a character not in an encoding, you can eliminate it
from the list of possible encodings. This doesn't really help much by
itself, though.

<mike
--
Mike Meyer <mw*@mired.or g> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Dec 4 '05 #4
Mentre io pensavo ad una intro simpatica "Michal" scriveva:
Hello,
is there any way how to detect string encoding in Python?
I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it, and
encode it to utf-8 (with string function encode).
Thank you for any answer


Hi,
As you already heard you can't be sure but you can guess.

I use a method like this:

def guess_encoding( text):
for best_enc in guess_list:
try:
unicode(text,be st_enc,"strict" )
except:
pass
else:
break
return best_enc

'guess_list' is an ordered charset name list like this:

['us-ascii','iso-8859-1','iso-8859-2',...,'windows-1250','windows-1252'...]

of course you can remove charsets you are sure you'll never find.
--
Questa potrebbe davvero essere la scintilla che fa traboccare la
goccia.

|\ | |HomePage : http://nem01.altervista.org
| \|emesis |XPN (my nr): http://xpn.altervista.org

Dec 4 '05 #5
You may want to look at some Python Cookbook recipes, such as
http://aspn.activestate.com/ASPN/Coo...n/Recipe/52257
"Auto-detect XML encoding" by Paul Prescod

Dec 4 '05 #6
Mike Meyer wrote:
"Diez B. Roggisch" <de***@nospam.w eb.de> writes:
Michal wrote:
is there any way how to detect string encoding in Python?
I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it,
and encode it to utf-8 (with string function encode).

But there is _no_ way to be absolutely sure. 8bit are 8bit, so each
file is "legal" in all encodings.


Not quite. Some encodings don't use all the valid 8-bit characters, so
if you encounter a character not in an encoding, you can eliminate it
from the list of possible encodings. This doesn't really help much by
itself, though.

<mike


I read or heard (can't remember the origin) that MS IE has a quite good
implementation of guessing the language en character encoding of web
pages when there not or falsely specified.
From what I can remember is that they used an algorithm to create some
statistics of the specific page and compared that with statistic about
all kinds of languages and encodings and just mapped the most likely.

Please be aware that I don't know if the above has even the slightest
amount of truth in it, however it didn't prevent me from posting anyway ;-)

--
mph
Dec 4 '05 #7
Martin> I read or heard (can't remember the origin) that MS IE has a
Martin> quite good implementation of guessing the language en character
Martin> encoding of web pages when there not or falsely specified.

Gee, that's nice. Too bad the source isn't available... <0.5 wink>

Skip
Dec 4 '05 #8
Mike Meyer wrote:
"Diez B. Roggisch" <de***@nospam.w eb.de> writes:
Michal wrote:
is there any way how to detect string encoding in Python?
I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it,
and encode it to utf-8 (with string function encode).


But there is _no_ way to be absolutely sure. 8bit are 8bit, so each
file is "legal" in all encodings.

Not quite. Some encodings don't use all the valid 8-bit characters, so
if you encounter a character not in an encoding, you can eliminate it
from the list of possible encodings. This doesn't really help much by
itself, though.

----- test.py
for enc in ["cp1250", "latin1", "iso-8859-2"]:
print enc
try:
str.decode("".j oin([chr(i) for i in xrange(256)]), enc)
except UnicodeDecodeEr ror, e:
print e
-----

192:~ deets$ python2.4 /tmp/test.py
cp1250
'charmap' codec can't decode byte 0x81 in position 129: character maps
to <undefined>
latin1
iso-8859-2

So cp1250 doesn't have all codepoints defined - but the others have.
Sure, this helps you to eliminate 1 of the three choices the OP wanted
to choose between - but how many texts you have that have a 129 in them?

Regards,

Diez
Dec 4 '05 #9
[Diez B. Roggisch]
Michal wrote:
is there any way how to detect string encoding in Python?

Recode might be of help here, it has such heuristics built in AFAIK.


If we are speaking about the same Recode ☺, there are some built in
tools that could help a human to discover a charset, but this requires
work and time, and is far from fully automated as one might dream.
While some charsets could be guessed almost correctly by automatic
means, most are difficult to recognise. The whole problem is not easy.

--
François Pinard http://pinard.progiciels-bpi.ca
Dec 5 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

10
17626
by: David Komanek | last post by:
Hi all, I have a question if it is possible to manipulate the settings of character encoding in Ms Internet Explorer 5.0, 5.5 and 6.0. The problem is that the default instalation of Ms IE seems to have hard selected default encoding to "Western European (ISO)", which means iso-8859-1. When browsing pages with some Central/Eastern European characters these are converted to iso-8859-1 so displayed wrong. I would suppose the...
7
4973
by: Mark | last post by:
Hi... I've been doing a lot of work both creating and consuming web services, and I notice there seems to be a discontinuity between a number of the different cogs in the wheel centering around windows-1252 and that it is not equivalent to iso-8859-1. Looking in the registry under HKEY_CLASSES_ROOT\MIME\Database\Charset and \Codepage, it seems that all variations on iso-8859-1 (latin1, etc) are mapped to code page 1252, which I'm...
37
10166
by: chandy | last post by:
Hi, I have an Html document that declares that it uses the utf-8 character set. As this document is editable via a web interface I need to make sure than high-ascii characters that may be accidentally entered are properly represented when the document is served. My programming language allows me to get the ascii value for any individual character so what I am doing when a change is saved is to look at each character in the content and...
3
5869
by: Jon Davis | last post by:
I have a software application I've written called PowerBlog (PowerBlog.net) that takes the editing capability of the Internet Explorer WebBrowser control (essentially a DHTMLTextBox), extracts the user-typed HTML, assigns it as an XML node's InnerText property (using C#: System.Xml.XmlDocument obj; obj.InnerText = myHTML). Then I later get the InnerText as a string and write to disk. When this text is displayed in a web browser, special...
2
2359
by: John Dalberg | last post by:
The below html validates correctly on w3.org's html validator when the file has an html extension. When the same file gets an aspx extension, I get the error below from the validator. This tells me that ASP.NET is changing the character encoding in the http header. How can this be corrected so that aspx pages validate correctly? Error Message: "The character encoding specified in the HTTP header (utf-8) is different from the value in...
1
5233
by: Roshan | last post by:
Hi, Is there a way in which I can detect the encoding used while creating a StreamReader? I have a method which receives a StreamReader as a parameter. I want to ensure that when I do a StreamReader.BaseStream.Seek() , I give the appropriate offset so that encoding information/byte order marks are not read in (In case the file uses UTF8 or Unicode encoding). Is there a standard way to do that? Thanks,
37
3374
by: Zhiv Kurilka | last post by:
Hi, I have a text file with following content: "((^)|(.* +))§§§§§§§§" if I read it with: k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII); k.readtotheend()
0
1445
by: Hugh Oxford | last post by:
Hi, I use imap_fetchbody to retrieve a message from the server, and I end up with something which contains lots of characters like this... =0A=0A=EF=BB=BF I have tried quoted_printable_decode(), which eliminates most of the problems, however sometimes a message comes in in which pound sterling signs are converted into characters unreadable by firefox, and represented by a question mark.
4
4850
by: JB | last post by:
Hi All, I have an application that reads text files and does various things with their content. I'd like to know how to detect the character encoding of each text file so my app can handle them correctly (e.g. ANSI, UTF8, etc). Thanks for any tips JB
0
8774
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9307
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9235
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9181
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
6735
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4809
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3261
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2721
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2180
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.