Hello,
is there any way how to detect string encoding in Python?
I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it, and
encode it to utf-8 (with string function encode).
Thank you for any answer
Regards
Michal 13 27993
Michal wrote: Hello, is there any way how to detect string encoding in Python?
I need to proccess several files. Each of them could be encoded in different charset (iso-8859-2, cp1250, etc). I want to detect it, and encode it to utf-8 (with string function encode).
Thank you for any answer Regards Michal
The two ways to detect a string's encoding are:
(1) know the encoding ahead of time
(2) guess correctly
This is the whole point of Unicode -- an encoding that works for _lots_
of languages.
--Scott David Daniels sc***********@a cm.org
Michal wrote: Hello, is there any way how to detect string encoding in Python?
I need to proccess several files. Each of them could be encoded in different charset (iso-8859-2, cp1250, etc). I want to detect it, and encode it to utf-8 (with string function encode).
You can only guess, by e.g. looking for words that contain e.g. umlauts.
Recode might be of help here, it has such heuristics built in AFAIK.
But there is _no_ way to be absolutely sure. 8bit are 8bit, so each file
is "legal" in all encodings.
Diez
"Diez B. Roggisch" <de***@nospam.w eb.de> writes: Michal wrote: is there any way how to detect string encoding in Python? I need to proccess several files. Each of them could be encoded in different charset (iso-8859-2, cp1250, etc). I want to detect it, and encode it to utf-8 (with string function encode). But there is _no_ way to be absolutely sure. 8bit are 8bit, so each file is "legal" in all encodings.
Not quite. Some encodings don't use all the valid 8-bit characters, so
if you encounter a character not in an encoding, you can eliminate it
from the list of possible encodings. This doesn't really help much by
itself, though.
<mike
--
Mike Meyer <mw*@mired.or g> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Mentre io pensavo ad una intro simpatica "Michal" scriveva: Hello, is there any way how to detect string encoding in Python? I need to proccess several files. Each of them could be encoded in different charset (iso-8859-2, cp1250, etc). I want to detect it, and encode it to utf-8 (with string function encode). Thank you for any answer
Hi,
As you already heard you can't be sure but you can guess.
I use a method like this:
def guess_encoding( text):
for best_enc in guess_list:
try:
unicode(text,be st_enc,"strict" )
except:
pass
else:
break
return best_enc
'guess_list' is an ordered charset name list like this:
['us-ascii','iso-8859-1','iso-8859-2',...,'windows-1250','windows-1252'...]
of course you can remove charsets you are sure you'll never find.
--
Questa potrebbe davvero essere la scintilla che fa traboccare la
goccia.
|\ | |HomePage : http://nem01.altervista.org
| \|emesis |XPN (my nr): http://xpn.altervista.org
Mike Meyer wrote: "Diez B. Roggisch" <de***@nospam.w eb.de> writes: Michal wrote: is there any way how to detect string encoding in Python? I need to proccess several files. Each of them could be encoded in different charset (iso-8859-2, cp1250, etc). I want to detect it, and encode it to utf-8 (with string function encode). But there is _no_ way to be absolutely sure. 8bit are 8bit, so each file is "legal" in all encodings.
Not quite. Some encodings don't use all the valid 8-bit characters, so if you encounter a character not in an encoding, you can eliminate it from the list of possible encodings. This doesn't really help much by itself, though.
<mike
I read or heard (can't remember the origin) that MS IE has a quite good
implementation of guessing the language en character encoding of web
pages when there not or falsely specified.
From what I can remember is that they used an algorithm to create some
statistics of the specific page and compared that with statistic about
all kinds of languages and encodings and just mapped the most likely.
Please be aware that I don't know if the above has even the slightest
amount of truth in it, however it didn't prevent me from posting anyway ;-)
--
mph
Martin> I read or heard (can't remember the origin) that MS IE has a
Martin> quite good implementation of guessing the language en character
Martin> encoding of web pages when there not or falsely specified.
Gee, that's nice. Too bad the source isn't available... <0.5 wink>
Skip
Mike Meyer wrote: "Diez B. Roggisch" <de***@nospam.w eb.de> writes:
Michal wrote:
is there any way how to detect string encoding in Python? I need to proccess several files. Each of them could be encoded in different charset (iso-8859-2, cp1250, etc). I want to detect it, and encode it to utf-8 (with string function encode).
But there is _no_ way to be absolutely sure. 8bit are 8bit, so each file is "legal" in all encodings.
Not quite. Some encodings don't use all the valid 8-bit characters, so if you encounter a character not in an encoding, you can eliminate it from the list of possible encodings. This doesn't really help much by itself, though.
----- test.py
for enc in ["cp1250", "latin1", "iso-8859-2"]:
print enc
try:
str.decode("".j oin([chr(i) for i in xrange(256)]), enc)
except UnicodeDecodeEr ror, e:
print e
-----
192:~ deets$ python2.4 /tmp/test.py
cp1250
'charmap' codec can't decode byte 0x81 in position 129: character maps
to <undefined>
latin1
iso-8859-2
So cp1250 doesn't have all codepoints defined - but the others have.
Sure, this helps you to eliminate 1 of the three choices the OP wanted
to choose between - but how many texts you have that have a 129 in them?
Regards,
Diez
[Diez B. Roggisch] Michal wrote:
is there any way how to detect string encoding in Python?
Recode might be of help here, it has such heuristics built in AFAIK.
If we are speaking about the same Recode ☺, there are some built in
tools that could help a human to discover a charset, but this requires
work and time, and is far from fully automated as one might dream.
While some charsets could be guessed almost correctly by automatic
means, most are difficult to recognise. The whole problem is not easy.
--
François Pinard http://pinard.progiciels-bpi.ca This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: David Komanek |
last post by:
Hi all,
I have a question if it is possible to manipulate the settings of
character encoding in Ms Internet Explorer 5.0, 5.5 and 6.0. The
problem is that the default instalation of Ms IE seems to have hard
selected default encoding to "Western European (ISO)", which means
iso-8859-1. When browsing pages with some Central/Eastern European
characters these are converted to iso-8859-1 so displayed wrong.
I would suppose the...
|
by: Mark |
last post by:
Hi...
I've been doing a lot of work both creating and consuming web services, and
I notice there seems to be a discontinuity between a number of the different
cogs in the wheel centering around windows-1252 and that it is not equivalent
to iso-8859-1.
Looking in the registry under HKEY_CLASSES_ROOT\MIME\Database\Charset and
\Codepage, it seems that all variations on iso-8859-1 (latin1, etc) are
mapped to code page 1252, which I'm...
|
by: chandy |
last post by:
Hi,
I have an Html document that declares that it uses the utf-8 character
set. As this document is editable via a web interface I need to make
sure than high-ascii characters that may be accidentally entered are
properly represented when the document is served. My programming
language allows me to get the ascii value for any individual character
so what I am doing when a change is saved is to look at each character
in the content and...
|
by: Jon Davis |
last post by:
I have a software application I've written called PowerBlog (PowerBlog.net)
that takes the editing capability of the Internet Explorer WebBrowser
control (essentially a DHTMLTextBox), extracts the user-typed HTML, assigns
it as an XML node's InnerText property (using C#: System.Xml.XmlDocument
obj; obj.InnerText = myHTML). Then I later get the InnerText as a string and
write to disk.
When this text is displayed in a web browser, special...
|
by: John Dalberg |
last post by:
The below html validates correctly on w3.org's html validator when the file
has an html extension. When the same file gets an aspx extension, I get the
error below from the validator. This tells me that ASP.NET is changing the
character encoding in the http header. How can this be corrected so that
aspx pages validate correctly?
Error Message:
"The character encoding specified in the HTTP header (utf-8) is different
from the value in...
| |
by: Roshan |
last post by:
Hi,
Is there a way in which I can detect the encoding used while creating a
StreamReader? I have a method which receives a StreamReader as a
parameter. I want to ensure that when I do a
StreamReader.BaseStream.Seek() , I give the appropriate offset so that
encoding information/byte order marks are not read in (In case the file
uses UTF8 or Unicode encoding). Is there a standard way to do that?
Thanks,
|
by: Zhiv Kurilka |
last post by:
Hi,
I have a text file with following content:
"((^)|(.* +))§§§§§§§§"
if I read it with:
k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII);
k.readtotheend()
|
by: Hugh Oxford |
last post by:
Hi,
I use imap_fetchbody to retrieve a message from the server, and I end up
with something which contains lots of characters like this...
=0A=0A=EF=BB=BF
I have tried quoted_printable_decode(), which eliminates most of the
problems, however sometimes a message comes in in which pound sterling
signs are converted into characters unreadable by firefox, and
represented by a question mark.
|
by: JB |
last post by:
Hi All,
I have an application that reads text files and does various things
with their content.
I'd like to know how to detect the character encoding of each text
file so my app can handle them correctly (e.g. ANSI, UTF8, etc).
Thanks for any tips
JB
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
| |
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
|
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |