473,399 Members | 3,919 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,399 software developers and data experts.

Distinguishing cp850 and cp1252?

I'm working on some Python code for reading files in a certain format,
and the examples of such files I've found on the internet appear to be
in either cp850 or cp1252 encoding (except for one exception for which I
can't find a correct encoding among the standard Python ones).

The file format itself includes nothing about which encoding is used,
but only one of the two produces sensible results in the non-ascii
examples I've seen.

Is there an easy way of guessing with reasonable accuracy which of these
two incodings was used for a particular file?

--
David Eppstein http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science
Jul 18 '05 #1
3 3293

"David Eppstein" <ep******@ics.uci.edu> wrote in message
news:ep****************************@news.service.u ci.edu...
I'm working on some Python code for reading files in a certain format,
and the examples of such files I've found on the internet appear to be
in either cp850 or cp1252 encoding (except for one exception for which I
can't find a correct encoding among the standard Python ones).

The file format itself includes nothing about which encoding is used,
but only one of the two produces sensible results in the non-ascii
examples I've seen.

Is there an easy way of guessing with reasonable accuracy which of these
two incodings was used for a particular file?
The only way I know of is to do a statistical analysis on letter
frequencies. To do that, you have to know your data fairly well.
For example, CP850 has a number of characters devoted to box
drawing characters. If your data doesn't involve drawing boxes,
and you find those characters in the input, I'd say that's a strong
clue that you're dealing with CP1252.

I know this doesn't help all that much, but it's the only thing
that has worked for me.

John Roth
--
David Eppstein http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science

Jul 18 '05 #2
David Eppstein wrote:
Is there an easy way of guessing with reasonable accuracy which of these
two incodings was used for a particular file?


You could try the assumption that most characters should be letters,
assuming your documents are likely text documents of some sort. The idea
is that what is a letter in one code is some non-letter graphical symbol
in the other.

So you would create a predicate "isletter" for each character set, and
then count the number of bytes in a document which are not letters. You
should probably exclude the ASCII characters in counting, since they
would have the same interpretation in either code. The code that gives
you fewer/none no-letter characters is likely the correct
interpretation.

To find out which bytes are letters, you could use unicodedata.category;
letters start with "L" (followed by either "l" or "u", depending on
case). You should compute a bitmap for each character set up-front, and
you should find out what the overlap in set bits is.

To get a higher accuracy, you need advance knowledge on the natural
language your documents are in, and then you need to use a dictionary
of that language.

HTH,
Martin

Jul 18 '05 #3
In article <vq************@news.supernews.com>,
"John Roth" <ne********@jhrothjr.com> wrote:
Is there an easy way of guessing with reasonable accuracy which of these
two incodings was used for a particular file?


The only way I know of is to do a statistical analysis on letter
frequencies. To do that, you have to know your data fairly well.
For example, CP850 has a number of characters devoted to box
drawing characters. If your data doesn't involve drawing boxes,
and you find those characters in the input, I'd say that's a strong
clue that you're dealing with CP1252.


Thanks. After trying some other more hackish things which sort of
worked (e.g. does the encoding lead to unicodes with ord>255?) I settled
on a very simple statistical scheme: vote for how many times the
encoding produces unicodes that answer true to isalpha(). Seems to be
working...

--
David Eppstein http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science
Jul 18 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Do Re Mi chel La Si Do | last post by:
Hi ! *** for information *** There is a bug in Python 2.4, worsened with the 2.4.1, for scripts, if they are: - large - with larges lines
1
by: Guilherme Pinto | last post by:
Hello. I am reading the book written by Bjarne Stroustrup called " The C++ Programming Language - Special Edition" and had a doubt which a think is really important to distinguish between the...
0
by: Hugh Sparks | last post by:
If I configure and use two different fragment extractors on the same XML document, how can I write xslt template match patterns that distinguish which elements these fragments replaced? Details:...
6
by: Pavils Jurjans | last post by:
Hello, Here's the sample XML: <sample1></sample1> <sample2/> From many XML books and online documentations, it is said to be just different syntax for the same data. However, when we...
2
by: Bennett Haselton | last post by:
I have a program that uses the System.Environment.OSVersion to report on the platform where it's running: Console.WriteLine(System.Environment.OSVersion.ToString());...
2
by: Nadav | last post by:
How can one distinguish between a mngd & unmngd assemblies? ( preferable through managed code ). -- Nadav http://www.ddevel.com
1
by: Andreas Busse | last post by:
Hello, is there a way to distinguish between VC.NET 2002 and VC.NET 2003 at compile-time (pre-defined preprocessor macros)? I have only found _MSC_VER to be set to 1300 in both versions, which...
2
by: Méta-MCI | last post by:
Hi! I've a problem with these 2 scripts: file aaa.py (write in ANSI/cp1252): # -*- coding: cp1252 -*- compo={}
26
by: tjhnson | last post by:
Hi, With properties, attributes and methods seem very similar. I was wondering what techniques people use to give clues to end users as to which 'things' are methods and which are attributes. ...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.