473,413 Members | 1,811 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,413 software developers and data experts.

An attempt at guessing the encoding of a (non-unicode) string

This is a subject that comes up fairly often. Last night, I had the
following idea, for which I would like feedback from you.

This could be implemented as a function in codecs.py (let's call it
"wild_guess"), that is based on some pre-calculated data. These
pre-calculated data would be produced as follows:

1. Create a dictionary (key: encoding, value: set of valid bytes for the
encoding)

1a. the sets can be constructed by trial and error:

def valid_bytes(encoding):
result= set()
for byte in xrange(256):
char= chr(byte)
try:
char.decode(encoding)
except UnicodeDecodeError:
pass
else:
result.add(char)
return result

2. for every 8-bit encoding, some "representative" text is given (the
longer, the better)

2a. the following function is a quick generator of all two-char
sequences from its string argument. can be used both for the production
of the pre-calculated data and for the analysis of a given string in the
'wild_guess' function.

def str_window(text):
return itertools.imap(
text.__getslice__, xrange(0, len(s)-1), xrange(2, len(s)+1)
)

So for every encoding and 'representative' text, a bag of two-char
sequences and their frequencies is calculated. {frequencies[encoding] =
dict(key: two-chars, value: count)}

2b. do a lengthy comparison of the bags in order to find the most common
two-char sequences that, as a set, can be considered unique for the
specific encoding.

2c. For every encoding, keep only a set of the (chosen in step 2b)
two-char sequences that were judged as 'representative'. Store these
calculated sets plus those from step 1a as python code in a helper
module to be imported from codecs.py for the wild_guess function
(reproduce the helper module every time some 'representative' text is
added or modified).

3. write the wild_guess function

3a. the function 'wild_guess' would first construct a set from its
argument:

sample_set= set(argument)

and by set operations against the sets from step 1a, we can exclude
codecs where the sample set is not a subset of the encoding valid set.
I don't expect that this step would exclude many encodings, but I think
it should not be skipped.

3b. pass the argument through the str_window function, and construct a
set of all two-char sequencies

3c. from all sets from step 2c, find the one whose intersection with set
from 3b is longest as a ratio of len(intersection)/len(encoding_set),
and suggest the relevant encoding.

What do you think? I can't test whether that would work unless I have
'representative' texts for various encodings. Please feel free to help
or bash :)

PS I know how generic 'representative' is, and how hard it is to qualify
some text as such, therefore the quotes. That is why I said 'the
longer, the better'.
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix
Jul 18 '05 #1
12 1859
Christos TZOTZIOY Georgiou wrote:
This is a subject that comes up fairly often. Last night, I had the
following idea, for which I would like feedback from you.

This could be implemented as a function in codecs.py (let's call it
"wild_guess"), that is based on some pre-calculated data. These
pre-calculated data would be produced as follows: .... What do you think? I can't test whether that would work unless I have
'representative' texts for various encodings. Please feel free to help
or bash :)


The representative text would, in some circles, be called a training
corpus. See the Natural Language Toolkit for some modules that may help
you prototype this approach:

<http://nltk.sf.net/>

In particular, check out the probability tutorial.
Jul 18 '05 #2
On Fri, 02 Apr 2004 15:05:42 GMT, rumours say that Jon Willeke
<j.***********@verizon.dot.net> might have written:
Christos TZOTZIOY Georgiou wrote: <snip>

This could be implemented as a function in codecs.py (let's call it
"wild_guess"), that is based on some pre-calculated data. These
pre-calculated data would be produced as follows:

...

<snip>

[Jon]The representative text would, in some circles, be called a training
corpus. See the Natural Language Toolkit for some modules that may help
you prototype this approach:

<http://nltk.sf.net/>

In particular, check out the probability tutorial.


Thanks for the hint, and I am browsing the documentation now. However,
I'd like to create something that would not be dependent on external
python libraries, so that anyone interested would just download a small
module that would do the job, hopefully good.
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix
Jul 18 '05 #3
I've been getting decent results by a much simpler approach:
count the number of characters for which the encoding produces a symbol
c for which c.isalpha() or c.isspace(), subtract a large penalty if
using the encoding leads to UnicodeDecodeError, and take the encoding
with the largest count.

--
David Eppstein http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science
Jul 18 '05 #4

"David Eppstein" <ep******@ics.uci.edu> wrote in message
news:ep****************************@news.service.u ci.edu...
I've been getting decent results by a much simpler approach:
count the number of characters for which the encoding produces a symbol
c for which c.isalpha() or c.isspace(), subtract a large penalty if
using the encoding leads to UnicodeDecodeError, and take the encoding
with the largest count.
Shouldn't that be isalphanum()? Or does your data not have
very many numbers?

John Roth
--
David Eppstein http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science

Jul 18 '05 #5
In article <10*************@news.supernews.com>,
"John Roth" <ne********@jhrothjr.com> wrote:
"David Eppstein" <ep******@ics.uci.edu> wrote in message
news:ep****************************@news.service.u ci.edu...
I've been getting decent results by a much simpler approach:
count the number of characters for which the encoding produces a symbol
c for which c.isalpha() or c.isspace(), subtract a large penalty if
using the encoding leads to UnicodeDecodeError, and take the encoding
with the largest count.


Shouldn't that be isalphanum()? Or does your data not have
very many numbers?


It's only important if your text has many code positions which produce a
digit in one encoding and not in another, and which are hard to
disambiguate using isalpha() alone. I haven't encountered that
situation.

--
David Eppstein http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science
Jul 18 '05 #6
Christos TZOTZIOY Georgiou wrote:
This could be implemented as a function in codecs.py (let's call it
"wild_guess"), that is based on some pre-calculated data.


Windows already has a related function:

http://msdn.microsoft.com/library/de...icode_81np.asp

Read more about it here:

http://weblogs.asp.net/oldnewthing/a.../24/95235.aspx

Roger
Jul 18 '05 #7
On Sat, 3 Apr 2004 12:22:05 -0800, rumours say that "Roger Binns"
<ro****@rogerbinns.com> might have written:
Christos TZOTZIOY Georgiou wrote:
This could be implemented as a function in codecs.py (let's call it
"wild_guess"), that is based on some pre-calculated data.
Windows already has a related function:

http://msdn.microsoft.com/library/de...icode_81np.asp


As far as I understand, this function tests whether its argument is a
valid Unicode text, so it has little to do with the issue I brought up:
take a python string (8-bit bytes) and try to guess its encoding (eg,
iso8859-1, iso8859-7 etc).

There must be a similar function used for the "auto guess encoding"
function of the MS Internet Explorer, however:

1. even if it is exported and usable under windows, it is not platform
independent

2. its guessing success rate (until IE 5.5 which I happen to use) is not
very high

<snip>

Thanks for your reply, anyway.
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix
Jul 18 '05 #8
On Fri, 02 Apr 2004 14:49:07 -0800, rumours say that David Eppstein
<ep******@ics.uci.edu> might have written:
I've been getting decent results by a much simpler approach:
count the number of characters for which the encoding produces a symbol
c for which c.isalpha() or c.isspace(), subtract a large penalty if
using the encoding leads to UnicodeDecodeError, and take the encoding
with the largest count.


Somebody (by email only so far) has suggested that spambayes could be
used to the task... perhaps they're right, however this is not as simple
and independent a solution I would like to deliver.

I would believe that your idea of a score is a good one; I feel that the
key should be two-char combinations, but I'll have to compare the
success rate of both one-char and two-char keys.

I'll try to search for "representative" texts on the web for as many
encodings as I can; any pointers, links from non-english speakers would
be welcome in the thread.
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix
Jul 18 '05 #9
I think you will find Mozilla's charset autodetection method
interesting.

A composite approach to language/encoding detection
http://www.mozilla.org/projects/intl...Detection.html

Perhaps this can be used with PyXPCOM. I don't know.
Jul 18 '05 #10
In article <6p********************************@4ax.com>,
Christos "TZOTZIOY" Georgiou <tz**@sil-tec.gr> wrote:
I've been getting decent results by a much simpler approach:
count the number of characters for which the encoding produces a symbol
c for which c.isalpha() or c.isspace(), subtract a large penalty if
using the encoding leads to UnicodeDecodeError, and take the encoding
with the largest count.


Somebody (by email only so far) has suggested that spambayes could be
used to the task... perhaps they're right, however this is not as simple
and independent a solution I would like to deliver.

I would believe that your idea of a score is a good one; I feel that the
key should be two-char combinations, but I'll have to compare the
success rate of both one-char and two-char keys.

I'll try to search for "representative" texts on the web for as many
encodings as I can; any pointers, links from non-english speakers would
be welcome in the thread.


BTW, if you're going to implement the single-char version, at least for
encodings that translate one byte -> one unicode position (e.g., not
utf8), and your texts are large enough, it will be faster to precompute
a table of byte frequencies in the text and then compute the score by
summing the frequencies of alphabetic bytes.

--
David Eppstein http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science
Jul 18 '05 #11
On 5 Apr 2004 08:14:54 -0700, rumours say that un********@hanmail.net
(Seo Sanghyeon) might have written:
I think you will find Mozilla's charset autodetection method
interesting.

A composite approach to language/encoding detection
http://www.mozilla.org/projects/intl...Detection.html
Thank you!
Perhaps this can be used with PyXPCOM. I don't know.


Neither do I...
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix
Jul 18 '05 #12
On Mon, 05 Apr 2004 13:37:34 -0700, rumours say that David Eppstein
<ep******@ics.uci.edu> might have written:
BTW, if you're going to implement the single-char version, at least for
encodings that translate one byte -> one unicode position (e.g., not
utf8), and your texts are large enough, it will be faster to precompute
a table of byte frequencies in the text and then compute the score by
summing the frequencies of alphabetic bytes.


Thanks for the pointer, David. However, as it often happens, I came
second (or, probably, n-th :). Seo Sanghyeon sent a URL that includes a
two-char proposal, and it provides an algorithm in section 4.7.1 that I
find appropriate for this matter:

http://www.mozilla.org/projects/intl...Detection.html
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix
Jul 18 '05 #13

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: lkrubner | last post by:
Last year I asked a bunch of questions about character encoding on this newsgroup. All the answers came down to using ord() in creative ways to try to make guesses about multi-byte characters. I...
6
by: jmgonet | last post by:
Hello everybody, I'm having troubles loading a Xml string encoded in UTF-8. If I try this code: ------------------------------ XmlDocument doc=new XmlDocument(); String s="<?xml...
4
by: fitsch | last post by:
Hi, I am trying to write a generic RSS/Atom/OPML feed client. The problem is, that those xml feeds may have different encodings: - <?xml version="1.0" encoding="ISO-8859-1" ?>... - <?xml...
4
by: Wayne Wengert | last post by:
I am trying to use the approach outlined in an article at http://www.dotnetjunkies.com/HowTo/99201486-ACFD-4607-A0CC-99E75836DC72.dcik but when I add the sample code to my VSNET 2003 solution it...
2
by: lprisr | last post by:
Hi, I have double byte characters in the content that I am returning using Web Services. However, the encoding in the xml file returned by Web Services is utf-8 and I am unable to read the...
5
by: howachen | last post by:
Suppose I have a string, e.g. $str = 'good morning'; I want encode the string into a format that doesn't contains any special character like \n, \r and space etc. Currently...
2
by: dba123 | last post by:
I want to change this method to Encode the text so that it covers situations where for example below, it's showing the link as an inline text instead of a link which should be clickable: example...
2
by: willie | last post by:
John Machin: Thank you for your patience and for educating me. (Though I still have a long way to go before enlightenment) I thought Python might have a small weakness in lacking an...
0
by: Janusz Nykiel | last post by:
I've stumbled upon unexpected behavior of the .NET 2.0 System.Xml.XmlWriter class when using it to write data to a binary stream (System.IO.Stream). If the amount of data is less than a certain...
19
by: est | last post by:
From python manual str( ) Return a string containing a nicely printable representation of an object. For strings, this returns the string itself. The difference with repr(object) is that...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.