473,241 Members | 1,455 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,241 software developers and data experts.

An attempt at guessing the encoding of a (non-unicode) string

This is a subject that comes up fairly often. Last night, I had the
following idea, for which I would like feedback from you.

This could be implemented as a function in codecs.py (let's call it
"wild_guess"), that is based on some pre-calculated data. These
pre-calculated data would be produced as follows:

1. Create a dictionary (key: encoding, value: set of valid bytes for the
encoding)

1a. the sets can be constructed by trial and error:

def valid_bytes(encoding):
result= set()
for byte in xrange(256):
char= chr(byte)
try:
char.decode(encoding)
except UnicodeDecodeError:
pass
else:
result.add(char)
return result

2. for every 8-bit encoding, some "representative" text is given (the
longer, the better)

2a. the following function is a quick generator of all two-char
sequences from its string argument. can be used both for the production
of the pre-calculated data and for the analysis of a given string in the
'wild_guess' function.

def str_window(text):
return itertools.imap(
text.__getslice__, xrange(0, len(s)-1), xrange(2, len(s)+1)
)

So for every encoding and 'representative' text, a bag of two-char
sequences and their frequencies is calculated. {frequencies[encoding] =
dict(key: two-chars, value: count)}

2b. do a lengthy comparison of the bags in order to find the most common
two-char sequences that, as a set, can be considered unique for the
specific encoding.

2c. For every encoding, keep only a set of the (chosen in step 2b)
two-char sequences that were judged as 'representative'. Store these
calculated sets plus those from step 1a as python code in a helper
module to be imported from codecs.py for the wild_guess function
(reproduce the helper module every time some 'representative' text is
added or modified).

3. write the wild_guess function

3a. the function 'wild_guess' would first construct a set from its
argument:

sample_set= set(argument)

and by set operations against the sets from step 1a, we can exclude
codecs where the sample set is not a subset of the encoding valid set.
I don't expect that this step would exclude many encodings, but I think
it should not be skipped.

3b. pass the argument through the str_window function, and construct a
set of all two-char sequencies

3c. from all sets from step 2c, find the one whose intersection with set
from 3b is longest as a ratio of len(intersection)/len(encoding_set),
and suggest the relevant encoding.

What do you think? I can't test whether that would work unless I have
'representative' texts for various encodings. Please feel free to help
or bash :)

PS I know how generic 'representative' is, and how hard it is to qualify
some text as such, therefore the quotes. That is why I said 'the
longer, the better'.
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix
Jul 18 '05 #1
12 1855
Christos TZOTZIOY Georgiou wrote:
This is a subject that comes up fairly often. Last night, I had the
following idea, for which I would like feedback from you.

This could be implemented as a function in codecs.py (let's call it
"wild_guess"), that is based on some pre-calculated data. These
pre-calculated data would be produced as follows: .... What do you think? I can't test whether that would work unless I have
'representative' texts for various encodings. Please feel free to help
or bash :)


The representative text would, in some circles, be called a training
corpus. See the Natural Language Toolkit for some modules that may help
you prototype this approach:

<http://nltk.sf.net/>

In particular, check out the probability tutorial.
Jul 18 '05 #2
On Fri, 02 Apr 2004 15:05:42 GMT, rumours say that Jon Willeke
<j.***********@verizon.dot.net> might have written:
Christos TZOTZIOY Georgiou wrote: <snip>

This could be implemented as a function in codecs.py (let's call it
"wild_guess"), that is based on some pre-calculated data. These
pre-calculated data would be produced as follows:

...

<snip>

[Jon]The representative text would, in some circles, be called a training
corpus. See the Natural Language Toolkit for some modules that may help
you prototype this approach:

<http://nltk.sf.net/>

In particular, check out the probability tutorial.


Thanks for the hint, and I am browsing the documentation now. However,
I'd like to create something that would not be dependent on external
python libraries, so that anyone interested would just download a small
module that would do the job, hopefully good.
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix
Jul 18 '05 #3
I've been getting decent results by a much simpler approach:
count the number of characters for which the encoding produces a symbol
c for which c.isalpha() or c.isspace(), subtract a large penalty if
using the encoding leads to UnicodeDecodeError, and take the encoding
with the largest count.

--
David Eppstein http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science
Jul 18 '05 #4

"David Eppstein" <ep******@ics.uci.edu> wrote in message
news:ep****************************@news.service.u ci.edu...
I've been getting decent results by a much simpler approach:
count the number of characters for which the encoding produces a symbol
c for which c.isalpha() or c.isspace(), subtract a large penalty if
using the encoding leads to UnicodeDecodeError, and take the encoding
with the largest count.
Shouldn't that be isalphanum()? Or does your data not have
very many numbers?

John Roth
--
David Eppstein http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science

Jul 18 '05 #5
In article <10*************@news.supernews.com>,
"John Roth" <ne********@jhrothjr.com> wrote:
"David Eppstein" <ep******@ics.uci.edu> wrote in message
news:ep****************************@news.service.u ci.edu...
I've been getting decent results by a much simpler approach:
count the number of characters for which the encoding produces a symbol
c for which c.isalpha() or c.isspace(), subtract a large penalty if
using the encoding leads to UnicodeDecodeError, and take the encoding
with the largest count.


Shouldn't that be isalphanum()? Or does your data not have
very many numbers?


It's only important if your text has many code positions which produce a
digit in one encoding and not in another, and which are hard to
disambiguate using isalpha() alone. I haven't encountered that
situation.

--
David Eppstein http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science
Jul 18 '05 #6
Christos TZOTZIOY Georgiou wrote:
This could be implemented as a function in codecs.py (let's call it
"wild_guess"), that is based on some pre-calculated data.


Windows already has a related function:

http://msdn.microsoft.com/library/de...icode_81np.asp

Read more about it here:

http://weblogs.asp.net/oldnewthing/a.../24/95235.aspx

Roger
Jul 18 '05 #7
On Sat, 3 Apr 2004 12:22:05 -0800, rumours say that "Roger Binns"
<ro****@rogerbinns.com> might have written:
Christos TZOTZIOY Georgiou wrote:
This could be implemented as a function in codecs.py (let's call it
"wild_guess"), that is based on some pre-calculated data.
Windows already has a related function:

http://msdn.microsoft.com/library/de...icode_81np.asp


As far as I understand, this function tests whether its argument is a
valid Unicode text, so it has little to do with the issue I brought up:
take a python string (8-bit bytes) and try to guess its encoding (eg,
iso8859-1, iso8859-7 etc).

There must be a similar function used for the "auto guess encoding"
function of the MS Internet Explorer, however:

1. even if it is exported and usable under windows, it is not platform
independent

2. its guessing success rate (until IE 5.5 which I happen to use) is not
very high

<snip>

Thanks for your reply, anyway.
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix
Jul 18 '05 #8
On Fri, 02 Apr 2004 14:49:07 -0800, rumours say that David Eppstein
<ep******@ics.uci.edu> might have written:
I've been getting decent results by a much simpler approach:
count the number of characters for which the encoding produces a symbol
c for which c.isalpha() or c.isspace(), subtract a large penalty if
using the encoding leads to UnicodeDecodeError, and take the encoding
with the largest count.


Somebody (by email only so far) has suggested that spambayes could be
used to the task... perhaps they're right, however this is not as simple
and independent a solution I would like to deliver.

I would believe that your idea of a score is a good one; I feel that the
key should be two-char combinations, but I'll have to compare the
success rate of both one-char and two-char keys.

I'll try to search for "representative" texts on the web for as many
encodings as I can; any pointers, links from non-english speakers would
be welcome in the thread.
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix
Jul 18 '05 #9
I think you will find Mozilla's charset autodetection method
interesting.

A composite approach to language/encoding detection
http://www.mozilla.org/projects/intl...Detection.html

Perhaps this can be used with PyXPCOM. I don't know.
Jul 18 '05 #10
In article <6p********************************@4ax.com>,
Christos "TZOTZIOY" Georgiou <tz**@sil-tec.gr> wrote:
I've been getting decent results by a much simpler approach:
count the number of characters for which the encoding produces a symbol
c for which c.isalpha() or c.isspace(), subtract a large penalty if
using the encoding leads to UnicodeDecodeError, and take the encoding
with the largest count.


Somebody (by email only so far) has suggested that spambayes could be
used to the task... perhaps they're right, however this is not as simple
and independent a solution I would like to deliver.

I would believe that your idea of a score is a good one; I feel that the
key should be two-char combinations, but I'll have to compare the
success rate of both one-char and two-char keys.

I'll try to search for "representative" texts on the web for as many
encodings as I can; any pointers, links from non-english speakers would
be welcome in the thread.


BTW, if you're going to implement the single-char version, at least for
encodings that translate one byte -> one unicode position (e.g., not
utf8), and your texts are large enough, it will be faster to precompute
a table of byte frequencies in the text and then compute the score by
summing the frequencies of alphabetic bytes.

--
David Eppstein http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science
Jul 18 '05 #11
On 5 Apr 2004 08:14:54 -0700, rumours say that un********@hanmail.net
(Seo Sanghyeon) might have written:
I think you will find Mozilla's charset autodetection method
interesting.

A composite approach to language/encoding detection
http://www.mozilla.org/projects/intl...Detection.html
Thank you!
Perhaps this can be used with PyXPCOM. I don't know.


Neither do I...
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix
Jul 18 '05 #12
On Mon, 05 Apr 2004 13:37:34 -0700, rumours say that David Eppstein
<ep******@ics.uci.edu> might have written:
BTW, if you're going to implement the single-char version, at least for
encodings that translate one byte -> one unicode position (e.g., not
utf8), and your texts are large enough, it will be faster to precompute
a table of byte frequencies in the text and then compute the score by
summing the frequencies of alphabetic bytes.


Thanks for the pointer, David. However, as it often happens, I came
second (or, probably, n-th :). Seo Sanghyeon sent a URL that includes a
two-char proposal, and it provides an algorithm in section 4.7.1 that I
find appropriate for this matter:

http://www.mozilla.org/projects/intl...Detection.html
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix
Jul 18 '05 #13

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: lkrubner | last post by:
Last year I asked a bunch of questions about character encoding on this newsgroup. All the answers came down to using ord() in creative ways to try to make guesses about multi-byte characters. I...
6
by: jmgonet | last post by:
Hello everybody, I'm having troubles loading a Xml string encoded in UTF-8. If I try this code: ------------------------------ XmlDocument doc=new XmlDocument(); String s="<?xml...
4
by: fitsch | last post by:
Hi, I am trying to write a generic RSS/Atom/OPML feed client. The problem is, that those xml feeds may have different encodings: - <?xml version="1.0" encoding="ISO-8859-1" ?>... - <?xml...
4
by: Wayne Wengert | last post by:
I am trying to use the approach outlined in an article at http://www.dotnetjunkies.com/HowTo/99201486-ACFD-4607-A0CC-99E75836DC72.dcik but when I add the sample code to my VSNET 2003 solution it...
2
by: lprisr | last post by:
Hi, I have double byte characters in the content that I am returning using Web Services. However, the encoding in the xml file returned by Web Services is utf-8 and I am unable to read the...
5
by: howachen | last post by:
Suppose I have a string, e.g. $str = 'good morning'; I want encode the string into a format that doesn't contains any special character like \n, \r and space etc. Currently...
2
by: dba123 | last post by:
I want to change this method to Encode the text so that it covers situations where for example below, it's showing the link as an inline text instead of a link which should be clickable: example...
2
by: willie | last post by:
John Machin: Thank you for your patience and for educating me. (Though I still have a long way to go before enlightenment) I thought Python might have a small weakness in lacking an...
0
by: Janusz Nykiel | last post by:
I've stumbled upon unexpected behavior of the .NET 2.0 System.Xml.XmlWriter class when using it to write data to a binary stream (System.IO.Stream). If the amount of data is less than a certain...
19
by: est | last post by:
From python manual str( ) Return a string containing a nicely printable representation of an object. For strings, this returns the string itself. The difference with repr(object) is that...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: fareedcanada | last post by:
Hello I am trying to split number on their count. suppose i have 121314151617 (12cnt) then number should be split like 12,13,14,15,16,17 and if 11314151617 (11cnt) then should be split like...
0
by: stefan129 | last post by:
Hey forum members, I'm exploring options for SSL certificates for multiple domains. Has anyone had experience with multi-domain SSL certificates? Any recommendations on reliable providers or specific...
0
Git
by: egorbl4 | last post by:
Скачал я git, хотел начать настройку, а там вылезло вот это Что это? Что мне с этим делать? ...
1
by: davi5007 | last post by:
Hi, Basically, I am trying to automate a field named TraceabilityNo into a web page from an access form. I've got the serial held in the variable strSearchString. How can I get this into the...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.