This is a subject that comes up fairly often. Last night, I had the

following idea, for which I would like feedback from you.

This could be implemented as a function in codecs.py (let's call it

"wild_guess"), that is based on some pre-calculated data. These

pre-calculated data would be produced as follows:

1. Create a dictionary (key: encoding, value: set of valid bytes for the

encoding)

1a. the sets can be constructed by trial and error:

def valid_bytes(encoding):

result= set()

for byte in xrange(256):

char= chr(byte)

try:

char.decode(encoding)

except UnicodeDecodeError:

pass

else:

result.add(char)

return result

2. for every 8-bit encoding, some "representative" text is given (the

longer, the better)

2a. the following function is a quick generator of all two-char

sequences from its string argument. can be used both for the production

of the pre-calculated data and for the analysis of a given string in the

'wild_guess' function.

def str_window(text):

return itertools.imap(

text.__getslice__, xrange(0, len(s)-1), xrange(2, len(s)+1)

)

So for every encoding and 'representative' text, a bag of two-char

sequences and their frequencies is calculated. {frequencies[encoding] =

dict(key: two-chars, value: count)}

2b. do a lengthy comparison of the bags in order to find the most common

two-char sequences that, as a set, can be considered unique for the

specific encoding.

2c. For every encoding, keep only a set of the (chosen in step 2b)

two-char sequences that were judged as 'representative'. Store these

calculated sets plus those from step 1a as python code in a helper

module to be imported from codecs.py for the wild_guess function

(reproduce the helper module every time some 'representative' text is

added or modified).

3. write the wild_guess function

3a. the function 'wild_guess' would first construct a set from its

argument:

sample_set= set(argument)

and by set operations against the sets from step 1a, we can exclude

codecs where the sample set is not a subset of the encoding valid set.

I don't expect that this step would exclude many encodings, but I think

it should not be skipped.

3b. pass the argument through the str_window function, and construct a

set of all two-char sequencies

3c. from all sets from step 2c, find the one whose intersection with set

from 3b is longest as a ratio of len(intersection)/len(encoding_set),

and suggest the relevant encoding.

What do you think? I can't test whether that would work unless I have

'representative' texts for various encodings. Please feel free to help

or bash :)

PS I know how generic 'representative' is, and how hard it is to qualify

some text as such, therefore the quotes. That is why I said 'the

longer, the better'.

--

TZOTZIOY, I speak England very best,

Ils sont fous ces Redmontains! --Harddix