473,386 Members | 2,078 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

Encoding sniffer?

Does anyone know of a Python module that is able to sniff the encoding of
text? Please: I know that there is no reliable way to do this but I need
something that works for most of the case...so please no discussion about
the sense of such a module and approach.

Andreas
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (Darwin)

iD8DBQFDvTlMCJIWIbr9KYwRAhXWAJ9X7XyhUBFJ34MAl1OIIM +psBY58ACg4iMg
5GC4VEeNhpoH5MueRlGN+as=
=DAfd
-----END PGP SIGNATURE-----

Jan 5 '06 #1
6 2297
Andreas Jung <li***@andreas-jung.com> wrote:
[-- text/plain, encoding quoted-printable, charset: us-ascii, 6 lines --]

Does anyone know of a Python module that is able to sniff the encoding of
text? Please: I know that there is no reliable way to do this but I need
something that works for most of the case...so please no discussion about
the sense of such a module and approach.


depends on what exactly you need
one approach is pyenca

the other is:

def try_encoding(s, encodings):
"try to guess the encoding of string s, testing encodings given in second parameter"

for enc in encodings:
try:
test = unicode(s, enc)
return enc
except UnicodeDecodeError:
pass

return None

print try_encodings(text, ['ascii', 'utf-8', 'iso8859_1', 'cp1252', 'macroman']
depending on what language and encodings you expects the text to be in,
the first or second approach is better
--
-----------------------------------------------------------
| Radovan GarabĂ*k http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
Jan 5 '06 #2
> print try_encodings(text, ['ascii', 'utf-8', 'iso8859_1', 'cp1252', 'macroman']

I've fallen into that trap before - it won't work after the iso8859_1.
The reason is that an eight-bit encoding have all 256 code-points
assigned (usually, there are exceptions but you have to be lucky to have
a string that contains a value not assigned in one of them - which is
highly unlikely)

AFAIK iso-8859-1 has all codepoints taken - so you won't go beyond that
in your example.
Regards,

Diez
Jan 5 '06 #3
Diez B. Roggisch <de***@nospam.web.de> wrote:
print try_encodings(text, ['ascii', 'utf-8', 'iso8859_1', 'cp1252', 'macroman']


I've fallen into that trap before - it won't work after the iso8859_1.
The reason is that an eight-bit encoding have all 256 code-points
assigned (usually, there are exceptions but you have to be lucky to have
a string that contains a value not assigned in one of them - which is
highly unlikely)

AFAIK iso-8859-1 has all codepoints taken - so you won't go beyond that
in your example.


I pasted from a wrong file :-)
See my previous posting (a few days ago) - what I did was to implement
iso8859_1_ncc encoding (iso8859_1 without control codes) and
the line should have been
try_encodings(text, ['ascii', 'utf-8', 'iso8859_1_ncc', 'cp1252', 'macroman']

where iso8859_1_ncc.py is the same as iso8859_1.py from python
distribution, with this line different:

decoding_map = codecs.make_identity_dict(range(32, 128)+range(128+32,256))
--
-----------------------------------------------------------
| Radovan GarabĂ*k http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
Jan 5 '06 #4
ga******************@kassiopeia.juls.savba.sk schrieb:
Diez B. Roggisch <de***@nospam.web.de> wrote:
print try_encodings(text, ['ascii', 'utf-8', 'iso8859_1', 'cp1252', 'macroman']


I've fallen into that trap before - it won't work after the iso8859_1.
The reason is that an eight-bit encoding have all 256 code-points
assigned (usually, there are exceptions but you have to be lucky to have
a string that contains a value not assigned in one of them - which is
highly unlikely)

AFAIK iso-8859-1 has all codepoints taken - so you won't go beyond that
in your example.

I pasted from a wrong file :-)
See my previous posting (a few days ago) - what I did was to implement
iso8859_1_ncc encoding (iso8859_1 without control codes) and
the line should have been
try_encodings(text, ['ascii', 'utf-8', 'iso8859_1_ncc', 'cp1252', 'macroman']

where iso8859_1_ncc.py is the same as iso8859_1.py from python
distribution, with this line different:

decoding_map = codecs.make_identity_dict(range(32, 128)+range(128+32,256))


Ok, I can see that. But still, there would be quite a few overlapping
codepoints.

I think what the OP (and many more people) wants would be something that
tries and guesses encodings based on probabilities for certain trigrams
containing an umlaut for example.

There seems to be a tool called "konwert" out there that does such
things, and recode has some guessing stuff too, AFAIK - but I haven't
seen any special python modules so far.

Diez
Jan 5 '06 #5
Diez B. Roggisch wrote:
AFAIK iso-8859-1 has all codepoints taken - so you won't go beyond that
in your example.


IIRC the range 128-159 (i.e. control codes with the high bit set)
are unused.

Ralf
Jan 6 '06 #6
Ralf Muschall:
Diez B. Roggisch wrote:
AFAIK iso-8859-1 has all codepoints taken - so you won't go beyond that
in your example.

IIRC the range 128-159 (i.e. control codes with the high bit set)
are unused.


ISO 8859-1 and ISO-8859-1 (extra hyphen) differ in that ISO-8859-1
includes the control codes in 128-159 (as well as the low control codes)
as defined by ISO 6429. ISO 6429 is not freely available online but the
equivalent ECMA standard ECMA 48 is:
http://www.ecma-international.org/pu...T/Ecma-048.pdf

Neil
Jan 6 '06 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: David Baker | last post by:
Hi all I am very new to ASP.Net. I am trying to create a sniffer for our program. We want our users to click our sniffer and hopefully the sniffer will check their computer against our...
2
by: Laurent Laporte | last post by:
hello, I'm using cvs standard module under Python 2.3 / 2.4 to read a CSV file. The file is opened in binary mode, so I keep the end of line terminator. It appears that the csv.Sniffer force...
43
by: Christoph Schneegans | last post by:
Hi! Okay, so positions on "text/html" XHTML are totally contradicting. Anyway! I hope there's more consensus about "application/xml" XHTML. I've recently learned that Opera 9.0b2 does not only...
0
by: showellshowell | last post by:
Hi everybody, I'm looking for a very simple HTTP debugging sniffer in Python-- hopefully 200 lines of less--that allows me to write simple methods to inspect requests and responses. It would...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.