469,364 Members | 2,332 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,364 developers. It's quick & easy.

Determining the encoding of a text file

Hello!
How do I determine the encoding of a text file ? That is,
given a text file I want to know the encoding it is in
UTF8 or UTF16 or Latin etc. It would be very helpful if
you could tell me how to do this in python on Linux. But
just the method is acceptable.
Thanks in advance!
Jul 18 '05 #1
4 7989

rajorshi> How do I determine the encoding of a text file ? That is,
rajorshi> given a text file I want to know the encoding it is in UTF8 or
rajorshi> UTF16 or Latin etc. It would be very helpful if you could tell
rajorshi> me how to do this in python on Linux. But just the method is
rajorshi> acceptable.

In general this is not possible. You can guess using heuristics, but there is
no predefined file attribute that indicates a file's encoding.

If you have a small set of candidate encodings you can generally do a decent
job guessing the encoding of a string by considering them in order. I placed
an example on my Python Bits page: <http://www.musi-cal.com/~skip/python/>. I
don't claim it's perfect and it's really only concerned with distiguishing
utf-8 and a few encodings which are similar to iso-8859-1, but it does a
decent job for me given the types of inputs I see.

Skip

Jul 18 '05 #2
In article <85*************************@posting.google.com> ,
ra******@fastmail.fm (Rajorshi) wrote:
How do I determine the encoding of a text file ? That is,
given a text file I want to know the encoding it is in
UTF8 or UTF16 or Latin etc. It would be very helpful if
you could tell me how to do this in python on Linux. But
just the method is acceptable.


If the first byte in the file is 0xFE and the second is 0xFF, then it's
likely the file is encoded in big-endian UTF-16. If the first byte is
0xFF and the second is 0xFE, then it's likely to be little-endian UTF-16.

Once you've eliminated those possibilities, then it gets trickier...

Dave
Jul 18 '05 #3

"Rajorshi" <ra******@fastmail.fm> wrote in message
news:85*************************@posting.google.co m...
Hello!
How do I determine the encoding of a text file ? That is,
given a text file I want to know the encoding it is in
UTF8 or UTF16 or Latin etc. It would be very helpful if
you could tell me how to do this in python on Linux. But
just the method is acceptable.
Thanks in advance!


The python integrated development environment IDLE, which is distributed
alone with python, shows one approach how to decode a
string. You could find it in the file $PYTHON/lib/idlelib/IOBinding.py, find
the decode().

But it's not perfect, you could integrate with Skip's example writing your
one.
Additional, if you want to guess the Chinese encoding, the perl lib
http://www.mandarintools.com/download/codelib.zip
may be for your reference, it can support GB2312-80, Hz, Big5, UTF-8, etc.

J.R.
Jul 18 '05 #4
Thanks for your suggestions!
"J.R." <j.*****@motorola.com> wrote in message news:<c2*********@newshost.mot.com>...
"Rajorshi" <ra******@fastmail.fm> wrote in message
news:85*************************@posting.google.co m...
Hello!
How do I determine the encoding of a text file ? That is,
given a text file I want to know the encoding it is in
UTF8 or UTF16 or Latin etc. It would be very helpful if
you could tell me how to do this in python on Linux. But
just the method is acceptable.
Thanks in advance!


The python integrated development environment IDLE, which is distributed
alone with python, shows one approach how to decode a
string. You could find it in the file $PYTHON/lib/idlelib/IOBinding.py, find
the decode().

But it's not perfect, you could integrate with Skip's example writing your
one.
Additional, if you want to guess the Chinese encoding, the perl lib
http://www.mandarintools.com/download/codelib.zip
may be for your reference, it can support GB2312-80, Hz, Big5, UTF-8, etc.

J.R.

Jul 18 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

reply views Thread by Chris McDonough | last post: by
3 posts views Thread by Tony Houghton | last post: by
4 posts views Thread by =?ISO-8859-1?Q?Nordl=F6w?= | last post: by
1 post views Thread by CARIGAR | last post: by
1 post views Thread by Marylou17 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.