"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab*************************@posting.google.co m...
Here is a nice code to detect utf-8
<http://in2.php.net/utf8_encode#39986> But, I couldn't find out the
logic behind the script. If anyone knows that please share.
Particularly I would like to detect other encodings too. So, I would
like to know the logic.
For example these texts are in TSCII encoding (for Tamil):
Žì¸õ. þÐ ¾Á¢Æ¢ø ¯ûÇ Åâ. þ¨¾ ¯*¸Ç¡ø ÀÊì¸ ÓÊ󾡸, ¯*¸û ¯Ä¡Å¢Â¢ø ±ó¾
Å¢¾ Á¡üÈò¨¾Ôõ ¦ºöÂò §¾¨Å þø¨Ä. ¦¾¡¼÷óÐ þó¾ þ¨½Âò ¾Çò¾¢ø «¨Áó¾¢ÕìÌõ
¾Á¢úô Àì¸*¸¨Çò ¾¨¼Â¢ýÈ¢ô ÀÊì¸Ä¡õ.
Any ideas? TIA.
A text that is not encoded in utf-8 will usually contain many byte sequences
that are invalid in utf-8. Encodings like TSCII are much more difficult to
detect, because every possible byte sequence would be valid (even though it
would not necessarily be a meaningful character sequence for a human
reader).
When I have a text with an unknown encoding I simply load it in an editor
that supports many encodings, and then try them out until I have found the
setting that causes the text to become readable. Writing a script that can
detect the encoding is obviously very difficult.
I should say, forget it. It is not worth the trouble.
Gerard van Wilgen
--
www.majstro.com (On-line translation dictionary / Enreta tradukvortaro)
www.travlang.com/Ergane (Free translation dictionary for Windows / Senpaga
tradukvortaro por Windows)