Howto: Detect encodings?

R. Rajesh Jeba Anbiah

Here is a nice code to detect utf-8
<http://in2.php.net/utf8_encode#39986> But, I couldn't find out the
logic behind the script. If anyone knows that please share.

Particularly I would like to detect other encodings too. So, I would
like to know the logic.

For example these texts are in TSCII encoding (for Tamil):
Ã…Â½Ã¬Â¸Ãµ. Ã¾Ã Â¾ÃÂ¢Ã†Â¢Ã¸ Â¯Ã»Ã‡ Ã…ÃƒÂ¢. Ã¾Â¨Â¾ Â¯Ã*Â¸Ã‡Â¡Ã¸ Ã€ÃŠÃ¬Â¸ Ã“ÃŠÃ³Â¾Â¡Ã¸, Â¯Ã*Â¸Ã» Â¯Ã„Â¡Ã…Â¢Ã‚Â¢Ã¸ Â±Ã³Â¾
Ã…Â¢Â¾ ÃÂ¡Ã¼ÃˆÃ²Â¨Â¾Ã”Ãµ Â¦ÂºÃ¶Ã‚Ã² Â§Â¾Â¨Ã… Ã¾Ã¸Â¨Ã„. Â¦Â¾Â¡Â¼Ã·Ã³Ã Ã¾Ã³Â¾ Ã¾Â¨Â½Ã‚Ã² Â¾Ã‡Ã²Â¾Â¢Ã¸ Â«Â¨ÃÃ³Â¾Â¢Ã•Ã¬ÃŒÃµ
Â¾ÃÂ¢ÃºÃ´ Ã€Ã¬Â¸Ã*Â¸Â¨Ã‡Ã² Â¾Â¨Â¼Ã‚Â¢Ã½ÃˆÂ¢Ã´ Ã€ÃŠÃ¬Â¸Ã„Â¡Ãµ.

Any ideas? TIA.

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com

Jul 17 '05 #1

Subscribe Post Reply

4555

Chung Leong

"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab*************************@posting.google.co m...

Here is a nice code to detect utf-8
<http://in2.php.net/utf8_encode#39986> But, I couldn't find out the
logic behind the script. If anyone knows that please share.
The code tries to decode the UTF8 text. When it runs into an error, then
it's not UTF8.
For example these texts are in TSCII encoding (for Tamil):
Å½ì¸õ. þÐ ¾Á¢Æ¢ø ¯ûÇ ÅÃ¢. þ¨¾ ¯í¸Ç¡ø ÀÊì¸ ÓÊó¾¡ø, ¯í¸û ¯Ä¡Å¢Â¢ø ±ó¾
Å¢¾ Á¡üÈò¨¾Ôõ ¦ºöÂò §¾¨Å þø¨Ä. ¦¾¡¼÷óÐ þó¾ þ¨½Âò ¾Çò¾¢ø «¨Áó¾¢ÕìÌõ
¾Á¢úô Àì¸í¸¨Çò ¾¨¼Â¢ýÈ¢ô ÀÊì¸Ä¡õ.

No easy way to do it. The question is, are you trying to distinguish between
different possible ways of encoding Tamil or identify TSCII from all
possible encodings?

Jul 17 '05 #2

Gerard van Wilgen

"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab*************************@posting.google.co m...

Here is a nice code to detect utf-8
<http://in2.php.net/utf8_encode#39986> But, I couldn't find out the
logic behind the script. If anyone knows that please share.

Particularly I would like to detect other encodings too. So, I would
like to know the logic.

For example these texts are in TSCII encoding (for Tamil):
Ã…Â½Ã¬Â¸Ãµ. Ã¾Ã Â¾ÃÂ¢Ã†Â¢Ã¸ Â¯Ã»Ã‡ Ã…ÃƒÂ¢. Ã¾Â¨Â¾ Â¯Ã*Â¸Ã‡Â¡Ã¸ Ã€ÃŠÃ¬Â¸ Ã“ÃŠÃ³Â¾Â¡Ã¸, Â¯Ã*Â¸Ã» Â¯Ã„Â¡Ã…Â¢Ã‚Â¢Ã¸ Â±Ã³Â¾
Ã…Â¢Â¾ ÃÂ¡Ã¼ÃˆÃ²Â¨Â¾Ã”Ãµ Â¦ÂºÃ¶Ã‚Ã² Â§Â¾Â¨Ã… Ã¾Ã¸Â¨Ã„. Â¦Â¾Â¡Â¼Ã·Ã³Ã Ã¾Ã³Â¾ Ã¾Â¨Â½Ã‚Ã² Â¾Ã‡Ã²Â¾Â¢Ã¸ Â«Â¨ÃÃ³Â¾Â¢Ã•Ã¬ÃŒÃµ
Â¾ÃÂ¢ÃºÃ´ Ã€Ã¬Â¸Ã*Â¸Â¨Ã‡Ã² Â¾Â¨Â¼Ã‚Â¢Ã½ÃˆÂ¢Ã´ Ã€ÃŠÃ¬Â¸Ã„Â¡Ãµ.

Any ideas? TIA.

A text that is not encoded in utf-8 will usually contain many byte sequences
that are invalid in utf-8. Encodings like TSCII are much more difficult to
detect, because every possible byte sequence would be valid (even though it
would not necessarily be a meaningful character sequence for a human
reader).

When I have a text with an unknown encoding I simply load it in an editor
that supports many encodings, and then try them out until I have found the
setting that causes the text to become readable. Writing a script that can
detect the encoding is obviously very difficult.

I should say, forget it. It is not worth the trouble.

Gerard van Wilgen
--
www.majstro.com (On-line translation dictionary / Enreta tradukvortaro)
www.travlang.com/Ergane (Free translation dictionary for Windows / Senpaga
tradukvortaro por Windows)

Jul 17 '05 #3

R. Rajesh Jeba Anbiah

"Chung Leong" <ch***********@hotmail.com> wrote in message news:<SM********************@comcast.com>...

"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab*************************@posting.google.co m...
Here is a nice code to detect utf-8
<http://in2.php.net/utf8_encode#39986> But, I couldn't find out the
logic behind the script. If anyone knows that please share.

The code tries to decode the UTF8 text. When it runs into an error, then
it's not UTF8.

Thanks for the info/logic. Though I'm bit aware of unicode, this is
the first time I'm putting my hands on it... It's bit kinda pain as
PHP's unicode support is broken and strange...

For example these texts are in TSCII encoding (for Tamil):
Å½ì¸õ. þÐ ¾Á¢Æ¢ø ¯ûÇ ÅÃ¢. þ¨¾ ¯í¸Ç¡ø ÀÊì¸ ÓÊó¾¡ø, ¯í¸û ¯Ä¡Å¢Â¢ø ±ó¾
Å¢¾ Á¡üÈò¨¾Ôõ ¦ºöÂò §¾¨Å þø¨Ä. ¦¾¡¼÷óÐ þó¾ þ¨½Âò ¾Çò¾¢ø «¨Áó¾¢ÕìÌõ
¾Á¢úô Àì¸í¸¨Çò ¾¨¼Â¢ýÈ¢ô ÀÊì¸Ä¡õ.

No easy way to do it. The question is, are you trying to distinguish between
different possible ways of encoding Tamil or identify TSCII from all
possible encodings?

I'll be interested to try both. Are you hinting that at least one is
easier? Thanks.

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com

Jul 17 '05 #4

R. Rajesh Jeba Anbiah

"Gerard van Wilgen" <gv********@planet.nl> wrote in message news:<ca**********@reader08.wxs.nl>...

"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab*************************@posting.google.co m... <snip>
A text that is not encoded in utf-8 will usually contain many byte sequences
that are invalid in utf-8. Encodings like TSCII are much more difficult to
detect, because every possible byte sequence would be valid (even though it
would not necessarily be a meaningful character sequence for a human
reader).

When I have a text with an unknown encoding I simply load it in an editor
that supports many encodings, and then try them out until I have found the
setting that causes the text to become readable.
Yes, I understand what you mean. Only human can identify it
clearly...
Writing a script that can
detect the encoding is obviously very difficult.

I should say, forget it. It is not worth the trouble.

This <http://www.murasu.com/converter/> tool can auto-detect
encoding. So, I think, still it is possible?

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com

Jul 17 '05 #5

Chung Leong

"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab**************************@posting.google.c om...

Thanks for the info/logic. Though I'm bit aware of unicode, this is
the first time I'm putting my hands on it... It's bit kinda pain as
PHP's unicode support is broken and strange...
Yeah, Unicode support in PHP is practically non-existence. You can still get
by though. More recent version of PHP supports character classes in regular
expressions, so you can do things like
/([\x{0900}-\x{09FF}]+)/.

UTF8 is in general rather tricky to work with. For example, you can't limit
the length of text entered by users using just the length attribute in HTML.
And when database width constraint chops off some UTF8 text in
mid-character, all sort of funky things happen in the browser.

My advise is not to use Unicode unless you have to. I am not familiar with
the Tamil script, but I think done a lot of work with Hindi. Most Hindi
websites do not use Unicode (e.g. www.webdunia.com), because Unicode Hindi
text requires rendering support from the operation system, which essentially
limits you to Windows/IE only.
I'll be interested to try both. Are you hinting that at least one is
easier? Thanks.

Choosing one encoding out of three is obviously easier than choosing one out
of several hundred. As far as I know the only fool proof way is to run a
spell check on the text. Statistical analysis could also work. Just count
how often the letters are occurring and compare that to a known profile for
that language.

Jul 17 '05 #6

R. Rajesh Jeba Anbiah

"Chung Leong" <ch***********@hotmail.com> wrote in message news:<Q-********************@comcast.com>...

"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab**************************@posting.google.c om...
Thanks for the info/logic. Though I'm bit aware of unicode, this is
the first time I'm putting my hands on it... It's bit kinda pain as
PHP's unicode support is broken and strange...
Yeah, Unicode support in PHP is practically non-existence. You can still get
by though. More recent version of PHP supports character classes in regular
expressions, so you can do things like
/([\x{0900}-\x{09FF}]+)/.

UTF8 is in general rather tricky to work with. For example, you can't limit
the length of text entered by users using just the length attribute in HTML.
And when database width constraint chops off some UTF8 text in
mid-character, all sort of funky things happen in the browser.

Thanks a lot for your comments and help. As you said, utf8 acts
much strange; if we include the utf8 texts from other files, it works
differently than expected. Anyway we can somehow get it work.
My advise is not to use Unicode unless you have to. I am not familiar with
the Tamil script, but I think done a lot of work with Hindi. Most Hindi
websites do not use Unicode (e.g. www.webdunia.com), because Unicode Hindi
text requires rendering support from the operation system, which essentially
limits you to Windows/IE only.

Yeah I understand. But, for Tamil staying behind Unicode may not
help much as many people are moving towards it. The reason should be
many people here use Windows/IE alone. <OT>BTW, www.webdunia.com is
your work? :-)</OT>

I'll be interested to try both. Are you hinting that at least one is
easier? Thanks.

Choosing one encoding out of three is obviously easier than choosing one out
of several hundred. As far as I know the only fool proof way is to run a
spell check on the text. Statistical analysis could also work. Just count
how often the letters are occurring and compare that to a known profile for
that language.

In Tamil, some characters won't start a word (unless someone did
a typo). I'd thought of using such grammar stuff, if there is no
direct solution to detect encoding. Thanks a lot for your help.

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com

Jul 17 '05 #7

Chung Leong

"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab**************************@posting.google.c om...

Yeah I understand. But, for Tamil staying behind Unicode may not
help much as many people are moving towards it. The reason should be
many people here use Windows/IE alone. <OT>BTW, www.webdunia.com is
your work? :-)</OT>

If that's the case then Unicode is definitely the preferred route. With
Hindi, for some reason a lot of people are still using hack encoding. Just
the other day I had to re-type a whole bunch of stuff and I don't know a
word of Hindi. I don't work for webdunia.com, but my project make use of
their content. Let me tell you converting their custom encoding into Unicode
was quite a challenge.

Jul 17 '05 #8

R. Rajesh Jeba Anbiah

"Chung Leong" <ch***********@hotmail.com> wrote in message news:<X9********************@comcast.com>...

"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab**************************@posting.google.c om...
Yeah I understand. But, for Tamil staying behind Unicode may not
help much as many people are moving towards it. The reason should be
many people here use Windows/IE alone. <OT>BTW, www.webdunia.com is
your work? :-)</OT>

If that's the case then Unicode is definitely the preferred route. With
Hindi, for some reason a lot of people are still using hack encoding. Just
the other day I had to re-type a whole bunch of stuff and I don't know a
word of Hindi. I don't work for webdunia.com, but my project make use of
their content. Let me tell you converting their custom encoding into Unicode
was quite a challenge.

Thanks Chung for you comments and help. AFAIK, webdunia uses Dunia
encoding and here is Dunia to Unicode map
<http://crl.nmsu.edu/~mleisher/naicode.html> and a Perl script
<http://crl.nmsu.edu/~mleisher/nai2ucs.pl> (incase, if you want).
Thanks.

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com

Jul 17 '05 #9

by: Gernot Hillier | last post by:

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi! I'm the developer of a Linux ISDN application which uses embedded Python for controlling the communication. It starts several threads (i.e....

Python

HOWTO: Detect IIS version ?

by: Tamir Khason | last post by:

The target: Detect IIS version on server The way: switch (Environment.OSVersion.Version.Major) { case 4: return IISVersion.IIS4; break; case 5: return IISVersion.IIS5: break;

C# / C Sharp

howto transfer chr(1) to a server via webservice

by: ProJee | last post by:

1. How to pass chr(1) (or another special char below 32) to a method in a webservice? It raises an exception when I try to pass it through webservice.htc 2. How to pass chr(255) It doesn't...

ASP.NET

Detect character encoding

by: Michal | last post by:

Hello, is there any way how to detect string encoding in Python? I need to proccess several files. Each of them could be encoded in different charset (iso-8859-2, cp1250, etc). I want to detect...

Python

Howto select encoding ISO-8859-1

by: Michael H | last post by:

I'm would like to enconde my XML into ISO-8859-1, but I can't seem to find howto. ANy suggestions? StringWriter writer = new StringWriter(); XmlTextWriter xmlWriter = new XmlTextWriter(writer);...

.NET Framework

HOWTO Make CStr for JavaScript on ASP w/ Request.Form and QueryStr

by: =?Utf-8?B?QVRT?= | last post by:

HOWTO Make CStr for JavaScript on ASP w/ Request.Form and QueryString In ASP, Request.Form and Request.QueryString return objects that do not support "toString", or any JavaScript string...

ASP / Active Server Pages

howto decouple in .NET?

by: Ben Voigt [C++ MVP] | last post by:

This is more of a C# question than a C++ question, but my best chance of explaining it is via comparison to C++. Ok: In C++ you can forward declare a type. Then references to that type can be...

.NET Framework

different encodings for unicode() and u''.encode(), bug?

by: mario | last post by:

Hello! i stumbled on this situation, that is if I decode some string, below just the empty string, using the mcbs encoding, it succeeds, but if I try to encode it back with the same encoding it...

Python

Re: Where to locate existing standard encodings in python

by: Philip Semanchuk | last post by:

On Nov 9, 2008, at 7:00 PM, News123 wrote: Look under the heading "Standard Encodings": http://docs.python.org/library/codecs.html Note that both the page you found (which appears to be a...

Python

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Howto: Detect encodings?

Similar topics