By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,723 Members | 1,891 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,723 IT Pros & Developers. It's quick & easy.

Howto: Detect encodings?

P: n/a
Here is a nice code to detect utf-8
<http://in2.php.net/utf8_encode#39986> But, I couldn't find out the
logic behind the script. If anyone knows that please share.

Particularly I would like to detect other encodings too. So, I would
like to know the logic.

For example these texts are in TSCII encoding (for Tamil):
Žì¸õ. þÐ ¾Á¢Æ¢ø ¯ûÇ Åâ. þ¨¾ ¯*¸Ç¡ø ÀÊì¸ ÓÊó¾¡ø, ¯*¸û ¯Ä¡Å¢Â¢ø ±ó¾
Å¢¾ Á¡üÈò¨¾Ôõ ¦ºöÂò §¾¨Å þø¨Ä. ¦¾¡¼÷óÐ þó¾ þ¨½Âò ¾Çò¾¢ø «¨Áó¾¢ÕìÌõ
¾Á¢úô Àì¸*¸¨Çò ¾¨¼Â¢ýÈ¢ô ÀÊì¸Ä¡õ.

Any ideas? TIA.

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com
Jul 17 '05 #1
Share this Question
Share on Google+
8 Replies


P: n/a
"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab*************************@posting.google.co m...
Here is a nice code to detect utf-8
<http://in2.php.net/utf8_encode#39986> But, I couldn't find out the
logic behind the script. If anyone knows that please share.
The code tries to decode the UTF8 text. When it runs into an error, then
it's not UTF8.
For example these texts are in TSCII encoding (for Tamil):
Ž. Ƣ â. ǡ , ġŢ¢
Ţ .
¢Ȣ ġ.


No easy way to do it. The question is, are you trying to distinguish between
different possible ways of encoding Tamil or identify TSCII from all
possible encodings?
Jul 17 '05 #2

P: n/a

"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab*************************@posting.google.co m...
Here is a nice code to detect utf-8
<http://in2.php.net/utf8_encode#39986> But, I couldn't find out the
logic behind the script. If anyone knows that please share.

Particularly I would like to detect other encodings too. So, I would
like to know the logic.

For example these texts are in TSCII encoding (for Tamil):
Žì¸õ. þÐ ¾Á¢Æ¢ø ¯ûÇ Åâ. þ¨¾ ¯*¸Ç¡ø ÀÊì¸ ÓÊó¾¡ø, ¯*¸û ¯Ä¡Å¢Â¢ø ±ó¾
Å¢¾ Á¡üÈò¨¾Ôõ ¦ºöÂò §¾¨Å þø¨Ä. ¦¾¡¼÷óÐ þó¾ þ¨½Âò ¾Çò¾¢ø «¨Áó¾¢ÕìÌõ
¾Á¢úô Àì¸*¸¨Çò ¾¨¼Â¢ýÈ¢ô ÀÊì¸Ä¡õ.

Any ideas? TIA.


A text that is not encoded in utf-8 will usually contain many byte sequences
that are invalid in utf-8. Encodings like TSCII are much more difficult to
detect, because every possible byte sequence would be valid (even though it
would not necessarily be a meaningful character sequence for a human
reader).

When I have a text with an unknown encoding I simply load it in an editor
that supports many encodings, and then try them out until I have found the
setting that causes the text to become readable. Writing a script that can
detect the encoding is obviously very difficult.

I should say, forget it. It is not worth the trouble.

Gerard van Wilgen
--
www.majstro.com (On-line translation dictionary / Enreta tradukvortaro)
www.travlang.com/Ergane (Free translation dictionary for Windows / Senpaga
tradukvortaro por Windows)

Jul 17 '05 #3

P: n/a
"Chung Leong" <ch***********@hotmail.com> wrote in message news:<SM********************@comcast.com>...
"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab*************************@posting.google.co m...
Here is a nice code to detect utf-8
<http://in2.php.net/utf8_encode#39986> But, I couldn't find out the
logic behind the script. If anyone knows that please share.


The code tries to decode the UTF8 text. When it runs into an error, then
it's not UTF8.


Thanks for the info/logic. Though I'm bit aware of unicode, this is
the first time I'm putting my hands on it... It's bit kinda pain as
PHP's unicode support is broken and strange...
For example these texts are in TSCII encoding (for Tamil):
Ž. Ƣ â. ǡ , ġŢ¢
Ţ .
¢Ȣ ġ.


No easy way to do it. The question is, are you trying to distinguish between
different possible ways of encoding Tamil or identify TSCII from all
possible encodings?


I'll be interested to try both. Are you hinting that at least one is
easier? Thanks.

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com
Jul 17 '05 #4

P: n/a
"Gerard van Wilgen" <gv********@planet.nl> wrote in message news:<ca**********@reader08.wxs.nl>...
"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab*************************@posting.google.co m... <snip>
A text that is not encoded in utf-8 will usually contain many byte sequences
that are invalid in utf-8. Encodings like TSCII are much more difficult to
detect, because every possible byte sequence would be valid (even though it
would not necessarily be a meaningful character sequence for a human
reader).

When I have a text with an unknown encoding I simply load it in an editor
that supports many encodings, and then try them out until I have found the
setting that causes the text to become readable.
Yes, I understand what you mean. Only human can identify it
clearly...
Writing a script that can
detect the encoding is obviously very difficult.

I should say, forget it. It is not worth the trouble.


This <http://www.murasu.com/converter/> tool can auto-detect
encoding. So, I think, still it is possible?

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com
Jul 17 '05 #5

P: n/a

"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab**************************@posting.google.c om...
Thanks for the info/logic. Though I'm bit aware of unicode, this is
the first time I'm putting my hands on it... It's bit kinda pain as
PHP's unicode support is broken and strange...
Yeah, Unicode support in PHP is practically non-existence. You can still get
by though. More recent version of PHP supports character classes in regular
expressions, so you can do things like
/([\x{0900}-\x{09FF}]+)/.

UTF8 is in general rather tricky to work with. For example, you can't limit
the length of text entered by users using just the length attribute in HTML.
And when database width constraint chops off some UTF8 text in
mid-character, all sort of funky things happen in the browser.

My advise is not to use Unicode unless you have to. I am not familiar with
the Tamil script, but I think done a lot of work with Hindi. Most Hindi
websites do not use Unicode (e.g. www.webdunia.com), because Unicode Hindi
text requires rendering support from the operation system, which essentially
limits you to Windows/IE only.
I'll be interested to try both. Are you hinting that at least one is
easier? Thanks.


Choosing one encoding out of three is obviously easier than choosing one out
of several hundred. As far as I know the only fool proof way is to run a
spell check on the text. Statistical analysis could also work. Just count
how often the letters are occurring and compare that to a known profile for
that language.
Jul 17 '05 #6

P: n/a
"Chung Leong" <ch***********@hotmail.com> wrote in message news:<Q-********************@comcast.com>...
"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab**************************@posting.google.c om...
Thanks for the info/logic. Though I'm bit aware of unicode, this is
the first time I'm putting my hands on it... It's bit kinda pain as
PHP's unicode support is broken and strange...
Yeah, Unicode support in PHP is practically non-existence. You can still get
by though. More recent version of PHP supports character classes in regular
expressions, so you can do things like
/([\x{0900}-\x{09FF}]+)/.

UTF8 is in general rather tricky to work with. For example, you can't limit
the length of text entered by users using just the length attribute in HTML.
And when database width constraint chops off some UTF8 text in
mid-character, all sort of funky things happen in the browser.


Thanks a lot for your comments and help. As you said, utf8 acts
much strange; if we include the utf8 texts from other files, it works
differently than expected. Anyway we can somehow get it work.
My advise is not to use Unicode unless you have to. I am not familiar with
the Tamil script, but I think done a lot of work with Hindi. Most Hindi
websites do not use Unicode (e.g. www.webdunia.com), because Unicode Hindi
text requires rendering support from the operation system, which essentially
limits you to Windows/IE only.


Yeah I understand. But, for Tamil staying behind Unicode may not
help much as many people are moving towards it. The reason should be
many people here use Windows/IE alone. <OT>BTW, www.webdunia.com is
your work? :-)</OT>
I'll be interested to try both. Are you hinting that at least one is
easier? Thanks.


Choosing one encoding out of three is obviously easier than choosing one out
of several hundred. As far as I know the only fool proof way is to run a
spell check on the text. Statistical analysis could also work. Just count
how often the letters are occurring and compare that to a known profile for
that language.


In Tamil, some characters won't start a word (unless someone did
a typo). I'd thought of using such grammar stuff, if there is no
direct solution to detect encoding. Thanks a lot for your help.

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com
Jul 17 '05 #7

P: n/a
"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab**************************@posting.google.c om...
Yeah I understand. But, for Tamil staying behind Unicode may not
help much as many people are moving towards it. The reason should be
many people here use Windows/IE alone. <OT>BTW, www.webdunia.com is
your work? :-)</OT>


If that's the case then Unicode is definitely the preferred route. With
Hindi, for some reason a lot of people are still using hack encoding. Just
the other day I had to re-type a whole bunch of stuff and I don't know a
word of Hindi. I don't work for webdunia.com, but my project make use of
their content. Let me tell you converting their custom encoding into Unicode
was quite a challenge.
Jul 17 '05 #8

P: n/a
"Chung Leong" <ch***********@hotmail.com> wrote in message news:<X9********************@comcast.com>...
"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab**************************@posting.google.c om...
Yeah I understand. But, for Tamil staying behind Unicode may not
help much as many people are moving towards it. The reason should be
many people here use Windows/IE alone. <OT>BTW, www.webdunia.com is
your work? :-)</OT>


If that's the case then Unicode is definitely the preferred route. With
Hindi, for some reason a lot of people are still using hack encoding. Just
the other day I had to re-type a whole bunch of stuff and I don't know a
word of Hindi. I don't work for webdunia.com, but my project make use of
their content. Let me tell you converting their custom encoding into Unicode
was quite a challenge.


Thanks Chung for you comments and help. AFAIK, webdunia uses Dunia
encoding and here is Dunia to Unicode map
<http://crl.nmsu.edu/~mleisher/naicode.html> and a Perl script
<http://crl.nmsu.edu/~mleisher/nai2ucs.pl> (incase, if you want).
Thanks.

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com
Jul 17 '05 #9

This discussion thread is closed

Replies have been disabled for this discussion.