Connecting Tech Pros Worldwide Help | Site Map

Howto: Detect encodings?

 
LinkBack Thread Tools Search this Thread
  #1  
Old July 17th, 2005, 05:43 AM
R. Rajesh Jeba Anbiah
Guest
 
Posts: n/a
Default Howto: Detect encodings?

Here is a nice code to detect utf-8
<http://in2.php.net/utf8_encode#39986> But, I couldn't find out the
logic behind the script. If anyone knows that please share.

Particularly I would like to detect other encodings too. So, I would
like to know the logic.

For example these texts are in TSCII encoding (for Tamil):
Žì¸õ. þÐ ¾Á¢Æ¢ø ¯ûÇ Åâ. þ¨¾ ¯*¸Ç¡ø ÀÊì¸ ÓÊ󾡸, ¯*¸û ¯Ä¡Å¢Â¢ø ±ó¾
Å¢¾ Á¡üÈò¨¾Ôõ ¦ºöÂò §¾¨Å þø¨Ä. ¦¾¡¼÷óÐ þó¾ þ¨½Âò ¾Çò¾¢ø «¨Áó¾¢ÕìÌõ
¾Á¢úô Àì¸*¸¨Çò ¾¨¼Â¢ýÈ¢ô ÀÊì¸Ä¡õ.

Any ideas? TIA.

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com

  #2  
Old July 17th, 2005, 05:43 AM
Chung Leong
Guest
 
Posts: n/a
Default Re: Howto: Detect encodings?

"R. Rajesh Jeba Anbiah" <ng4rrjanbiah@rediffmail.com> wrote in message
news:abc4d8b8.0406090618.bab78e5@posting.google.co m...[color=blue]
> Here is a nice code to detect utf-8
> <http://in2.php.net/utf8_encode#39986> But, I couldn't find out the
> logic behind the script. If anyone knows that please share.[/color]

The code tries to decode the UTF8 text. When it runs into an error, then
it's not UTF8.
[color=blue]
> For example these texts are in TSCII encoding (for Tamil):
> Ž. Ƣ â. ǡ , ġŢ¢
> Ţ .
> ¢Ȣ ġ.[/color]

No easy way to do it. The question is, are you trying to distinguish between
different possible ways of encoding Tamil or identify TSCII from all
possible encodings?


  #3  
Old July 17th, 2005, 05:43 AM
Gerard van Wilgen
Guest
 
Posts: n/a
Default Re: Howto: Detect encodings?


"R. Rajesh Jeba Anbiah" <ng4rrjanbiah@rediffmail.com> wrote in message
news:abc4d8b8.0406090618.bab78e5@posting.google.co m...[color=blue]
> Here is a nice code to detect utf-8
> <http://in2.php.net/utf8_encode#39986> But, I couldn't find out the
> logic behind the script. If anyone knows that please share.
>
> Particularly I would like to detect other encodings too. So, I would
> like to know the logic.
>
> For example these texts are in TSCII encoding (for Tamil):
> Žì¸õ. þÐ ¾Á¢Æ¢ø ¯ûÇ Åâ. þ¨¾ ¯*¸Ç¡ø ÀÊì¸ ÓÊ󾡸, ¯*¸û ¯Ä¡Å¢Â¢ø ±ó¾
> Å¢¾ Á¡üÈò¨¾Ôõ ¦ºöÂò §¾¨Å þø¨Ä. ¦¾¡¼÷óÐ þó¾ þ¨½Âò ¾Çò¾¢ø «¨Áó¾¢ÕìÌõ
> ¾Á¢úô Àì¸*¸¨Çò ¾¨¼Â¢ýÈ¢ô ÀÊì¸Ä¡õ.
>
> Any ideas? TIA.[/color]

A text that is not encoded in utf-8 will usually contain many byte sequences
that are invalid in utf-8. Encodings like TSCII are much more difficult to
detect, because every possible byte sequence would be valid (even though it
would not necessarily be a meaningful character sequence for a human
reader).

When I have a text with an unknown encoding I simply load it in an editor
that supports many encodings, and then try them out until I have found the
setting that causes the text to become readable. Writing a script that can
detect the encoding is obviously very difficult.

I should say, forget it. It is not worth the trouble.

Gerard van Wilgen
--
www.majstro.com (On-line translation dictionary / Enreta tradukvortaro)
www.travlang.com/Ergane (Free translation dictionary for Windows / Senpaga
tradukvortaro por Windows)

  #4  
Old July 17th, 2005, 05:44 AM
R. Rajesh Jeba Anbiah
Guest
 
Posts: n/a
Default Re: Howto: Detect encodings?

"Chung Leong" <chernyshevsky@hotmail.com> wrote in message news:<SM6dne6GEs2kAFrdRVn-vw@comcast.com>...[color=blue]
> "R. Rajesh Jeba Anbiah" <ng4rrjanbiah@rediffmail.com> wrote in message
> news:abc4d8b8.0406090618.bab78e5@posting.google.co m...[color=green]
> > Here is a nice code to detect utf-8
> > <http://in2.php.net/utf8_encode#39986> But, I couldn't find out the
> > logic behind the script. If anyone knows that please share.[/color]
>
> The code tries to decode the UTF8 text. When it runs into an error, then
> it's not UTF8.[/color]

Thanks for the info/logic. Though I'm bit aware of unicode, this is
the first time I'm putting my hands on it... It's bit kinda pain as
PHP's unicode support is broken and strange...
[color=blue][color=green]
> > For example these texts are in TSCII encoding (for Tamil):
> > Ž. Ƣ â. ǡ , ġŢ¢
> > Ţ .
> > ¢Ȣ ġ.[/color]
>
> No easy way to do it. The question is, are you trying to distinguish between
> different possible ways of encoding Tamil or identify TSCII from all
> possible encodings?[/color]

I'll be interested to try both. Are you hinting that at least one is
easier? Thanks.

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com
  #5  
Old July 17th, 2005, 05:44 AM
R. Rajesh Jeba Anbiah
Guest
 
Posts: n/a
Default Re: Howto: Detect encodings?

"Gerard van Wilgen" <gvanwilgen@planet.nl> wrote in message news:<ca8eav$1p9$1@reader08.wxs.nl>...[color=blue]
> "R. Rajesh Jeba Anbiah" <ng4rrjanbiah@rediffmail.com> wrote in message
> news:abc4d8b8.0406090618.bab78e5@posting.google.co m...[/color]
<snip>[color=blue]
>
> A text that is not encoded in utf-8 will usually contain many byte sequences
> that are invalid in utf-8. Encodings like TSCII are much more difficult to
> detect, because every possible byte sequence would be valid (even though it
> would not necessarily be a meaningful character sequence for a human
> reader).
>
> When I have a text with an unknown encoding I simply load it in an editor
> that supports many encodings, and then try them out until I have found the
> setting that causes the text to become readable.[/color]

Yes, I understand what you mean. Only human can identify it
clearly...
[color=blue]
> Writing a script that can
> detect the encoding is obviously very difficult.
>
> I should say, forget it. It is not worth the trouble.[/color]

This <http://www.murasu.com/converter/> tool can auto-detect
encoding. So, I think, still it is possible?

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com
  #6  
Old July 17th, 2005, 05:44 AM
Chung Leong
Guest
 
Posts: n/a
Default Re: Howto: Detect encodings?


"R. Rajesh Jeba Anbiah" <ng4rrjanbiah@rediffmail.com> wrote in message
news:abc4d8b8.0406100019.61b48ff7@posting.google.c om...[color=blue]
> Thanks for the info/logic. Though I'm bit aware of unicode, this is
> the first time I'm putting my hands on it... It's bit kinda pain as
> PHP's unicode support is broken and strange...[/color]

Yeah, Unicode support in PHP is practically non-existence. You can still get
by though. More recent version of PHP supports character classes in regular
expressions, so you can do things like
/([\x{0900}-\x{09FF}]+)/.

UTF8 is in general rather tricky to work with. For example, you can't limit
the length of text entered by users using just the length attribute in HTML.
And when database width constraint chops off some UTF8 text in
mid-character, all sort of funky things happen in the browser.

My advise is not to use Unicode unless you have to. I am not familiar with
the Tamil script, but I think done a lot of work with Hindi. Most Hindi
websites do not use Unicode (e.g. www.webdunia.com), because Unicode Hindi
text requires rendering support from the operation system, which essentially
limits you to Windows/IE only.
[color=blue]
> I'll be interested to try both. Are you hinting that at least one is
> easier? Thanks.[/color]

Choosing one encoding out of three is obviously easier than choosing one out
of several hundred. As far as I know the only fool proof way is to run a
spell check on the text. Statistical analysis could also work. Just count
how often the letters are occurring and compare that to a known profile for
that language.


  #7  
Old July 17th, 2005, 05:44 AM
R. Rajesh Jeba Anbiah
Guest
 
Posts: n/a
Default Re: Howto: Detect encodings?

"Chung Leong" <chernyshevsky@hotmail.com> wrote in message news:<Q-2dnYW2xMUIc1XdRVn-ig@comcast.com>...[color=blue]
> "R. Rajesh Jeba Anbiah" <ng4rrjanbiah@rediffmail.com> wrote in message
> news:abc4d8b8.0406100019.61b48ff7@posting.google.c om...[color=green]
> > Thanks for the info/logic. Though I'm bit aware of unicode, this is
> > the first time I'm putting my hands on it... It's bit kinda pain as
> > PHP's unicode support is broken and strange...[/color]
>
> Yeah, Unicode support in PHP is practically non-existence. You can still get
> by though. More recent version of PHP supports character classes in regular
> expressions, so you can do things like
> /([\x{0900}-\x{09FF}]+)/.
>
> UTF8 is in general rather tricky to work with. For example, you can't limit
> the length of text entered by users using just the length attribute in HTML.
> And when database width constraint chops off some UTF8 text in
> mid-character, all sort of funky things happen in the browser.[/color]

Thanks a lot for your comments and help. As you said, utf8 acts
much strange; if we include the utf8 texts from other files, it works
differently than expected. Anyway we can somehow get it work.
[color=blue]
> My advise is not to use Unicode unless you have to. I am not familiar with
> the Tamil script, but I think done a lot of work with Hindi. Most Hindi
> websites do not use Unicode (e.g. www.webdunia.com), because Unicode Hindi
> text requires rendering support from the operation system, which essentially
> limits you to Windows/IE only.[/color]

Yeah I understand. But, for Tamil staying behind Unicode may not
help much as many people are moving towards it. The reason should be
many people here use Windows/IE alone. <OT>BTW, www.webdunia.com is
your work? :-)</OT>
[color=blue][color=green]
> > I'll be interested to try both. Are you hinting that at least one is
> > easier? Thanks.[/color]
>
> Choosing one encoding out of three is obviously easier than choosing one out
> of several hundred. As far as I know the only fool proof way is to run a
> spell check on the text. Statistical analysis could also work. Just count
> how often the letters are occurring and compare that to a known profile for
> that language.[/color]

In Tamil, some characters won't start a word (unless someone did
a typo). I'd thought of using such grammar stuff, if there is no
direct solution to detect encoding. Thanks a lot for your help.

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com
  #8  
Old July 17th, 2005, 05:45 AM
Chung Leong
Guest
 
Posts: n/a
Default Re: Howto: Detect encodings?

"R. Rajesh Jeba Anbiah" <ng4rrjanbiah@rediffmail.com> wrote in message
news:abc4d8b8.0406110038.7d7b2009@posting.google.c om...[color=blue]
> Yeah I understand. But, for Tamil staying behind Unicode may not
> help much as many people are moving towards it. The reason should be
> many people here use Windows/IE alone. <OT>BTW, www.webdunia.com is
> your work? :-)</OT>[/color]

If that's the case then Unicode is definitely the preferred route. With
Hindi, for some reason a lot of people are still using hack encoding. Just
the other day I had to re-type a whole bunch of stuff and I don't know a
word of Hindi. I don't work for webdunia.com, but my project make use of
their content. Let me tell you converting their custom encoding into Unicode
was quite a challenge.


  #9  
Old July 17th, 2005, 05:45 AM
R. Rajesh Jeba Anbiah
Guest
 
Posts: n/a
Default Re: Howto: Detect encodings?

"Chung Leong" <chernyshevsky@hotmail.com> wrote in message news:<X9Gdnc9H998xqlfdRVn-iQ@comcast.com>...[color=blue]
> "R. Rajesh Jeba Anbiah" <ng4rrjanbiah@rediffmail.com> wrote in message
> news:abc4d8b8.0406110038.7d7b2009@posting.google.c om...[color=green]
> > Yeah I understand. But, for Tamil staying behind Unicode may not
> > help much as many people are moving towards it. The reason should be
> > many people here use Windows/IE alone. <OT>BTW, www.webdunia.com is
> > your work? :-)</OT>[/color]
>
> If that's the case then Unicode is definitely the preferred route. With
> Hindi, for some reason a lot of people are still using hack encoding. Just
> the other day I had to re-type a whole bunch of stuff and I don't know a
> word of Hindi. I don't work for webdunia.com, but my project make use of
> their content. Let me tell you converting their custom encoding into Unicode
> was quite a challenge.[/color]

Thanks Chung for you comments and help. AFAIK, webdunia uses Dunia
encoding and here is Dunia to Unicode map
<http://crl.nmsu.edu/~mleisher/naicode.html> and a Perl script
<http://crl.nmsu.edu/~mleisher/nai2ucs.pl> (incase, if you want).
Thanks.

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com
 

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

Popular Articles

What is Bytes?

We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights. Get the best answers to your questions from over 220,989 network members.