473,387 Members | 1,789 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

Howto: Detect encodings?

Here is a nice code to detect utf-8
<http://in2.php.net/utf8_encode#39986> But, I couldn't find out the
logic behind the script. If anyone knows that please share.

Particularly I would like to detect other encodings too. So, I would
like to know the logic.

For example these texts are in TSCII encoding (for Tamil):
Žì¸õ. þÐ ¾Á¢Æ¢ø ¯ûÇ Åâ. þ¨¾ ¯*¸Ç¡ø ÀÊì¸ ÓÊó¾¡ø, ¯*¸û ¯Ä¡Å¢Â¢ø ±ó¾
Å¢¾ Á¡üÈò¨¾Ôõ ¦ºöÂò §¾¨Å þø¨Ä. ¦¾¡¼÷óÐ þó¾ þ¨½Âò ¾Çò¾¢ø «¨Áó¾¢ÕìÌõ
¾Á¢úô Àì¸*¸¨Çò ¾¨¼Â¢ýÈ¢ô ÀÊì¸Ä¡õ.

Any ideas? TIA.

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com
Jul 17 '05 #1
8 4555
"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab*************************@posting.google.co m...
Here is a nice code to detect utf-8
<http://in2.php.net/utf8_encode#39986> But, I couldn't find out the
logic behind the script. If anyone knows that please share.
The code tries to decode the UTF8 text. When it runs into an error, then
it's not UTF8.
For example these texts are in TSCII encoding (for Tamil):
Ž. Ƣ â. ǡ , ġŢ¢
Ţ .
¢Ȣ ġ.


No easy way to do it. The question is, are you trying to distinguish between
different possible ways of encoding Tamil or identify TSCII from all
possible encodings?
Jul 17 '05 #2

"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab*************************@posting.google.co m...
Here is a nice code to detect utf-8
<http://in2.php.net/utf8_encode#39986> But, I couldn't find out the
logic behind the script. If anyone knows that please share.

Particularly I would like to detect other encodings too. So, I would
like to know the logic.

For example these texts are in TSCII encoding (for Tamil):
Žì¸õ. þÐ ¾Á¢Æ¢ø ¯ûÇ Åâ. þ¨¾ ¯*¸Ç¡ø ÀÊì¸ ÓÊó¾¡ø, ¯*¸û ¯Ä¡Å¢Â¢ø ±ó¾
Å¢¾ Á¡üÈò¨¾Ôõ ¦ºöÂò §¾¨Å þø¨Ä. ¦¾¡¼÷óÐ þó¾ þ¨½Âò ¾Çò¾¢ø «¨Áó¾¢ÕìÌõ
¾Á¢úô Àì¸*¸¨Çò ¾¨¼Â¢ýÈ¢ô ÀÊì¸Ä¡õ.

Any ideas? TIA.


A text that is not encoded in utf-8 will usually contain many byte sequences
that are invalid in utf-8. Encodings like TSCII are much more difficult to
detect, because every possible byte sequence would be valid (even though it
would not necessarily be a meaningful character sequence for a human
reader).

When I have a text with an unknown encoding I simply load it in an editor
that supports many encodings, and then try them out until I have found the
setting that causes the text to become readable. Writing a script that can
detect the encoding is obviously very difficult.

I should say, forget it. It is not worth the trouble.

Gerard van Wilgen
--
www.majstro.com (On-line translation dictionary / Enreta tradukvortaro)
www.travlang.com/Ergane (Free translation dictionary for Windows / Senpaga
tradukvortaro por Windows)

Jul 17 '05 #3
"Chung Leong" <ch***********@hotmail.com> wrote in message news:<SM********************@comcast.com>...
"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab*************************@posting.google.co m...
Here is a nice code to detect utf-8
<http://in2.php.net/utf8_encode#39986> But, I couldn't find out the
logic behind the script. If anyone knows that please share.


The code tries to decode the UTF8 text. When it runs into an error, then
it's not UTF8.


Thanks for the info/logic. Though I'm bit aware of unicode, this is
the first time I'm putting my hands on it... It's bit kinda pain as
PHP's unicode support is broken and strange...
For example these texts are in TSCII encoding (for Tamil):
Ž. Ƣ â. ǡ , ġŢ¢
Ţ .
¢Ȣ ġ.


No easy way to do it. The question is, are you trying to distinguish between
different possible ways of encoding Tamil or identify TSCII from all
possible encodings?


I'll be interested to try both. Are you hinting that at least one is
easier? Thanks.

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com
Jul 17 '05 #4
"Gerard van Wilgen" <gv********@planet.nl> wrote in message news:<ca**********@reader08.wxs.nl>...
"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab*************************@posting.google.co m... <snip>
A text that is not encoded in utf-8 will usually contain many byte sequences
that are invalid in utf-8. Encodings like TSCII are much more difficult to
detect, because every possible byte sequence would be valid (even though it
would not necessarily be a meaningful character sequence for a human
reader).

When I have a text with an unknown encoding I simply load it in an editor
that supports many encodings, and then try them out until I have found the
setting that causes the text to become readable.
Yes, I understand what you mean. Only human can identify it
clearly...
Writing a script that can
detect the encoding is obviously very difficult.

I should say, forget it. It is not worth the trouble.


This <http://www.murasu.com/converter/> tool can auto-detect
encoding. So, I think, still it is possible?

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com
Jul 17 '05 #5

"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab**************************@posting.google.c om...
Thanks for the info/logic. Though I'm bit aware of unicode, this is
the first time I'm putting my hands on it... It's bit kinda pain as
PHP's unicode support is broken and strange...
Yeah, Unicode support in PHP is practically non-existence. You can still get
by though. More recent version of PHP supports character classes in regular
expressions, so you can do things like
/([\x{0900}-\x{09FF}]+)/.

UTF8 is in general rather tricky to work with. For example, you can't limit
the length of text entered by users using just the length attribute in HTML.
And when database width constraint chops off some UTF8 text in
mid-character, all sort of funky things happen in the browser.

My advise is not to use Unicode unless you have to. I am not familiar with
the Tamil script, but I think done a lot of work with Hindi. Most Hindi
websites do not use Unicode (e.g. www.webdunia.com), because Unicode Hindi
text requires rendering support from the operation system, which essentially
limits you to Windows/IE only.
I'll be interested to try both. Are you hinting that at least one is
easier? Thanks.


Choosing one encoding out of three is obviously easier than choosing one out
of several hundred. As far as I know the only fool proof way is to run a
spell check on the text. Statistical analysis could also work. Just count
how often the letters are occurring and compare that to a known profile for
that language.
Jul 17 '05 #6
"Chung Leong" <ch***********@hotmail.com> wrote in message news:<Q-********************@comcast.com>...
"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab**************************@posting.google.c om...
Thanks for the info/logic. Though I'm bit aware of unicode, this is
the first time I'm putting my hands on it... It's bit kinda pain as
PHP's unicode support is broken and strange...
Yeah, Unicode support in PHP is practically non-existence. You can still get
by though. More recent version of PHP supports character classes in regular
expressions, so you can do things like
/([\x{0900}-\x{09FF}]+)/.

UTF8 is in general rather tricky to work with. For example, you can't limit
the length of text entered by users using just the length attribute in HTML.
And when database width constraint chops off some UTF8 text in
mid-character, all sort of funky things happen in the browser.


Thanks a lot for your comments and help. As you said, utf8 acts
much strange; if we include the utf8 texts from other files, it works
differently than expected. Anyway we can somehow get it work.
My advise is not to use Unicode unless you have to. I am not familiar with
the Tamil script, but I think done a lot of work with Hindi. Most Hindi
websites do not use Unicode (e.g. www.webdunia.com), because Unicode Hindi
text requires rendering support from the operation system, which essentially
limits you to Windows/IE only.


Yeah I understand. But, for Tamil staying behind Unicode may not
help much as many people are moving towards it. The reason should be
many people here use Windows/IE alone. <OT>BTW, www.webdunia.com is
your work? :-)</OT>
I'll be interested to try both. Are you hinting that at least one is
easier? Thanks.


Choosing one encoding out of three is obviously easier than choosing one out
of several hundred. As far as I know the only fool proof way is to run a
spell check on the text. Statistical analysis could also work. Just count
how often the letters are occurring and compare that to a known profile for
that language.


In Tamil, some characters won't start a word (unless someone did
a typo). I'd thought of using such grammar stuff, if there is no
direct solution to detect encoding. Thanks a lot for your help.

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com
Jul 17 '05 #7
"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab**************************@posting.google.c om...
Yeah I understand. But, for Tamil staying behind Unicode may not
help much as many people are moving towards it. The reason should be
many people here use Windows/IE alone. <OT>BTW, www.webdunia.com is
your work? :-)</OT>


If that's the case then Unicode is definitely the preferred route. With
Hindi, for some reason a lot of people are still using hack encoding. Just
the other day I had to re-type a whole bunch of stuff and I don't know a
word of Hindi. I don't work for webdunia.com, but my project make use of
their content. Let me tell you converting their custom encoding into Unicode
was quite a challenge.
Jul 17 '05 #8
"Chung Leong" <ch***********@hotmail.com> wrote in message news:<X9********************@comcast.com>...
"R. Rajesh Jeba Anbiah" <ng**********@rediffmail.com> wrote in message
news:ab**************************@posting.google.c om...
Yeah I understand. But, for Tamil staying behind Unicode may not
help much as many people are moving towards it. The reason should be
many people here use Windows/IE alone. <OT>BTW, www.webdunia.com is
your work? :-)</OT>


If that's the case then Unicode is definitely the preferred route. With
Hindi, for some reason a lot of people are still using hack encoding. Just
the other day I had to re-type a whole bunch of stuff and I don't know a
word of Hindi. I don't work for webdunia.com, but my project make use of
their content. Let me tell you converting their custom encoding into Unicode
was quite a challenge.


Thanks Chung for you comments and help. AFAIK, webdunia uses Dunia
encoding and here is Dunia to Unicode map
<http://crl.nmsu.edu/~mleisher/naicode.html> and a Perl script
<http://crl.nmsu.edu/~mleisher/nai2ucs.pl> (incase, if you want).
Thanks.

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com
Jul 17 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Gernot Hillier | last post by:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi! I'm the developer of a Linux ISDN application which uses embedded Python for controlling the communication. It starts several threads (i.e....
1
by: Tamir Khason | last post by:
The target: Detect IIS version on server The way: switch (Environment.OSVersion.Version.Major) { case 4: return IISVersion.IIS4; break; case 5: return IISVersion.IIS5: break;
3
by: ProJee | last post by:
1. How to pass chr(1) (or another special char below 32) to a method in a webservice? It raises an exception when I try to pass it through webservice.htc 2. How to pass chr(255) It doesn't...
13
by: Michal | last post by:
Hello, is there any way how to detect string encoding in Python? I need to proccess several files. Each of them could be encoded in different charset (iso-8859-2, cp1250, etc). I want to detect...
3
by: Michael H | last post by:
I'm would like to enconde my XML into ISO-8859-1, but I can't seem to find howto. ANy suggestions? StringWriter writer = new StringWriter(); XmlTextWriter xmlWriter = new XmlTextWriter(writer);...
7
by: =?Utf-8?B?QVRT?= | last post by:
HOWTO Make CStr for JavaScript on ASP w/ Request.Form and QueryString In ASP, Request.Form and Request.QueryString return objects that do not support "toString", or any JavaScript string...
13
by: Ben Voigt [C++ MVP] | last post by:
This is more of a C# question than a C++ question, but my best chance of explaining it is via comparison to C++. Ok: In C++ you can forward declare a type. Then references to that type can be...
13
by: mario | last post by:
Hello! i stumbled on this situation, that is if I decode some string, below just the empty string, using the mcbs encoding, it succeeds, but if I try to encode it back with the same encoding it...
3
by: Philip Semanchuk | last post by:
On Nov 9, 2008, at 7:00 PM, News123 wrote: Look under the heading "Standard Encodings": http://docs.python.org/library/codecs.html Note that both the page you found (which appears to be a...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.