473,594 Members | 2,770 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Howto: Detect encodings?

Here is a nice code to detect utf-8
<http://in2.php.net/utf8_encode#399 86> But, I couldn't find out the
logic behind the script. If anyone knows that please share.

Particularly I would like to detect other encodings too. So, I would
like to know the logic.

For example these texts are in TSCII encoding (for Tamil):
Žì¸õ. þÐ ¾Á¢Æ¢ø ¯ûÇ Åâ. þ¨¾ ¯*¸Ç¡ø ÀÊì¸ ÓÊó¾¡ø, ¯*¸û ¯Ä¡Å¢Â¢ ±ó¾
Å¢¾ Á¡üÈò¨¾ õ ¦ºöÂò §¾¨Å þø¨Ä. ¦¾¡¼÷óÐ þó¾ þ¨½Âò ¾Çò¾¢ø «¨Áó¾¢Õ Ìõ
¾Á¢úô Àì¸*¸¨Ç ¾¨¼Â¢ýÈ ô ÀÊì¸Ä¡õ.

Any ideas? TIA.

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com
Jul 17 '05 #1
8 4573
"R. Rajesh Jeba Anbiah" <ng**********@r ediffmail.com> wrote in message
news:ab******** *************** **@posting.goog le.com...
Here is a nice code to detect utf-8
<http://in2.php.net/utf8_encode#399 86> But, I couldn't find out the
logic behind the script. If anyone knows that please share.
The code tries to decode the UTF8 text. When it runs into an error, then
it's not UTF8.
For example these texts are in TSCII encoding (for Tamil):
Ž. Ƣ â. ǡ , ġŢ¢
Ţ .
¢Ȣ ġ.


No easy way to do it. The question is, are you trying to distinguish between
different possible ways of encoding Tamil or identify TSCII from all
possible encodings?
Jul 17 '05 #2

"R. Rajesh Jeba Anbiah" <ng**********@r ediffmail.com> wrote in message
news:ab******** *************** **@posting.goog le.com...
Here is a nice code to detect utf-8
<http://in2.php.net/utf8_encode#399 86> But, I couldn't find out the
logic behind the script. If anyone knows that please share.

Particularly I would like to detect other encodings too. So, I would
like to know the logic.

For example these texts are in TSCII encoding (for Tamil):
Žì¸õ. þÐ ¾Á¢Æ¢ø ¯ûÇ Åâ. þ¨¾ ¯*¸Ç¡ø ÀÊì¸ ÓÊó¾¡ø, ¯*¸û ¯Ä¡Å¢Â¢ ±ó¾
Å¢¾ Á¡üÈò¨¾ õ ¦ºöÂò §¾¨Å þø¨Ä. ¦¾¡¼÷óÐ þó¾ þ¨½Âò ¾Çò¾¢ø «¨Áó¾¢Õ Ìõ
¾Á¢úô Àì¸*¸¨Ç ¾¨¼Â¢ýÈ ô ÀÊì¸Ä¡õ.

Any ideas? TIA.


A text that is not encoded in utf-8 will usually contain many byte sequences
that are invalid in utf-8. Encodings like TSCII are much more difficult to
detect, because every possible byte sequence would be valid (even though it
would not necessarily be a meaningful character sequence for a human
reader).

When I have a text with an unknown encoding I simply load it in an editor
that supports many encodings, and then try them out until I have found the
setting that causes the text to become readable. Writing a script that can
detect the encoding is obviously very difficult.

I should say, forget it. It is not worth the trouble.

Gerard van Wilgen
--
www.majstro.com (On-line translation dictionary / Enreta tradukvortaro)
www.travlang.com/Ergane (Free translation dictionary for Windows / Senpaga
tradukvortaro por Windows)

Jul 17 '05 #3
"Chung Leong" <ch***********@ hotmail.com> wrote in message news:<SM******* *************@c omcast.com>...
"R. Rajesh Jeba Anbiah" <ng**********@r ediffmail.com> wrote in message
news:ab******** *************** **@posting.goog le.com...
Here is a nice code to detect utf-8
<http://in2.php.net/utf8_encode#399 86> But, I couldn't find out the
logic behind the script. If anyone knows that please share.


The code tries to decode the UTF8 text. When it runs into an error, then
it's not UTF8.


Thanks for the info/logic. Though I'm bit aware of unicode, this is
the first time I'm putting my hands on it... It's bit kinda pain as
PHP's unicode support is broken and strange...
For example these texts are in TSCII encoding (for Tamil):
Ž. Ƣ â. ǡ , ġŢ¢
Ţ .
¢Ȣ ġ.


No easy way to do it. The question is, are you trying to distinguish between
different possible ways of encoding Tamil or identify TSCII from all
possible encodings?


I'll be interested to try both. Are you hinting that at least one is
easier? Thanks.

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com
Jul 17 '05 #4
"Gerard van Wilgen" <gv********@pla net.nl> wrote in message news:<ca******* ***@reader08.wx s.nl>...
"R. Rajesh Jeba Anbiah" <ng**********@r ediffmail.com> wrote in message
news:ab******** *************** **@posting.goog le.com... <snip>
A text that is not encoded in utf-8 will usually contain many byte sequences
that are invalid in utf-8. Encodings like TSCII are much more difficult to
detect, because every possible byte sequence would be valid (even though it
would not necessarily be a meaningful character sequence for a human
reader).

When I have a text with an unknown encoding I simply load it in an editor
that supports many encodings, and then try them out until I have found the
setting that causes the text to become readable.
Yes, I understand what you mean. Only human can identify it
clearly...
Writing a script that can
detect the encoding is obviously very difficult.

I should say, forget it. It is not worth the trouble.


This <http://www.murasu.com/converter/> tool can auto-detect
encoding. So, I think, still it is possible?

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com
Jul 17 '05 #5

"R. Rajesh Jeba Anbiah" <ng**********@r ediffmail.com> wrote in message
news:ab******** *************** ***@posting.goo gle.com...
Thanks for the info/logic. Though I'm bit aware of unicode, this is
the first time I'm putting my hands on it... It's bit kinda pain as
PHP's unicode support is broken and strange...
Yeah, Unicode support in PHP is practically non-existence. You can still get
by though. More recent version of PHP supports character classes in regular
expressions, so you can do things like
/([\x{0900}-\x{09FF}]+)/.

UTF8 is in general rather tricky to work with. For example, you can't limit
the length of text entered by users using just the length attribute in HTML.
And when database width constraint chops off some UTF8 text in
mid-character, all sort of funky things happen in the browser.

My advise is not to use Unicode unless you have to. I am not familiar with
the Tamil script, but I think done a lot of work with Hindi. Most Hindi
websites do not use Unicode (e.g. www.webdunia.com), because Unicode Hindi
text requires rendering support from the operation system, which essentially
limits you to Windows/IE only.
I'll be interested to try both. Are you hinting that at least one is
easier? Thanks.


Choosing one encoding out of three is obviously easier than choosing one out
of several hundred. As far as I know the only fool proof way is to run a
spell check on the text. Statistical analysis could also work. Just count
how often the letters are occurring and compare that to a known profile for
that language.
Jul 17 '05 #6
"Chung Leong" <ch***********@ hotmail.com> wrote in message news:<Q-*************** *****@comcast.c om>...
"R. Rajesh Jeba Anbiah" <ng**********@r ediffmail.com> wrote in message
news:ab******** *************** ***@posting.goo gle.com...
Thanks for the info/logic. Though I'm bit aware of unicode, this is
the first time I'm putting my hands on it... It's bit kinda pain as
PHP's unicode support is broken and strange...
Yeah, Unicode support in PHP is practically non-existence. You can still get
by though. More recent version of PHP supports character classes in regular
expressions, so you can do things like
/([\x{0900}-\x{09FF}]+)/.

UTF8 is in general rather tricky to work with. For example, you can't limit
the length of text entered by users using just the length attribute in HTML.
And when database width constraint chops off some UTF8 text in
mid-character, all sort of funky things happen in the browser.


Thanks a lot for your comments and help. As you said, utf8 acts
much strange; if we include the utf8 texts from other files, it works
differently than expected. Anyway we can somehow get it work.
My advise is not to use Unicode unless you have to. I am not familiar with
the Tamil script, but I think done a lot of work with Hindi. Most Hindi
websites do not use Unicode (e.g. www.webdunia.com), because Unicode Hindi
text requires rendering support from the operation system, which essentially
limits you to Windows/IE only.


Yeah I understand. But, for Tamil staying behind Unicode may not
help much as many people are moving towards it. The reason should be
many people here use Windows/IE alone. <OT>BTW, www.webdunia.com is
your work? :-)</OT>
I'll be interested to try both. Are you hinting that at least one is
easier? Thanks.


Choosing one encoding out of three is obviously easier than choosing one out
of several hundred. As far as I know the only fool proof way is to run a
spell check on the text. Statistical analysis could also work. Just count
how often the letters are occurring and compare that to a known profile for
that language.


In Tamil, some characters won't start a word (unless someone did
a typo). I'd thought of using such grammar stuff, if there is no
direct solution to detect encoding. Thanks a lot for your help.

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com
Jul 17 '05 #7
"R. Rajesh Jeba Anbiah" <ng**********@r ediffmail.com> wrote in message
news:ab******** *************** ***@posting.goo gle.com...
Yeah I understand. But, for Tamil staying behind Unicode may not
help much as many people are moving towards it. The reason should be
many people here use Windows/IE alone. <OT>BTW, www.webdunia.com is
your work? :-)</OT>


If that's the case then Unicode is definitely the preferred route. With
Hindi, for some reason a lot of people are still using hack encoding. Just
the other day I had to re-type a whole bunch of stuff and I don't know a
word of Hindi. I don't work for webdunia.com, but my project make use of
their content. Let me tell you converting their custom encoding into Unicode
was quite a challenge.
Jul 17 '05 #8
"Chung Leong" <ch***********@ hotmail.com> wrote in message news:<X9******* *************@c omcast.com>...
"R. Rajesh Jeba Anbiah" <ng**********@r ediffmail.com> wrote in message
news:ab******** *************** ***@posting.goo gle.com...
Yeah I understand. But, for Tamil staying behind Unicode may not
help much as many people are moving towards it. The reason should be
many people here use Windows/IE alone. <OT>BTW, www.webdunia.com is
your work? :-)</OT>


If that's the case then Unicode is definitely the preferred route. With
Hindi, for some reason a lot of people are still using hack encoding. Just
the other day I had to re-type a whole bunch of stuff and I don't know a
word of Hindi. I don't work for webdunia.com, but my project make use of
their content. Let me tell you converting their custom encoding into Unicode
was quite a challenge.


Thanks Chung for you comments and help. AFAIK, webdunia uses Dunia
encoding and here is Dunia to Unicode map
<http://crl.nmsu.edu/~mleisher/naicode.html> and a Perl script
<http://crl.nmsu.edu/~mleisher/nai2ucs.pl> (incase, if you want).
Thanks.

--
| Just another PHP saint |
Email: rrjanbiah-at-Y!com
Jul 17 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
5833
by: Gernot Hillier | last post by:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi! I'm the developer of a Linux ISDN application which uses embedded Python for controlling the communication. It starts several threads (i.e. one for each incoming call and for outgoing faxes) which run Python scripts in embedded Python interpreters which in turn do the real communication stuff. Any incoming data or confirmations of done jobs are sent via Email.
1
7972
by: Tamir Khason | last post by:
The target: Detect IIS version on server The way: switch (Environment.OSVersion.Version.Major) { case 4: return IISVersion.IIS4; break; case 5: return IISVersion.IIS5: break;
3
1449
by: ProJee | last post by:
1. How to pass chr(1) (or another special char below 32) to a method in a webservice? It raises an exception when I try to pass it through webservice.htc 2. How to pass chr(255) It doesn't raise an exception, but another character comes to the server in some cases (different server= different results).. probably some
13
27965
by: Michal | last post by:
Hello, is there any way how to detect string encoding in Python? I need to proccess several files. Each of them could be encoded in different charset (iso-8859-2, cp1250, etc). I want to detect it, and encode it to utf-8 (with string function encode). Thank you for any answer Regards Michal
3
10824
by: Michael H | last post by:
I'm would like to enconde my XML into ISO-8859-1, but I can't seem to find howto. ANy suggestions? StringWriter writer = new StringWriter(); XmlTextWriter xmlWriter = new XmlTextWriter(writer); xmlWriter.Formatting = Formatting.Indented; doc.Save(writer); return writer.ToString();
7
9654
by: =?Utf-8?B?QVRT?= | last post by:
HOWTO Make CStr for JavaScript on ASP w/ Request.Form and QueryString In ASP, Request.Form and Request.QueryString return objects that do not support "toString", or any JavaScript string operation on parameters not passed. Example: Make a TST.asp and post to it as TST.asp?STATE=TEST <%@ Language=JavaScript %> <%
13
1543
by: Ben Voigt [C++ MVP] | last post by:
This is more of a C# question than a C++ question, but my best chance of explaining it is via comparison to C++. Ok: In C++ you can forward declare a type. Then references to that type can be passed around in a typesafe way without introducing a dependency on the type. Only users needing to access the members need the type definition. C# generics provide a similar capability, where you can treat a type as
13
3672
by: mario | last post by:
Hello! i stumbled on this situation, that is if I decode some string, below just the empty string, using the mcbs encoding, it succeeds, but if I try to encode it back with the same encoding it surprisingly fails with a LookupError. This seems like something to be corrected? $ python Python 2.5.1 (r251:54869, Apr 18 2007, 22:08:04) on darwin
3
2654
by: Philip Semanchuk | last post by:
On Nov 9, 2008, at 7:00 PM, News123 wrote: Look under the heading "Standard Encodings": http://docs.python.org/library/codecs.html Note that both the page you found (which appears to be a copy of the Python documentation) and the reference I provide say, "Neither the list of aliases nor the list of languages is meant to be exhaustive".
0
7936
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, well explore What is ONU, What Is Router, ONU & Routers main usage, and What is the difference between ONU and Router. Lets take a closer look ! Part I. Meaning of...
0
7874
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8366
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
6646
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development projectplanning, coding, testing, and deploymentwithout human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
5738
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupr who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5402
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
1
2383
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1469
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
1203
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.