OK Andreas,
I realize two things now, after more study on the subject :
1. I was indeed confused about the encoding format when I asked the
question.
2. That I still am not clear about the whole thing.
So here is my question again in specifics:
I have web logs that indicate query strings of search engines. The are
encoded by server if they are unicode characters. What I am trying to
do is try to get back the original unicode characters from the encoded
form. Here are three examples:
a) %E3%83%95%E3%82%A3%E3%83%AA%E3%83%83%E3%83%97%E3%8 2%B9
This is encoded with UTF-8 encoding technique (this I am telling
by looking at it). So I correctly decoded it to :
フィリップス
b) %83t%83B%83%8A%83b%83v%83X
This one I have no clue what it is, how is it encoded and how to
decode it. The only thing I know that it is supposed some Japanese
word.
c) %B7%C9%C0%FB%C6%D1
This I guess is GB2312 encoding. The output should be
飞利浦. I still don't know how to decode it though.
So here is my grand question:
1. Is there a way (algorithm/already available function) that I can
use to
a) determine what type of encoding it is, for all such
encodings.
b) decode it to get the Unicode characters.
I hope I am able to express myself more clearly this time.
Thanks for helping me out
Supratim
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote in message news:<220120042354363258%nh******@rrzn-user.uni-hannover.de>...
su******@sagemetrics.com (Supratim) wrote:
%B7%C9 is UTF-16 encoded version of unicode 98DE (39134 in decimal)
You are confused and you are confusing us. Please tell us which
character(s) you have in mind.
<http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=98de>
<http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=b7c9>
<http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=c9b7>
And read <http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>