By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
454,610 Members | 1,307 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 454,610 IT Pros & Developers. It's quick & easy.

Converting UTF-16 encoded chars in querystring to unicode

P: n/a
Hi,
For past few weeks I am working on a function that would take encoded
Unicode characters from query string of http requests and then decode
them back to Unicode numbers.
I have full success with UTF-8 encoding but it is UTF-16 where I
stumble. Can somebody help me with one of the following examples that
puzzle me :

%B7%C9 is UTF-16 encoded version of unicode 98DE (39134 in decimal)

But looking at the decoding algorithm that is given in the RFC 2781
for UTF-16 I don't understand how it is decoded to 98DE!
The algorithm says that if W1, the first 2 bytes (B7C9), is less than
D800 then the character value is value of W1. If it is so then the
unicode value should be B7C9 (47049 in decimal) or C9B7 (51639 in
decimal) in case of LE, which are both wrong.

Can anyone help me with this puzzle and tell me how the following
string the the query string can be decoded ?
%B7%C9%C0%FB%C6%D6
Thanks a lot for helping,
Supratim
Jul 20 '05 #1
Share this Question
Share on Google+
3 Replies


P: n/a
su******@sagemetrics.com (Supratim) wrote:
%B7%C9 is UTF-16 encoded version of unicode 98DE (39134 in decimal)


You are confused and you are confusing us. Please tell us which
character(s) you have in mind.
<http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=98de>
<http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=b7c9>
<http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=c9b7>

And read <http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>

--
Top-posting.
What's the most irritating thing on Usenet?
Jul 20 '05 #2

P: n/a
OK Andreas,
I realize two things now, after more study on the subject :
1. I was indeed confused about the encoding format when I asked the
question.
2. That I still am not clear about the whole thing.

So here is my question again in specifics:

I have web logs that indicate query strings of search engines. The are
encoded by server if they are unicode characters. What I am trying to
do is try to get back the original unicode characters from the encoded
form. Here are three examples:
a) %E3%83%95%E3%82%A3%E3%83%AA%E3%83%83%E3%83%97%E3%8 2%B9
This is encoded with UTF-8 encoding technique (this I am telling
by looking at it). So I correctly decoded it to :
&#12501&#12451&#12522&#12483&#12503&#12473

b) %83t%83B%83%8A%83b%83v%83X
This one I have no clue what it is, how is it encoded and how to
decode it. The only thing I know that it is supposed some Japanese
word.

c) %B7%C9%C0%FB%C6%D1
This I guess is GB2312 encoding. The output should be
&#39134&#21033&#28006. I still don't know how to decode it though.
So here is my grand question:
1. Is there a way (algorithm/already available function) that I can
use to
a) determine what type of encoding it is, for all such
encodings.
b) decode it to get the Unicode characters.

I hope I am able to express myself more clearly this time.
Thanks for helping me out
Supratim
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote in message news:<220120042354363258%nh******@rrzn-user.uni-hannover.de>...
su******@sagemetrics.com (Supratim) wrote:
%B7%C9 is UTF-16 encoded version of unicode 98DE (39134 in decimal)


You are confused and you are confusing us. Please tell us which
character(s) you have in mind.
<http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=98de>
<http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=b7c9>
<http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=c9b7>

And read <http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>

Jul 20 '05 #3

P: n/a
su******@sagemetrics.com (Supratim) wrote:
I have web logs that indicate query strings of search engines.
Search engines include the applied encoding in their URLs.
See <http://www.unics.uni-hannover.de/nhtcapri/#search_engines>
<http://www.unics.uni-hannover.de/nhtcapri/arabic.html#search_engines>
and following pages for some examples.
b) %83t%83B%83%8A%83b%83v%83X
<http://google.com/search?q=%83t%83B%83%8A%83b%83v%83X>
<http://google.com/search?q=%83t%83B%83%8A%83b%83v%83X&ie=Shift_JIS&o e=UTF-8>
c) %B7%C9%C0%FB%C6%D1
<http://google.com/search?q=%B7%C9%C0%FB%C6%D1>
<http://google.com/search?q=%B7%C9%C0%FB%C6%D1&ie=GB2312&oe=UTF-8>
1. Is there a way (algorithm/already available function) that I can
use to
a) determine what type of encoding it is, for all such
encodings.
As stated above, the query string probably includes the encoding, e.g.
"cs=cp932" with AllTheWeb
"enc=cp932" with AltaVista
"ie=Shift_JIS" with Google
b) decode it to get the Unicode characters.


Various programs exist to convert between different encodings;
depends on your operating system.

Did you read
<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html> ?

I think we are in the wrong group, BTW.

--
Top-posting.
What's the most irritating thing on Usenet?
Jul 20 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.