473,385 Members | 1,359 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

Converting UTF-16 encoded chars in querystring to unicode

Hi,
For past few weeks I am working on a function that would take encoded
Unicode characters from query string of http requests and then decode
them back to Unicode numbers.
I have full success with UTF-8 encoding but it is UTF-16 where I
stumble. Can somebody help me with one of the following examples that
puzzle me :

%B7%C9 is UTF-16 encoded version of unicode 98DE (39134 in decimal)

But looking at the decoding algorithm that is given in the RFC 2781
for UTF-16 I don't understand how it is decoded to 98DE!
The algorithm says that if W1, the first 2 bytes (B7C9), is less than
D800 then the character value is value of W1. If it is so then the
unicode value should be B7C9 (47049 in decimal) or C9B7 (51639 in
decimal) in case of LE, which are both wrong.

Can anyone help me with this puzzle and tell me how the following
string the the query string can be decoded ?
%B7%C9%C0%FB%C6%D6
Thanks a lot for helping,
Supratim
Jul 20 '05 #1
3 5209
su******@sagemetrics.com (Supratim) wrote:
%B7%C9 is UTF-16 encoded version of unicode 98DE (39134 in decimal)


You are confused and you are confusing us. Please tell us which
character(s) you have in mind.
<http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=98de>
<http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=b7c9>
<http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=c9b7>

And read <http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>

--
Top-posting.
What's the most irritating thing on Usenet?
Jul 20 '05 #2
OK Andreas,
I realize two things now, after more study on the subject :
1. I was indeed confused about the encoding format when I asked the
question.
2. That I still am not clear about the whole thing.

So here is my question again in specifics:

I have web logs that indicate query strings of search engines. The are
encoded by server if they are unicode characters. What I am trying to
do is try to get back the original unicode characters from the encoded
form. Here are three examples:
a) %E3%83%95%E3%82%A3%E3%83%AA%E3%83%83%E3%83%97%E3%8 2%B9
This is encoded with UTF-8 encoding technique (this I am telling
by looking at it). So I correctly decoded it to :
&#12501&#12451&#12522&#12483&#12503&#12473

b) %83t%83B%83%8A%83b%83v%83X
This one I have no clue what it is, how is it encoded and how to
decode it. The only thing I know that it is supposed some Japanese
word.

c) %B7%C9%C0%FB%C6%D1
This I guess is GB2312 encoding. The output should be
&#39134&#21033&#28006. I still don't know how to decode it though.
So here is my grand question:
1. Is there a way (algorithm/already available function) that I can
use to
a) determine what type of encoding it is, for all such
encodings.
b) decode it to get the Unicode characters.

I hope I am able to express myself more clearly this time.
Thanks for helping me out
Supratim
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote in message news:<220120042354363258%nh******@rrzn-user.uni-hannover.de>...
su******@sagemetrics.com (Supratim) wrote:
%B7%C9 is UTF-16 encoded version of unicode 98DE (39134 in decimal)


You are confused and you are confusing us. Please tell us which
character(s) you have in mind.
<http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=98de>
<http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=b7c9>
<http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=c9b7>

And read <http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>

Jul 20 '05 #3
su******@sagemetrics.com (Supratim) wrote:
I have web logs that indicate query strings of search engines.
Search engines include the applied encoding in their URLs.
See <http://www.unics.uni-hannover.de/nhtcapri/#search_engines>
<http://www.unics.uni-hannover.de/nhtcapri/arabic.html#search_engines>
and following pages for some examples.
b) %83t%83B%83%8A%83b%83v%83X
<http://google.com/search?q=%83t%83B%83%8A%83b%83v%83X>
<http://google.com/search?q=%83t%83B%83%8A%83b%83v%83X&ie=Shift_JIS&o e=UTF-8>
c) %B7%C9%C0%FB%C6%D1
<http://google.com/search?q=%B7%C9%C0%FB%C6%D1>
<http://google.com/search?q=%B7%C9%C0%FB%C6%D1&ie=GB2312&oe=UTF-8>
1. Is there a way (algorithm/already available function) that I can
use to
a) determine what type of encoding it is, for all such
encodings.
As stated above, the query string probably includes the encoding, e.g.
"cs=cp932" with AllTheWeb
"enc=cp932" with AltaVista
"ie=Shift_JIS" with Google
b) decode it to get the Unicode characters.


Various programs exist to convert between different encodings;
depends on your operating system.

Did you read
<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html> ?

I think we are in the wrong group, BTW.

--
Top-posting.
What's the most irritating thing on Usenet?
Jul 20 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: j.a. harriman | last post by:
Hi, On MSDN I know there is a JScript example (Upgrading Visual C++ Projects to Visual Studio .NET in Batch Mode) to upgrade VS6 C++ projects to .NET solutions. It converts the project files...
10
by: jose.jeria | last post by:
I use the following to convert uppercase to lowercase: translate($queryString, 'ABCDE...', 'abcde...') But how can i convert the case for umlauts? öåä etc
5
by: Robert | last post by:
I have a series of web applications (configured as separate applications) on a server. There is a main application at the root and then several virtual directories that are independant...
3
by: Sharon | last post by:
I have a buffer of byte that contains a raw data of a 1 byte-per-pixel image data. I need to convert this buffer to a Bitmap of Format32bppArgb and to a Bitmap of Format24bppRgb. Can anybody...
2
by: Curious Trigger | last post by:
Hello, if have an asp.net web page with a detailsview. This detailsview uses a sqldatasource connecting to a sql server 2005 database with a select statement simliar to this one: SELECT...
11
by: Jean-François Michaud | last post by:
Hello all, I'm having a little problem, The UTF-8 parser we are using converts the newline entity ( ) within an attribute that we are using to paliate CSS limitations. After the parser has...
4
by: Christian Mairoll | last post by:
Hello, I'm maintaining a multi language website and have tried to convert it from ASP.NET 1.1 to 2.0 using Visual Studio 2005. When it had finished, I noticed, that it converted all my aspx...
1
by: Tejas | last post by:
Hi, I am using ldap_get_values() call to get the user attributes from LDAP. This call is returning the user attributes in UTF-8 encoding and its a PCHAR*. For normal English characters this...
0
by: Alci | last post by:
I am getting some Korean characters data from MS SQL server. These data were submitted as UTF-8 into the database, but stored as normal varchars. So, when I getting them out of database by using...
4
by: BG Mahesh | last post by:
hi We are using the normal html controls (textarea) in the posting form. The form page has the utf-8 character set. Users are copying the text from MS Word or Openoffice doc etc. Our PHP...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.