Converting UTF-16 encoded chars in querystring to unicode

Supratim

Hi,
For past few weeks I am working on a function that would take encoded
Unicode characters from query string of http requests and then decode
them back to Unicode numbers.
I have full success with UTF-8 encoding but it is UTF-16 where I
stumble. Can somebody help me with one of the following examples that
puzzle me :

%B7%C9 is UTF-16 encoded version of unicode 98DE (39134 in decimal)

But looking at the decoding algorithm that is given in the RFC 2781
for UTF-16 I don't understand how it is decoded to 98DE!
The algorithm says that if W1, the first 2 bytes (B7C9), is less than
D800 then the character value is value of W1. If it is so then the
unicode value should be B7C9 (47049 in decimal) or C9B7 (51639 in
decimal) in case of LE, which are both wrong.

Can anyone help me with this puzzle and tell me how the following
string the the query string can be decoded ?
%B7%C9%C0%FB%C6%D6
Thanks a lot for helping,
Supratim

Jul 20 '05 #1

Subscribe Post Reply

5209

Andreas Prilop

su******@sagemetrics.com (Supratim) wrote:

%B7%C9 is UTF-16 encoded version of unicode 98DE (39134 in decimal)

You are confused and you are confusing us. Please tell us which
character(s) you have in mind.
<http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=98de>
<http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=b7c9>
<http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=c9b7>

And read <http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 20 '05 #2

Supratim

OK Andreas,
I realize two things now, after more study on the subject :
1. I was indeed confused about the encoding format when I asked the
question.
2. That I still am not clear about the whole thing.

So here is my question again in specifics:

I have web logs that indicate query strings of search engines. The are
encoded by server if they are unicode characters. What I am trying to
do is try to get back the original unicode characters from the encoded
form. Here are three examples:
a) %E3%83%95%E3%82%A3%E3%83%AA%E3%83%83%E3%83%97%E3%8 2%B9
This is encoded with UTF-8 encoding technique (this I am telling
by looking at it). So I correctly decoded it to :
&#12501&#12451&#12522&#12483&#12503&#12473

b) %83t%83B%83%8A%83b%83v%83X
This one I have no clue what it is, how is it encoded and how to
decode it. The only thing I know that it is supposed some Japanese
word.

c) %B7%C9%C0%FB%C6%D1
This I guess is GB2312 encoding. The output should be
&#39134&#21033&#28006. I still don't know how to decode it though.
So here is my grand question:
1. Is there a way (algorithm/already available function) that I can
use to
a) determine what type of encoding it is, for all such
encodings.
b) decode it to get the Unicode characters.

I hope I am able to express myself more clearly this time.
Thanks for helping me out
Supratim
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote in message news:<220120042354363258%nh******@rrzn-user.uni-hannover.de>...

su******@sagemetrics.com (Supratim) wrote:
%B7%C9 is UTF-16 encoded version of unicode 98DE (39134 in decimal)

You are confused and you are confusing us. Please tell us which
character(s) you have in mind.
<http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=98de>
<http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=b7c9>
<http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=c9b7>

And read <http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>

Jul 20 '05 #3

Andreas Prilop

su******@sagemetrics.com (Supratim) wrote:

I have web logs that indicate query strings of search engines.
Search engines include the applied encoding in their URLs.
See <http://www.unics.uni-hannover.de/nhtcapri/#search_engines>
<http://www.unics.uni-hannover.de/nhtcapri/arabic.html#search_engines>
and following pages for some examples.
b) %83t%83B%83%8A%83b%83v%83X
<http://google.com/search?q=%83t%83B%83%8A%83b%83v%83X>
<http://google.com/search?q=%83t%83B%83%8A%83b%83v%83X&ie=Shift_JIS&o e=UTF-8>
c) %B7%C9%C0%FB%C6%D1
<http://google.com/search?q=%B7%C9%C0%FB%C6%D1>
<http://google.com/search?q=%B7%C9%C0%FB%C6%D1&ie=GB2312&oe=UTF-8>
1. Is there a way (algorithm/already available function) that I can
use to
a) determine what type of encoding it is, for all such
encodings.
As stated above, the query string probably includes the encoding, e.g.
"cs=cp932" with AllTheWeb
"enc=cp932" with AltaVista
"ie=Shift_JIS" with Google
b) decode it to get the Unicode characters.

Various programs exist to convert between different encodings;
depends on your operating system.

Did you read
<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html> ?

I think we are in the wrong group, BTW.

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 20 '05 #4

Similar topics

Converting VS6 C++ to .NET Solutions

by: j.a. harriman | last post by:

Hi, On MSDN I know there is a JScript example (Upgrading Visual C++ Projects to Visual Studio .NET in Batch Mode) to upgrade VS6 C++ projects to .NET solutions. It converts the project files...

.NET Framework

Converting Case - Umlauts?

by: jose.jeria | last post by:

I use the following to convert uppercase to lowercase: translate($queryString, 'ABCDE...', 'abcde...') But how can i convert the case for umlauts? öåä etc

.NET Framework

Reference Issue converting site from 1.1 to 2.0

by: Robert | last post by:

I have a series of web applications (configured as separate applications) on a server. There is a main application at the root and then several virtual directories that are independant...

ASP.NET

Converting 8bpp raw data to a 32bppArgb Bitmap

by: Sharon | last post by:

I have a buffer of byte that contains a raw data of a 1 byte-per-pixel image data. I need to convert this buffer to a Bitmap of Format32bppArgb and to a Bitmap of Format24bppRgb. Can anybody...

C# / C Sharp

Detailsview shows real numbers with commata: Error converting data type nvarchar to real!

by: Curious Trigger | last post by:

Hello, if have an asp.net web page with a detailsview. This detailsview uses a sqldatasource connecting to a sql server 2005 database with a select statement simliar to this one: SELECT...

ASP.NET

Preventing the UTF-8 Parser from converting an entity?

by: Jean-François Michaud | last post by:

Hello all, I'm having a little problem, The UTF-8 parser we are using converts the newline entity ( ) within an attribute that we are using to paliate CSS limitations. After the parser has...

.NET Framework

Lost UTF-8 encoding on all files while converting ASP.NET web from 1.1 to 2.0

by: Christian Mairoll | last post by:

Hello, I'm maintaining a multi language website and have tried to convert it from ASP.NET 1.1 to 2.0 using Visual Studio 2005. When it had finished, I noticed, that it converted all my aspx...

ASP.NET

ldap_get_values: converting UTF8 encoding to ANSI MBCS string on UNIX systems

by: Tejas | last post by:

Hi, I am using ldap_get_values() call to get the user attributes from LDAP. This call is returning the user attributes in UTF-8 encoding and its a PCHAR*. For normal English characters this...

C / C++

Converting ASCII to UTF-8

by: Alci | last post by:

I am getting some Korean characters data from MS SQL server. These data were submitted as UTF-8 into the database, but stored as normal varchars. So, when I getting them out of database by using...

ASP.NET

Having trouble converting few characters using htmlentities function

by: BG Mahesh | last post by:

hi We are using the normal html controls (textarea) in the posting form. The form page has the utf-8 character set. Users are copying the text from MS Word or Openoffice doc etc. Our PHP...

PHP

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++