By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
458,145 Members | 1,572 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 458,145 IT Pros & Developers. It's quick & easy.

Convert Encoding from Shift-JIS to UTF-8

P: n/a
I am trying to convert some Japanese text encoded as Shift-JIS/ISO-2022-JP
to UTF-8 so I can store all data in my database with a common encoding.

My problem is the encoding conversion code works for Japanese characters
encoded as "iso-2022-jp" but does not for "shift-jis"

What looked straight forward is proving less so, my test code looks like
this:

<%@ Page Language="C#"%>

<script language="C#" runat="server">
////////////////////////////////////////////////////////////////////////////
/////////////////////////////
public void Page_Load()
////////////////////////////////////////////////////////////////////////////
/////////////////////////////
{
string S = Request.Form["text"];

Encoding SourceEncoding = Encoding.GetEncoding( "shift-jis" );
Encoding TargetEncoding = Encoding.UTF8;

Response.Write( SourceEncoding.GetString( TargetEncoding.GetBytes( S ) ) );
}
</script>

Thanks in advance


Nov 16 '05 #1
Share this Question
Share on Google+
5 Replies


P: n/a
DbNetLink <robin@____dbnetlink.co.uk> wrote:
I am trying to convert some Japanese text encoded as Shift-JIS/ISO-2022-JP
to UTF-8 so I can store all data in my database with a common encoding.


There's something wrong here. The request value is a unicode string -
all strings are unicode in .NET. Any encoding has already been taken
into account. You should be able to just write the string to the
database without any change.

See http://www.pobox.com/~skeet/csharp/unicode.html

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 16 '05 #2

P: n/a
Is that true even if the web page transmitting the form has had it's
encoding set to "shift-jis".

When you say Unicode I am assuming this means UTF-16 ?

Assuming that were true then I would therefore expect to be able to convert
the page like this

////////////////////////////////////////////////////////////////////////////
/////////////////////////////
public void Page_Load()
////////////////////////////////////////////////////////////////////////////
/////////////////////////////
{

string S = Request.Form["text"];

Encoding SourceEncoding = Encoding.Unicode;
Encoding TargetEncoding = Encoding.UTF8;

Response.Write( SourceEncoding.GetString( TargetEncoding.GetBytes( S ) ) );
}

But this does not appear to work as I would expect either.


"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message
news:MP************************@msnews.microsoft.c om...
DbNetLink <robin@____dbnetlink.co.uk> wrote:
I am trying to convert some Japanese text encoded as Shift-JIS/ISO-2022-JP to UTF-8 so I can store all data in my database with a common encoding.


There's something wrong here. The request value is a unicode string -
all strings are unicode in .NET. Any encoding has already been taken
into account. You should be able to just write the string to the
database without any change.

See http://www.pobox.com/~skeet/csharp/unicode.html

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Nov 16 '05 #3

P: n/a
DbNetLink <robin@____dbnetlink.co.uk> wrote:
Is that true even if the web page transmitting the form has had it's
encoding set to "shift-jis".

When you say Unicode I am assuming this means UTF-16 ?
Yes.
Assuming that were true then I would therefore expect to be able to convert
the page like this

////////////////////////////////////////////////////////////////////////////
/////////////////////////////
public void Page_Load()
////////////////////////////////////////////////////////////////////////////
/////////////////////////////
{

string S = Request.Form["text"];

Encoding SourceEncoding = Encoding.Unicode;
Encoding TargetEncoding = Encoding.UTF8;

Response.Write( SourceEncoding.GetString( TargetEncoding.GetBytes( S ) ) );
}

But this does not appear to work as I would expect either.


No, that shouldn't work. That's trying to use the Unicode encoding of a
string as if it were a UTF-8 encoding of a string.

If you want the UTF-8 encoded bytes, just use Encoding.UTF8.GetBytes(S)

Did you read the page I linked to?

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 16 '05 #4

P: n/a
>> If you want the UTF-8 encoded bytes, just use Encoding.UTF8.GetBytes(S)

Is that not what I am doing in the line:

Response.Write( SourceEncoding.GetString( TargetEncoding.GetBytes(
S ) ) );

Given the earlier line:

Encoding TargetEncoding = Encoding.UTF8;

I did read the link but was unable to relate it directly to my problem of
converting one encoding to another using .Net.

If it is simply down to an error in my code perhaps you could point it out
as I have already spent 2 days on trying to understand what I am doing wrong
and would love to be put out of my misery :(
Thanks for your help BTW

"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message
news:MP************************@msnews.microsoft.c om...
DbNetLink <robin@____dbnetlink.co.uk> wrote:
Is that true even if the web page transmitting the form has had it's
encoding set to "shift-jis".

When you say Unicode I am assuming this means UTF-16 ?


Yes.
Assuming that were true then I would therefore expect to be able to convert the page like this

//////////////////////////////////////////////////////////////////////////// /////////////////////////////
public void Page_Load()
//////////////////////////////////////////////////////////////////////////// /////////////////////////////
{

string S = Request.Form["text"];

Encoding SourceEncoding = Encoding.Unicode;
Encoding TargetEncoding = Encoding.UTF8;

Response.Write( SourceEncoding.GetString( TargetEncoding.GetBytes( S ) ) ); }

But this does not appear to work as I would expect either.


No, that shouldn't work. That's trying to use the Unicode encoding of a
string as if it were a UTF-8 encoding of a string.

If you want the UTF-8 encoded bytes, just use Encoding.UTF8.GetBytes(S)

Did you read the page I linked to?

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Nov 16 '05 #5

P: n/a
DbNetLink <robin@____dbnetlink.co.uk> wrote:
If you want the UTF-8 encoded bytes, just use Encoding.UTF8.GetBytes(S)

Is that not what I am doing in the line:

Response.Write( SourceEncoding.GetString( TargetEncoding.GetBytes(
S ) ) );


No. You're converting the string into UTF-8, but then using the result
as if it were a valid shift-jis-encoded byte array.
Given the earlier line:

Encoding TargetEncoding = Encoding.UTF8;

I did read the link but was unable to relate it directly to my problem of
converting one encoding to another using .Net.
It gives the fundamentals, which should explain why the line of code at
the top is a really bad idea.
If it is simply down to an error in my code perhaps you could point it out
as I have already spent 2 days on trying to understand what I am doing wrong
and would love to be put out of my misery :(


You should just be able to use the string, without venturing into
encodings at all.

If that's not working, you need to work through it step by step - see
http://www.pobox.com/~skeet/csharp/d...ngunicode.html

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 16 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.