473,405 Members | 2,171 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,405 software developers and data experts.

Convert Encoding from Shift-JIS to UTF-8

I am trying to convert some Japanese text encoded as Shift-JIS/ISO-2022-JP
to UTF-8 so I can store all data in my database with a common encoding.

My problem is the encoding conversion code works for Japanese characters
encoded as "iso-2022-jp" but does not for "shift-jis"

What looked straight forward is proving less so, my test code looks like
this:

<%@ Page Language="C#"%>

<script language="C#" runat="server">
////////////////////////////////////////////////////////////////////////////
/////////////////////////////
public void Page_Load()
////////////////////////////////////////////////////////////////////////////
/////////////////////////////
{
string S = Request.Form["text"];

Encoding SourceEncoding = Encoding.GetEncoding( "shift-jis" );
Encoding TargetEncoding = Encoding.UTF8;

Response.Write( SourceEncoding.GetString( TargetEncoding.GetBytes( S ) ) );
}
</script>

Thanks in advance


Nov 16 '05 #1
5 28155
DbNetLink <robin@____dbnetlink.co.uk> wrote:
I am trying to convert some Japanese text encoded as Shift-JIS/ISO-2022-JP
to UTF-8 so I can store all data in my database with a common encoding.


There's something wrong here. The request value is a unicode string -
all strings are unicode in .NET. Any encoding has already been taken
into account. You should be able to just write the string to the
database without any change.

See http://www.pobox.com/~skeet/csharp/unicode.html

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 16 '05 #2
Is that true even if the web page transmitting the form has had it's
encoding set to "shift-jis".

When you say Unicode I am assuming this means UTF-16 ?

Assuming that were true then I would therefore expect to be able to convert
the page like this

////////////////////////////////////////////////////////////////////////////
/////////////////////////////
public void Page_Load()
////////////////////////////////////////////////////////////////////////////
/////////////////////////////
{

string S = Request.Form["text"];

Encoding SourceEncoding = Encoding.Unicode;
Encoding TargetEncoding = Encoding.UTF8;

Response.Write( SourceEncoding.GetString( TargetEncoding.GetBytes( S ) ) );
}

But this does not appear to work as I would expect either.


"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message
news:MP************************@msnews.microsoft.c om...
DbNetLink <robin@____dbnetlink.co.uk> wrote:
I am trying to convert some Japanese text encoded as Shift-JIS/ISO-2022-JP to UTF-8 so I can store all data in my database with a common encoding.


There's something wrong here. The request value is a unicode string -
all strings are unicode in .NET. Any encoding has already been taken
into account. You should be able to just write the string to the
database without any change.

See http://www.pobox.com/~skeet/csharp/unicode.html

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Nov 16 '05 #3
DbNetLink <robin@____dbnetlink.co.uk> wrote:
Is that true even if the web page transmitting the form has had it's
encoding set to "shift-jis".

When you say Unicode I am assuming this means UTF-16 ?
Yes.
Assuming that were true then I would therefore expect to be able to convert
the page like this

////////////////////////////////////////////////////////////////////////////
/////////////////////////////
public void Page_Load()
////////////////////////////////////////////////////////////////////////////
/////////////////////////////
{

string S = Request.Form["text"];

Encoding SourceEncoding = Encoding.Unicode;
Encoding TargetEncoding = Encoding.UTF8;

Response.Write( SourceEncoding.GetString( TargetEncoding.GetBytes( S ) ) );
}

But this does not appear to work as I would expect either.


No, that shouldn't work. That's trying to use the Unicode encoding of a
string as if it were a UTF-8 encoding of a string.

If you want the UTF-8 encoded bytes, just use Encoding.UTF8.GetBytes(S)

Did you read the page I linked to?

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 16 '05 #4
>> If you want the UTF-8 encoded bytes, just use Encoding.UTF8.GetBytes(S)

Is that not what I am doing in the line:

Response.Write( SourceEncoding.GetString( TargetEncoding.GetBytes(
S ) ) );

Given the earlier line:

Encoding TargetEncoding = Encoding.UTF8;

I did read the link but was unable to relate it directly to my problem of
converting one encoding to another using .Net.

If it is simply down to an error in my code perhaps you could point it out
as I have already spent 2 days on trying to understand what I am doing wrong
and would love to be put out of my misery :(
Thanks for your help BTW

"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message
news:MP************************@msnews.microsoft.c om...
DbNetLink <robin@____dbnetlink.co.uk> wrote:
Is that true even if the web page transmitting the form has had it's
encoding set to "shift-jis".

When you say Unicode I am assuming this means UTF-16 ?


Yes.
Assuming that were true then I would therefore expect to be able to convert the page like this

//////////////////////////////////////////////////////////////////////////// /////////////////////////////
public void Page_Load()
//////////////////////////////////////////////////////////////////////////// /////////////////////////////
{

string S = Request.Form["text"];

Encoding SourceEncoding = Encoding.Unicode;
Encoding TargetEncoding = Encoding.UTF8;

Response.Write( SourceEncoding.GetString( TargetEncoding.GetBytes( S ) ) ); }

But this does not appear to work as I would expect either.


No, that shouldn't work. That's trying to use the Unicode encoding of a
string as if it were a UTF-8 encoding of a string.

If you want the UTF-8 encoded bytes, just use Encoding.UTF8.GetBytes(S)

Did you read the page I linked to?

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Nov 16 '05 #5
DbNetLink <robin@____dbnetlink.co.uk> wrote:
If you want the UTF-8 encoded bytes, just use Encoding.UTF8.GetBytes(S)

Is that not what I am doing in the line:

Response.Write( SourceEncoding.GetString( TargetEncoding.GetBytes(
S ) ) );


No. You're converting the string into UTF-8, but then using the result
as if it were a valid shift-jis-encoded byte array.
Given the earlier line:

Encoding TargetEncoding = Encoding.UTF8;

I did read the link but was unable to relate it directly to my problem of
converting one encoding to another using .Net.
It gives the fundamentals, which should explain why the line of code at
the top is a really bad idea.
If it is simply down to an error in my code perhaps you could point it out
as I have already spent 2 days on trying to understand what I am doing wrong
and would love to be put out of my misery :(


You should just be able to use the string, without venturing into
encodings at all.

If that's not working, you need to work through it step by step - see
http://www.pobox.com/~skeet/csharp/d...ngunicode.html

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 16 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: Artemisio | last post by:
I have done a small currency calculator. It works and I'm very glad. But...I'd like to have a line shift if user types a wrong choice. Please, look at the code and output example down here: #...
10
by: Christopher H. Laco | last post by:
Long story longer. I need to get web user input into a backend system that a) only grocks single byte encoding, b) expectes the data transer to be 1 bytes = 1 character, and c) uses the HP Roman-6...
3
by: hunterb | last post by:
I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a...
10
by: Vijay Kumar R. Zanvar | last post by:
I do not know much about shift sequence(7.1.1#5). Can somebody enlighten me, giving some examples? Regards, Vijay Kumar R. Zanvar -- Calvin: Hi Mom! I'm making my own newspaper to report...
1
by: Mirosław Iwanowski | last post by:
Hello! I need to conver string (sql statement) from one Polish charset standard (ISO-8859-2) to another (Win-1250). Using help I managed to create line like this one:...
4
by: Trond Hoiberg | last post by:
I was wondering if someone in here knows if it is possible to convert a letter (a, b, c....) to the ISO Latin-1 Character Set Decimal code? a= a b=b c=c I know it is possible but i was looking...
2
by: | last post by:
I am woking on a base64 encoder and I am looking for some design help. I have a woking model but would like some input on the design. I currently read 3 bytes from a binary stream with each byte...
1
by: Mały Piotruś | last post by:
Hello, Could you help me please with encoding transformations in .NET? I am beginner. I have some code that nearly works - but I have problem with converting from fileEncoding to Unicode (Strings...
1
by: John Richardson | last post by:
I'm trying to override the SHIFT-SPACE "negative feature" in the Winforms datagrid, to only be a space. The following link describes this:...
1
by: Sadie | last post by:
please help me with the java codes for this problem i tried to do this program a week ago but even now i dont have an idea of how to go about with it. please help me it is urgent Cryptography ...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.