473,516 Members | 3,248 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Basic Conversion Query

Hi,

I have a string (System.String) which holds some data. This data is
encoding in UTF8 (i.e. anywhere in the string where there should be a
single 'é' character, there will be two characters holding the
equivalent of that character in the UTF8 format).

How can I decode this UTF8-encoded string?

In Delphi I could simple say:

myString = UTF8ToAnsi(myString);

How can I do this using .NET?

I tried making a general-purpose static method to do this:

private static string Utf8ToAscii(string value)
{
byte[] utf8Bytes = Encoding.UTF8.GetBytes(value);
byte[] asciiBytes = Encoding.Convert(Encoding.UTF8,
Encoding.Unicode, utf8Bytes);

return Encoding.Unicode.GetString(asciiBytes);
}

....but it doesn't work as desired. This just causes the pair of
encoded characters to be replaced to '?' characters.

TIA
Nov 15 '05 #1
15 2534
Check out the System.Text.UTF8Encoding class:
http://tinyurl.com/kh15

--
Greetz

Jan Tielens
________________________________
Read my weblog: http://weblogs.asp.net/jan
"C# Learner" <cs****@learner.here> wrote in message
news:uc********************************@4ax.com...
Hi,

I have a string (System.String) which holds some data. This data is
encoding in UTF8 (i.e. anywhere in the string where there should be a
single 'é' character, there will be two characters holding the
equivalent of that character in the UTF8 format).

How can I decode this UTF8-encoded string?

In Delphi I could simple say:

myString = UTF8ToAnsi(myString);

How can I do this using .NET?

I tried making a general-purpose static method to do this:

private static string Utf8ToAscii(string value)
{
byte[] utf8Bytes = Encoding.UTF8.GetBytes(value);
byte[] asciiBytes = Encoding.Convert(Encoding.UTF8,
Encoding.Unicode, utf8Bytes);

return Encoding.Unicode.GetString(asciiBytes);
}

...but it doesn't work as desired. This just causes the pair of
encoded characters to be replaced to '?' characters.

TIA

Nov 15 '05 #2
C# Learner <cs****@learner.here> wrote:
I have a string (System.String) which holds some data. This data is
encoding in UTF8 (i.e. anywhere in the string where there should be a
single 'é' character, there will be two characters holding the
equivalent of that character in the UTF8 format).

How can I decode this UTF8-encoded string?


Please see my responses in microsoft.public.dotnet.framework. Your
basic problem is mixing up binary data and character data.

See http://www.pobox.com/~skeet/csharp/unicode.html

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 15 '05 #3
Jon Skeet [C# MVP] <sk***@pobox.com> wrote:
C# Learner <cs****@learner.here> wrote:
I have a string (System.String) which holds some data. This data is
encoding in UTF8 (i.e. anywhere in the string where there should be a
single 'é' character, there will be two characters holding the
equivalent of that character in the UTF8 format).

How can I decode this UTF8-encoded string?


Please see my responses in microsoft.public.dotnet.framework. Your
basic problem is mixing up binary data and character data.

See http://www.pobox.com/~skeet/csharp/unicode.html


I really can't see where I'm mixing up anything with anything. I
simply have a string in UTF8 format. I just want to decode it.

Is it not possible in .NET? I've done this in Delphi without problem.

Thanks for your patient replies.
Nov 15 '05 #4
"Jan Tielens" <ja*@no.spam.please.leadit.be> wrote:
Check out the System.Text.UTF8Encoding class:
http://tinyurl.com/kh15


Hi,

Thanks for the reply, but this doesn't seem to be working:

private static string Utf8ToAscii(string value)
{
byte[] utf8Bytes = Encoding.UTF8.GetBytes(value);

System.Text.UTF8Encoding u = new UTF8Encoding();
return u.GetString(utf8Bytes);
}
Nov 15 '05 #5
C# Learner <cs****@learner.here> wrote:
I really can't see where I'm mixing up anything with anything. I
simply have a string in UTF8 format.
That's mixing things up to start with. "<x> in UTF8 format/encoding" is
only valid where <x> is a sequence of bytes. A string is a sequence of
Unicode characters.
I just want to decode it.
You're looking at the wrong thing though - you need to decode the bytes
you read from the socket, rather than generating character data from
those bytes in some way which you haven't defined and then talking
about that character data as if it were "in UTF8 format".
Is it not possible in .NET? I've done this in Delphi without problem.


I suspect a string in Delphi isn't the same as it is in .NET.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 15 '05 #6
C# Learner <cs****@learner.here> wrote:
Thanks for the reply, but this doesn't seem to be working:

private static string Utf8ToAscii(string value)
{
byte[] utf8Bytes = Encoding.UTF8.GetBytes(value);

System.Text.UTF8Encoding u = new UTF8Encoding();
return u.GetString(utf8Bytes);
}


That code works fine, but doesn't do what you want it to do. Look very
closely at the documentation for Encoding.GetBytes and
Encoding.GetString. Between those and the web page I pointed you at
before, you should end up with an understanding of why "a UTF-8 encoded
string" is like saying "a hex-formatted number" - the number itself, as
a number, has no encoding, only a string representation of the number
has a format.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 15 '05 #7
Jon Skeet [C# MVP] <sk***@pobox.com> wrote:
C# Learner <cs****@learner.here> wrote:
I really can't see where I'm mixing up anything with anything. I
simply have a string in UTF8 format.


That's mixing things up to start with. "<x> in UTF8 format/encoding" is
only valid where <x> is a sequence of bytes. A string is a sequence of
Unicode characters.
I just want to decode it.


You're looking at the wrong thing though - you need to decode the bytes
you read from the socket, rather than generating character data from
those bytes in some way which you haven't defined and then talking
about that character data as if it were "in UTF8 format".
Is it not possible in .NET? I've done this in Delphi without problem.


I suspect a string in Delphi isn't the same as it is in .NET.


Okay, I've got it working.

I had to use the following:

private static string Utf8ToAnsi(string value)
{
byte[] utf8Bytes = RawEncoding.GetBytes(value);
byte[] ansiBytes = Encoding.Convert(Encoding.UTF8,
Encoding.Default, utf8Bytes);

return Encoding.Default.GetString(ansiBytes);
}

public class RawEncoding
{
public static byte[] GetBytes(string text)
{
byte[] result = new byte[text.Length];

for(int i = 0; i < text.Length; ++i) {
result[i] = (byte)text[i];
}

return result;
}
}

Thanks
Nov 15 '05 #8
C# Learner <cs****@learner.here> wrote:
Okay, I've got it working.

I had to use the following:

private static string Utf8ToAnsi(string value)
{
byte[] utf8Bytes = RawEncoding.GetBytes(value);
byte[] ansiBytes = Encoding.Convert(Encoding.UTF8,
Encoding.Default, utf8Bytes);

return Encoding.Default.GetString(ansiBytes);
}

public class RawEncoding
{
public static byte[] GetBytes(string text)
{
byte[] result = new byte[text.Length];

for(int i = 0; i < text.Length; ++i) {
result[i] = (byte)text[i];
}

return result;
}
}


That's still ignoring the original problem though, and you may well
find you get corrupted data. (It's also doing one conversion more than
you really need to.)

The key thing is how you're getting this "UTF-8 encoded string" in the
first place. Something must be converting bytes into a string - and
*that's* the place to fix. Either it shouldn't be doing a conversion at
all (in which case it should pass the byte array along) or it should be
doing the conversion using the UTF-8 encoding (in which case your
string will then be correct).

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 15 '05 #9
Jon Skeet [C# MVP] <sk***@pobox.com> wrote:
That's still ignoring the original problem though, and you may well
find you get corrupted data. (It's also doing one conversion more than
you really need to.)

The key thing is how you're getting this "UTF-8 encoded string" in the
first place. Something must be converting bytes into a string - and
*that's* the place to fix. Either it shouldn't be doing a conversion at
all (in which case it should pass the byte array along) or it should be
doing the conversion using the UTF-8 encoding (in which case your
string will then be correct).


Just for reference, here's the method I use to "convert" the bytes to
a string after reading them from the socket:

public static string GetString(byte[] data)
{
StringBuilder sb = new StringBuilder();

for(int i = 0; i < data.Length; ++i) {
sb.Append((char)data[i]);
}

return sb.ToString();
}

Regards
Nov 15 '05 #10
C# Learner <cs****@learner.here> wrote:
Just for reference, here's the method I use to "convert" the bytes to
a string after reading them from the socket:

public static string GetString(byte[] data)
{
StringBuilder sb = new StringBuilder();

for(int i = 0; i < data.Length; ++i) {
sb.Append((char)data[i]);
}

return sb.ToString();
}


Right. Don't do that. *That's* where you're mixing binary and character
data (essentially treating binary data as character data).

Either keep it as bytes, or use Encoding.UTF8.GetString(data) instead
of the above when you're reading the text.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 15 '05 #11
Jon Skeet [C# MVP] <sk***@pobox.com> wrote:
Right. Don't do that. *That's* where you're mixing binary and character
data (essentially treating binary data as character data).

Either keep it as bytes, or use Encoding.UTF8.GetString(data) instead
of the above when you're reading the text.


The problem with using UTF8.GetString is that it just seems to remove
important bytes.

For example, the raw packet read from the socket might be something
like (note, these are raw bytes, and I'm displaying them as a string
literal for convenience):

"FOOPROTOCOL\xC0\x801\xC0\x80Field1\xC0\x80"

So that's:
"FOOPROTOCOOL"
0xC0
0x80
'1'
0xC0
0x80
"Field1"
0xC0
0x80

Now, using UTF8.GetString on the above will do something like the
following:

"FOOPROTOCOL1Field1"

i.e. all the "\xC0\x80" delimiters were removed.
Nov 15 '05 #12
C# Learner <cs****@learner.here> wrote:
The problem with using UTF8.GetString is that it just seems to remove
important bytes.
Not if those bytes are part of the text...
For example, the raw packet read from the socket might be something
like (note, these are raw bytes, and I'm displaying them as a string
literal for convenience):

"FOOPROTOCOL\xC0\x801\xC0\x80Field1\xC0\x80"

So that's:
"FOOPROTOCOOL"
0xC0
0x80
'1'
0xC0
0x80
"Field1"
0xC0
0x80

Now, using UTF8.GetString on the above will do something like the
following:

"FOOPROTOCOL1Field1"

i.e. all the "\xC0\x80" delimiters were removed.


Yes, because you're *again* mixing binary data and character data. The
0xc0 and 0x80 bytes aren't part of the text, they're delimiters. You
should only convert the text data into a string, however you do the
conversion.

It sounds like what you actually should be doing is finding the
delimiters within the byte array and converting each text section of
the binary data into a string separately. Unless you do that, you will
definitely be mixing binary and character data.

I don't know if you've got access to the protocol itself by the way,
but if you have I'd suggest changing it so that rather than using
delimiters, you prefix each string with the length in bytes.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 15 '05 #13
Jon Skeet [C# MVP] <sk***@pobox.com> wrote:
Yes, because you're *again* mixing binary data and character data. The
0xc0 and 0x80 bytes aren't part of the text, they're delimiters. You
should only convert the text data into a string, however you do the
conversion.

It sounds like what you actually should be doing is finding the
delimiters within the byte array and converting each text section of
the binary data into a string separately. Unless you do that, you will
definitely be mixing binary and character data.

I don't know if you've got access to the protocol itself by the way,
but if you have I'd suggest changing it so that rather than using
delimiters, you prefix each string with the length in bytes.


Hi Jon,

I guess all the problems I'm running into are due to the fact that I
basically think of a string as an array of bytes.

In this case, I don't have access to the protocol, and can't change
it. Also, separating the packets into an array of fields would be
less efficient and more work than is desired in this particular
scenario. The reason for this is that a packet may have a large
number of fields, say 100.

I think the only way of doing this correctly would be to keep the C#
code I have currently. It works as expected.

Thanks for your patience in this matter. It's much appreciated.
Nov 15 '05 #14
C# Learner <cs****@learner.here> wrote:
I guess all the problems I'm running into are due to the fact that I
basically think of a string as an array of bytes.
Yes indeed - it's not, it's a sequence of *characters*.
In this case, I don't have access to the protocol, and can't change
it. Also, separating the packets into an array of fields would be
less efficient and more work than is desired in this particular
scenario. The reason for this is that a packet may have a large
number of fields, say 100.
It's really not going to take long to sort them out though...
I think the only way of doing this correctly would be to keep the C#
code I have currently. It works as expected.


Well, in that case I'd at least recommend changing your code to:

public static string GetString(byte[] data)
{
char[] chars = new char[data.Length];
for (int i=0; i < data.Length; i++)
{
chars[i]=data[i];
}
return new string(chars);
}

private static string Utf8ToAnsi(string value)
{
byte[] utf8Bytes = new byte[value.Length];
for (int i=0; i < utf8Bytes.Length; i++)
{
utf8Bytes[i] = (byte)value[i];
}

return Encoding.UTF8.GetString(utf8Bytes);
}

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 15 '05 #15
Jon Skeet [C# MVP] <sk***@pobox.com> wrote:
C# Learner <cs****@learner.here> wrote:
I guess all the problems I'm running into are due to the fact that I
basically think of a string as an array of bytes.


Yes indeed - it's not, it's a sequence of *characters*.


This is something I'd better look into!
In this case, I don't have access to the protocol, and can't change
it. Also, separating the packets into an array of fields would be
less efficient and more work than is desired in this particular
scenario. The reason for this is that a packet may have a large
number of fields, say 100.


It's really not going to take long to sort them out though...
I think the only way of doing this correctly would be to keep the C#
code I have currently. It works as expected.


Well, in that case I'd at least recommend changing your code to:


<snipped for brevity>

Will do, thanks again.
Nov 15 '05 #16

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

14
2484
by: luis | last post by:
Are basic types (int, long, ...) objetcs or not? I read that in C# all are objects including basic types, derived from Object class. Then in msdn documentation says that boxing converts basic types in objects. But if they are objects why it´s need this conversion? Aren´t objects (basic types) like Java?
10
1651
by: Arno R | last post by:
Hi all, I have a database that I need to use in different versions of Access. This is A97 in most places and A2k in a few other locations. (I develop in A97 and convert the db to A2k for these other locations) FYI: I am using A97 as backend for both versions. No problem. After conversion A97 > A2k everything (seems to) work(s) just fine,...
1
11239
by: Philip Bondi | last post by:
Hello to all SQL Server junkies who work with non-English characters: For people running scripts from the command line using ANSI files with special characters, it is very important to use isql and disable "Automatic ANSI to OEM conversion": - This only affects isql from the command line, and no gui applications -...
0
1726
by: zorba0332 | last post by:
I am sorry, I think I posted this in the wrong place.... I would hope the admins will deleate the post in the other location... Thanks I am working with an older controller that is sending me Ascii modbus information that I need to convert to a floating point. I query the machine and get the number back and extract the nesscesary HEX...
97
5421
by: Master Programmer | last post by:
An friend insider told me that VB is to be killled off within 18 months. I guess this makes sence now that C# is here. I believe it and am actualy surprised they ever even included it in VS 2003 in the first place. Anyone else heard about this development? The Master
4
3070
by: Chris Asaipillai | last post by:
Hi there My compay has a number of Visual Basic 6 applications which are front endeed onto either SQL Server or Microsoft Access databases. Now we are in process of planning to re-write these applications into Visual Basic.Net. My managers main thought is that Visual Basic 6 is (or has!) stopped being supported by Microsoft.
5
6962
by: umeshj99 | last post by:
Hi! I am using SQL Server 2005 express edition as backend and Visual Basic 2005 express edition as frontend. This question is related to date comparison. SELECT IssueDate, Client, AgencyCode FROM SPACEBOOKING WHERE (IssueDate BETWEEN CONVERT(DATETIME, '2006-05-01 00:00:00', 102) AND CONVERT(DATETIME, '2006-05-31 00:00:00',...
7
2999
by: bruce.dodds | last post by:
Access seems to be handling a date string conversion inconsistently in an append query. The query converts a YYYYMM string into a date, using the following function: CDate(Right(,2) & "/1/" & Left(,4)) I entered the string "200715" in a record to test an error condition.
6
2549
by: Vince | last post by:
Hello all, I am using Visual Basic to open a saved query and then save information in the query to an array for later use. The problem is that the same query shows different results when opened directly vs. when opened by Visual Basic. It is as if Visual Basic is not letting the query fully evaluate before processing records. The query...
0
7273
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
7574
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7136
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
7547
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
0
5712
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5106
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
3265
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
0
3252
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
487
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.