This is going to be a question for anyone who is an expert in C# Text Encoding.
My situation is this: I have a Sybase database which is firing back ISO-8559 encoded strings. I am unable to get the db to translate to UTF-8 for non technical reasons.
So I have a string coming back with the character œ (ISO value 156). this character appears in .NET as a box character because 156 is not a valid Unicode character value.
I have been scratching my head over this one and have produced a series of tests to try to get the conversion correct.
My code is below followed by the output:
Expand|Select|Wrap|Line Numbers
- string sybaseRawString = DataAccessLayer.GetXXX();
- Encoding iso = Encoding.GetEncoding("iso-8859-1");
- Encoding sbcs = Encoding.Default; //SBCSCodePageEncoding
- Encoding unicode = Encoding.Unicode;
- Encoding utf8 = Encoding.UTF8;
- byte[] isoBytes = iso.GetBytes(sybaseRawString);
- byte[] sbcsBytes = sbcs.GetBytes(sybaseRawString);
- byte[] utf8Bytes = Encoding.Convert(iso, utf8, isoBytes);
- byte[] unicodeBytes = Encoding.Convert(utf8, unicode, utf8Bytes);
- WriteLine("SYBASE ISO-8559 STRING");
- WriteLine(sybaseRawString);
- WriteLine(ToString(isoBytes));
- WriteLine("ISO -> SBCS ENCODED STRING");
- WriteLine(new String(sbcs.GetChars(sbcsBytes)));
- WriteLine(ToString(sbcsBytes));
- string expected = "FTSE TECHMARK 100 (œ)";
- WriteLine("EXPECTED .NET STRING");
- WriteLine(expected);
- WriteLine(ToString(Encoding.Unicode.GetBytes(expected)));
- WriteLine("ISO -> UNICODE");
- WriteLine(new String(unicode.GetChars(unicodeBytes)));
- WriteLine(ToString(unicodeBytes));
- WriteLine("ISO -> UTF8");
- WriteLine(new String(utf8.GetChars(utf8Bytes)));
- WriteLine(ToString(utf8Bytes));
- nb. I have replaced the box chars with question marks apart from SBCS which did produce a question mark. This is because html understands them and translates them to œ!!!
The output in the DEBUG window is as follows:
SYBASE ISO-8559 STRING
FTSE TECHMARK 100 (?)
46-54-53-45-20-54-45-43-48-4D-41-52-4B-20-31-30-30-20-28-9C-29
ISO -> SBCS ENCODED STRING
FTSE TECHMARK 100 (?)
46-54-53-45-20-54-45-43-48-4D-41-52-4B-20-31-30-30-20-28-3F-29
EXPECTED .NET STRING
FTSE TECHMARK 100 (œ)
46-00-54-00-53-00-45-00-20-00-54-00-45-00-43-00-48-00-4D-00-41-00-52-00-4B-00-20-00-31-00-30-00-30-00-20-00-28-00-53-01-29-00
ISO -> UNICODE
FTSE TECHMARK 100 (?)
46-00-54-00-53-00-45-00-20-00-54-00-45-00-43-00-48-00-4D-00-41-00-52-00-4B-00-20-00-31-00-30-00-30-00-20-00-28-00-9C-00-29-00
ISO -> UTF8
FTSE TECHMARK 100 (?)
46-54-53-45-20-54-45-43-48-4D-41-52-4B-20-31-30-30-20-28-C2-9C-29
However when I view this in NUnit. all the ? appear correctly as œ albeit every so slightly different to the Expected .NET version (ISO vs Unicode??), is NUnit is detecting the encoding format of the char and printing it correctly?
My question is how do I get from my original Sybase ISO-8559 string to the Expected .NET bytes (Unicode) so that I can be sure that all of my .NET apps will display the characters correctly.
Many thanks for any help received!