473,401 Members | 2,127 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,401 software developers and data experts.

UTF-8 Character Decoding Problem

I am using a HtmlInputFile control in ASP.NET 2.0 to upload a file in a UserControl. After upload, I am examining the HttpPostedFile property of this object to read the bytes of the uploaded file's input stream, convert them to characters, and store them in a database.

For the most part, the process I had in place was working, but I noticed certain characters in my test file were not decoded correctly. I investigated the values of the corresponding byte values I received from the input stream, and came to the conclusion that the stream was encoded as UTF-8 ( which I have also come to understand is the default in many places in .NET when no other encoding scheme is specified ). I think the problem is that all the characters in the stream were encoded using a single byte, so in the case of the few characters in the file whose UTF-8 encodings required more than one byte, they are improperly decoded.

An example:

One of the characters I am having an issue with is the dash ( hyphen, minus-sign operator ) character (-). It is encoded from the input stream as a single byte with the value 0x96, but I found a resource that lists this character as requiring two bytes ( 0xc2, 0x96 ) in the UTF-8 encoding. The result is that when I convert the byte that is supposed to represent this character to a char, it ends up with the value ( bit-wise ) 0xfffd.

Here is the code I am trying to use to accomplish this:

Expand|Select|Wrap|Line Numbers
  1. private string serializeFile( )
  2.     {
  3.     StringBuilder fileContents = new StringBuilder( );
  4.     if( this._file != null )
  5.         {
  6.         HttpPostedFile file = this._file;
  7.  
  8.         byte[ ] fileBytes = new byte[file.ContentLength];
  9.         file.InputStream.Read( fileBytes, 0, file.ContentLength );
  10.         foreach( byte fileByte in fileBytes )
  11.             {
  12.             Char character = Convert.ToChar( fileByte );
  13.             fileContents.Append( character );
  14.             }
  15.         }
  16. ...
  17.  
My question is this: can anyone see what I am doing wrong? Or, if I'm not doing anything obviously wrong (which would surprise me), how can I properly decode these characters without resorting to statically testing for the characters I know to be a problem?

Thank you very much for taking the time to read this, and if you choose to help. Let me know if you need any more information.
Nov 5 '07 #1
4 5872
Plater
7,872 Expert 4TB
Take a look at the
System.Text.Encoding section.

Probably in particular:
System.Text.Encoding.UTF8

and say: System.Text.Encoding.UTF8.GetString(byte[])
Nov 5 '07 #2
I'm sorry. I forgot to mention that I've tried several different solutions involving the System.Text.Encoding class. I have tried using the UTF8 property to decode my byte array. The result is the same as using the Convert.ToChar method. I have also tried creating a UTF8 Decoder, and using its GetChars method; that led me to trying to detect when the byte value was over 127 ( outside of ASCII range ) and create a UTF32 character, using three other bytes, each with value 0x0, for padding. I then tried to use the UTF32 property to decode that byte array, with the same output all around.

I think I am well and truly not getting something here.
Nov 5 '07 #3
Plater
7,872 Expert 4TB
Hmm, well what you described seemed more like utf-16
(System.Text.Encoding.BigEndianUnicode)
utf-8 means 8bits per character, seems like there shouldn't BE any utf-8 encoded character that takes 2 8bit values to create.
That would make it utf-16?
Nov 5 '07 #4
I looked into the UTF-8 characters in question, and am now very suspicious of the characters themselves, as they are only supported by a handful of fonts.

Furthermore, after editing my test file in Notepad, to replace the offending characters in an environment in which I was sure the resultant character would be encoded as ASCII, voila!, no more problems when uploading the file.

I think the problem is therefore with the encoding of the files themselves; it is still puzzling, but I guess no longer within the scope of this forum. I'll shut up about it now =).

Thanks for your suggestions, Plater.
Nov 5 '07 #5

Sign in to post your reply or Sign up for a free account.

Similar topics

27
by: EU citizen | last post by:
Do web pages have to be created in unicode in order to use UTF-8 encoding? If so, can anyone name a free application which I can use under Windows 98 to create web pages?
38
by: Haines Brown | last post by:
I'm having trouble finding the character entity for the French abbreviation for "number" (capital N followed by a small supercript o, period). My references are not listing it. Where would I...
32
by: Wolfgang Draxinger | last post by:
I understand that it is perfectly possible to store UTF-8 strings in a std::string, however doing so can cause some implicaions. E.g. you can't count the amount of characters by length() | size()....
6
by: jmgonet | last post by:
Hello everybody, I'm having troubles loading a Xml string encoded in UTF-8. If I try this code: ------------------------------ XmlDocument doc=new XmlDocument(); String s="<?xml...
0
by: Tim Northrup | last post by:
Help! We have DB2 V7.2 (fixpak 12) installed on Windows2003 Server, and the latest V7.2 client installed on another system. The DB2CODEPAGE on all systems is set to 1208, and the database was...
1
by: JJBW | last post by:
Hi I am creating some aspx files in Visual Studio 2003 for a Danish web site. The page is encoded as UTF-8 - However, when I save the the aspx file as "UTF-8 without signature" the Danish...
1
by: sheldon.regular | last post by:
I am new to unicode so please bear with my stupidity. I am doing the following in a Python IDE called Wing with Python 23. äöü äöü '\xc3\xa4\xc3\xb6\xc3\xbc' u'\xe4\xf6\xfc'...
23
by: Allan Ebdrup | last post by:
I hava an ajax web application where i hvae problems with UTF-8 encoding oc chineese chars. My Ajax webapplication runs in a HTML page that is UTF-8 Encoded. I copy and paste some chineese chars...
35
by: Bjoern Hoehrmann | last post by:
Hi, For a free software project, I had to write a routine that, given a Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds the UTF-8 encoded form of it, for example, U+00F6...
4
by: =?ISO-8859-2?Q?Boris_Du=B9ek?= | last post by:
Hi, I have an API that returns UTF-8 encoded strings. I have a utf8 codevt facet available to do the conversion from UTF-8 to wchar_t encoding defined by the platform. I have no trouble...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.