UTF-8 Character Decoding Problem

I am using a HtmlInputFile control in ASP.NET 2.0 to upload a file in a UserControl. After upload, I am examining the HttpPostedFile property of this object to read the bytes of the uploaded file's input stream, convert them to characters, and store them in a database.

For the most part, the process I had in place was working, but I noticed certain characters in my test file were not decoded correctly. I investigated the values of the corresponding byte values I received from the input stream, and came to the conclusion that the stream was encoded as UTF-8 ( which I have also come to understand is the default in many places in .NET when no other encoding scheme is specified ). I think the problem is that all the characters in the stream were encoded using a single byte, so in the case of the few characters in the file whose UTF-8 encodings required more than one byte, they are improperly decoded.

An example:

One of the characters I am having an issue with is the dash ( hyphen, minus-sign operator ) character (-). It is encoded from the input stream as a single byte with the value 0x96, but I found a resource that lists this character as requiring two bytes ( 0xc2, 0x96 ) in the UTF-8 encoding. The result is that when I convert the byte that is supposed to represent this character to a char, it ends up with the value ( bit-wise ) 0xfffd.

Here is the code I am trying to use to accomplish this:

Expand|Select|Wrap|Line Numbers

 
private string serializeFile( )

    {

    StringBuilder fileContents = new StringBuilder( );

    if( this._file != null )

        {

        HttpPostedFile file = this._file;
 
        byte[ ] fileBytes = new byte[file.ContentLength];

        file.InputStream.Read( fileBytes, 0, file.ContentLength );

        foreach( byte fileByte in fileBytes )

            {

            Char character = Convert.ToChar( fileByte );

            fileContents.Append( character );

            }

        }

...

My question is this: can anyone see what I am doing wrong? Or, if I'm not doing anything obviously wrong (which would surprise me), how can I properly decode these characters without resorting to statically testing for the characters I know to be a problem?

Thank you very much for taking the time to read this, and if you choose to help. Let me know if you need any more information.

Nov 5 '07 #1

Subscribe Post Reply

5872

Plater

7,872

Expert 4TB

Take a look at the
System.Text.Encoding section.

Probably in particular:
System.Text.Encoding.UTF8

and say: System.Text.Encoding.UTF8.GetString(byte[])

Nov 5 '07 #2

seedstorm

I'm sorry. I forgot to mention that I've tried several different solutions involving the System.Text.Encoding class. I have tried using the UTF8 property to decode my byte array. The result is the same as using the Convert.ToChar method. I have also tried creating a UTF8 Decoder, and using its GetChars method; that led me to trying to detect when the byte value was over 127 ( outside of ASCII range ) and create a UTF32 character, using three other bytes, each with value 0x0, for padding. I then tried to use the UTF32 property to decode that byte array, with the same output all around.

I think I am well and truly not getting something here.

Nov 5 '07 #3

Plater

7,872

Expert 4TB

Hmm, well what you described seemed more like utf-16
(System.Text.Encoding.BigEndianUnicode)
utf-8 means 8bits per character, seems like there shouldn't BE any utf-8 encoded character that takes 2 8bit values to create.
That would make it utf-16?

Nov 5 '07 #4

seedstorm

I looked into the UTF-8 characters in question, and am now very suspicious of the characters themselves, as they are only supported by a handful of fonts.

Furthermore, after editing my test file in Notepad, to replace the offending characters in an environment in which I was sure the resultant character would be encoded as ASCII, voila!, no more problems when uploading the file.

I think the problem is therefore with the encoding of the files themselves; it is still puzzling, but I guess no longer within the scope of this forum. I'll shut up about it now =).

Thanks for your suggestions, Plater.

Nov 5 '07 #5

Similar topics

UTF-8 & Unicode

by: EU citizen | last post by:

Do web pages have to be created in unicode in order to use UTF-8 encoding? If so, can anyone name a free application which I can use under Windows 98 to create web pages?

.NET Framework

French "No" character entity

by: Haines Brown | last post by:

I'm having trouble finding the character entity for the French abbreviation for "number" (capital N followed by a small supercript o, period). My references are not listing it. Where would I...

HTML / CSS

std::string vs. Unicode UTF-8

by: Wolfgang Draxinger | last post by:

I understand that it is perfectly possible to store UTF-8 strings in a std::string, however doing so can cause some implicaions. E.g. you can't count the amount of characters by length() | size()....

C / C++

LoadXML and UTF-8 encoding

by: jmgonet | last post by:

Hello everybody, I'm having troubles loading a Xml string encoded in UTF-8. If I try this code: ------------------------------ XmlDocument doc=new XmlDocument(); String s="<?xml...

.NET Framework

DB2 V7.2 fixpak 12 / UTF-8 db doing extra Unicode -> UTF-8 conversion on client?

by: Tim Northrup | last post by:

Help! We have DB2 V7.2 (fixpak 12) installed on Windows2003 Server, and the latest V7.2 client installed on another system. The DB2CODEPAGE on all systems is set to 1208, and the database was...

DB2 Database

UTF-8 with signature & UTF-8 without signature

by: JJBW | last post by:

Hi I am creating some aspx files in Visual Studio 2003 for a Danish web site. The page is encoded as UTF-8 - However, when I save the the aspx file as "UTF-8 without signature" the Danish...

ASP.NET

Printing UTF-8

by: sheldon.regular | last post by:

I am new to unicode so please bear with my stupidity. I am doing the following in a Python IDE called Wing with Python 23. Ã¤Ã¶Ã¼ Ã¤Ã¶Ã¼ '\xc3\xa4\xc3\xb6\xc3\xbc' u'\xe4\xf6\xfc'...

Python

UTF-8 encoding in AJAX web application.

by: Allan Ebdrup | last post by:

I hava an ajax web application where i hvae problems with UTF-8 encoding oc chineese chars. My Ajax webapplication runs in a HTML page that is UTF-8 Encoded. I copy and paste some chineese chars...

C# / C Sharp

More elegant UTF-8 encoder

by: Bjoern Hoehrmann | last post by:

Hi, For a free software project, I had to write a routine that, given a Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds the UTF-8 encoded form of it, for example, U+00F6...

C / C++

std::wstringbuf and imbue to convert from utf-8 to wchar_t?

by: =?ISO-8859-2?Q?Boris_Du=B9ek?= | last post by:

Hi, I have an API that returns UTF-8 encoded strings. I have a utf8 codevt facet available to do the conversion from UTF-8 to wchar_t encoding defined by the platform. I have no trouble...

C / C++

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA