Read UTF8 (mixed byte) file & convert to Unicode

hunterb

I have a file which has no BOM and contains mostly single byte chars. There
are numerous double byte chars (Japanese) which appear throughout. I need to
take the resulting Unicode and store it in a DB and display it onscreen. No
matter which way I open the file, convert it to Unicode/leave it as is or
what ever, I see all single bytes ok, but double bytes become 2 seperate
single bytes. Surely there is an easy way to convert these mixed bytes to
Unicode? Below is 2 (of many) attempts at doing the conversion. I was
expecting that Encoding.Convert would be able to do this. My HTML charset,
session codepage, locale, thread culture are all set correctly for Japanese.
(reading Japanese from a unicode file works).

Attempt 1:
Fs = New FileStream(Page.MapPath("/mixed_byte-jp.html"), FileMode.Open,
FileAccess.Read, FileShare.None)
Dim bytUTF8(Fs.Length) As Byte
Fs.Read(bytUTF8, 0, bytUTF8.Length)
bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8)
Response.Write(Encoding.Unicode.GetString(bytUni))

Attempt 2:
reader = New System.IO.StreamReader(Page.MapPath("/mixed_byte-jp.html"),
System.Text.Encoding.UTF8, True)
bytUTF8 = System.Text.Encoding.UTF8.GetBytes(reader.ReadToEn d())
bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8)
lblMessage.Text = Encoding.Unicode.GetString(bytUni)

In ASP3 I had to pass the text through ADO to do the conversion which was
very ugly to do - surely that is not required now?
Thanks very much,
Hunter

Jul 21 '05 #1

Subscribe Post Reply

7700

Jon Skeet [C# MVP]

<"=?Utf-8?B?aHVudGVyYg==?=" <Hunter
Be******@discussions.microsoft.com>> wrote:

I have a file which has no BOM and contains mostly single byte chars. There
are numerous double byte chars (Japanese) which appear throughout. I need to
take the resulting Unicode and store it in a DB and display it onscreen. No
matter which way I open the file, convert it to Unicode/leave it as is or
what ever, I see all single bytes ok, but double bytes become 2 seperate
single bytes. Surely there is an easy way to convert these mixed bytes to
Unicode? Below is 2 (of many) attempts at doing the conversion. I was
expecting that Encoding.Convert would be able to do this. My HTML charset,
session codepage, locale, thread culture are all set correctly for Japanese.
(reading Japanese from a unicode file works).

Attempt 1:
Fs = New FileStream(Page.MapPath("/mixed_byte-jp.html"), FileMode.Open,
FileAccess.Read, FileShare.None)
Dim bytUTF8(Fs.Length) As Byte
Fs.Read(bytUTF8, 0, bytUTF8.Length)
bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8)
Response.Write(Encoding.Unicode.GetString(bytUni))

Attempt 2:
reader = New System.IO.StreamReader(Page.MapPath("/mixed_byte-jp.html"),
System.Text.Encoding.UTF8, True)
bytUTF8 = System.Text.Encoding.UTF8.GetBytes(reader.ReadToEn d())
bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8)
lblMessage.Text = Encoding.Unicode.GetString(bytUni)

In ASP3 I had to pass the text through ADO to do the conversion which was
very ugly to do - surely that is not required now?

No. Your first problem is that you're reading the text in assuming it's
UTF-8, then converting it *back* to UTF-8 bytes, then treating those
bytes as if they were UTF-16 (Unicode) bytes. There's no need to
convert them into bytes again - reader.ReadToEnd() is giving you a
string, so just use that string!

Now, that assumes that the file is *actually* in UTF-8. In my
experience Japanese characters come out as 3 bytes in UTF-8, so you may
actually have a Shift-JIS file instead.

You should not that your first attempt doesn't guarantee to read the
whole file, by the way - see
http://www.pobox.com/~skeet/csharp/readbinary.html

For more information about Unicode issues, see
http://www.pobox.com/~skeet/csharp/unicode.html
http://www.pobox.com/~skeet/csharp/d...ngunicode.html
--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Jul 21 '05 #2

MasterGaurav

Is the HTML file really a Unicode file? Open the HTML file in notepad
and do a "Save As". What do you see in the "Encoding" option at the end
of the "Save As" dialog? Is it Unicode?

--
CHeers,
Gaurav Vaish
http://mastergaurav.org
http://mastergaurav.blogspot.com
--------------------------

Jul 21 '05 #3

Hunter Beanland

I got it working after a bit. I big thankyou Jon.
Reconising the file encoding was most of the battle. It wasn't UTF8 at all -
the japanese chars were always 2 bytes, not 3. When i started thinking about
shift_jis the solution was obvious - in hindsight.

Instead of using: Encoding.UTF8
I used: Encoding.GetEncoding("shift_jis")

Thanks!

"Jon Skeet [C# MVP]" wrote:

<"=?Utf-8?B?aHVudGVyYg==?=" <Hunter
Be******@discussions.microsoft.com>> wrote:
I have a file which has no BOM and contains mostly single byte chars. There
are numerous double byte chars (Japanese) which appear throughout. I need to
take the resulting Unicode and store it in a DB and display it onscreen. No
matter which way I open the file, convert it to Unicode/leave it as is or
what ever, I see all single bytes ok, but double bytes become 2 seperate
single bytes. Surely there is an easy way to convert these mixed bytes to
Unicode? Below is 2 (of many) attempts at doing the conversion. I was
expecting that Encoding.Convert would be able to do this. My HTML charset,
session codepage, locale, thread culture are all set correctly for Japanese.
(reading Japanese from a unicode file works).

Attempt 1:
Fs = New FileStream(Page.MapPath("/mixed_byte-jp.html"), FileMode.Open,
FileAccess.Read, FileShare.None)
Dim bytUTF8(Fs.Length) As Byte
Fs.Read(bytUTF8, 0, bytUTF8.Length)
bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8)
Response.Write(Encoding.Unicode.GetString(bytUni))

Attempt 2:
reader = New System.IO.StreamReader(Page.MapPath("/mixed_byte-jp.html"),
System.Text.Encoding.UTF8, True)
bytUTF8 = System.Text.Encoding.UTF8.GetBytes(reader.ReadToEn d())
bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8)
lblMessage.Text = Encoding.Unicode.GetString(bytUni)

In ASP3 I had to pass the text through ADO to do the conversion which was
very ugly to do - surely that is not required now?

No. Your first problem is that you're reading the text in assuming it's
UTF-8, then converting it *back* to UTF-8 bytes, then treating those
bytes as if they were UTF-16 (Unicode) bytes. There's no need to
convert them into bytes again - reader.ReadToEnd() is giving you a
string, so just use that string!

Now, that assumes that the file is *actually* in UTF-8. In my
experience Japanese characters come out as 3 bytes in UTF-8, so you may
actually have a Shift-JIS file instead.

You should not that your first attempt doesn't guarantee to read the
whole file, by the way - see
http://www.pobox.com/~skeet/csharp/readbinary.html

For more information about Unicode issues, see
http://www.pobox.com/~skeet/csharp/unicode.html
http://www.pobox.com/~skeet/csharp/d...ngunicode.html
--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Jul 21 '05 #4

Similar topics

utf8 silly question

by: Catalin Constantin | last post by:

i have the following code: c=chr(169)+" some text" how can i utf8 encode the variable above ? something like in php utf8_encode($var);?! chr(169) is the &copy (c) sign ! 10x for your...

Python

UTF8 to Unicode conversion

by: Spamtrap | last post by:

I only work in Perl occasionaly, and have been searching for a solution for a conversion, and everything I found seems much too complex. All I need to do is take a simple text file and copy...

Perl

UTF8 <-> windows 1251 conversion

by: David | last post by:

hello... i've a little problem here... n00b question -)) so if you can help me... the "output" string bellow, comes in UNICODE, but i want to get it on windows-1251 (cytillic) how can i do...

C# / C Sharp

Read/Write Files

by: Tibby | last post by:

I need to read/write not only text files, but binary as well. It seems like on binary files, it doesn't right the last 10% of the file. -- Thanks --- Outgoing mail is certified Virus...

Visual Basic .NET

Using RichTextBox Read/Write Unicode File

by: Ahmed A. | last post by:

This will be very helpfull for many! Using RichTextBox Read/Write Unicode File http://www.microsoft.com/indonesia/msdn/wnf_RichTextBox.as p Private Function ReadFile(ByVal myfile As String)...

Visual Basic .NET

Read UTF8 (mixed byte) file & convert to Unicode

by: hunterb | last post by:

.NET Framework

Mixed UTF8 / Latin1 database

by: Claudio Cicali | last post by:

Hi, I'm trying to restore a pg_dump-backed up database from one server to another. The problem is that the db is "mixed encoded" in UTF-8 and LATIN1... (weird but, yes it is ! It was ported once...

PostgreSQL Database

how to convert characters to upper case in utf8 env.

by: csanjith | last post by:

Hi, i have a situaion where i need to convert the characters entered in an text field to upper case using C. The configuration id utf8 environment in which user can enter any character (single ,...

C / C++

how to read exif information?

by: sam | last post by:

same as subject?

Visual Basic .NET

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing