472,779 Members | 1,933 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,779 software developers and data experts.

Read UTF8 (mixed byte) file & convert to Unicode

I have a file which has no BOM and contains mostly single byte chars. There
are numerous double byte chars (Japanese) which appear throughout. I need to
take the resulting Unicode and store it in a DB and display it onscreen. No
matter which way I open the file, convert it to Unicode/leave it as is or
what ever, I see all single bytes ok, but double bytes become 2 seperate
single bytes. Surely there is an easy way to convert these mixed bytes to
Unicode? Below is 2 (of many) attempts at doing the conversion. I was
expecting that Encoding.Convert would be able to do this. My HTML charset,
session codepage, locale, thread culture are all set correctly for Japanese.
(reading Japanese from a unicode file works).

Attempt 1:
Fs = New FileStream(Page.MapPath("/mixed_byte-jp.html"), FileMode.Open,
FileAccess.Read, FileShare.None)
Dim bytUTF8(Fs.Length) As Byte
Fs.Read(bytUTF8, 0, bytUTF8.Length)
bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8)
Response.Write(Encoding.Unicode.GetString(bytUni))

Attempt 2:
reader = New System.IO.StreamReader(Page.MapPath("/mixed_byte-jp.html"),
System.Text.Encoding.UTF8, True)
bytUTF8 = System.Text.Encoding.UTF8.GetBytes(reader.ReadToEn d())
bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8)
lblMessage.Text = Encoding.Unicode.GetString(bytUni)

In ASP3 I had to pass the text through ADO to do the conversion which was
very ugly to do - surely that is not required now?
Thanks very much,
Hunter
Jul 21 '05 #1
3 7620
<"=?Utf-8?B?aHVudGVyYg==?=" <Hunter
Be******@discussions.microsoft.com>> wrote:
I have a file which has no BOM and contains mostly single byte chars. There
are numerous double byte chars (Japanese) which appear throughout. I need to
take the resulting Unicode and store it in a DB and display it onscreen. No
matter which way I open the file, convert it to Unicode/leave it as is or
what ever, I see all single bytes ok, but double bytes become 2 seperate
single bytes. Surely there is an easy way to convert these mixed bytes to
Unicode? Below is 2 (of many) attempts at doing the conversion. I was
expecting that Encoding.Convert would be able to do this. My HTML charset,
session codepage, locale, thread culture are all set correctly for Japanese.
(reading Japanese from a unicode file works).

Attempt 1:
Fs = New FileStream(Page.MapPath("/mixed_byte-jp.html"), FileMode.Open,
FileAccess.Read, FileShare.None)
Dim bytUTF8(Fs.Length) As Byte
Fs.Read(bytUTF8, 0, bytUTF8.Length)
bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8)
Response.Write(Encoding.Unicode.GetString(bytUni))

Attempt 2:
reader = New System.IO.StreamReader(Page.MapPath("/mixed_byte-jp.html"),
System.Text.Encoding.UTF8, True)
bytUTF8 = System.Text.Encoding.UTF8.GetBytes(reader.ReadToEn d())
bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8)
lblMessage.Text = Encoding.Unicode.GetString(bytUni)

In ASP3 I had to pass the text through ADO to do the conversion which was
very ugly to do - surely that is not required now?


No. Your first problem is that you're reading the text in assuming it's
UTF-8, then converting it *back* to UTF-8 bytes, then treating those
bytes as if they were UTF-16 (Unicode) bytes. There's no need to
convert them into bytes again - reader.ReadToEnd() is giving you a
string, so just use that string!

Now, that assumes that the file is *actually* in UTF-8. In my
experience Japanese characters come out as 3 bytes in UTF-8, so you may
actually have a Shift-JIS file instead.

You should not that your first attempt doesn't guarantee to read the
whole file, by the way - see
http://www.pobox.com/~skeet/csharp/readbinary.html

For more information about Unicode issues, see
http://www.pobox.com/~skeet/csharp/unicode.html
http://www.pobox.com/~skeet/csharp/d...ngunicode.html
--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #2
Is the HTML file really a Unicode file? Open the HTML file in notepad
and do a "Save As". What do you see in the "Encoding" option at the end
of the "Save As" dialog? Is it Unicode?

--
CHeers,
Gaurav Vaish
http://mastergaurav.org
http://mastergaurav.blogspot.com
--------------------------

Jul 21 '05 #3
I got it working after a bit. I big thankyou Jon.
Reconising the file encoding was most of the battle. It wasn't UTF8 at all -
the japanese chars were always 2 bytes, not 3. When i started thinking about
shift_jis the solution was obvious - in hindsight.

Instead of using: Encoding.UTF8
I used: Encoding.GetEncoding("shift_jis")

Thanks!

"Jon Skeet [C# MVP]" wrote:
<"=?Utf-8?B?aHVudGVyYg==?=" <Hunter
Be******@discussions.microsoft.com>> wrote:
I have a file which has no BOM and contains mostly single byte chars. There
are numerous double byte chars (Japanese) which appear throughout. I need to
take the resulting Unicode and store it in a DB and display it onscreen. No
matter which way I open the file, convert it to Unicode/leave it as is or
what ever, I see all single bytes ok, but double bytes become 2 seperate
single bytes. Surely there is an easy way to convert these mixed bytes to
Unicode? Below is 2 (of many) attempts at doing the conversion. I was
expecting that Encoding.Convert would be able to do this. My HTML charset,
session codepage, locale, thread culture are all set correctly for Japanese.
(reading Japanese from a unicode file works).

Attempt 1:
Fs = New FileStream(Page.MapPath("/mixed_byte-jp.html"), FileMode.Open,
FileAccess.Read, FileShare.None)
Dim bytUTF8(Fs.Length) As Byte
Fs.Read(bytUTF8, 0, bytUTF8.Length)
bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8)
Response.Write(Encoding.Unicode.GetString(bytUni))

Attempt 2:
reader = New System.IO.StreamReader(Page.MapPath("/mixed_byte-jp.html"),
System.Text.Encoding.UTF8, True)
bytUTF8 = System.Text.Encoding.UTF8.GetBytes(reader.ReadToEn d())
bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8)
lblMessage.Text = Encoding.Unicode.GetString(bytUni)

In ASP3 I had to pass the text through ADO to do the conversion which was
very ugly to do - surely that is not required now?


No. Your first problem is that you're reading the text in assuming it's
UTF-8, then converting it *back* to UTF-8 bytes, then treating those
bytes as if they were UTF-16 (Unicode) bytes. There's no need to
convert them into bytes again - reader.ReadToEnd() is giving you a
string, so just use that string!

Now, that assumes that the file is *actually* in UTF-8. In my
experience Japanese characters come out as 3 bytes in UTF-8, so you may
actually have a Shift-JIS file instead.

You should not that your first attempt doesn't guarantee to read the
whole file, by the way - see
http://www.pobox.com/~skeet/csharp/readbinary.html

For more information about Unicode issues, see
http://www.pobox.com/~skeet/csharp/unicode.html
http://www.pobox.com/~skeet/csharp/d...ngunicode.html
--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Jul 21 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Catalin Constantin | last post by:
i have the following code: c=chr(169)+" some text" how can i utf8 encode the variable above ? something like in php utf8_encode($var);?! chr(169) is the &copy (c) sign ! 10x for your...
6
by: Spamtrap | last post by:
I only work in Perl occasionaly, and have been searching for a solution for a conversion, and everything I found seems much too complex. All I need to do is take a simple text file and copy...
3
by: David | last post by:
hello... i've a little problem here... n00b question -)) so if you can help me... the "output" string bellow, comes in UNICODE, but i want to get it on windows-1251 (cytillic) how can i do...
10
by: Tibby | last post by:
I need to read/write not only text files, but binary as well. It seems like on binary files, it doesn't right the last 10% of the file. -- Thanks --- Outgoing mail is certified Virus...
0
by: Ahmed A. | last post by:
This will be very helpfull for many! Using RichTextBox Read/Write Unicode File http://www.microsoft.com/indonesia/msdn/wnf_RichTextBox.as p Private Function ReadFile(ByVal myfile As String)...
3
by: hunterb | last post by:
I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a...
2
by: Claudio Cicali | last post by:
Hi, I'm trying to restore a pg_dump-backed up database from one server to another. The problem is that the db is "mixed encoded" in UTF-8 and LATIN1... (weird but, yes it is ! It was ported once...
8
by: csanjith | last post by:
Hi, i have a situaion where i need to convert the characters entered in an text field to upper case using C. The configuration id utf8 environment in which user can enter any character (single ,...
3
by: sam | last post by:
same as subject?
0
by: Rina0 | last post by:
Cybersecurity engineering is a specialized field that focuses on the design, development, and implementation of systems, processes, and technologies that protect against cyber threats and...
0
linyimin
by: linyimin | last post by:
Spring Startup Analyzer generates an interactive Spring application startup report that lets you understand what contributes to the application startup time and helps to optimize it. Support for...
0
by: Taofi | last post by:
I try to insert a new record but the error message says the number of query names and destination fields are not the same This are my field names ID, Budgeted, Actual, Status and Differences ...
14
DJRhino1175
by: DJRhino1175 | last post by:
When I run this code I get an error, its Run-time error# 424 Object required...This is my first attempt at doing something like this. I test the entire code and it worked until I added this - If...
0
by: Rina0 | last post by:
I am looking for a Python code to find the longest common subsequence of two strings. I found this blog post that describes the length of longest common subsequence problem and provides a solution in...
0
by: lllomh | last post by:
Define the method first this.state = { buttonBackgroundColor: 'green', isBlinking: false, // A new status is added to identify whether the button is blinking or not } autoStart=()=>{
0
by: lllomh | last post by:
How does React native implement an English player?
0
by: Mushico | last post by:
How to calculate date of retirement from date of birth
2
by: DJRhino | last post by:
Was curious if anyone else was having this same issue or not.... I was just Up/Down graded to windows 11 and now my access combo boxes are not acting right. With win 10 I could start typing...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.