I have a file which has no BOM and contains mostly single byte chars. There
are numerous double byte chars (Japanese) which appear throughout. I need to
take the resulting Unicode and store it in a DB and display it onscreen. No
matter which way I open the file, convert it to Unicode/leave it as is or
what ever, I see all single bytes ok, but double bytes become 2 seperate
single bytes. Surely there is an easy way to convert these mixed bytes to
Unicode? Below is 2 (of many) attempts at doing the conversion. I was
expecting that Encoding.Convert would be able to do this. My HTML charset,
session codepage, locale, thread culture are all set correctly for Japanese.
(reading Japanese from a unicode file works).
Attempt 1:
Fs = New FileStream(Page.MapPath("/mixed_byte-jp.html"), FileMode.Open,
FileAccess.Read, FileShare.None)
Dim bytUTF8(Fs.Length) As Byte
Fs.Read(bytUTF8, 0, bytUTF8.Length)
bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8)
Response.Write(Encoding.Unicode.GetString(bytUni))
Attempt 2:
reader = New System.IO.StreamReader(Page.MapPath("/mixed_byte-jp.html"),
System.Text.Encoding.UTF8, True)
bytUTF8 = System.Text.Encoding.UTF8.GetBytes(reader.ReadToEn d())
bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8)
lblMessage.Text = Encoding.Unicode.GetString(bytUni)
In ASP3 I had to pass the text through ADO to do the conversion which was
very ugly to do - surely that is not required now?
Thanks very much,
Hunter 3 7620
<"=?Utf-8?B?aHVudGVyYg==?=" <Hunter Be******@discussions.microsoft.com>> wrote: I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a DB and display it onscreen. No matter which way I open the file, convert it to Unicode/leave it as is or what ever, I see all single bytes ok, but double bytes become 2 seperate single bytes. Surely there is an easy way to convert these mixed bytes to Unicode? Below is 2 (of many) attempts at doing the conversion. I was expecting that Encoding.Convert would be able to do this. My HTML charset, session codepage, locale, thread culture are all set correctly for Japanese. (reading Japanese from a unicode file works).
Attempt 1: Fs = New FileStream(Page.MapPath("/mixed_byte-jp.html"), FileMode.Open, FileAccess.Read, FileShare.None) Dim bytUTF8(Fs.Length) As Byte Fs.Read(bytUTF8, 0, bytUTF8.Length) bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8) Response.Write(Encoding.Unicode.GetString(bytUni))
Attempt 2: reader = New System.IO.StreamReader(Page.MapPath("/mixed_byte-jp.html"), System.Text.Encoding.UTF8, True) bytUTF8 = System.Text.Encoding.UTF8.GetBytes(reader.ReadToEn d()) bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8) lblMessage.Text = Encoding.Unicode.GetString(bytUni)
In ASP3 I had to pass the text through ADO to do the conversion which was very ugly to do - surely that is not required now?
No. Your first problem is that you're reading the text in assuming it's
UTF-8, then converting it *back* to UTF-8 bytes, then treating those
bytes as if they were UTF-16 (Unicode) bytes. There's no need to
convert them into bytes again - reader.ReadToEnd() is giving you a
string, so just use that string!
Now, that assumes that the file is *actually* in UTF-8. In my
experience Japanese characters come out as 3 bytes in UTF-8, so you may
actually have a Shift-JIS file instead.
You should not that your first attempt doesn't guarantee to read the
whole file, by the way - see http://www.pobox.com/~skeet/csharp/readbinary.html
For more information about Unicode issues, see http://www.pobox.com/~skeet/csharp/unicode.html http://www.pobox.com/~skeet/csharp/d...ngunicode.html
--
Jon Skeet - <sk***@pobox.com> http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Is the HTML file really a Unicode file? Open the HTML file in notepad
and do a "Save As". What do you see in the "Encoding" option at the end
of the "Save As" dialog? Is it Unicode?
--
CHeers,
Gaurav Vaish http://mastergaurav.org http://mastergaurav.blogspot.com
--------------------------
I got it working after a bit. I big thankyou Jon.
Reconising the file encoding was most of the battle. It wasn't UTF8 at all -
the japanese chars were always 2 bytes, not 3. When i started thinking about
shift_jis the solution was obvious - in hindsight.
Instead of using: Encoding.UTF8
I used: Encoding.GetEncoding("shift_jis")
Thanks!
"Jon Skeet [C# MVP]" wrote: <"=?Utf-8?B?aHVudGVyYg==?=" <Hunter Be******@discussions.microsoft.com>> wrote: I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a DB and display it onscreen. No matter which way I open the file, convert it to Unicode/leave it as is or what ever, I see all single bytes ok, but double bytes become 2 seperate single bytes. Surely there is an easy way to convert these mixed bytes to Unicode? Below is 2 (of many) attempts at doing the conversion. I was expecting that Encoding.Convert would be able to do this. My HTML charset, session codepage, locale, thread culture are all set correctly for Japanese. (reading Japanese from a unicode file works).
Attempt 1: Fs = New FileStream(Page.MapPath("/mixed_byte-jp.html"), FileMode.Open, FileAccess.Read, FileShare.None) Dim bytUTF8(Fs.Length) As Byte Fs.Read(bytUTF8, 0, bytUTF8.Length) bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8) Response.Write(Encoding.Unicode.GetString(bytUni))
Attempt 2: reader = New System.IO.StreamReader(Page.MapPath("/mixed_byte-jp.html"), System.Text.Encoding.UTF8, True) bytUTF8 = System.Text.Encoding.UTF8.GetBytes(reader.ReadToEn d()) bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8) lblMessage.Text = Encoding.Unicode.GetString(bytUni)
In ASP3 I had to pass the text through ADO to do the conversion which was very ugly to do - surely that is not required now?
No. Your first problem is that you're reading the text in assuming it's UTF-8, then converting it *back* to UTF-8 bytes, then treating those bytes as if they were UTF-16 (Unicode) bytes. There's no need to convert them into bytes again - reader.ReadToEnd() is giving you a string, so just use that string!
Now, that assumes that the file is *actually* in UTF-8. In my experience Japanese characters come out as 3 bytes in UTF-8, so you may actually have a Shift-JIS file instead.
You should not that your first attempt doesn't guarantee to read the whole file, by the way - see http://www.pobox.com/~skeet/csharp/readbinary.html
For more information about Unicode issues, see http://www.pobox.com/~skeet/csharp/unicode.html http://www.pobox.com/~skeet/csharp/d...ngunicode.html
-- Jon Skeet - <sk***@pobox.com> http://www.pobox.com/~skeet If replying to the group, please do not mail me too This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Catalin Constantin |
last post by:
i have the following code:
c=chr(169)+" some text"
how can i utf8 encode the variable above ?
something like in php utf8_encode($var);?!
chr(169) is the © (c) sign !
10x for your...
|
by: Spamtrap |
last post by:
I only work in Perl occasionaly, and have been searching for a
solution for a conversion, and everything I found seems much too
complex.
All I need to do is take a simple text file and copy...
|
by: David |
last post by:
hello... i've a little problem here... n00b question -))
so if you can help me...
the "output" string bellow, comes in UNICODE, but i want to get it on
windows-1251 (cytillic)
how can i do...
|
by: Tibby |
last post by:
I need to read/write not only text files, but binary as well. It seems like on binary files, it doesn't right the last 10% of the file.
--
Thanks
---
Outgoing mail is certified Virus...
|
by: Ahmed A. |
last post by:
This will be very helpfull for many!
Using RichTextBox Read/Write Unicode File
http://www.microsoft.com/indonesia/msdn/wnf_RichTextBox.as
p
Private Function ReadFile(ByVal myfile As String)...
|
by: hunterb |
last post by:
I have a file which has no BOM and contains mostly single byte chars. There
are numerous double byte chars (Japanese) which appear throughout. I need to
take the resulting Unicode and store it in a...
|
by: Claudio Cicali |
last post by:
Hi,
I'm trying to restore a pg_dump-backed up database from one
server to another. The problem is that the db is "mixed encoded"
in UTF-8 and LATIN1... (weird but, yes it is ! It was ported once...
|
by: csanjith |
last post by:
Hi, i have a situaion where i need to convert the characters entered in
an text field to upper case using C. The configuration id utf8
environment in which user can enter any character (single ,...
|
by: sam |
last post by:
same as subject?
|
by: Rina0 |
last post by:
Cybersecurity engineering is a specialized field that focuses on the design, development, and implementation of systems, processes, and technologies that protect against cyber threats and...
|
by: linyimin |
last post by:
Spring Startup Analyzer generates an interactive Spring application startup report that lets you understand what contributes to the application startup time and helps to optimize it. Support for...
|
by: Taofi |
last post by:
I try to insert a new record but the error message says the number of query names and destination fields are not the same
This are my field names
ID, Budgeted, Actual, Status and Differences
...
|
by: DJRhino1175 |
last post by:
When I run this code I get an error, its Run-time error# 424 Object required...This is my first attempt at doing something like this. I test the entire code and it worked until I added this -
If...
|
by: Rina0 |
last post by:
I am looking for a Python code to find the longest common subsequence of two strings. I found this blog post that describes the length of longest common subsequence problem and provides a solution in...
|
by: lllomh |
last post by:
Define the method first
this.state = {
buttonBackgroundColor: 'green',
isBlinking: false, // A new status is added to identify whether the button is blinking or not
}
autoStart=()=>{
|
by: lllomh |
last post by:
How does React native implement an English player?
|
by: Mushico |
last post by:
How to calculate date of retirement from date of birth
|
by: DJRhino |
last post by:
Was curious if anyone else was having this same issue or not....
I was just Up/Down graded to windows 11 and now my access combo boxes are not acting right. With win 10 I could start typing...
| |