I have a file which has no BOM and contains mostly single byte chars. There
are numerous double byte chars (Japanese) which appear throughout. I need to
take the resulting Unicode and store it in a DB and display it onscreen. No
matter which way I open the file, convert it to Unicode/leave it as is or
what ever, I see all single bytes ok, but double bytes become 2 seperate
single bytes. Surely there is an easy way to convert these mixed bytes to
Unicode? Below is 2 (of many) attempts at doing the conversion. I was
expecting that Encoding.Convert would be able to do this. My HTML charset,
session codepage, locale, thread culture are all set correctly for Japanese.
(reading Japanese from a unicode file works).
Attempt 1:
Fs = New FileStream(Page.MapPath("/mixed_byte-jp.html"), FileMode.Open,
FileAccess.Read, FileShare.None)
Dim bytUTF8(Fs.Length) As Byte
Fs.Read(bytUTF8, 0, bytUTF8.Length)
bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8)
Response.Write(Encoding.Unicode.GetString(bytUni))
Attempt 2:
reader = New System.IO.StreamReader(Page.MapPath("/mixed_byte-jp.html"),
System.Text.Encoding.UTF8, True)
bytUTF8 = System.Text.Encoding.UTF8.GetBytes(reader.ReadToEn d())
bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8)
lblMessage.Text = Encoding.Unicode.GetString(bytUni)
In ASP3 I had to pass the text through ADO to do the conversion which was
very ugly to do - surely that is not required now?
Thanks very much,
Hunter 3 7700
<"=?Utf-8?B?aHVudGVyYg==?=" <Hunter Be******@discussions.microsoft.com>> wrote: I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a DB and display it onscreen. No matter which way I open the file, convert it to Unicode/leave it as is or what ever, I see all single bytes ok, but double bytes become 2 seperate single bytes. Surely there is an easy way to convert these mixed bytes to Unicode? Below is 2 (of many) attempts at doing the conversion. I was expecting that Encoding.Convert would be able to do this. My HTML charset, session codepage, locale, thread culture are all set correctly for Japanese. (reading Japanese from a unicode file works).
Attempt 1: Fs = New FileStream(Page.MapPath("/mixed_byte-jp.html"), FileMode.Open, FileAccess.Read, FileShare.None) Dim bytUTF8(Fs.Length) As Byte Fs.Read(bytUTF8, 0, bytUTF8.Length) bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8) Response.Write(Encoding.Unicode.GetString(bytUni))
Attempt 2: reader = New System.IO.StreamReader(Page.MapPath("/mixed_byte-jp.html"), System.Text.Encoding.UTF8, True) bytUTF8 = System.Text.Encoding.UTF8.GetBytes(reader.ReadToEn d()) bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8) lblMessage.Text = Encoding.Unicode.GetString(bytUni)
In ASP3 I had to pass the text through ADO to do the conversion which was very ugly to do - surely that is not required now?
No. Your first problem is that you're reading the text in assuming it's
UTF-8, then converting it *back* to UTF-8 bytes, then treating those
bytes as if they were UTF-16 (Unicode) bytes. There's no need to
convert them into bytes again - reader.ReadToEnd() is giving you a
string, so just use that string!
Now, that assumes that the file is *actually* in UTF-8. In my
experience Japanese characters come out as 3 bytes in UTF-8, so you may
actually have a Shift-JIS file instead.
You should not that your first attempt doesn't guarantee to read the
whole file, by the way - see http://www.pobox.com/~skeet/csharp/readbinary.html
For more information about Unicode issues, see http://www.pobox.com/~skeet/csharp/unicode.html http://www.pobox.com/~skeet/csharp/d...ngunicode.html
--
Jon Skeet - <sk***@pobox.com> http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Is the HTML file really a Unicode file? Open the HTML file in notepad
and do a "Save As". What do you see in the "Encoding" option at the end
of the "Save As" dialog? Is it Unicode?
--
CHeers,
Gaurav Vaish http://mastergaurav.org http://mastergaurav.blogspot.com
--------------------------
I got it working after a bit. I big thankyou Jon.
Reconising the file encoding was most of the battle. It wasn't UTF8 at all -
the japanese chars were always 2 bytes, not 3. When i started thinking about
shift_jis the solution was obvious - in hindsight.
Instead of using: Encoding.UTF8
I used: Encoding.GetEncoding("shift_jis")
Thanks!
"Jon Skeet [C# MVP]" wrote: <"=?Utf-8?B?aHVudGVyYg==?=" <Hunter Be******@discussions.microsoft.com>> wrote: I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a DB and display it onscreen. No matter which way I open the file, convert it to Unicode/leave it as is or what ever, I see all single bytes ok, but double bytes become 2 seperate single bytes. Surely there is an easy way to convert these mixed bytes to Unicode? Below is 2 (of many) attempts at doing the conversion. I was expecting that Encoding.Convert would be able to do this. My HTML charset, session codepage, locale, thread culture are all set correctly for Japanese. (reading Japanese from a unicode file works).
Attempt 1: Fs = New FileStream(Page.MapPath("/mixed_byte-jp.html"), FileMode.Open, FileAccess.Read, FileShare.None) Dim bytUTF8(Fs.Length) As Byte Fs.Read(bytUTF8, 0, bytUTF8.Length) bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8) Response.Write(Encoding.Unicode.GetString(bytUni))
Attempt 2: reader = New System.IO.StreamReader(Page.MapPath("/mixed_byte-jp.html"), System.Text.Encoding.UTF8, True) bytUTF8 = System.Text.Encoding.UTF8.GetBytes(reader.ReadToEn d()) bytUni = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8) lblMessage.Text = Encoding.Unicode.GetString(bytUni)
In ASP3 I had to pass the text through ADO to do the conversion which was very ugly to do - surely that is not required now?
No. Your first problem is that you're reading the text in assuming it's UTF-8, then converting it *back* to UTF-8 bytes, then treating those bytes as if they were UTF-16 (Unicode) bytes. There's no need to convert them into bytes again - reader.ReadToEnd() is giving you a string, so just use that string!
Now, that assumes that the file is *actually* in UTF-8. In my experience Japanese characters come out as 3 bytes in UTF-8, so you may actually have a Shift-JIS file instead.
You should not that your first attempt doesn't guarantee to read the whole file, by the way - see http://www.pobox.com/~skeet/csharp/readbinary.html
For more information about Unicode issues, see http://www.pobox.com/~skeet/csharp/unicode.html http://www.pobox.com/~skeet/csharp/d...ngunicode.html
-- Jon Skeet - <sk***@pobox.com> http://www.pobox.com/~skeet If replying to the group, please do not mail me too This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Catalin Constantin |
last post by:
i have the following code:
c=chr(169)+" some text"
how can i utf8 encode the variable above ?
something like in php utf8_encode($var);?!
chr(169) is the © (c) sign !
10x for your...
|
by: Spamtrap |
last post by:
I only work in Perl occasionaly, and have been searching for a
solution for a conversion, and everything I found seems much too
complex.
All I need to do is take a simple text file and copy...
|
by: David |
last post by:
hello... i've a little problem here... n00b question -))
so if you can help me...
the "output" string bellow, comes in UNICODE, but i want to get it on
windows-1251 (cytillic)
how can i do...
|
by: Tibby |
last post by:
I need to read/write not only text files, but binary as well. It seems like on binary files, it doesn't right the last 10% of the file.
--
Thanks
---
Outgoing mail is certified Virus...
|
by: Ahmed A. |
last post by:
This will be very helpfull for many!
Using RichTextBox Read/Write Unicode File
http://www.microsoft.com/indonesia/msdn/wnf_RichTextBox.as
p
Private Function ReadFile(ByVal myfile As String)...
|
by: hunterb |
last post by:
I have a file which has no BOM and contains mostly single byte chars. There
are numerous double byte chars (Japanese) which appear throughout. I need to
take the resulting Unicode and store it in a...
|
by: Claudio Cicali |
last post by:
Hi,
I'm trying to restore a pg_dump-backed up database from one
server to another. The problem is that the db is "mixed encoded"
in UTF-8 and LATIN1... (weird but, yes it is ! It was ported once...
|
by: csanjith |
last post by:
Hi, i have a situaion where i need to convert the characters entered in
an text field to upper case using C. The configuration id utf8
environment in which user can enter any character (single ,...
|
by: sam |
last post by:
same as subject?
|
by: taylorcarr |
last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
|
by: aa123db |
last post by:
Variable and constants
Use var or let for variables and const fror constants.
Var foo ='bar';
Let foo ='bar';const baz ='bar';
Functions
function $name$ ($parameters$) {
}
...
|
by: ryjfgjl |
last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers,...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
| |