473,769 Members | 2,437 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

How to read html files AS IS. Encoding seems to change the characters.

My task is to read html files from disk and save them onto SQL Server
database field. I have created an nvarchar(max) field to hold them.
The problem is that some characters, particularly html entities, and
French/German special characters are lost and/or replaced by a
question mark.
This is really frustrating. I have tried using StreamReader with ALL
the encodings available and none work correctly. Each encoding handles
some characters and but loses others. I also tried reading into byte
array, but as soon as I converted the array to string the encoding
ruined the text.
Maybe the solution is not to convert to string? but then how will I
save it to the database?

Is there a way to get this html text AS IS - with no encoding and no
changes into the database?
I could do it in Delphi and many other pre .NET, there must be a way
in C# too - surely?!

Thanks a lot for your help,
zoro.

Mar 30 '07
14 5771
Zoro wrote:
Thanks again Goran for your help.
>You are writing it back as UTF-8, as you are not specifying any encoding
in the WriteAllText method call.

It looks like I may be able to do it with string after all.
It looks like the problem with my test was - like you suggested - that
i didn't specify the write encoding. When I do, as long as I use the
same encoding when reading and writing, it worked with all 3 codes you
have suggested (but not with any of the built in codes - e.g. UTF-n!).
Then you have successfully decoded the file into text, as you are not
losing any characters.

If you save the file using utf-8, all the characters will still be
there, as strings are unicode and utf-8 can store any unicode
characters. The reason that your test did not succeed with the unicode
encodings is because the utility that you are using doesn't support
unicode. You would need the "Pro" version for that.
I am still not clear on how it's going to work away from the test -
using the database situation, but I am HOPING it would work as
follows:
1. I will use 1 of these codes to read the file
2. then store the string into nvarchar field and add a note informing
users of the encoding I used
3. specify the same encoding when creating the file, after reading the
string from the db.

Do you think this would work?
Thanks again,
zoro.
As you successfully decoded the file to a string, you can store that in
a nvarchar/ntext field and you are done. You can also store the encoding
used if you like to recreate the file exactly, but you can create a file
using any encoding that supports the characters in the text.

One advantage with using utf-8 encoding is that it places a BOM (byte
order mark) at the beginning of the file, that can be used to identify
the encoding used. If you use the File.ReadAllTex t to read a file that
contains a BOM, it will read the file correctly, even if you specify a
completely different encoding.

--
Göran Andersson
_____
http://www.guffa.com
Apr 1 '07 #11
As you successfully decoded the file to a string, you can store that in
a nvarchar/ntext field and you are done. You can also store the encoding
used if you like to recreate the file exactly, but you can create a file
using any encoding that supports the characters in the text.
This doesn't seem to work. When I tried to read the files using
28591encoding and writing using 65001 (UTF-8), I seem to have lost the
dashes (-) in the text. I am not sure why this happens, but it looks
like I must save the text with the same encoding I read from the file.
btw, this is not only observed in the utility, I also opened the file
in a browser and the browser interpreted it differently too.

Thank again,
zoro.

Apr 1 '07 #12
Zoro wrote:
>As you successfully decoded the file to a string, you can store that in
a nvarchar/ntext field and you are done. You can also store the encoding
used if you like to recreate the file exactly, but you can create a file
using any encoding that supports the characters in the text.

This doesn't seem to work. When I tried to read the files using
28591encoding and writing using 65001 (UTF-8), I seem to have lost the
dashes (-) in the text. I am not sure why this happens, but it looks
like I must save the text with the same encoding I read from the file.
btw, this is not only observed in the utility, I also opened the file
in a browser and the browser interpreted it differently too.

Thank again,
zoro.
That is strange. A dash is a regular ASCII character, which should work
in any encoding.

However, the browser might misinterpret the encoding, have you tried to
open the file in Notepad?

--
Göran Andersson
_____
http://www.guffa.com
Apr 1 '07 #13
That is strange. A dash is a regular ASCII character, which should work
in any encoding.

However, the browser might misinterpret the encoding, have you tried to
open the file in Notepad?
I did open them in Notepad. In the original file it's a dash and in
the new file it's a question mark (?) or missing char - depending on
the encoding.
However, I suspect this is because they are Em-dash or En-dash - ascii
150, 151.
Thanks,
zoro.

Apr 1 '07 #14
As you successfully decoded the file to a string, you can store that in
a nvarchar/ntext field and you are done. You can also store the encoding
used if you like to recreate the file exactly, but you can create a file
using any encoding that supports the characters in the text.
On problem with loading with an encoding and saving with another is that
if the original encoding was not correct, you get junk.
Imagine loading a Greek file using the Russian codepage (cp1250), then
saving it as utf-8.

The second problem with this is if some files have in fact a encoding
meta-data in the head (or will start having, at some point).
This means that you end up writing a file as UTF-8 while the meta says
something else (the original encoding).

Other than this is should be ok.
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
Apr 3 '07 #15

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
9070
by: David Thomas | last post by:
Hi there, a while ago, I posted a question regarding reading japanese text from a text file. Well, since I solved the problem, I thought I'd post my solution for the benefit of other people with the same problem. The plan was to make a script to read and display japanese text. I will use it for making a japanese proverb script and for a japanese language study script.
3
7773
by: hunterb | last post by:
I have a file which has no BOM and contains mostly single byte chars. There are numerous double byte chars (Japanese) which appear throughout. I need to take the resulting Unicode and store it in a DB and display it onscreen. No matter which way I open the file, convert it to Unicode/leave it as is or what ever, I see all single bytes ok, but double bytes become 2 seperate single bytes. Surely there is an easy way to convert these mixed...
11
7045
by: Ron | last post by:
Hello, I'm having an aggravating time getting the "html" spewed by Word 2003 to display correctly in a webpage. The situation here is that the people creating the documents only know Word, and aren't very computer savvy. I created a system where they can save their Word documents as "html" and upload them to a certain directory, and the web page dynamically runs them through tidylib using the tidy extension to php4, thus causing the...
7
2265
by: Naren | last post by:
Hello All, Can any one help me in this file read problem. #include <stdio.h> int main() {
10
2545
by: Tibby | last post by:
I need to read/write not only text files, but binary as well. It seems like on binary files, it doesn't right the last 10% of the file. -- Thanks --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.538 / Virus Database: 333 - Release Date: 11/10/2003
3
5126
by: Chip | last post by:
There is surprisingly little information on the various encoding options for reading a text file. I have what seems to be a very basic issue: I'm reading a text file that includes Spanish characters such as "ñ". When I read the file into a string, that character is missing. Encoding seems to be the culprit. File writers SHOULD begin a file with the BOM (Byte Order Mark) to let us know what encoding to read the file with, but most software...
8
27537
by: Zephyre | last post by:
I have some UTF-8 text files written in Chinese to be read. Now the only method that I know to read text from it is to use fopen() function. Thus, I must read the contents byte by byte, change the UTF-8 characters to Unicode, store the characters into wchar_t variables. But I think this method is too complex and isn't elegant at all. Are there any ways to read the UTF-8 text files as simple and convenient as the way that we read ANSI...
3
18961
by: nicolasg | last post by:
Hi, I'm trying to open a file (any file) in binary mode and save it inside a new text file. After that I want to read the source from the text file and save it back to the disk with its original form. The problem is tha the binary source that I extract from the text file seems to be diferent from the source I saved. Here is my code: 1) handle=file('image.gif','rb')
11
5129
Dormilich
by: Dormilich | last post by:
Lately I have seen so much awful HTML, that I like to show what a HTML document should look like, regarding the requirements from the W3C. the absolute minimum is defined as: or expressed in code: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <HTML> <HEAD> <TITLE>My first HTML document</TITLE>
0
9423
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10049
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
9865
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8876
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7413
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6675
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5448
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3967
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3567
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.