473,396 Members | 2,011 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

HTML/XML character encoding getting changed

I have a software application I've written called PowerBlog (PowerBlog.net)
that takes the editing capability of the Internet Explorer WebBrowser
control (essentially a DHTMLTextBox), extracts the user-typed HTML, assigns
it as an XML node's InnerText property (using C#: System.Xml.XmlDocument
obj; obj.InnerText = myHTML). Then I later get the InnerText as a string and
write to disk.

When this text is displayed in a web browser, special characters that are
beyond the standard ASCII charset are not rendered correctly. Frequently, I
have copied text from a web site, pasted in the DHTMLTextbox, saved, and
published it, and my published output has corrupt characters. However, prior
to publishing, when previewing my document it looks fine -- it is only when
it is published (extracted, written to disk, uploaded to the server via FTP,
downloaded via HTTP) that the corruption occurs.

There are several places where this problem could be occurring, and I don't
know how to figure it out.

- A "design feature" in the XmlNode's InnerText property that converts the
&###; encoding into an actual character.
- An encoding flaw when written to disk (currently I'm using the default,
UTF-8 I guess).
- A flaw in the FTP client class where the file is being corrupted during
upload (I think I'm using binary upload format but perhaps I should
double-check).
- A flaw in IIS (no known strange settings exist)

I still need to do some homework on this but I was wondering if anyone has
any bright ideas before I continue searching this out?

Thanks,
Jon
Nov 15 '05 #1
3 5847

"Jon Davis" <jo*@REMOVE.ME.PLEASE.jondavis.net> wrote in message
news:Ok**************@tk2msftngp13.phx.gbl...
- A "design feature" in the XmlNode's InnerText property that converts the
&###; encoding into an actual character.
- An encoding flaw when written to disk (currently I'm using the default,
UTF-8 I guess).
- A flaw in the FTP client class where the file is being corrupted during
upload (I think I'm using binary upload format but perhaps I should
double-check).
- A flaw in IIS (no known strange settings exist)
For starters I'd rule out the last two options - I think it's almost got to
be in character encoding or the way you're writing it to disk.

As you notice, if the source code text in your DHTML component is stored in
a different encoding to the format you're using to write to disk, then
you'll lose information, or it will be written incorrectly. Most encodings
store ascii characters upto 128 the same, so errors only become obvious
after 128.

I'd be interested to find out what encoding the DHTML control is using to
store its source code. UCS-2 is, as far as I'm aware, the standard windows
encoding, so you might want to try writing out to disk using this encoding
rather than UTF-8. The streamwriters let you set the encoding before
writing. Hopefully you'll not get any loss of information, which is what is
happening when you try to write UCS-2 as UTF-8! Just a guess, but worth a
try!?

HTH

Tobin

Thanks,
Jon

Nov 15 '05 #2
Thanks Tobin. I'll check out UCS-2, et al.

Jon
"Tobin Harris" <to********************@breathemail.net> wrote in message
news:bo*************@ID-135366.news.uni-berlin.de...

"Jon Davis" <jo*@REMOVE.ME.PLEASE.jondavis.net> wrote in message
news:Ok**************@tk2msftngp13.phx.gbl...
- A "design feature" in the XmlNode's InnerText property that converts the &###; encoding into an actual character.
- An encoding flaw when written to disk (currently I'm using the default, UTF-8 I guess).
- A flaw in the FTP client class where the file is being corrupted during upload (I think I'm using binary upload format but perhaps I should
double-check).
- A flaw in IIS (no known strange settings exist)
For starters I'd rule out the last two options - I think it's almost got

to be in character encoding or the way you're writing it to disk.

As you notice, if the source code text in your DHTML component is stored in a different encoding to the format you're using to write to disk, then
you'll lose information, or it will be written incorrectly. Most encodings
store ascii characters upto 128 the same, so errors only become obvious
after 128.

I'd be interested to find out what encoding the DHTML control is using to
store its source code. UCS-2 is, as far as I'm aware, the standard windows
encoding, so you might want to try writing out to disk using this encoding
rather than UTF-8. The streamwriters let you set the encoding before
writing. Hopefully you'll not get any loss of information, which is what is happening when you try to write UCS-2 as UTF-8! Just a guess, but worth a
try!?

HTH

Tobin

Thanks,
Jon


Nov 15 '05 #3
Hopefully you'll not get any loss of information, which is what is
happening when you try to write UCS-2 as UTF-8!


Unlikely. UCS2 or UTF8 are two different representations of the same
character set (Unicode). There is no loss of info when you convert from one
to the other (if the conversion is correctly done).
- A flaw in the FTP client class where the file is being corrupted during
upload (I think I'm using binary upload format but perhaps I should
double-check).
Unlikely. Even if the binary is not set, the only damaged characters will be
the control characters (below 0x20).
- An encoding flaw when written to disk (currently I'm using the default,
UTF-8 I guess).

Most probable. As a test, add this to the html file, first one in the
<head> section:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">

Without it the browser will assume the default is iso-8859-1.

Mihai

Nov 15 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

14
by: Dylan | last post by:
Here's what I'm trying to do: - scrape some html content from various sources The issue I'm running to: - some of the sources have incorrectly encoded characters... for example, cp1252...
10
by: David Komanek | last post by:
Hi all, I have a question if it is possible to manipulate the settings of character encoding in Ms Internet Explorer 5.0, 5.5 and 6.0. The problem is that the default instalation of Ms IE seems...
11
by: Albretch | last post by:
Hi HTML gurus, I understand that you would use HTML character entities for &auml; and &euro; but why on earth would anyone encode: a colon: ":", a semicolon ";", or a gramatical period...
37
by: chandy | last post by:
Hi, I have an Html document that declares that it uses the utf-8 character set. As this document is editable via a web interface I need to make sure than high-ascii characters that may be...
2
by: John Dalberg | last post by:
The below html validates correctly on w3.org's html validator when the file has an html extension. When the same file gets an aspx extension, I get the error below from the validator. This tells me...
37
by: Zhiv Kurilka | last post by:
Hi, I have a text file with following content: "((^)|(.* +))§§§§§§§§" if I read it with: k=System.IO.StreamReader( "file.txt",System.Text.Encoding.ASCII); k.readtotheend()
2
by: stup | last post by:
Hi! I have a small javascript snippet that does the following: // an entire html document is in here data = "\u003c!DOCTYPE html PUBLIC \u0022-//W3C//DTD XHTML 1.1 Strict//EN\u0022\n ....";...
4
by: GGnOrE | last post by:
Hey, When I am writing an HTML Document, how do i know what character encoding I am using. Is Times New Roman have a specific character encoding or can it be found on my host server? What do you...
1
by: dineshchothe | last post by:
Hello, I want to read text from text area of jsp page and write its contents to a text file which is at server side.While doing this contents are get written into the file at server side...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.