By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
425,696 Members | 2,182 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 425,696 IT Pros & Developers. It's quick & easy.

HTML/XML character encoding getting changed

P: n/a
I have a software application I've written called PowerBlog (PowerBlog.net)
that takes the editing capability of the Internet Explorer WebBrowser
control (essentially a DHTMLTextBox), extracts the user-typed HTML, assigns
it as an XML node's InnerText property (using C#: System.Xml.XmlDocument
obj; obj.InnerText = myHTML). Then I later get the InnerText as a string and
write to disk.

When this text is displayed in a web browser, special characters that are
beyond the standard ASCII charset are not rendered correctly. Frequently, I
have copied text from a web site, pasted in the DHTMLTextbox, saved, and
published it, and my published output has corrupt characters. However, prior
to publishing, when previewing my document it looks fine -- it is only when
it is published (extracted, written to disk, uploaded to the server via FTP,
downloaded via HTTP) that the corruption occurs.

There are several places where this problem could be occurring, and I don't
know how to figure it out.

- A "design feature" in the XmlNode's InnerText property that converts the
&###; encoding into an actual character.
- An encoding flaw when written to disk (currently I'm using the default,
UTF-8 I guess).
- A flaw in the FTP client class where the file is being corrupted during
upload (I think I'm using binary upload format but perhaps I should
double-check).
- A flaw in IIS (no known strange settings exist)

I still need to do some homework on this but I was wondering if anyone has
any bright ideas before I continue searching this out?

Thanks,
Jon
Nov 15 '05 #1
Share this Question
Share on Google+
3 Replies


P: n/a

"Jon Davis" <jo*@REMOVE.ME.PLEASE.jondavis.net> wrote in message
news:Ok**************@tk2msftngp13.phx.gbl...
- A "design feature" in the XmlNode's InnerText property that converts the
&###; encoding into an actual character.
- An encoding flaw when written to disk (currently I'm using the default,
UTF-8 I guess).
- A flaw in the FTP client class where the file is being corrupted during
upload (I think I'm using binary upload format but perhaps I should
double-check).
- A flaw in IIS (no known strange settings exist)
For starters I'd rule out the last two options - I think it's almost got to
be in character encoding or the way you're writing it to disk.

As you notice, if the source code text in your DHTML component is stored in
a different encoding to the format you're using to write to disk, then
you'll lose information, or it will be written incorrectly. Most encodings
store ascii characters upto 128 the same, so errors only become obvious
after 128.

I'd be interested to find out what encoding the DHTML control is using to
store its source code. UCS-2 is, as far as I'm aware, the standard windows
encoding, so you might want to try writing out to disk using this encoding
rather than UTF-8. The streamwriters let you set the encoding before
writing. Hopefully you'll not get any loss of information, which is what is
happening when you try to write UCS-2 as UTF-8! Just a guess, but worth a
try!?

HTH

Tobin

Thanks,
Jon

Nov 15 '05 #2

P: n/a
Thanks Tobin. I'll check out UCS-2, et al.

Jon
"Tobin Harris" <to********************@breathemail.net> wrote in message
news:bo*************@ID-135366.news.uni-berlin.de...

"Jon Davis" <jo*@REMOVE.ME.PLEASE.jondavis.net> wrote in message
news:Ok**************@tk2msftngp13.phx.gbl...
- A "design feature" in the XmlNode's InnerText property that converts the &###; encoding into an actual character.
- An encoding flaw when written to disk (currently I'm using the default, UTF-8 I guess).
- A flaw in the FTP client class where the file is being corrupted during upload (I think I'm using binary upload format but perhaps I should
double-check).
- A flaw in IIS (no known strange settings exist)
For starters I'd rule out the last two options - I think it's almost got

to be in character encoding or the way you're writing it to disk.

As you notice, if the source code text in your DHTML component is stored in a different encoding to the format you're using to write to disk, then
you'll lose information, or it will be written incorrectly. Most encodings
store ascii characters upto 128 the same, so errors only become obvious
after 128.

I'd be interested to find out what encoding the DHTML control is using to
store its source code. UCS-2 is, as far as I'm aware, the standard windows
encoding, so you might want to try writing out to disk using this encoding
rather than UTF-8. The streamwriters let you set the encoding before
writing. Hopefully you'll not get any loss of information, which is what is happening when you try to write UCS-2 as UTF-8! Just a guess, but worth a
try!?

HTH

Tobin

Thanks,
Jon


Nov 15 '05 #3

P: n/a
Hopefully you'll not get any loss of information, which is what is
happening when you try to write UCS-2 as UTF-8!


Unlikely. UCS2 or UTF8 are two different representations of the same
character set (Unicode). There is no loss of info when you convert from one
to the other (if the conversion is correctly done).
- A flaw in the FTP client class where the file is being corrupted during
upload (I think I'm using binary upload format but perhaps I should
double-check).
Unlikely. Even if the binary is not set, the only damaged characters will be
the control characters (below 0x20).
- An encoding flaw when written to disk (currently I'm using the default,
UTF-8 I guess).

Most probable. As a test, add this to the html file, first one in the
<head> section:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">

Without it the browser will assume the default is iso-8859-1.

Mihai

Nov 15 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.