By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,369 Members | 1,138 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,369 IT Pros & Developers. It's quick & easy.

About charset setting and replacing

P: n/a
Hi there,
I am writing a program to load HTML from file and send it to IE
directly. I've met some problem in charset setting. Most of HTML have
charset "us-ascii", for some reason, some UNICODE TEXT will be
inserted into the HTML before sending to IE. The problem is

1) Can I specify special charset for some component, e.g.
<span charset="UTF-8"SOME UNICODE HERE</spand>

2) If "NO" for 1), so any way to change the charset of the original
HTML? Because I have no HTML praser handy, I can only SEARCH & REPLACE
the charset programmly. I've checked the several HTML and find the
CHARSET format like

<META http-equiv=Content-Type content="text/html; charset=us-ascii">

So, for leading the program to replace the correct one, I search the
keyword "charset=" and get the position, and then search the position
of double quotation marks, finally, I replace the substring with UTF8,
everything seems fine. However, I am worrying about if there are some
excepction. Will these, for example, happen?

<META http-equiv=Content-Type content="text/html;" charset="us-ascii">

OR

<META http-equiv=Content-Type content='text/html;' charset='us-ascii'>

OR

<META http-equiv=Content-Type content='text/html; charset=us-ascii'>
Any better approach for my problem?

p.s. Someone suggest me to send the original code to IE and then call
IE's charset setting function to change the charset, I try, but for my
UNICODE TEXT, aftering changing the charset, the UNICODE TEXT becomes
some meaningly code!!!

Thanks in advance.

Jul 14 '06 #1
Share this Question
Share on Google+
7 Replies


P: n/a
gm****@21cn.com writes:
I am writing a program to load HTML from file and send it to IE
directly. I've met some problem in charset setting. Most of HTML have
charset "us-ascii", for some reason, some UNICODE TEXT will be
inserted into the HTML before sending to IE. The problem is

1) Can I specify special charset for some component, e.g.
<span charset="UTF-8"SOME UNICODE HERE</spand>
No.
2) If "NO" for 1), so any way to change the charset of the original
HTML? Because I have no HTML praser handy, I can only SEARCH & REPLACE
the charset programmly. I've checked the several HTML and find the
CHARSET format like

<META http-equiv=Content-Type content="text/html; charset=us-ascii">
The usual best solution is to set the real HTTP content type header.
Content-type: text/html; charset=UTF-8
This will override the <metaelement if there is one, so you don't need
to worry about the format.

Since any valid us-ascii character is also (the same) valid UTF-8
character you might as well do this all the time.

However, from the description you gave, it doesn't sound like you're
using HTTP.
So, for leading the program to replace the correct one, I search the
keyword "charset=" and get the position, and then search the position
of double quotation marks, finally, I replace the substring with UTF8,
everything seems fine. However, I am worrying about if there are some
excepction. Will these, for example, happen?

<META http-equiv=Content-Type content="text/html;" charset="us-ascii">
<META http-equiv=Content-Type content='text/html;' charset='us-ascii'>
No.
<META http-equiv=Content-Type content='text/html; charset=us-ascii'>
Might happen.

Additionally, the attribute names and tag name may or may not be
(partially) capitalised, as may the charset value, and possibly other
bits. There may be a slash immediately before the end of the tag (if
it's an XHTML document rather than a HTML document). The order of the
attributes may be reversed, so:
<MeTA ConTenT='text/html; charset=US-ascII'
htTp-EQUiv="Content-Type" />
is an unusual combination of the above, but still perfectly legal...

You might also get cases which have nothing to do with a <meta>
element, but trigger your pattern matching anyway.
Any better approach for my problem?
Setting the HTTP headers is the best solution. If you can't do that
then using a real HTML parser is likely to be more reliable than any
search-and-replace you put together.

--
Chris
Jul 14 '06 #2

P: n/a
gm****@21cn.com wrote:
Hi there, I am writing a program to load HTML from file and send it
to IE directly. I've met some problem in charset setting. Most of
HTML have charset "us-ascii", for some reason, some UNICODE TEXT
will be inserted into the HTML before sending to IE. The problem is

1) Can I specify special charset for some component, e.g. <span
charset="UTF-8"SOME UNICODE HERE</spand>
1. UTF-8 isn't a charset, it's an encoding.
2. The UTF-8 encoding includes and encompasses all of US-ASCII.
3. Encodings apply to pages, not to HTML fragments.

If you create a page that is encoded as UTF-8, and serve it as UTF-8,
US-ASCII characters will automatically be rendered correctly.

What I don't understand is what you mean by "send it to IE directly".
Are you writing a server? If so, then you need to look into how to serve
pages encoded as UTF-8 (and that would be off-topic here).

--
Jack.
Jul 14 '06 #3

P: n/a
gm****@21cn.com wrote:
Hi there, I am writing a program to load HTML from file and send it
to IE directly. I've met some problem in charset setting. Most of
HTML have charset "us-ascii", for some reason, some UNICODE TEXT
will be inserted into the HTML before sending to IE. The problem is

1) Can I specify special charset for some component, e.g. <span
charset="UTF-8"SOME UNICODE HERE</spand>

1. UTF-8 isn't a charset, it's an encoding.
Anyway, the following meta is extract from some page (the source HTML
of the searching result of google)

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
2. The UTF-8 encoding includes and encompasses all of US-ASCII.
3. Encodings apply to pages, not to HTML fragments.

If you create a page that is encoded as UTF-8, and serve it as UTF-8,
US-ASCII characters will automatically be rendered correctly.
What I mean is : insert some UNICODE (e.g. Asian Character) into the
HTML, so if the charset is US-ASCII, it cannot render the text
correctly.
What I don't understand is what you mean by "send it to IE directly".
Are you writing a server? If so, then you need to look into how to serve
pages encoded as UTF-8 (and that would be off-topic here).
I am sorry for my misleading you. I am writing a client which send the
HTML code to IE with Microsoft IWebbrower2 and IHTMLDocument2
interfaces. With those interfaces, I can change the HTML of any page
dynamically.

Jul 14 '06 #4

P: n/a

Chris Morris 写道:
gm****@21cn.com writes:
I am writing a program to load HTML from file and send it to IE
directly. I've met some problem in charset setting. Most of HTML have
charset "us-ascii", for some reason, some UNICODE TEXT will be
inserted into the HTML before sending to IE. The problem is

1) Can I specify special charset for some component, e.g.
<span charset="UTF-8"SOME UNICODE HERE</spand>

No.
2) If "NO" for 1), so any way to change the charset of the original
HTML? Because I have no HTML praser handy, I can only SEARCH & REPLACE
the charset programmly. I've checked the several HTML and find the
CHARSET format like

<META http-equiv=Content-Type content="text/html; charset=us-ascii">

The usual best solution is to set the real HTTP content type header.
Content-type: text/html; charset=UTF-8
This will override the <metaelement if there is one, so you don't need
to worry about the format.

Since any valid us-ascii character is also (the same) valid UTF-8
character you might as well do this all the time.

However, from the description you gave, it doesn't sound like you're
using HTTP.
I am writing a client to change HTML dynamically. All HTML are saved on
local Harddisk, it's nothing relate to network prototype.
So, for leading the program to replace the correct one, I search the
keyword "charset=" and get the position, and then search the position
of double quotation marks, finally, I replace the substring with UTF8,
everything seems fine. However, I am worrying about if there are some
excepction. Will these, for example, happen?

<META http-equiv=Content-Type content="text/html;" charset="us-ascii">
<META http-equiv=Content-Type content='text/html;' charset='us-ascii'>
No.
<META http-equiv=Content-Type content='text/html; charset=us-ascii'>
Might happen.

Additionally, the attribute names and tag name may or may not be
(partially) capitalised, as may the charset value, and possibly other
bits. There may be a slash immediately before the end of the tag (if
it's an XHTML document rather than a HTML document). The order of the
attributes may be reversed, so:
<MeTA ConTenT='text/html; charset=US-ascII'
htTp-EQUiv="Content-Type" />
is an unusual combination of the above, but still perfectly legal...
I am not quite familiar with HTML, As you mention above, for both HTML
and XHTML, if the following valid ?

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
You might also get cases which have nothing to do with a <meta>
element, but trigger your pattern matching anyway.
Any better approach for my problem?

Setting the HTTP headers is the best solution. If you can't do that
then using a real HTML parser is likely to be more reliable than any
search-and-replace you put together.
Thanks. I see.

Jul 14 '06 #5

P: n/a
gm****@21cn.com writes:
I am not quite familiar with HTML,
See http://www.w3.org/TR/HTML4/ for the official specifications.
As you mention above, for both HTML
and XHTML, if the following valid ?

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
No - this is valid in HTML, but not in XHTML. Internet Explorer does
not support XHTML and treats it as if it were HTML. You may find in
XHTML source documents something like this:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
which is valid in XHTML but not valid in HTML.

--
Chris
Jul 14 '06 #6

P: n/a
gm****@21cn.com <gm****@21cn.comscripsit:
>1. UTF-8 isn't a charset, it's an encoding.
Anyway, the following meta is extract from some page (the source HTML
of the searching result of google)

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
The meta tag itself is correct, though it will be (by the specifications and
in actual practice) ignored, if the server specifies a charset parameter in
actual HTTP headers. You need to find out what the server does, typically by
using an HTTP header viewer.

Anyway, UTF-8 is a "charset" in the technical sense that the HTTP header and
its <metasimulation uses the name "charset" for the parameter that
specifies the character encoding. The choice of the name "charset" is
unfortunate but cannot be changed any more.
What I mean is : insert some UNICODE (e.g. Asian Character) into the
HTML, so if the charset is US-ASCII, it cannot render the text
correctly.
You haven't understood the answers. You cannot change the encoding
("charset") in the midst of a document. Period. Stop trying.

Why cannot you simply use UTF-8 for the entire document? As explained, ASCII
characters need not be changed in any way when you put them into an UTF-8
document.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

Jul 15 '06 #7

P: n/a
gm****@21cn.com wrote:
What I mean is : insert some UNICODE (e.g. Asian Character) into the
HTML, so if the charset is US-ASCII, it cannot render the text
correctly.
Is this what you need?:

<http://www.w3.org/TR/html4/charset.html#h-5.3.1>

It is independent of the charset (notwithstanding the encoding of the
'&', '#', 'x', ';' and digit characters used).

--
ss at comp dot lancs dot ac dot uk |
Jul 15 '06 #8

This discussion thread is closed

Replies have been disabled for this discussion.