473,473 Members | 2,303 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

About charset setting and replacing

Hi there,
I am writing a program to load HTML from file and send it to IE
directly. I've met some problem in charset setting. Most of HTML have
charset "us-ascii", for some reason, some UNICODE TEXT will be
inserted into the HTML before sending to IE. The problem is

1) Can I specify special charset for some component, e.g.
<span charset="UTF-8"SOME UNICODE HERE</spand>

2) If "NO" for 1), so any way to change the charset of the original
HTML? Because I have no HTML praser handy, I can only SEARCH & REPLACE
the charset programmly. I've checked the several HTML and find the
CHARSET format like

<META http-equiv=Content-Type content="text/html; charset=us-ascii">

So, for leading the program to replace the correct one, I search the
keyword "charset=" and get the position, and then search the position
of double quotation marks, finally, I replace the substring with UTF8,
everything seems fine. However, I am worrying about if there are some
excepction. Will these, for example, happen?

<META http-equiv=Content-Type content="text/html;" charset="us-ascii">

OR

<META http-equiv=Content-Type content='text/html;' charset='us-ascii'>

OR

<META http-equiv=Content-Type content='text/html; charset=us-ascii'>
Any better approach for my problem?

p.s. Someone suggest me to send the original code to IE and then call
IE's charset setting function to change the charset, I try, but for my
UNICODE TEXT, aftering changing the charset, the UNICODE TEXT becomes
some meaningly code!!!

Thanks in advance.

Jul 14 '06 #1
7 5961
gm****@21cn.com writes:
I am writing a program to load HTML from file and send it to IE
directly. I've met some problem in charset setting. Most of HTML have
charset "us-ascii", for some reason, some UNICODE TEXT will be
inserted into the HTML before sending to IE. The problem is

1) Can I specify special charset for some component, e.g.
<span charset="UTF-8"SOME UNICODE HERE</spand>
No.
2) If "NO" for 1), so any way to change the charset of the original
HTML? Because I have no HTML praser handy, I can only SEARCH & REPLACE
the charset programmly. I've checked the several HTML and find the
CHARSET format like

<META http-equiv=Content-Type content="text/html; charset=us-ascii">
The usual best solution is to set the real HTTP content type header.
Content-type: text/html; charset=UTF-8
This will override the <metaelement if there is one, so you don't need
to worry about the format.

Since any valid us-ascii character is also (the same) valid UTF-8
character you might as well do this all the time.

However, from the description you gave, it doesn't sound like you're
using HTTP.
So, for leading the program to replace the correct one, I search the
keyword "charset=" and get the position, and then search the position
of double quotation marks, finally, I replace the substring with UTF8,
everything seems fine. However, I am worrying about if there are some
excepction. Will these, for example, happen?

<META http-equiv=Content-Type content="text/html;" charset="us-ascii">
<META http-equiv=Content-Type content='text/html;' charset='us-ascii'>
No.
<META http-equiv=Content-Type content='text/html; charset=us-ascii'>
Might happen.

Additionally, the attribute names and tag name may or may not be
(partially) capitalised, as may the charset value, and possibly other
bits. There may be a slash immediately before the end of the tag (if
it's an XHTML document rather than a HTML document). The order of the
attributes may be reversed, so:
<MeTA ConTenT='text/html; charset=US-ascII'
htTp-EQUiv="Content-Type" />
is an unusual combination of the above, but still perfectly legal...

You might also get cases which have nothing to do with a <meta>
element, but trigger your pattern matching anyway.
Any better approach for my problem?
Setting the HTTP headers is the best solution. If you can't do that
then using a real HTML parser is likely to be more reliable than any
search-and-replace you put together.

--
Chris
Jul 14 '06 #2
gm****@21cn.com wrote:
Hi there, I am writing a program to load HTML from file and send it
to IE directly. I've met some problem in charset setting. Most of
HTML have charset "us-ascii", for some reason, some UNICODE TEXT
will be inserted into the HTML before sending to IE. The problem is

1) Can I specify special charset for some component, e.g. <span
charset="UTF-8"SOME UNICODE HERE</spand>
1. UTF-8 isn't a charset, it's an encoding.
2. The UTF-8 encoding includes and encompasses all of US-ASCII.
3. Encodings apply to pages, not to HTML fragments.

If you create a page that is encoded as UTF-8, and serve it as UTF-8,
US-ASCII characters will automatically be rendered correctly.

What I don't understand is what you mean by "send it to IE directly".
Are you writing a server? If so, then you need to look into how to serve
pages encoded as UTF-8 (and that would be off-topic here).

--
Jack.
Jul 14 '06 #3
gm****@21cn.com wrote:
Hi there, I am writing a program to load HTML from file and send it
to IE directly. I've met some problem in charset setting. Most of
HTML have charset "us-ascii", for some reason, some UNICODE TEXT
will be inserted into the HTML before sending to IE. The problem is

1) Can I specify special charset for some component, e.g. <span
charset="UTF-8"SOME UNICODE HERE</spand>

1. UTF-8 isn't a charset, it's an encoding.
Anyway, the following meta is extract from some page (the source HTML
of the searching result of google)

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
2. The UTF-8 encoding includes and encompasses all of US-ASCII.
3. Encodings apply to pages, not to HTML fragments.

If you create a page that is encoded as UTF-8, and serve it as UTF-8,
US-ASCII characters will automatically be rendered correctly.
What I mean is : insert some UNICODE (e.g. Asian Character) into the
HTML, so if the charset is US-ASCII, it cannot render the text
correctly.
What I don't understand is what you mean by "send it to IE directly".
Are you writing a server? If so, then you need to look into how to serve
pages encoded as UTF-8 (and that would be off-topic here).
I am sorry for my misleading you. I am writing a client which send the
HTML code to IE with Microsoft IWebbrower2 and IHTMLDocument2
interfaces. With those interfaces, I can change the HTML of any page
dynamically.

Jul 14 '06 #4

Chris Morris 写道:
gm****@21cn.com writes:
I am writing a program to load HTML from file and send it to IE
directly. I've met some problem in charset setting. Most of HTML have
charset "us-ascii", for some reason, some UNICODE TEXT will be
inserted into the HTML before sending to IE. The problem is

1) Can I specify special charset for some component, e.g.
<span charset="UTF-8"SOME UNICODE HERE</spand>

No.
2) If "NO" for 1), so any way to change the charset of the original
HTML? Because I have no HTML praser handy, I can only SEARCH & REPLACE
the charset programmly. I've checked the several HTML and find the
CHARSET format like

<META http-equiv=Content-Type content="text/html; charset=us-ascii">

The usual best solution is to set the real HTTP content type header.
Content-type: text/html; charset=UTF-8
This will override the <metaelement if there is one, so you don't need
to worry about the format.

Since any valid us-ascii character is also (the same) valid UTF-8
character you might as well do this all the time.

However, from the description you gave, it doesn't sound like you're
using HTTP.
I am writing a client to change HTML dynamically. All HTML are saved on
local Harddisk, it's nothing relate to network prototype.
So, for leading the program to replace the correct one, I search the
keyword "charset=" and get the position, and then search the position
of double quotation marks, finally, I replace the substring with UTF8,
everything seems fine. However, I am worrying about if there are some
excepction. Will these, for example, happen?

<META http-equiv=Content-Type content="text/html;" charset="us-ascii">
<META http-equiv=Content-Type content='text/html;' charset='us-ascii'>
No.
<META http-equiv=Content-Type content='text/html; charset=us-ascii'>
Might happen.

Additionally, the attribute names and tag name may or may not be
(partially) capitalised, as may the charset value, and possibly other
bits. There may be a slash immediately before the end of the tag (if
it's an XHTML document rather than a HTML document). The order of the
attributes may be reversed, so:
<MeTA ConTenT='text/html; charset=US-ascII'
htTp-EQUiv="Content-Type" />
is an unusual combination of the above, but still perfectly legal...
I am not quite familiar with HTML, As you mention above, for both HTML
and XHTML, if the following valid ?

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
You might also get cases which have nothing to do with a <meta>
element, but trigger your pattern matching anyway.
Any better approach for my problem?

Setting the HTTP headers is the best solution. If you can't do that
then using a real HTML parser is likely to be more reliable than any
search-and-replace you put together.
Thanks. I see.

Jul 14 '06 #5
gm****@21cn.com writes:
I am not quite familiar with HTML,
See http://www.w3.org/TR/HTML4/ for the official specifications.
As you mention above, for both HTML
and XHTML, if the following valid ?

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
No - this is valid in HTML, but not in XHTML. Internet Explorer does
not support XHTML and treats it as if it were HTML. You may find in
XHTML source documents something like this:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
which is valid in XHTML but not valid in HTML.

--
Chris
Jul 14 '06 #6
gm****@21cn.com <gm****@21cn.comscripsit:
>1. UTF-8 isn't a charset, it's an encoding.
Anyway, the following meta is extract from some page (the source HTML
of the searching result of google)

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
The meta tag itself is correct, though it will be (by the specifications and
in actual practice) ignored, if the server specifies a charset parameter in
actual HTTP headers. You need to find out what the server does, typically by
using an HTTP header viewer.

Anyway, UTF-8 is a "charset" in the technical sense that the HTTP header and
its <metasimulation uses the name "charset" for the parameter that
specifies the character encoding. The choice of the name "charset" is
unfortunate but cannot be changed any more.
What I mean is : insert some UNICODE (e.g. Asian Character) into the
HTML, so if the charset is US-ASCII, it cannot render the text
correctly.
You haven't understood the answers. You cannot change the encoding
("charset") in the midst of a document. Period. Stop trying.

Why cannot you simply use UTF-8 for the entire document? As explained, ASCII
characters need not be changed in any way when you put them into an UTF-8
document.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

Jul 15 '06 #7
gm****@21cn.com wrote:
What I mean is : insert some UNICODE (e.g. Asian Character) into the
HTML, so if the charset is US-ASCII, it cannot render the text
correctly.
Is this what you need?:

<http://www.w3.org/TR/html4/charset.html#h-5.3.1>

It is independent of the charset (notwithstanding the encoding of the
'&', '#', 'x', ';' and digit characters used).

--
ss at comp dot lancs dot ac dot uk |
Jul 15 '06 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Wouter | last post by:
Hi, I use Windows XP, Apache2 and MySQL 4.1. I changed my default charset by adding in my php.ini: default-character-set=utf8 Some row inserts now went wrong, so I did remove it again....
30
by: The Plankmeister | last post by:
In the <meta http-equiv="Content-Type" content="text/html; charset=????????"> tag, what is the best charset to use? I tend to use UTF-8, for no reason other than it sounds much cooler than...
37
by: Haines Brown | last post by:
I understand that <br /> is marginal in CSS, and so am looking for a substitute for the EOL character. I've failed in both approaches and seek advice. The first thing I tried was to use the...
35
by: The Bicycling Guitarist | last post by:
My web site has not been spidered by Googlebot since April 2003. The site in question is at www.TheBicyclingGuitarist.net/ I received much help from this NG and the stylesheets NG when updating the...
9
by: Michael | last post by:
Hi all, I would like to get people's opinion about executing SQL statements in C# (or any other .NET language really). I used to create my SQL statement by building a string and replacing single...
4
by: Rmi | last post by:
Question: How can you determine the character set used by a webpage you built? My understanding of the issue is that the character set used by an HTML file (or any other file, for that matter)...
3
by: yellowtek | last post by:
Hi, I'm simply using PHP as a programming language, and I just want to print some text information to stdout, but instruction print "" (&eacute;) does not print my "e" with an accent in the...
2
by: vunet.us | last post by:
I translate my website to German and some ASCII characters are not translated by my browser. For example I see this: &#x00FC; I use charset: <meta http-equiv="Content-Type" content="text/html;...
0
VietPP
by: VietPP | last post by:
Hi all, I've asked too much question in this day, hehe. I'm trying to export my table data in OracleDB to excel. The problem is my charset in database is US7ACSII (using Vietnamese font), when...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
1
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
1
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
1
muto222
php
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.