gmclee@21cn.com writes:
Quote:
I am writing a program to load HTML from file and send it to IE
directly. I've met some problem in charset setting. Most of HTML have
charset "us-ascii", for some reason, some UNICODE TEXT will be
inserted into the HTML before sending to IE. The problem is
>
1) Can I specify special charset for some component, e.g.
<span charset="UTF-8"SOME UNICODE HERE</spand>
No.
Quote:
2) If "NO" for 1), so any way to change the charset of the original
HTML? Because I have no HTML praser handy, I can only SEARCH & REPLACE
the charset programmly. I've checked the several HTML and find the
CHARSET format like
>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
The usual best solution is to set the real HTTP content type header.
Content-type: text/html; charset=UTF-8
This will override the <metaelement if there is one, so you don't need
to worry about the format.
Since any valid us-ascii character is also (the same) valid UTF-8
character you might as well do this all the time.
However, from the description you gave, it doesn't sound like you're
using HTTP.
Quote:
So, for leading the program to replace the correct one, I search the
keyword "charset=" and get the position, and then search the position
of double quotation marks, finally, I replace the substring with UTF8,
everything seems fine. However, I am worrying about if there are some
excepction. Will these, for example, happen?
>
<META http-equiv=Content-Type content="text/html;" charset="us-ascii">
<META http-equiv=Content-Type content='text/html;' charset='us-ascii'>
No.
Quote:
<META http-equiv=Content-Type content='text/html; charset=us-ascii'>
Might happen.
Additionally, the attribute names and tag name may or may not be
(partially) capitalised, as may the charset value, and possibly other
bits. There may be a slash immediately before the end of the tag (if
it's an XHTML document rather than a HTML document). The order of the
attributes may be reversed, so:
<MeTA ConTenT='text/html; charset=US-ascII'
htTp-EQUiv="Content-Type" />
is an unusual combination of the above, but still perfectly legal...
You might also get cases which have nothing to do with a <meta>
element, but trigger your pattern matching anyway.
Quote:
Any better approach for my problem?
Setting the HTTP headers is the best solution. If you can't do that
then using a real HTML parser is likely to be more reliable than any
search-and-replace you put together.
--
Chris