473,395 Members | 1,977 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

Transcoding HTML

I'm sure that I read somewhere that an HTML document might be
transcoded to a different characterset at some stage in its journey,
so while it might start out as (for example) ISO-8859-15, by the time
it is actually viewed its been converted to UTF-8. Maybe by whatever
the author used to upload the document to the server, maybe a a proxy,
maybe by the user agent (if it saves it to disk), maybe by the httpd
in some content negotiation.

Does anybody have any information on systems that do this in practise?

--
David Dorward
http://dorward.me.uk/
Jul 20 '05 #1
9 2036
On Tue, 28 Oct 2003, David Dorward wrote:
I'm sure that I read somewhere that an HTML document might be
transcoded to a different characterset at some stage in its journey,
so while it might start out as (for example) ISO-8859-15, by the time
it is actually viewed its been converted to UTF-8.


In theory this is true. In practice the use of such transcoding
features in servers or proxies seems to be confined to particular
communities where, for whatever reason, several incompatible character
codings are in use. I heard of Japanese transcoding proxies, but the
only ones I met directly were Russian ones, see Russian Apache for
details.

There's a URL here http://apache.lexa.ru/english/meta-http-eng.html
(with a rather remarkable figurehead ;-) but I suspect it may be out
of date. Still, it'll give you the flavour of the thing, I guess.

Jul 20 '05 #2
David Dorward wrote:
I'm sure that I read somewhere that an HTML document might be
transcoded to a different characterset at some stage in its journey,
so while it might start out as (for example) ISO-8859-15, by the time
it is actually viewed its been converted to UTF-8. Maybe by whatever
the author used to upload the document to the server, maybe a a proxy,
maybe by the user agent (if it saves it to disk), maybe by the httpd
in some content negotiation.

Does anybody have any information on systems that do this in practise?


IE6 will often do this when saving a document locally. The FileSave
dialog box lets the user choose an encoding, and an appropriate element
like
<META http-equiv=Content-Type content="text/html; charset=utf-8">
is added or changed depending on whether the document had the element
originally.

Other changes that are made:
- <!DOCTYPE...> (HTML4.0 trans.) is added if it wasn't there.
- <META content="MSHTML 6.00.2800.1264" name=GENERATOR> is added
- All the elements are capitalized.
- Line breaks are adjusted.
- Quotes around attribute values are stripped where not required.
- Numeric character references like © may be rewritten as the
actual character if supported by the encoding.

I'm sure more changes are made, but I noticed these in a quick
examination.

I'll speculate that IE6 creates the new document from its internal
representation without reference to the original source.

Even more oddly, sometimes the document is saved as a verbatim copy of
the source. Perhaps this only happens when the declared encoding and the
user's chosen encoding are identical.

Andrew Graham
Jul 20 '05 #3
On Tue, 28 Oct 2003 18:09:47 GMT, "Andrew Graham"
<an*********************@nospam.invalid> wrote:

I'll speculate that IE6 creates the new document from its internal
representation without reference to the original source.
Yes it's a representation of the document tree, and bears no relation
to the original source.
Even more oddly, sometimes the document is saved as a verbatim copy of
the source. Perhaps this only happens when the declared encoding and the
user's chosen encoding are identical.


It normally depends if you say "save web page complete" or "save web
page html only" the first is a normalised source, the second the
actual source.

Jim.
--
comp.lang.javascript FAQ - http://jibbering.com/faq/

Jul 20 '05 #4
On Tue, 28 Oct 2003, Andrew Graham wrote:
IE6 will often do this when saving a document locally.


Good point. Mozilla Composer can also do this when one chooses an
encoding and then saves the edited document.

I thought the questioner was more interested in automated transcoding
in servers and proxies...?

Jul 20 '05 #5
Alan J. Flavell wrote:
I thought the questioner was more interested in automated transcoding
in servers and proxies...?


No no, any system that does it is of interest.

--
David Dorward http://dorward.me.uk/
Jul 20 '05 #6
On Tue, 28 Oct 2003, David Dorward wrote:
I thought the questioner was more interested in automated transcoding
in servers and proxies...?


No no, any system that does it is of interest.


Well, you're in the best position to know what you're interested in
;-) so please excuse me for assuming. Can't think of any other
examples at the moment though.

Jul 20 '05 #7
In article <ca*************************@posting.google.com> , one of infinite monkeys
at the keyboard of do*****@yahoo.com (David Dorward) wrote:
I'm sure that I read somewhere that an HTML document might be
transcoded to a different characterset at some stage in its journey,
so while it might start out as (for example) ISO-8859-15, by the time
it is actually viewed its been converted to UTF-8.
Yes, there are certainly reasons why that might happen.

Most markup parsers work internally with a selected charset, and
documents at input. They can transcode back on output, but this
is then an extra overhead. Several of my modules generate all output
as UTF-8, leaving you the option to filter it through a transcoding
module if you want something else. XSLT of course has its own rules,
but will typically be fastest if you use the processor's internal
charset for output.
Does anybody have any information on systems that do this in practise?


Come and see my talk at ApacheCon!

--
Nick Kew

In urgent need of paying work - see http://www.webthing.com/~nick/cv.html
Jul 20 '05 #8
In article <3f***************@news.cis.dfn.de>,
ji*@jibbering.com (Jim Ley) wrote:
On Tue, 28 Oct 2003 18:09:47 GMT, "Andrew Graham"
<an*********************@nospam.invalid> wrote:

I'll speculate that IE6 creates the new document from its internal
representation without reference to the original source.


Yes it's a representation of the document tree, and bears no relation
to the original source.


However, if the document is reparsed, the new tree is not necessarily
the same due to whitespace introduced by pretty printing, which may
affect scripts. Also, due to the doctype change, the layout mode may be
different after reparse.

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 20 '05 #9
"Andrew Graham" <an*********************@nospam.invalid> wrote:
IE6 will often do this when saving a document locally.


Don't do this then. Rather choose "View source" and save in your text
editor.
Jul 20 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: VK | last post by:
09/30/03 Phil Powell posted his "Radio buttons do not appear checked" question. This question led to a long discussion about the naming rules applying to variables, objects, methods and properties...
0
by: Boris Ammerlaan | last post by:
This notice is posted about every week. I'll endeavor to use the same subject line so that those of you who have seen it can kill-file the subject; additionally, Supersedes: headers are used to...
4
by: Francois Keyeux | last post by:
hello everyone: i have a web site built using vbasic active server scripting running on iis (it works on either iis 50 and 60, but is designed for iis 50) i know how to create a plain text...
4
by: HeroOfSpielburg | last post by:
Hello, I am trying to using the Shift_JIS character set in my web pages, and have specified it as such in the <head>. <meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS"> ...
5
by: serge calderara | last post by:
Dear all, I am new in asp.net and prepare myself for exam I still have dificulties to understand the difference between server control and HTML control. Okey things whcih are clear are the fact...
1
by: Matthias Langbein | last post by:
Hi all, when i convert a uploaded file to UTF-8 with the utf8_encode function, the string is prefixed by the two characters ÿþ The file is originally encoded as UTF-16. Can anybody tell me,...
2
by: alou131 | last post by:
Hello all! I have this server side video transcoding script that works on all video files uploaded and transcodes them to the .flv format. The problem is when a video file that is already in...
6
by: Guy Macon | last post by:
cwdjrxyz wrote: HTML 5 has solved the above probem. See the following web page: HTML 5, one vocabulary, two serializations http://www.w3.org/QA/2008/01/html5-is-html-and-xml.html
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.