Help | Site Map
Connecting Tech Pros Worldwide
 
 
LinkBack Thread Tools
  #1  
Old July 20th, 2005, 05:26 PM
David Dorward
Guest
 
Posts: n/a
Default Transcoding HTML

I'm sure that I read somewhere that an HTML document might be
transcoded to a different characterset at some stage in its journey,
so while it might start out as (for example) ISO-8859-15, by the time
it is actually viewed its been converted to UTF-8. Maybe by whatever
the author used to upload the document to the server, maybe a a proxy,
maybe by the user agent (if it saves it to disk), maybe by the httpd
in some content negotiation.

Does anybody have any information on systems that do this in practise?

--
David Dorward
http://dorward.me.uk/
  #2  
Old July 20th, 2005, 05:26 PM
Alan J. Flavell
Guest
 
Posts: n/a
Default Re: Transcoding HTML

On Tue, 28 Oct 2003, David Dorward wrote:
[color=blue]
> I'm sure that I read somewhere that an HTML document might be
> transcoded to a different characterset at some stage in its journey,
> so while it might start out as (for example) ISO-8859-15, by the time
> it is actually viewed its been converted to UTF-8.[/color]

In theory this is true. In practice the use of such transcoding
features in servers or proxies seems to be confined to particular
communities where, for whatever reason, several incompatible character
codings are in use. I heard of Japanese transcoding proxies, but the
only ones I met directly were Russian ones, see Russian Apache for
details.

There's a URL here http://apache.lexa.ru/english/meta-http-eng.html
(with a rather remarkable figurehead ;-) but I suspect it may be out
of date. Still, it'll give you the flavour of the thing, I guess.

  #3  
Old July 20th, 2005, 05:26 PM
Andrew Graham
Guest
 
Posts: n/a
Default Re: Transcoding HTML

David Dorward wrote:[color=blue]
> I'm sure that I read somewhere that an HTML document might be
> transcoded to a different characterset at some stage in its journey,
> so while it might start out as (for example) ISO-8859-15, by the time
> it is actually viewed its been converted to UTF-8. Maybe by whatever
> the author used to upload the document to the server, maybe a a proxy,
> maybe by the user agent (if it saves it to disk), maybe by the httpd
> in some content negotiation.
>
> Does anybody have any information on systems that do this in practise?[/color]

IE6 will often do this when saving a document locally. The FileSave
dialog box lets the user choose an encoding, and an appropriate element
like
<META http-equiv=Content-Type content="text/html; charset=utf-8">
is added or changed depending on whether the document had the element
originally.

Other changes that are made:
- <!DOCTYPE...> (HTML4.0 trans.) is added if it wasn't there.
- <META content="MSHTML 6.00.2800.1264" name=GENERATOR> is added
- All the elements are capitalized.
- Line breaks are adjusted.
- Quotes around attribute values are stripped where not required.
- Numeric character references like © may be rewritten as the
actual character if supported by the encoding.

I'm sure more changes are made, but I noticed these in a quick
examination.

I'll speculate that IE6 creates the new document from its internal
representation without reference to the original source.

Even more oddly, sometimes the document is saved as a verbatim copy of
the source. Perhaps this only happens when the declared encoding and the
user's chosen encoding are identical.

Andrew Graham


  #4  
Old July 20th, 2005, 05:26 PM
Jim Ley
Guest
 
Posts: n/a
Default Re: Transcoding HTML

On Tue, 28 Oct 2003 18:09:47 GMT, "Andrew Graham"
<andrewgraham.at.att.net@nospam.invalid> wrote:

[color=blue]
>I'll speculate that IE6 creates the new document from its internal
>representation without reference to the original source.[/color]

Yes it's a representation of the document tree, and bears no relation
to the original source.
[color=blue]
>Even more oddly, sometimes the document is saved as a verbatim copy of
>the source. Perhaps this only happens when the declared encoding and the
>user's chosen encoding are identical.[/color]

It normally depends if you say "save web page complete" or "save web
page html only" the first is a normalised source, the second the
actual source.

Jim.
--
comp.lang.javascript FAQ - http://jibbering.com/faq/

  #5  
Old July 20th, 2005, 05:26 PM
Alan J. Flavell
Guest
 
Posts: n/a
Default Re: Transcoding HTML

On Tue, 28 Oct 2003, Andrew Graham wrote:
[color=blue]
> IE6 will often do this when saving a document locally.[/color]

Good point. Mozilla Composer can also do this when one chooses an
encoding and then saves the edited document.

I thought the questioner was more interested in automated transcoding
in servers and proxies...?

  #6  
Old July 20th, 2005, 05:26 PM
David Dorward
Guest
 
Posts: n/a
Default Re: Transcoding HTML

Alan J. Flavell wrote:
[color=blue]
> I thought the questioner was more interested in automated transcoding
> in servers and proxies...?[/color]

No no, any system that does it is of interest.

--
David Dorward http://dorward.me.uk/
  #7  
Old July 20th, 2005, 05:26 PM
Alan J. Flavell
Guest
 
Posts: n/a
Default Re: Transcoding HTML

On Tue, 28 Oct 2003, David Dorward wrote:
[color=blue][color=green]
> > I thought the questioner was more interested in automated transcoding
> > in servers and proxies...?[/color]
>
> No no, any system that does it is of interest.[/color]

Well, you're in the best position to know what you're interested in
;-) so please excuse me for assuming. Can't think of any other
examples at the moment though.

  #8  
Old July 20th, 2005, 05:26 PM
Nick Kew
Guest
 
Posts: n/a
Default Re: Transcoding HTML

In article <caa3f16.0310280618.6f4001eb@posting.google.com> , one of infinite monkeys
at the keyboard of dorward@yahoo.com (David Dorward) wrote:[color=blue]
> I'm sure that I read somewhere that an HTML document might be
> transcoded to a different characterset at some stage in its journey,
> so while it might start out as (for example) ISO-8859-15, by the time
> it is actually viewed its been converted to UTF-8.[/color]

Yes, there are certainly reasons why that might happen.

Most markup parsers work internally with a selected charset, and
documents at input. They can transcode back on output, but this
is then an extra overhead. Several of my modules generate all output
as UTF-8, leaving you the option to filter it through a transcoding
module if you want something else. XSLT of course has its own rules,
but will typically be fastest if you use the processor's internal
charset for output.
[color=blue]
> Does anybody have any information on systems that do this in practise?[/color]

Come and see my talk at ApacheCon!

--
Nick Kew

In urgent need of paying work - see http://www.webthing.com/~nick/cv.html
  #9  
Old July 20th, 2005, 05:26 PM
Henri Sivonen
Guest
 
Posts: n/a
Default Re: Transcoding HTML

In article <3f9eb19a.99017038@news.cis.dfn.de>,
jim@jibbering.com (Jim Ley) wrote:
[color=blue]
> On Tue, 28 Oct 2003 18:09:47 GMT, "Andrew Graham"
> <andrewgraham.at.att.net@nospam.invalid> wrote:
>
>[color=green]
> >I'll speculate that IE6 creates the new document from its internal
> >representation without reference to the original source.[/color]
>
> Yes it's a representation of the document tree, and bears no relation
> to the original source.[/color]

However, if the document is reparsed, the new tree is not necessarily
the same due to whitespace introduced by pretty printing, which may
affect scripts. Also, due to the doctype change, the layout mode may be
different after reparse.

--
Henri Sivonen
hsivonen@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
  #10  
Old July 20th, 2005, 05:27 PM
Andreas Prilop
Guest
 
Posts: n/a
Default Re: Transcoding HTML

"Andrew Graham" <andrewgraham.at.att.net@nospam.invalid> wrote:
[color=blue]
> IE6 will often do this when saving a document locally.[/color]

Don't do this then. Rather choose "View source" and save in your text
editor.
 

Bookmarks

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

What is Bytes?

We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights. Get the best answers to your questions from over network members.
Post your question now . . .
It's fast and it's free

Popular Articles