By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,738 Members | 1,080 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,738 IT Pros & Developers. It's quick & easy.

Content-type META tag paradox?

P: n/a
Why would anyone ever have expected a content-type META tag to be effective
at all? Is it because someone was misled by the happenstance the letters of
the alphabet, the digits, and the characters {< > / ; , " ' =} happen to be
at the same locations in several particular common encodings (US-ASCII,
ISO-8859-1, etc.)?

Even assuming that these characters are always in the same locations, before
it can find my META tag in the first place, the UA has to make an *initial*
assumption as to whether the document is even 8-bit versus 16-bit.

Let's say I were perverse enough to save my HTML document using an editor
that encodes text as EBCDIC. How would the UA figure out not only where my
META tag is, but where ANYTHING is?

And that just addreses the encoding. How about the content-type? What if I
were twisted enough to tell the UA, which has been assuming that my document
is an HTML text document just long enough to reach and parse my META tag in
the first place, that the document is really text/plain? (In that case,
would it start over from the beginning based on text/plain, this time not
doing any parsing at all, and therefore not finding a META tag at all, and
therefore not finding an override for the default supposition of text/html,
and therefore AGAIN starting from the beginning and parsing the document as
HTML, and then finding the META tag and realizing the document is
text/plain, and starting all over again based on that assumption, and then
.....) Or image/gif?

So I'm just curious how the content-type META tag got into the spec in the
first place. It seems to defy logic.

--
Harlan Messinger
Remove the first dot from my e-mail address.
Veuillez ôter le premier point de mon adresse de courriel.

Jul 20 '05 #1
Share this Question
Share on Google+
40 Replies


P: n/a
"Harlan Messinger" <h.*********@comcast.net> a écrit dans le message de
news:c5************@ID-114100.news.uni-berlin.de
Let's say I were perverse enough to save my HTML document using an
editor that encodes text as EBCDIC. How would the UA figure out not
only where my META tag is, but where ANYTHING is?


I can't understand - HTML must use only us-ascii characters, either do the
content type attributes values...
Hu ?

Jul 20 '05 #2

P: n/a
"Harlan Messinger" <h.*********@comcast.net> a écrit dans le message de
news:c5************@ID-114100.news.uni-berlin.de
Let's say I were perverse enough to save my HTML document using an
editor that encodes text as EBCDIC. How would the UA figure out not
only where my META tag is, but where ANYTHING is?


I can't understand - HTML must use only us-ascii characters, either do the
content type attributes values...
Hu ?

Jul 20 '05 #3

P: n/a
Harlan Messinger wrote:
[snip]
And that just addreses the encoding. How about the content-type? What if I
were twisted enough to tell the UA, which has been assuming that my
document is an HTML text document just long enough to reach and parse my
META tag in the first place, that the document is really text/plain? (In
that case, would it start over from the beginning based on text/plain,
this time not doing any parsing at all, and therefore not finding a META
tag at all, and therefore not finding an override for the default
supposition of text/html, and therefore AGAIN starting from the beginning
and parsing the document as HTML, and then finding the META tag and
realizing the document is text/plain, and starting all over again based on
that assumption, and then ....) Or image/gif?
The HTTP 1.1 specification makes it clear that the Content-Type header
should take precedence over anything that may be in the response body:

"If and only if the media type is not given by a Content-Type field, the
recipient MAY attempt to guess the media type via inspection of its content
and/or the name extension(s) of the URI used to identify the resource."

-- <URL:http://www.w3.org/Protocols/rfc2616/rfc2616-sec7.html#sec7.2.1>

So if the HTTP headers say it's text/plain, text/plain it is (unless you use
a browser that violates the HTTP 1.1 specification).

So I'm just curious how the content-type META tag got into the spec in the
first place. It seems to defy logic.


According to the HTML 4.01 specification, <meta> elements with http-equiv
attributes are designed to be parsed by the server and converted to proper
HTTP headers. In reality, this is usually impractical, and browsers
started to pay attention themselves a long time ago. As you've said, this
can lead to some stupid results.

"HTTP servers may use the property name specified by the http-equiv
attribute to create an [RFC822]-style header in the HTTP response."

-- <URL:http://www.w3.org/TR/html401/struct/global.html#h-7.4.4.2>

As far as the paradox of figuring out the character encoding goes, the way I
understand it is that if the HTTP headers don't indicate the encoding, it
defaults to US-ASCII, and when it gets to the relevant <meta> element, the
browser has the option of starting again with that character encoding.
This means that as long as you use a superset of US-ASCII, it will "work".

I believe the rules for default character encodings change when you start
talking about XHTML (plus you have to throw the XML prolog into the mix).
Also remember that a browser can (reliably?) detect UTF-16 by the BOM.

You'll probably want to read through this lot if you haven't already:

<URL:http://ppewww.ph.gla.ac.uk/~flavell/charset/>
--
Jim Dabell

Jul 20 '05 #4

P: n/a
Harlan Messinger wrote:
[snip]
And that just addreses the encoding. How about the content-type? What if I
were twisted enough to tell the UA, which has been assuming that my
document is an HTML text document just long enough to reach and parse my
META tag in the first place, that the document is really text/plain? (In
that case, would it start over from the beginning based on text/plain,
this time not doing any parsing at all, and therefore not finding a META
tag at all, and therefore not finding an override for the default
supposition of text/html, and therefore AGAIN starting from the beginning
and parsing the document as HTML, and then finding the META tag and
realizing the document is text/plain, and starting all over again based on
that assumption, and then ....) Or image/gif?
The HTTP 1.1 specification makes it clear that the Content-Type header
should take precedence over anything that may be in the response body:

"If and only if the media type is not given by a Content-Type field, the
recipient MAY attempt to guess the media type via inspection of its content
and/or the name extension(s) of the URI used to identify the resource."

-- <URL:http://www.w3.org/Protocols/rfc2616/rfc2616-sec7.html#sec7.2.1>

So if the HTTP headers say it's text/plain, text/plain it is (unless you use
a browser that violates the HTTP 1.1 specification).

So I'm just curious how the content-type META tag got into the spec in the
first place. It seems to defy logic.


According to the HTML 4.01 specification, <meta> elements with http-equiv
attributes are designed to be parsed by the server and converted to proper
HTTP headers. In reality, this is usually impractical, and browsers
started to pay attention themselves a long time ago. As you've said, this
can lead to some stupid results.

"HTTP servers may use the property name specified by the http-equiv
attribute to create an [RFC822]-style header in the HTTP response."

-- <URL:http://www.w3.org/TR/html401/struct/global.html#h-7.4.4.2>

As far as the paradox of figuring out the character encoding goes, the way I
understand it is that if the HTTP headers don't indicate the encoding, it
defaults to US-ASCII, and when it gets to the relevant <meta> element, the
browser has the option of starting again with that character encoding.
This means that as long as you use a superset of US-ASCII, it will "work".

I believe the rules for default character encodings change when you start
talking about XHTML (plus you have to throw the XML prolog into the mix).
Also remember that a browser can (reliably?) detect UTF-16 by the BOM.

You'll probably want to read through this lot if you haven't already:

<URL:http://ppewww.ph.gla.ac.uk/~flavell/charset/>
--
Jim Dabell

Jul 20 '05 #5

P: n/a
"Jim Dabell" <ji********@jimdabell.com> a écrit dans le message de
news:7K********************@giganews.com
According to the HTML 4.01 specification, <meta> elements with
http-equiv attributes are designed to be parsed by the server and
converted to proper HTTP headers.
The rec says "may", so this is not mandatory. By the way, if anyone could
post a list of HTTPd that really retrieve the HTML content-type META value
and sends it in the response http header, I would be very glad 8) Can't find
this information anywhere since years
In reality, this is usually impractical, and browsers
started to pay attention themselves a long time ago.


I didn't understand, can you please explain ?

Jul 20 '05 #6

P: n/a
"Jim Dabell" <ji********@jimdabell.com> a écrit dans le message de
news:7K********************@giganews.com
According to the HTML 4.01 specification, <meta> elements with
http-equiv attributes are designed to be parsed by the server and
converted to proper HTTP headers.
The rec says "may", so this is not mandatory. By the way, if anyone could
post a list of HTTPd that really retrieve the HTML content-type META value
and sends it in the response http header, I would be very glad 8) Can't find
this information anywhere since years
In reality, this is usually impractical, and browsers
started to pay attention themselves a long time ago.


I didn't understand, can you please explain ?

Jul 20 '05 #7

P: n/a
"Pierre Goiffon" <pg******@nowhere.invalid> wrote:
Let's say I were perverse enough to save my HTML document using an
editor that encodes text as EBCDIC. How would the UA figure out not
only where my META tag is, but where ANYTHING is?


I can't understand - HTML must use only us-ascii characters, either
do the content type attributes values...


Where did you get such ideas?

HTML surely needs some characters that belong to the Ascii repertoire,
such as "<" and "a". But there is no requirement that they be represented
in the Ascii encoding.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #8

P: n/a
"Pierre Goiffon" <pg******@nowhere.invalid> wrote:
Let's say I were perverse enough to save my HTML document using an
editor that encodes text as EBCDIC. How would the UA figure out not
only where my META tag is, but where ANYTHING is?


I can't understand - HTML must use only us-ascii characters, either
do the content type attributes values...


Where did you get such ideas?

HTML surely needs some characters that belong to the Ascii repertoire,
such as "<" and "a". But there is no requirement that they be represented
in the Ascii encoding.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #9

P: n/a
"Jukka K. Korpela" <jk******@cs.tut.fi> a écrit dans le message de
news:Xn*****************************@193.229.0.31
I can't understand - HTML must use only us-ascii characters, either
do the content type attributes values...


Where did you get such ideas?


I was talking about HTML code, not about contents ! Tags like <html>,
<head>, etc
Contents in a HTML document needs of course all the meaning to contains
international characters, from european to chinese.

I was pretty sure to have read that HTML code is only composed of us-ascii
characters but... can't find out that somewhere now :(
If this is confirmed, as said Jim Dabell just below, a browser that
retrieves a document with no http charset information could I guess easyly
gets the html content-type meta value by parsing the document in us-ascii.
In a well formed document, all the structured delimiter for that meta
(doctype, html, head) would be indeed encoded using us-ascii only.

Did I explained it much clearer ? Sorry but it's just difficult for me to
express in english which is not at all my mother tongue.

Jul 20 '05 #10

P: n/a
"Jukka K. Korpela" <jk******@cs.tut.fi> a écrit dans le message de
news:Xn*****************************@193.229.0.31
I can't understand - HTML must use only us-ascii characters, either
do the content type attributes values...


Where did you get such ideas?


I was talking about HTML code, not about contents ! Tags like <html>,
<head>, etc
Contents in a HTML document needs of course all the meaning to contains
international characters, from european to chinese.

I was pretty sure to have read that HTML code is only composed of us-ascii
characters but... can't find out that somewhere now :(
If this is confirmed, as said Jim Dabell just below, a browser that
retrieves a document with no http charset information could I guess easyly
gets the html content-type meta value by parsing the document in us-ascii.
In a well formed document, all the structured delimiter for that meta
(doctype, html, head) would be indeed encoded using us-ascii only.

Did I explained it much clearer ? Sorry but it's just difficult for me to
express in english which is not at all my mother tongue.

Jul 20 '05 #11

P: n/a

"Jim Dabell" <ji********@jimdabell.com> wrote in message
news:7K********************@giganews.com...
Harlan Messinger wrote:
[snip]
And that just addreses the encoding. How about the content-type? What if I were twisted enough to tell the UA, which has been assuming that my
document is an HTML text document just long enough to reach and parse my
META tag in the first place, that the document is really text/plain? (In
that case, would it start over from the beginning based on text/plain,
this time not doing any parsing at all, and therefore not finding a META
tag at all, and therefore not finding an override for the default
supposition of text/html, and therefore AGAIN starting from the beginning and parsing the document as HTML, and then finding the META tag and
realizing the document is text/plain, and starting all over again based on that assumption, and then ....) Or image/gif?
The HTTP 1.1 specification makes it clear that the Content-Type header
should take precedence over anything that may be in the response body:

"If and only if the media type is not given by a Content-Type field, the
recipient MAY attempt to guess the media type via inspection of its

content and/or the name extension(s) of the URI used to identify the resource."

-- <URL:http://www.w3.org/Protocols/rfc2616/rfc2616-sec7.html#sec7.2.1>
So if the HTTP headers say it's text/plain, text/plain it is (unless you use a browser that violates the HTTP 1.1 specification).
If we always assumed there was a valid HTTP header that indicated the
content type and encoding to use, then there wouldn't be any point to the
META tag at all, so by discussing the META tag, it seems to me we must be
assuming conditions under which there isn't an appropriate HTTP header.

So I'm just curious how the content-type META tag got into the spec in the first place. It seems to defy logic.
According to the HTML 4.01 specification, <meta> elements with http-equiv
attributes are designed to be parsed by the server and converted to proper
HTTP headers. In reality, this is usually impractical, and browsers
started to pay attention themselves a long time ago. As you've said, this
can lead to some stupid results.

"HTTP servers may use the property name specified by the http-equiv
attribute to create an [RFC822]-style header in the HTTP response."

-- <URL:http://www.w3.org/TR/html401/struct/global.html#h-7.4.4.2>

As far as the paradox of figuring out the character encoding goes, the way

I understand it is that if the HTTP headers don't indicate the encoding, it
defaults to US-ASCII,
OK, if that's true, that explains it--except it still doesn't help if
you've, heaven help you, used EBCDIC.
and when it gets to the relevant <meta> element, the
browser has the option of starting again with that character encoding.
This means that as long as you use a superset of US-ASCII, it will "work".


Jul 20 '05 #12

P: n/a

"Jim Dabell" <ji********@jimdabell.com> wrote in message
news:7K********************@giganews.com...
Harlan Messinger wrote:
[snip]
And that just addreses the encoding. How about the content-type? What if I were twisted enough to tell the UA, which has been assuming that my
document is an HTML text document just long enough to reach and parse my
META tag in the first place, that the document is really text/plain? (In
that case, would it start over from the beginning based on text/plain,
this time not doing any parsing at all, and therefore not finding a META
tag at all, and therefore not finding an override for the default
supposition of text/html, and therefore AGAIN starting from the beginning and parsing the document as HTML, and then finding the META tag and
realizing the document is text/plain, and starting all over again based on that assumption, and then ....) Or image/gif?
The HTTP 1.1 specification makes it clear that the Content-Type header
should take precedence over anything that may be in the response body:

"If and only if the media type is not given by a Content-Type field, the
recipient MAY attempt to guess the media type via inspection of its

content and/or the name extension(s) of the URI used to identify the resource."

-- <URL:http://www.w3.org/Protocols/rfc2616/rfc2616-sec7.html#sec7.2.1>
So if the HTTP headers say it's text/plain, text/plain it is (unless you use a browser that violates the HTTP 1.1 specification).
If we always assumed there was a valid HTTP header that indicated the
content type and encoding to use, then there wouldn't be any point to the
META tag at all, so by discussing the META tag, it seems to me we must be
assuming conditions under which there isn't an appropriate HTTP header.

So I'm just curious how the content-type META tag got into the spec in the first place. It seems to defy logic.
According to the HTML 4.01 specification, <meta> elements with http-equiv
attributes are designed to be parsed by the server and converted to proper
HTTP headers. In reality, this is usually impractical, and browsers
started to pay attention themselves a long time ago. As you've said, this
can lead to some stupid results.

"HTTP servers may use the property name specified by the http-equiv
attribute to create an [RFC822]-style header in the HTTP response."

-- <URL:http://www.w3.org/TR/html401/struct/global.html#h-7.4.4.2>

As far as the paradox of figuring out the character encoding goes, the way

I understand it is that if the HTTP headers don't indicate the encoding, it
defaults to US-ASCII,
OK, if that's true, that explains it--except it still doesn't help if
you've, heaven help you, used EBCDIC.
and when it gets to the relevant <meta> element, the
browser has the option of starting again with that character encoding.
This means that as long as you use a superset of US-ASCII, it will "work".


Jul 20 '05 #13

P: n/a
On Wed, 14 Apr 2004, Pierre Goiffon wrote:
"Jim Dabell" <ji********@jimdabell.com> a écrit dans le message de
news:7K********************@giganews.com
According to the HTML 4.01 specification, <meta> elements with
http-equiv attributes are designed to be parsed by the server and
converted to proper HTTP headers.


The rec says "may", so this is not mandatory.


Right. That -had- been the originally stated intention of <meta...>
headers, but in fact it's rarely implemented on the server side.

And there seems good reason for that, because server-side control of
HTTP Content-type header, charset attribute and so on and so forth, is
needed for a whole range of content types, not limited to HTML. So, as
a server implementer, you're surely going to say we've implemented the
necessary mechanisms which work for -any- content-type, so why would
we go and do the extra work to implement a mechanism that's only
defined for HTML?
In reality, this is usually impractical, and browsers
started to pay attention themselves a long time ago.


I didn't understand, can you please explain ?


I think Jim is saying that if the browser does not get the information
from an HTTP header, then it will parse and use the meta http-equiv.

And indeed the notorious "meta http-equiv=Refresh" purports to be
equivalent to an HTTP header that doesn't officially exist (and which,
last time I tried, wasn't implemented in MSIE - not that MSIE even
supports some official parts of HTTP).

Jul 20 '05 #14

P: n/a
On Wed, 14 Apr 2004, Pierre Goiffon wrote:
"Jim Dabell" <ji********@jimdabell.com> a écrit dans le message de
news:7K********************@giganews.com
According to the HTML 4.01 specification, <meta> elements with
http-equiv attributes are designed to be parsed by the server and
converted to proper HTTP headers.


The rec says "may", so this is not mandatory.


Right. That -had- been the originally stated intention of <meta...>
headers, but in fact it's rarely implemented on the server side.

And there seems good reason for that, because server-side control of
HTTP Content-type header, charset attribute and so on and so forth, is
needed for a whole range of content types, not limited to HTML. So, as
a server implementer, you're surely going to say we've implemented the
necessary mechanisms which work for -any- content-type, so why would
we go and do the extra work to implement a mechanism that's only
defined for HTML?
In reality, this is usually impractical, and browsers
started to pay attention themselves a long time ago.


I didn't understand, can you please explain ?


I think Jim is saying that if the browser does not get the information
from an HTTP header, then it will parse and use the meta http-equiv.

And indeed the notorious "meta http-equiv=Refresh" purports to be
equivalent to an HTTP header that doesn't officially exist (and which,
last time I tried, wasn't implemented in MSIE - not that MSIE even
supports some official parts of HTTP).

Jul 20 '05 #15

P: n/a
On Wed, 14 Apr 2004, Jim Dabell wrote:
As far as the paradox of figuring out the character encoding goes,
the way I understand it is that if the HTTP headers don't indicate
the encoding, it defaults to US-ASCII, and when it gets to the
relevant <meta> element, the browser has the option of starting
again with that character encoding.


At risk of counting angels on pinheads, there are several mutually
contradictory specifications: one ruling that in general all text/*
content types default to us-ascii, one ruling that HTML <= 3.2
defaults to iso-8859-1, and one providing that XHTML can default to
whatever the BOM indicates, if there is one - although some
authorities would have it that XHTML can only do that when served as
an application/something content-type, whereas a text/* content type
must default to us-ascii if nothing else is specified. The meta
http-equiv is OF NO RELEVANCE to XHTML, but appendix C permits it to
be present in XHTML/1.0 for its so-called compatibility (huh!) with
HTML client agents.

It's all rather confusing, really. It's much better to have a simple
set of practical working rules, than to try to understand just exactly
whether every particular combination is technically permissible.
(How about that for a statement by a state-registered pedant, eh?).
cheers
Jul 20 '05 #16

P: n/a
On Wed, 14 Apr 2004, Jim Dabell wrote:
As far as the paradox of figuring out the character encoding goes,
the way I understand it is that if the HTTP headers don't indicate
the encoding, it defaults to US-ASCII, and when it gets to the
relevant <meta> element, the browser has the option of starting
again with that character encoding.


At risk of counting angels on pinheads, there are several mutually
contradictory specifications: one ruling that in general all text/*
content types default to us-ascii, one ruling that HTML <= 3.2
defaults to iso-8859-1, and one providing that XHTML can default to
whatever the BOM indicates, if there is one - although some
authorities would have it that XHTML can only do that when served as
an application/something content-type, whereas a text/* content type
must default to us-ascii if nothing else is specified. The meta
http-equiv is OF NO RELEVANCE to XHTML, but appendix C permits it to
be present in XHTML/1.0 for its so-called compatibility (huh!) with
HTML client agents.

It's all rather confusing, really. It's much better to have a simple
set of practical working rules, than to try to understand just exactly
whether every particular combination is technically permissible.
(How about that for a statement by a state-registered pedant, eh?).
cheers
Jul 20 '05 #17

P: n/a
/Harlan Messinger/:
Why would anyone ever have expected a content-type META tag to be effective
at all? Is it because someone was misled by the happenstance the letters of
the alphabet, the digits, and the characters {< > / ; , " ' =} happen to be
at the same locations in several particular common encodings (US-ASCII,
ISO-8859-1, etc.)?

Even assuming that these characters are always in the same locations, before
it can find my META tag in the first place, the UA has to make an *initial*
assumption as to whether the document is even 8-bit versus 16-bit.
The HTML 4 spec
<http://www.w3.org/TR/html4/charset.html#spec-char-encoding> states:
The META declaration must only be used when the character encoding
is organized such that ASCII-valued bytes stand for ASCII characters
(at least until the META element is parsed). META declarations
should appear as early as possible in the HEAD element.


If your document is "16-bit" then most probably bytes doesn't map to
ASCII characters.

--
Stanimir
Jul 20 '05 #18

P: n/a
/Harlan Messinger/:
Why would anyone ever have expected a content-type META tag to be effective
at all? Is it because someone was misled by the happenstance the letters of
the alphabet, the digits, and the characters {< > / ; , " ' =} happen to be
at the same locations in several particular common encodings (US-ASCII,
ISO-8859-1, etc.)?

Even assuming that these characters are always in the same locations, before
it can find my META tag in the first place, the UA has to make an *initial*
assumption as to whether the document is even 8-bit versus 16-bit.
The HTML 4 spec
<http://www.w3.org/TR/html4/charset.html#spec-char-encoding> states:
The META declaration must only be used when the character encoding
is organized such that ASCII-valued bytes stand for ASCII characters
(at least until the META element is parsed). META declarations
should appear as early as possible in the HEAD element.


If your document is "16-bit" then most probably bytes doesn't map to
ASCII characters.

--
Stanimir
Jul 20 '05 #19

P: n/a
"Pierre Goiffon" <pg******@nowhere.invalid> wrote:
I was talking about HTML code, not about contents ! Tags like <html>,
<head>, etc
So was I. Please read my entire message, it wasn't that long.
I was pretty sure to have read that HTML code is only composed of
us-ascii characters but... can't find out that somewhere now :(
Well, if you cannot check such simple things, maybe you shouldn't be too
sure. But this isn't about characters, this is about character encodings.
Did I explained it much clearer ?


Your explanation was pretty clear the first time, just factually wrong.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #20

P: n/a
"Pierre Goiffon" <pg******@nowhere.invalid> wrote:
I was talking about HTML code, not about contents ! Tags like <html>,
<head>, etc
So was I. Please read my entire message, it wasn't that long.
I was pretty sure to have read that HTML code is only composed of
us-ascii characters but... can't find out that somewhere now :(
Well, if you cannot check such simple things, maybe you shouldn't be too
sure. But this isn't about characters, this is about character encodings.
Did I explained it much clearer ?


Your explanation was pretty clear the first time, just factually wrong.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Jul 20 '05 #21

P: n/a

"Stanimir Stamenkov" <s7****@netscape.net> wrote in message
news:c5************@ID-207379.news.uni-berlin.de...
/Harlan Messinger/:
Why would anyone ever have expected a content-type META tag to be effective at all? Is it because someone was misled by the happenstance the letters of the alphabet, the digits, and the characters {< > / ; , " ' =} happen to be at the same locations in several particular common encodings (US-ASCII,
ISO-8859-1, etc.)?

Even assuming that these characters are always in the same locations, before it can find my META tag in the first place, the UA has to make an *initial* assumption as to whether the document is even 8-bit versus 16-bit.


The HTML 4 spec
<http://www.w3.org/TR/html4/charset.html#spec-char-encoding> states:
The META declaration must only be used when the character encoding
is organized such that ASCII-valued bytes stand for ASCII characters
(at least until the META element is parsed). META declarations
should appear as early as possible in the HEAD element.


If your document is "16-bit" then most probably bytes doesn't map to
ASCII characters.


Ah. Thank you. In other words, its use really is as limited as I suspected.
(Out of curiosity, how many people will encode files as ASCII up until the
end of this META tag and then switch to something else?)

Jul 20 '05 #22

P: n/a

"Stanimir Stamenkov" <s7****@netscape.net> wrote in message
news:c5************@ID-207379.news.uni-berlin.de...
/Harlan Messinger/:
Why would anyone ever have expected a content-type META tag to be effective at all? Is it because someone was misled by the happenstance the letters of the alphabet, the digits, and the characters {< > / ; , " ' =} happen to be at the same locations in several particular common encodings (US-ASCII,
ISO-8859-1, etc.)?

Even assuming that these characters are always in the same locations, before it can find my META tag in the first place, the UA has to make an *initial* assumption as to whether the document is even 8-bit versus 16-bit.


The HTML 4 spec
<http://www.w3.org/TR/html4/charset.html#spec-char-encoding> states:
The META declaration must only be used when the character encoding
is organized such that ASCII-valued bytes stand for ASCII characters
(at least until the META element is parsed). META declarations
should appear as early as possible in the HEAD element.


If your document is "16-bit" then most probably bytes doesn't map to
ASCII characters.


Ah. Thank you. In other words, its use really is as limited as I suspected.
(Out of curiosity, how many people will encode files as ASCII up until the
end of this META tag and then switch to something else?)

Jul 20 '05 #23

P: n/a
/Harlan Messinger/:
(Out of curiosity, how many people will encode files as ASCII up until the
end of this META tag and then switch to something else?)


Much enough, I suspect. ASCII being such a long standing standard is
used as base for many encodings including the UTF-8, which most
probably would become the "ASCII of the future".

--
Stanimir
Jul 20 '05 #24

P: n/a
/Harlan Messinger/:
(Out of curiosity, how many people will encode files as ASCII up until the
end of this META tag and then switch to something else?)


Much enough, I suspect. ASCII being such a long standing standard is
used as base for many encodings including the UTF-8, which most
probably would become the "ASCII of the future".

--
Stanimir
Jul 20 '05 #25

P: n/a
"Jukka K. Korpela" <jk******@cs.tut.fi> a écrit dans le message de
news:Xn*****************************@193.229.0.31
I was pretty sure to have read that HTML code is only composed of
us-ascii characters but... can't find out that somewhere now :(
Well, if you cannot check such simple things, maybe you shouldn't be
too sure.


Well, I was thinking wrong, so thank you for correcting me ! Usenet is just
a good place to learn and exchange information, don't you think ?

I didn't find any information on the W3C website that indicates that a html
file must be encoded using at least a charsets that are a subset of
us-ascii - and you're perfectly right, this would be sort of an aberration.
If I get it right, the position of the 127 us-ascii characters is the same
in UTF-8 but not at all in UTF-16 or Unicode - and there's surely a lot of
charsets where the first 127 characters are not the same as in us-ascii.

So again, thanks to have correct me, and don't hesitate to add more
information or correct what I just wrote :)
But this isn't about characters, this is about character
encodings.


Yes we're talking about the same thing, but you have a dramatically better
english level than me :)

Jul 20 '05 #26

P: n/a
"Jukka K. Korpela" <jk******@cs.tut.fi> a écrit dans le message de
news:Xn*****************************@193.229.0.31
I was pretty sure to have read that HTML code is only composed of
us-ascii characters but... can't find out that somewhere now :(
Well, if you cannot check such simple things, maybe you shouldn't be
too sure.


Well, I was thinking wrong, so thank you for correcting me ! Usenet is just
a good place to learn and exchange information, don't you think ?

I didn't find any information on the W3C website that indicates that a html
file must be encoded using at least a charsets that are a subset of
us-ascii - and you're perfectly right, this would be sort of an aberration.
If I get it right, the position of the 127 us-ascii characters is the same
in UTF-8 but not at all in UTF-16 or Unicode - and there's surely a lot of
charsets where the first 127 characters are not the same as in us-ascii.

So again, thanks to have correct me, and don't hesitate to add more
information or correct what I just wrote :)
But this isn't about characters, this is about character
encodings.


Yes we're talking about the same thing, but you have a dramatically better
english level than me :)

Jul 20 '05 #27

P: n/a
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> a écrit dans le message de
news:Pi*******************************@ppepc56.ph. gla.ac.uk
So, as a server implementer, you're surely going to say we've
implemented the necessary mechanisms which work for -any-
content-type, so why would we go and do the extra work to
implement a mechanism that's only defined for HTML?


The charset definition is critical for HTML document, and it's part of the
content-type. This is a good reason to implement this mecanism on a server,
especially if it's going to be used in a shared context (n sites on a single
server)
In reality, this is usually impractical, and browsers
started to pay attention themselves a long time ago.


I didn't understand, can you please explain ?


I think Jim is saying that if the browser does not get the information
from an HTTP header, then it will parse and use the meta http-equiv.


Err, that is what the rec state, so I thought Jim Dabell wanted to say
something else ?

Jul 20 '05 #28

P: n/a
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> a écrit dans le message de
news:Pi*******************************@ppepc56.ph. gla.ac.uk
So, as a server implementer, you're surely going to say we've
implemented the necessary mechanisms which work for -any-
content-type, so why would we go and do the extra work to
implement a mechanism that's only defined for HTML?


The charset definition is critical for HTML document, and it's part of the
content-type. This is a good reason to implement this mecanism on a server,
especially if it's going to be used in a shared context (n sites on a single
server)
In reality, this is usually impractical, and browsers
started to pay attention themselves a long time ago.


I didn't understand, can you please explain ?


I think Jim is saying that if the browser does not get the information
from an HTTP header, then it will parse and use the meta http-equiv.


Err, that is what the rec state, so I thought Jim Dabell wanted to say
something else ?

Jul 20 '05 #29

P: n/a
"Stanimir Stamenkov" <s7****@netscape.net> a écrit dans le message de
news:c5************@ID-207379.news.uni-berlin.de
ASCII being such a long standing standard is
used as base for many encodings including the UTF-8, which most
probably would become the "ASCII of the future".


Does UTF-8 includes all the characters needed for eureopean languages ?
Asian languages ? African languages ? (...)
It's not really clear to me if UTF-8 is a small piece of Unicode or if it's
just a meaning to keep compatibility with us-ascii - read the unicode.org
FAQ wasn't really helpful :/ Can anyone explains this to me ?

Jul 20 '05 #30

P: n/a
"Stanimir Stamenkov" <s7****@netscape.net> a écrit dans le message de
news:c5************@ID-207379.news.uni-berlin.de
ASCII being such a long standing standard is
used as base for many encodings including the UTF-8, which most
probably would become the "ASCII of the future".


Does UTF-8 includes all the characters needed for eureopean languages ?
Asian languages ? African languages ? (...)
It's not really clear to me if UTF-8 is a small piece of Unicode or if it's
just a meaning to keep compatibility with us-ascii - read the unicode.org
FAQ wasn't really helpful :/ Can anyone explains this to me ?

Jul 20 '05 #31

P: n/a
/Pierre Goiffon/:
Does UTF-8 includes all the characters needed for eureopean languages ?
Asian languages ? African languages ? (...)
It's not really clear to me if UTF-8 is a small piece of Unicode or if it's
just a meaning to keep compatibility with us-ascii - read the unicode.org
FAQ wasn't really helpful :/ Can anyone explains this to me ?


With UTF-8 you can encode any character specified in the Unicode
character set. The way it works: ASCII and basic latin characters
are represented as single bytes where these bytes correspond to
ASCII byte codes. Cyrillic and other character groups are
represented with two bytes, further character groups are represented
with three or more bytes.

--
Stanimir
Jul 20 '05 #32

P: n/a
/Pierre Goiffon/:
Does UTF-8 includes all the characters needed for eureopean languages ?
Asian languages ? African languages ? (...)
It's not really clear to me if UTF-8 is a small piece of Unicode or if it's
just a meaning to keep compatibility with us-ascii - read the unicode.org
FAQ wasn't really helpful :/ Can anyone explains this to me ?


With UTF-8 you can encode any character specified in the Unicode
character set. The way it works: ASCII and basic latin characters
are represented as single bytes where these bytes correspond to
ASCII byte codes. Cyrillic and other character groups are
represented with two bytes, further character groups are represented
with three or more bytes.

--
Stanimir
Jul 20 '05 #33

P: n/a
"Stanimir Stamenkov" <s7****@netscape.net> a écrit dans le message de
news:c5************@ID-207379.news.uni-berlin.de
With UTF-8 you can encode any character specified in the Unicode
character set.


Thanks very much for answer.

I learnt much with this thread ! :)

Jul 20 '05 #34

P: n/a
"Stanimir Stamenkov" <s7****@netscape.net> a écrit dans le message de
news:c5************@ID-207379.news.uni-berlin.de
With UTF-8 you can encode any character specified in the Unicode
character set.


Thanks very much for answer.

I learnt much with this thread ! :)

Jul 20 '05 #35

P: n/a
On Thu, 15 Apr 2004, Pierre Goiffon wrote:
Does UTF-8 includes all the characters needed for eureopean languages ?
utf-8 doesn't "include" anything. It's one of the possible encoding
formats for Unicode.
Asian languages ? African languages ? (...)
http://www.unicode.org
It's not really clear to me if UTF-8 is a small piece of Unicode


Then kindly read the Unicode FAQ.

Jul 20 '05 #36

P: n/a
On Thu, 15 Apr 2004, Pierre Goiffon wrote:
Does UTF-8 includes all the characters needed for eureopean languages ?
utf-8 doesn't "include" anything. It's one of the possible encoding
formats for Unicode.
Asian languages ? African languages ? (...)
http://www.unicode.org
It's not really clear to me if UTF-8 is a small piece of Unicode


Then kindly read the Unicode FAQ.

Jul 20 '05 #37

P: n/a
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> writes:
On Wed, 14 Apr 2004, Pierre Goiffon wrote:
"Jim Dabell" <ji********@jimdabell.com> a écrit dans le message de
news:7K********************@giganews.com
According to the HTML 4.01 specification, <meta> elements with
http-equiv attributes are designed to be parsed by the server and
converted to proper HTTP headers.


The rec says "may", so this is not mandatory.


Right. That -had- been the originally stated intention of <meta...>
headers, but in fact it's rarely implemented on the server side.


Out of curiosity, is it *ever* implemented server-side? I realise in
theory an Apache module or similar could, but I haven't heard of
anyone actually doing so.

--
Chris
Jul 20 '05 #38

P: n/a
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> writes:
On Wed, 14 Apr 2004, Pierre Goiffon wrote:
"Jim Dabell" <ji********@jimdabell.com> a écrit dans le message de
news:7K********************@giganews.com
According to the HTML 4.01 specification, <meta> elements with
http-equiv attributes are designed to be parsed by the server and
converted to proper HTTP headers.


The rec says "may", so this is not mandatory.


Right. That -had- been the originally stated intention of <meta...>
headers, but in fact it's rarely implemented on the server side.


Out of curiosity, is it *ever* implemented server-side? I realise in
theory an Apache module or similar could, but I haven't heard of
anyone actually doing so.

--
Chris
Jul 20 '05 #39

P: n/a
On Thu, 15 Apr 2004, Chris Morris wrote:
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> writes:
Right. That -had- been the originally stated intention of <meta...>
headers, but in fact it's rarely implemented on the server side.
Out of curiosity, is it *ever* implemented server-side?


I haven't seen it myself, I must admit, but since I can't claim to
know all servers, I was making only limited claims ;-)
I realise in theory an Apache module or similar could, but I haven't
heard of anyone actually doing so.


Russian Apache's on-the-fly transcoding module does something with
meta...charset, but as far as I know, what it does with it is to take
it out (to avoid confusion with the gen-u-wine HTTP charset, which it
obviously needs to set).
Jul 20 '05 #40

P: n/a
On Thu, 15 Apr 2004, Chris Morris wrote:
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> writes:
Right. That -had- been the originally stated intention of <meta...>
headers, but in fact it's rarely implemented on the server side.
Out of curiosity, is it *ever* implemented server-side?


I haven't seen it myself, I must admit, but since I can't claim to
know all servers, I was making only limited claims ;-)
I realise in theory an Apache module or similar could, but I haven't
heard of anyone actually doing so.


Russian Apache's on-the-fly transcoding module does something with
meta...charset, but as far as I know, what it does with it is to take
it out (to avoid confusion with the gen-u-wine HTTP charset, which it
obviously needs to set).
Jul 20 '05 #41

This discussion thread is closed

Replies have been disabled for this discussion.