On Sun, 1 Aug 2004, Nick Kew wrote:
Someone just drew attention to an open bug report in Apache concerning
shipping with an AddDefaultCharset set by default in httpd.conf.
Thanks for pointing this issue out. This is only a first attempt at
an answer.
This leads to bogus charsets being served in many cases.
Inevitably, it does, yes, I can't disagree with that.
I've just put forward a suggestion, but I'd welcome review from those
familiar with character encoding issues
I think that might include me...
and CA-2000-02
Well, I know that its conclusion is that documents should not be
served out without a charset. On the other hand, Martin D's comment -
that serving out a wrong charset is worse than useless - cannot be
refuted.
What I cannot claim to be an expert on are the minutiae of cross-site
scripting as explored in that CERT advisory.
Comments and suggestions please:
http://issues.apache.org/bugzilla/show_bug.cgi?id=23421
It seems to me to be most unfortunate that Martin D presents a
detailed essay on the harmfulness of this procedure without apparently
making a single mention of the CA-2000-02 issue which motivated the
original introduction of this default. That makes it so much harder
to form a balanced view of the logic.
<advocatus-diaboli>
Maybe the default should be x-user-defined ?
</>
The implication of his point (2), which I interpret as saying it would
be better to get rid of server charset altogether, and rely on meta
http-equiv for HTML and the <?xml..encoding for XHTML, would I think
be energetically disputed by quite a number of respected contributors.
[Aside - I wish people wouldn't refer to cross-site scripting as CSS!
Those who insist on using a TLA would be better advised to use XSS -
Google suggests
http://www.cgisecurity.com/articles/...shtml#whatdoes ]
Keep in mind that character encoding is an issue for most kinds of
text/* content, as well as being an issue for some kinds of
application/* content. Some of those content types have their own
machinery for indicating character representations, but most of them
have not (text/plain for example). In general, you need an HTTP
mechanism which works for all of those content-types: and Apache *has*
such a mechanism. The hard part seems to be persuading people to use
it!!
However, this report seems to be confined to (X)HTML content.
The complainant is probably right that more emphasis should be put on
producing author-oriented documentation about this issue. (But I'm
not about to volunteer to write it, I'm afraid.)
My *reluctant* conclusion, bearing in mind the item (2) in Martin D's
original report, and the widespread observed reliance on meta
http-equiv and f(r)iends, would be that there needs to be a module
which parses the actual documents at least as far as the meta
http-equiv charset, the <?xml...encoding and BOM, and copies the
"correct"[1] information into the real HTTP header. And that this
behaviour should be enabled by default, with the documentation saying
to more clueful authors/admins that they would do well to turn this
default behaviour off, and handle the job properly for themselves.
[1] Unfortunately, what is "correct" is by no means obvious, when
confronted with an arbitrary document. I've seen occasional documents
claiming to be XHTML/1.0 and served out as text/html, in which the
HTTP header, the BOM, the meta..http-equiv and the <?xml..encoding
were apparently saying different things.
Your hint about multiviews is a Good Thing, but I'm not sure that the
use of multiple filename extensions has been sufficiently promoted yet
for it to be simply dropped onto users as a new Apache default. If
that route is chosen, I think it'll need to be staged-in, and there
still will need to be a default for use when there is no other source
of information - unless someone can interpret CA-2000-02 as harmless,
which I'm certainly not keen to do.
<digression> One should note that MultiViews sometimes accidentally
exposes the existence of other documents in a web subdirectory, which
the author had intended should be hidden from casual browsing.
(mod_speling sometimes does that, too).
was that any use? It's only first reactions.