On 21 Sep 2004 11:30:45 -0700,
lkrubner@geocities.com (lawrence) wrote:
[color=blue]
>Andy Hassall <andy@andyh.co.uk> wrote in message news:<p4eok059t03jj84ssu1n6tkgped5dfijhv@4ax.com>. ..[color=green]
>> It may send Content-type determined by the MIME type for the extension, or
>> looked up through mime-magic, but it generally doesn't know character set, and
>> to my knowledge Apache itself won't send the character set part of the header
>> itself - it just sends 'data' in a character-set agnostic way.
>>
>> You can set it up so that Apache sends a character set header with content
>> negotiation settings, though, but you need to provide the server with more
>> information in that case.
>>[color=darkred]
>> >A weaker solution is send a meta
>> >http-equiv tag specifying the charset. But something somewhere has to
>> >send that info. If the web server has no way to know the charset
>> >because all the characters are being generated by PHP, the PHP should
>> >send a charset header, yes?[/color]
>>
>> Yes. There's an option in php.ini as to which character set to default to - I
>> think the default default is iso8859-1. (Although really ought to be iso8859-15
>> due to the Euro).[/color]
>
>Okay, I don't get this at all. What sends the character encoding
>information? If you have a set of static HTML files sitting on a
>server, what is responsible for sending the character encoding?[/color]
Done a bit more digging, and there's this in my httpd.conf:
#
# Specify a default charset for all pages sent out. This is
# always a good idea and opens the door for future internationalisation
# of your web site, should you ever want it. Specifying it as
# a default does little harm; as the standard dictates that a page
# is in iso-8859-1 (latin1) unless specified otherwise i.e. you
# are merely stating the obvious. There are also some security
# reasons in browsers, related to javascript and URL parsing
# which encourage you to always set a default char set.
#
AddDefaultCharset ISO-8859-1
OK, so Apache sends out a character set heading under the recommended
configuration - although it's effectively hardcoded; it doesn't 'detect' the
encoding of the file since that's basically impossible in isolation.
To get Apache to send out a character set header for a specific file, you'd
then need to use Apache content negotiation if you wanted to select a different
character set for a particular file - either with a type-map or I believe it
can base it off suffixes of the filename (index.html.iso8859-p15 and so on).
Consider the following response from Apache:
andyh@server:~/public_html$ touch utf8.html.utf8
andyh@server:~/public_html$ telnet localhost 80
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
HEAD /~andyh/utf8.html HTTP/1.0
HTTP/1.1 200 OK
Date: Tue, 21 Sep 2004 19:19:03 GMT
Server: Apache/2.0.51 (Unix) PHP/5.0.1 DAV/2 SVN/1.0.6
Content-Location: utf8.html.utf8
Vary: negotiate
TCN: choice
Last-Modified: Tue, 21 Sep 2004 19:18:47 GMT
ETag: "3811f-0-7f9b93c0;7f9b93c0"
Accept-Ranges: bytes
Connection: close
Content-Type: text/html; charset=utf-8
Connection closed by foreign host.
OK - so a filename of utf8.html.utf8 means that a request for utf8.html comes
out in utf8 encoding. (I've got content negotiation enabled on my server).
Presumably in the case of multiple encodings for the same URI then the
browser's Accept-charset header comes into play for Apache to pick which to
serve.
[color=blue]
> If I,
>as a web-designer, am not supposed to use http-equiv meta tags,
>because they are weak, then the information is not inside of the HTML
>file. So the information needs to be outside of the HMTL file. And
>what is outside of the HTML file? If Apache remains agnostic about
>character encoding, then at what point does character encoding get
>sent? Where is the information stored, and how is it sent out to web
>browsers?[/color]
Either a type map, or encoded in the filename. (can't speak for other servers
apart from Apache).
[color=blue]
>Every character has an encoding by default, right? If no encoding is
>given, then there are a series of possible defaults, right? An Apache
>server may have a default, or PHP may have a default encoding set in
>the php.ini file, right?[/color]
Right.
[color=blue]
> If not default is set anywhere then the
>characters are basically raw text, right? In other words, ASCII?[/color]
Ah, but even ASCII isn't raw text, depending on your definition of raw - it's
the ASCII encoding of a small-ish character set.
'Binary' is the usual definition of completely raw data - it's just a stream
of bytes with no defined correspondence to characters.
As to what the default in HTTP is - time to dig out the HTTP standards.
RFC 2616: Hypertext Transfer Protocol -- HTTP/1.1
<ftp://ftp.isi.edu/in-notes/rfc2616.txt>
"
3.4.1 Missing Charset
Some HTTP/1.0 software has interpreted a Content-Type header without
charset parameter incorrectly to mean "recipient should guess."
Senders wishing to defeat this behavior MAY include a charset
parameter even when the charset is ISO-8859-1 and SHOULD do so when
it is known that it will not confuse the recipient.
Unfortunately, some older HTTP/1.0 clients did not deal properly with
an explicit charset parameter. HTTP/1.1 recipients MUST respect the
charset label provided by the sender; and those user agents that have
a provision to "guess" a charset MUST use the charset from the
content-type field if they support that charset, rather than the
recipient's preference, when initially displaying a document. See
section 3.7.1.
"
"
3.7.1 Canonicalization and Text Defaults
[...]
The "charset" parameter is used with some media types to define the
character set (section 3.4) of the data. When no explicit charset
parameter is provided by the sender, media subtypes of the "text"
type are defined to have a default charset value of "ISO-8859-1" when
received via HTTP. Data in character sets other than "ISO-8859-1" or
its subsets MUST be labeled with an appropriate charset value. See
section 3.4.1 for compatibility problems.
"
OK - so we officially default to ISO-8859-1, at least for text/* content
types, which is a superset of ASCII, but definitely a well-defined character
set and not just a raw stream of bytes. Makes sense.
[color=blue]
>Or do I have it all wrong?[/color]
Definitely sounds like you've got the idea.
[color=blue][color=green][color=darkred]
>> >> >I'm doing this so I can output to XML without getting errors about
>> >> >"You should not sent plain text".
>> >>
>> >> Don't know what you mean here. XML content doesn't have to be UTF-8 encoded,
>> >> just properly escaped and the encoding set correctly.[/color][/color]
>
>Sorry, I meant RSS. Most RSS validators throw an error if you try to
>set up an RSS feed using plain text.[/color]
Oh, is this just a case of the wrong Content-type though - text/plain or
text/html vs. text/xml or whatever it is?
[snip]
[color=blue][color=green]
>> I *think* form data is always in the character set of the page containing the
>> original form. I haven't got a reference to back that up, though.[/color]
>
>Yes, we had quite a conversation about that over on another newsgroup.
>It was quite informative. You can read it here, if you've any
>interest:
>
>
http://groups.google.com/groups?hl=e...%3D10%26sa%3DN[/color]
Hm - Netscape 4 as ever is a complete mess then! Does anyone actually use NN4
any more? It's well past time it was blasted out of existence - does it do
_anything_ right?
--
Andy Hassall / <andy@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool