Zhang Weiwu <zhangweiwu@realss.com> wrote:
[color=blue]
> Hello. I am working with a php software project, in it
> (
www.egroupware.org) Chinese simplified locate is "zh" while
> Traditional Chinese "tw".[/color]
I presume you refer to locale names here. Locales are a world of their
own, and many people think that they are a wrong approach to the
problems of cultural diversity. HTML itself has nothing to do with
locales, even though some locale names may coincide with values of the
lang="..." attribute.
[color=blue]
> I wish to send correct language attribute in http header,[/color]
That's not of much use, but if you do so, the correct values are
ISO 639 language codes optionally followed by a subcode. Actually the
details are a bit muddy, since the HTTP protocol definition refers to
RFC 1766, which has now been obsoleted by RFC 3066 and RFC 3282.
But apparently the latter is to be taken as overriding HTTP/1.1
specification if there is a conflict, since it explicitly defines the
Content-Language header.
[color=blue]
> I found "zh" is not standard.[/color]
It definitely is the standard ISO 639 code for the Chinese language and
the one that shall be used both in lang="..." attributes in HTML and in
Content-Language headers in HTTP. Whether it is followed by a subcode
is a different issue, and so is the fairly complex question what really
constitutes "the Chinese language" here, but apparently it is to be
understood in a very broad sense (not limited to putonghua).
[color=blue]
> I found this line in apache2's default
> httpd.conf
>
> # Simplified Chinese (zh-CN)
> AddLanguage zh-CN .zh-cn
>
> So it seems zh-CN is the correct language attribute to send.[/color]
The Apache default configuration has no authoritative role. It is
something that should comply with specifications, _be_ correct by the
specs, not try to define what is correct.
But zh-CN is _a_ correct value for the Content-Language header.
It specifies the particular variant of Chinese spoken in China (country
code CN), though this probably raises more questions that it can
answer. This is very confusing, since what people probably _mean_ when
use that code is "traditional Chinese" _writing system_, and for such
purposes, there are also some IANA registered subcodes, see
http://www.iana.org/assignments/language-tags
which lists, among others, "zh-Hans" defined as 'Chinese, in simplified
script'. But the fact that it is registered properly according the
procedures set up in the relevant RFCs doesn't mean that it would used
and useable in practice.
Whether Content-Language should specify "zh-CN" is debatable. According
to the HTTP protocol, "The Content-Language entity-header field
describes the natural language(s) of the intended audience for the
enclosed entity" (i.e., of the document sent). Note that it is not
defined as the language of the document. The distinction is subtle, but
it becomes important when subcodes are included.
Content-Language: zh-CN
says, by the protocol, that the document is intended for people who
understand Chinese in the form used in China. It is beyond my
competence to decide whether this would be adequate, but I think I do
know that it would be incorrect to send, for example,
Content-Language: en-GB
except perhaps in very special cases. Surely people who speak, for
example, the "standard" US version of English would reasonably
understand British English as well.
So normally the Content-Language header, if used, should only specify
the major language code, such as zh or en.
[color=blue]
> But
> mozilla seems to refuse it:
>
> body:lang(zh-CN) { font-size: 14pt}[/color]
That construct is a CSS rule, not HTML or HTTP at all, though it may
relate to both:
"If the document language specifies how the human language of an
element is determined, it is possible to write selectors in CSS that
match an element based on its language. For example, in HTML [HTML40],
the language is determined by a combination of the "lang" attribute,
the META element, and possibly by information from the protocol (such
as HTTP headers). XML uses an attribute called xml:lang, and there may
be other document language-specific methods for determining the
language."
http://www.w3.org/TR/REC-CSS2/selector.html#lang
This is rather vague - it does not really define whether a browser
_shall_, _should_, or _may_ use HTTP headers to define what elements
the :lang(...) selector matches, but clearly it is meant that the
language specified in lang="..." attributes in HTML takes precedence.
(Besides, the rule tries to enforce a fixed font size for the body of a
document, which is almost always a poor idea on the Web. But this is a
different can of worms.)
[color=blue]
> This line always make the body font 14pt even if lang in http
> header is set to "en".[/color]
That sounds like a Mozilla bug. It's really off-topic in this group,
since it's about a browser's CSS implementation.
[color=blue]
> But
>
> body:lang(cn) { font-size: 14pt}
>
> is recognized by mozilla. This line only make the body font 14pt
> when the header is lang=cn.[/color]
Header lang=cn? Is this about HTTP headers, or about the lang="..."
attribute in HTML? In either case though, the code "cn" is definitely
incorrect. There is no language code "cn" assigned in ISO 639, and it
must not be used by private agreement either, since all two-letter
language codes are reserved for allocation by the ISO.
But a browser is not expected to know such things in practice. It
treats the language code just as a string, though it may recognize some
codes and do something sensible based on its knowing what the language
of the text is (although this is mostly just wishful thinking).
[color=blue]
> I happen to know a few
> others, say "zh_CN" "zh_EUC" and "zh-EUC".[/color]
Welcome to the meta-Babel of language codes. Language codes are well
standardized: every group has its own standards or "standards".
Compared to that, HTML and HTTP have fairly fixed rules what the codes
mean and what codes shall be used. Too bad so few programs behave by
the rules.
[color=blue]
> Today I read w3c's
> suggestion:
> http://www.w3.org/International/questions/qa-css-lang.html[/color]
Beware that's it's mostly just wishful thinking. For example, IE, the
dominant browser, knows nothing about any CSS selectors used there, and
this situation should not be expected to change in the next few years.
Note that the document is descriptive, not normative. As far as I can
see, it complies with W3C recommendations, which is not surprising.
And it seems to paint a correct picture about the situation with
Chinese, except that it does not quite explicitly say this:
zh-CN and zh-TW are not the correct way to indicate writing system,
but they are what some browsers recognize, whereas the correct way
is ignored by browsers.
[color=blue]
> "zh-Hans" is new to me: neither apache nor mozilla seems to know
> it.[/color]
It is registered by IANA. But as you see, browser (and server) vendors
didn't notice this any more than most of us have.
[color=blue]
> So I'm puzzled: is there something I can rely on?[/color]
No.
[color=blue]
> Should I use "cn" all the way or not?[/color]
Using <html lang="cn"> is correct if your document is in Chinese. It
won't help much (if anything) at present, but neither should it cause
problems.
If you use <html lang="cn-CN">, then some browsers will select
simplified Chinese glyphs, which is probably what you want then,
although this is hardly the theoretically correct way to go.
Similarly, to get traditional Chinese glyphs, you could use
<html lang="cn-TW">, and this would work on some browsers.
All the rest is probably just futile at present.
--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html