On Sat, Jul 26, Stan Brown inscribed on the eternal scroll:
[color=blue]
> Under the Hall of Shame you mention "(characters 128 through 159 are
> undefined in Unicode)." While that's true,[/color]
It's good enough for Government work, I suppose; but to be
pedantically accurate, those "characters" are perfectly well _defined_
in Unicode - but as _control functions_
The key feature here is that for HTML there is an SGML declaration
which says that those characters are excluded. So it's the rules of
SGML in conjunction with the definition of HTML which says that those
code points in the Document Character Set are undefined (and XML and
XHTML go further by making them illegal).
[color=blue]
> you might want to make a stronger statement like "under Unicode or
> any other non-proprietary standard".[/color]
This is all rather confusing. It's specifically the &#number; values
in the range 128 to 159 inclusive which are ruled-out by the
specifications.
If you have a document whose external coding is Windows-1252 i.e one
which contains 8-bit characters whose coded values are in this range,
then it's legal enough to send it out advertised as "text/html;
charset=windows-1252". There's no mandate on client agents to accept
this particular proprietary coding, of course, but the document format
is entirely legal and proper nevertheless according to the relevant
interworking rules (he says grudgingly) - at least it has been since
MS finally got around to registering windows-1252 in the IANA register
of charset values.
The strange thing is that the vast majority of folks who are trying to
send out these "8-bit characters" are attempting to do it by
representing them as the said undefined &#number; references in the
range 128 to 159, instead of doing it as actual 8-bit characters where
it would be technically legal - of course the correct &#number;
references for those Windows displayable characters are quite
elsewhere - you can read their hex values off from the official
Unicode mapping tables for Windows-1252 at
http://www.unicode.org/Public/MAPPIN...OWS/CP1252.TXT
and as you see, every single one of the characters in this range
corresponds to a Unicode character above 0x00FF.
And the same goes for all the other Windows-125x codings - indeed it's
even more true for those, since those have been registered as valid
Internet codings under IETF procedures _years_ before MS finally got
around to registering 1252 itself.
OK, so what's the bottom line? I think a necessary and sufficient
statement which avoids even pedantic inaccuracies, instead of
characters 128 through 159 are undefined in Unicode
would be[1]:
'&#number;' references 128 through 159 are undefined in HTML
or maybe
Character references '&#number;' 128 through 159 are undefined in HTML
(to which one could add "and illegal in XHTML").
cheers
[1] I left the original US idiom in place - Britspeak doesn't use
"through" in this sense, although we understand it. We'd say
"128 to 159 inclusive" if we wanted to make it clear, or just "128 to
159".
Incidentally 127 is also excluded, but I haven't seen anyone trying to
use it, so I guess that's OK.