
July 20th, 2005, 08:21 PM
| | | Named vs. numerical entities
I recently read the claim somewhere that numerical entities (such as
—) have a speed advantage over the equivalent named entities
(such as —) because the numerical entity requires just a single
byte to be downloaded to the browser, while the named entity requires
one byte for each letter. (So in this case, it would presumably be one
byte vs. seven bytes.) I found this claim a little surprising -- I
would have thought *each* numeral in the numerical entity would require
one byte. Does the Web server really send the entire numerical entity
as a single... character or whatever... I don't even know how to phrase
this question correctly!
Also, which form of the entity enjoys wider browser support? They both
seem to work with modern browsers... but what about older or very buggy
browsers? | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
Jonas Smithson wrote:[color=blue]
> I recently read the claim somewhere that numerical entities (such
> as —) have a speed advantage over the equivalent named
> entities (such as —) because the numerical entity requires
> just a single byte to be downloaded to the browser, while the named
> entity requires one byte for each letter.[/color]
My, that was a load of poppycock you were told.
[color=blue]
> I found this claim a little surprising[/color]
That's being too kind.
[color=blue]
> I would have thought *each* numeral in the numerical entity would
> require one byte.[/color]
That depends on the encoding. You'd best consult the guides if you
want to know more. I wish I understood it all better. I don't, despite
reading **numerous** posts from folks here who are quite well-versed.
If you're interested, Google the group for "Alan Flavell encoding" or
"Andreas Prilop charset". That'll turn up lots of posts. I'd suggest
you read what they say carefully; read those who argue with them, at
least on character encoding issues, with a grain of salt.
[color=blue]
> Also, which form of the entity enjoys wider browser support? They
> both seem to work with modern browsers... but what about older or
> very buggy browsers?[/color]
Again, A. Flavell is your man. Brace yourself for some heavy reading: http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist
--
Brian (remove ".invalid" to email me) http://www.tsmchughs.com/ | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
"Jonas Smithson" <smithsonNOSPAM@REMOVETHISboardermail.com> wrote in
comp.infosystems. www.authoring.html:[color=blue]
>I recently read the claim somewhere that numerical entities (such as
>—) have a speed advantage over the equivalent named entities
>(such as —) because the numerical entity requires just a single
>byte to be downloaded to the browser, while the named entity requires
>one byte for each letter. (So in this case, it would presumably be one
>byte vs. seven bytes.) I found this claim a little surprising -- I
>would have thought *each* numeral in the numerical entity would require
>one byte.[/color]
It does.
Where the difference arises is if you actually create your document
in Unicode instead of an 8-bit character set. If the document is
actually composed in Unicode, and transmitted in Unicode, then there
is an advantage of the actual 8212 character because it needs only
two bytes whereas — is 7 characters. (I can't remember whether
that's 7*2=14 bytes or some compression goes on, but it's certainly
more than 2 bytes.)
--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
2.1 changes: http://www.w3.org/TR/CSS21/changes.html
validator: http://jigsaw.w3.org/css-validator/ | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
Jonas Smithson wrote:
[color=blue]
> I recently read the claim somewhere that numerical entities (such
> as —) have a speed advantage over the equivalent named
> entities (such as —) because the numerical entity requires
> just a single byte to be downloaded to the browser, while the named
> entity requires one byte for each letter. (So in this case, it
> would presumably be one byte vs. seven bytes.)[/color]
BTW, did the person whose work you read actually claim that there
would be a noticeable difference in 2 documents, where document (a)
had 6 (or 12, or, heck, even 60) bytes more than document (b)?
--
Brian (remove ".invalid" to email me) http://www.tsmchughs.com/ | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
Stan Brown wrote:
[color=blue]
> Where the difference arises is if you actually create your document
> in Unicode[/color]
I'm not sure what you mean by this. Unicode is a character set, not an
encoding. AIUI, all HTML documents are presumed to be written in
Unicode, although that's an awkward thing to say.
[color=blue]
> instead of an 8-bit character set. If the document is actually
> composed in Unicode, and transmitted in Unicode,[/color]
There's no such thing as "transmitted in Unicode". You mean
encoded in UTF-8? But UTF-8 is an 8-bit character set (hence the name).
[color=blue]
> then there is an advantage of the actual 8212 character because it
> needs only two bytes whereas — is 7 characters.[/color]
The only sense I can make of this is that if you use an encoding that
permits a direct representation of a charcter instead of requiring an
entity you'll save few byes. So, in UTF-8, the letter A requires 1
byte where A would require 5. Is that what you meant?
--
Brian (remove ".invalid" to email me) http://www.tsmchughs.com/ | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
Brian wrote:
[color=blue]
> BTW, did the person whose work you read actually claim that there
> would be a noticeable difference in 2 documents, where document (a)
> had 6 (or 12, or, heck, even 60) bytes more than document (b)?[/color]
No, he didn't put the remark in context, as I recall... although I
don't even remember whether I read it online or in some computer book,
and the whole subject of encodings is totally confusing to me so I
probably misunderstood whatever context there may have been.
However, some of my pages have numerous character entities on them...
let's say up to fifty on a page, perhaps; if they each entailed an
extra six bytes (for example) over some alternate method, then that
might add up to an extra 300 bytes. What does that equal in download
time? How many bytes of difference do *you* think would make a
"noticeable difference" between two documents... say, to a user on a
56K modem? | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
On Fri, 16 Jul 2004, Brian wrote:
[color=blue]
> There's no such thing as "transmitted in Unicode".[/color]
Agreed.
[color=blue]
> You mean encoded in UTF-8? But UTF-8 is an 8-bit character set[/color]
No, utf-8 isn't a "character set" at all (that MIME "charset"
parameter denotes what we nowadays call a "character encoding
scheme").
[color=blue]
> (hence the name).[/color]
The utf-8 scheme is built with 8-bit units, indeed, but characters are
represented by variable numbers of those units. (As you obviously
know).
cheers | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
On Thu, 15 Jul 2004, Brian wrote:
[color=blue][color=green]
> > I found this claim a little surprising[/color]
>
> That's being too kind.[/color]
;-)
If the hon Usenaut is worried about the size of their HTML documents,
it may be worth noting that most current browsers are happy to accept
gzip-compressed HTML. At least for documents which are in a Latin
base-language, this can make far more difference to total size than
worrying about the difference between a few &-notations and utf-8
encoding.
But it's probably not worth doing this until the individual HTML items
are significantly larger than the amount of HTTP red-tape involved in
retrieving the item. More than a few kBytes each, let's say.
For extra brownie points, the server can be set to honour the
browser's Accept-encoding header, sending gzip-compressed format to
those who say they accept it, and straight HTML to any who don't.
There are third-party Apache modules which take care of this "on the
fly", but it can be done more simply (i.e with MultiViews) if one is
willing to store both versions on the server. Disk space is cheap
nowadays, after all.
good luck | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
On Fri, 16 Jul 2004, Jonas Smithson wrote:
[color=blue]
> I recently read the claim somewhere that numerical entities (such as
> —) have a speed advantage over the equivalent named entities[/color]
Others have rightly explained what nonsense that is...
[color=blue]
> Also, which form of the entity enjoys wider browser support?[/color]
You've been given the URL of my checklist for the wider picture, but
to summarise the relevant points:
- utf-8 encoding is widely supported and a compact representation; its
problem is more the possibility of mishandling in the hands of authors
who are not yet familiar with it.
- The Latin-1 named entities (those proposed in the appendix to
RFC1866/HTML2.0) are very well supported
- Generally speaking the entities introduced in HTML4 are now
supported, but there are still browsers around (e.g NN4.*) that don't
understand them. For almost all of these characters, I'd still say
that the &#number; representation is somewhat more widely supported.
It's best, of course, if your HTML authoring software takes care of
the details for you, according to some options which you can set.
€ is widely recognised, and at least still comprehensible in
browsers which don't implement it (since browsers usually display
character entities literally if they don't understand them).
[color=blue]
> They both seem to work with modern browsers... but what about older
> or very buggy browsers?[/color]
The checklist does its best to take that into account and choose best
compromises depending on the character repertoire which you need.
WebTV seemed to be hopeless with anything outside of a subset of
Windows-1252 repertoire. If you have anything more challenging as
your content, then you'd basically have to write it off. I hear that
they're working on it. | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
"Alan J. Flavell" <flavell@ph.gla.ac.uk> wrote:
[color=blue]
>On Fri, 16 Jul 2004, Brian wrote:
>[color=green]
>> There's no such thing as "transmitted in Unicode".[/color]
>
>Agreed.[/color]
Why can't a document be encoded (and transmitted) in Unicode? If
Windows Notepad lets you save a text file as Unicode (big- or
little-endian), isn't that the same thing?
--
Harlan Messinger
Remove the first dot from my e-mail address.
Veuillez ôter le premier point de mon adresse de courriel. | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
On Fri, 16 Jul 2004, Harlan Messinger wrote:
[color=blue]
> Why can't a document be encoded (and transmitted) in Unicode?[/color]
It cannot be "in Unicode" but UTF-8, UTF-16, or UTF-32;
and in addition in different byte order for UTF-16 and UTF-32.
<http://www.unicode.org/unicode/faq/utf_bom.html>
[color=blue]
> If
> Windows Notepad lets you save a text file as Unicode (big- or
> little-endian), isn't that the same thing?[/color]
"Big- or little-endian" rules out UTF-8, so probably it's UTF-16.
UTF-32 isn't used in MS Windows AFAIK.
--
Top-posting.
What's the most irritating thing on Usenet? | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
On Fri, 16 Jul 2004, Jonas Smithson wrote:
[color=blue]
> I recently read the claim somewhere that numerical entities (such as
> —) have a speed advantage over the equivalent named entities
> (such as —) because the numerical entity requires just a single
> byte to be downloaded to the browser, while the named entity requires
> one byte for each letter.[/color]
Others told you already that isn't true. But even if it were true,
a single image is usually bigger than your source text. So length
doesn't really matter. [ Oops, what did I write :-) ]
But as <http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist.html#s6>
explains, decimal references are somewhat better supported among
(older) browsers than hexadecimal references or entities.
--
Top-posting.
What's the most irritating thing on Usenet? | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
On Fri, 16 Jul 2004 05:34:11 GMT, Jonas Smithson
<smithsonNOSPAM@REMOVETHISboardermail.com> wrote:
[color=blue]
> Brian wrote:
>[color=green]
>> BTW, did the person whose work you read actually claim that there
>> would be a noticeable difference in 2 documents, where document (a)
>> had 6 (or 12, or, heck, even 60) bytes more than document (b)?[/color]
>
> No, he didn't put the remark in context, as I recall... although I
> don't even remember whether I read it online or in some computer book,
> and the whole subject of encodings is totally confusing to me so I
> probably misunderstood whatever context there may have been.
>
> However, some of my pages have numerous character entities on them...
> let's say up to fifty on a page, perhaps; if they each entailed an
> extra six bytes (for example) over some alternate method, then that
> might add up to an extra 300 bytes. What does that equal in download
> time? How many bytes of difference do *you* think would make a
> "noticeable difference" between two documents... say, to a user on a
> 56K modem?[/color]
Negligible. Probably most pages have that much deletable/editable crap in
them plus some... | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
On Fri, 16 Jul 2004, Harlan Messinger wrote:
[color=blue]
> Why can't a document be encoded (and transmitted) in Unicode?[/color]
Because "Unicode" is not the name of an encoding scheme.
[color=blue]
> If Windows Notepad lets you save a text file as Unicode (big- or
> little-endian), isn't that the same thing?[/color]
You're talking about just two of the possible encoding schemes for
Unicode. MS using baby-talk is maybe "good enough for government
work", but this here is a technical forum. What MS's terms are
denoting are utf-16LE and utf-16BE encoding schemes.
And in any case, probably the best choice (if no other constraints
apply) of Unicode encoding scheme for HTML used in a WWW context is
utf-8, not utf-16LE/BE. | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
"Alan J. Flavell" <flavell@ph.gla.ac.uk> a écrit dans le message de
news:Pine.LNX.4.53.0407161334000.7123@ppepc56.ph.g la.ac.uk[color=blue]
> And in any case, probably the best choice (if no other constraints
> apply) of Unicode encoding scheme for HTML used in a WWW context is
> utf-8, not utf-16LE/BE.[/color]
Do you mean, when using a vast majority of latin characters ?
If not, wouldn't the file will get very large ? Wouldn't it be better to use
UTF-16 ? | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
On Fri, 16 Jul 2004, Pierre Goiffon wrote:
[color=blue][color=green]
>> And in any case, probably the best choice (if no other constraints
>> apply) of Unicode encoding scheme for HTML used in a WWW context is
>> utf-8, not utf-16LE/BE.[/color]
>
> Do you mean, when using a vast majority of latin characters ?
> If not, wouldn't the file will get very large ?[/color]
Not bigger than a simple image.
[color=blue]
> Wouldn't it be better to use UTF-16 ?[/color]
Only if you prefer not to be indexed by Google correctly.
<http://www.google.com/search?q=%22UTF-1+6%22>
--
Top-posting.
What's the most irritating thing on Usenet? | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
In article <Pine.LNX.4.53.0407160944450.7114@ppepc56.ph.gla.a c.uk>,
"Alan J. Flavell" <flavell@ph.gla.ac.uk> writes:
[color=blue]
> There are third-party Apache modules which take care of this "on the
> fly",[/color]
mod_deflate is standard. No need for third-party modules.
--
Nick Kew | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
On Fri, 16 Jul 2004, Pierre Goiffon wrote:
[color=blue]
> "Alan J. Flavell" <flavell@ph.gla.ac.uk> a écrit dans le message de
> news:Pine.LNX.4.53.0407161334000.7123@ppepc56.ph.g la.ac.uk[color=green]
> > And in any case, probably the best choice (if no other constraints
> > apply) of Unicode encoding scheme for HTML used in a WWW context is
> > utf-8, not utf-16LE/BE.[/color]
>
> Do you mean, when using a vast majority of latin characters ?[/color]
Not necessarily: Greek, Cyrillic, Arabic, Hebrew are all represented
by 2 octets in utf-8. Armenian, Syriac and Coptic too, hmmm. The
cutoff (IINM) is U+07FF.
CJK scripts are a different matter, but AFAICS they are still usually
represented in one of their traditional encodings, rather than in a
Unicode-based scheme.
Indic scripts will also need 3 octets per character in utf-8 (and in
this case AIUI the use of unicode-based encodings is very beneficial,
since there /was/ no widely accepted pre-unicode scheme: I'm told that
in order to read Indian newspapers on the web, pretty much each
newspaper needed a different "font" i.e in effect was using its own
private character encoding. But I'm no expert in that field, so the
information is only second-hand).
[color=blue]
> If not, wouldn't the file will get very large ? Wouldn't it be
> better to use UTF-16 ?[/color]
I haven't widely tested browser compatibility for utf-16 encodings, so
I can't comment on that aspect. But keep in mind that the markup,
styles, etc. etc. are expressed by ASCII characters, and by using
utf-16 you're going to double the size of *those* as compared with
utf-8.
But yes, if your material is such that most of the data characters
need 3 octets in utf-8, and you've decided to use a unicode scheme,
then utf-16 could well be more-compact, you're right. | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
On Fri, 16 Jul 2004, Nick Kew wrote:
[color=blue]
> "Alan J. Flavell" <flavell@ph.gla.ac.uk> writes:
>[color=green]
> > There are third-party Apache modules which take care of this "on the
> > fly",[/color]
>
> mod_deflate is standard. No need for third-party modules.[/color]
Thanks for the information! | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
On Fri, 16 Jul 2004, Alan J. Flavell wrote:
[color=blue]
> Indic scripts will also need 3 octets per character in utf-8 (and in
> this case AIUI the use of unicode-based encodings is very beneficial,
> since there /was/ no widely accepted pre-unicode scheme: I'm told that
> in order to read Indian newspapers on the web, pretty much each
> newspaper needed a different "font" i.e in effect was using its own
> private character encoding.[/color]
But there's also <http://www.bbc.co.uk/hindi/>
and <http://www.bbc.co.uk/tamil/> .
--
Top-posting.
What's the most irritating thing on Usenet? | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
Jonas Smithson wrote:
[color=blue]
> However, some of my pages have numerous character entities on
> them... let's say up to fifty on a page, perhaps; if they each
> entailed an extra six bytes (for example) over some alternate
> method, then that might add up to an extra 300 bytes. What does
> that equal in download time? How many bytes of difference do *you*
> think would make a "noticeable difference" between two documents...
> say, to a user on a 56K modem?[/color]
Well, do the math. 300/56000 is not very significant. I suppose,
300/~33000 is more accurate a comparison, but even there, it's nothing
to worry about. Spending time tuning one image on a page will likely
have a greater impact than encoding will.
You should only worry about encoding if it causes rendering problems.
--
Brian (remove ".invalid" to email me) http://www.tsmchughs.com/ | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
Alan J. Flavell wrote:
[color=blue]
> On Fri, 16 Jul 2004, Brian wrote:
>[color=green]
>> UTF-8 is an 8-bit character set[/color]
>
> No, utf-8 isn't a "character set" at all (that MIME "charset"
> parameter denotes what we nowadays call a "character encoding
> scheme").[/color]
Cripes, I cannot keep the terminology straight. I wish they had called
that thing by its name, charenc or something. Yes, utf-8 is an encoding.
--
Brian (remove ".invalid" to email me) http://www.tsmchughs.com/ | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
"Andreas Prilop" <nhtcapri@rrzn-user.uni-hannover.de> wrote in message
news:Pine.GSO.4.44.0407161425010.9642-100000@s5b003...[color=blue]
> On Fri, 16 Jul 2004, Harlan Messinger wrote:
>[color=green]
> > Why can't a document be encoded (and transmitted) in Unicode?[/color]
>
> It cannot be "in Unicode" but UTF-8, UTF-16, or UTF-32;
> and in addition in different byte order for UTF-16 and UTF-32.
> <http://www.unicode.org/unicode/faq/utf_bom.html>
>[color=green]
> > If
> > Windows Notepad lets you save a text file as Unicode (big- or
> > little-endian), isn't that the same thing?[/color]
>
> "Big- or little-endian" rules out UTF-8, so probably it's UTF-16.
> UTF-32 isn't used in MS Windows AFAIK.[/color]
I'm really interested in what the distinction is. I admit I don't know what
UTF-16 or why it's different from what I would call "Unicode encoding", but
why wouldn't a fixed 16-bit encoding scheme where "A" is encoded as 0040, an
em-dash is encoded as 2014, a katakana "pu" is encoded as 30D7, and so forth
not be "Unicode encoding"?
Is it that this encoding scheme already existed and had the name "UTF-16"
before the term "Unicode" was coined? So that the reason we don't call it
"Unicode encoding" is simply that it already has another name? | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
On Fri, 16 Jul 2004, Harlan Messinger wrote:
[color=blue][color=green]
>> <http://www.unicode.org/unicode/faq/utf_bom.html>[/color]
>
> I'm really interested in what the distinction is. I admit I don't know what
> UTF-16 or why it's different from what I would call "Unicode encoding", [...][/color]
Err, did you read the page above, which I cited with reason?
--
Top-posting.
What's the most irritating thing on Usenet? | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
"Andreas Prilop" <nhtcapri@rrzn-user.uni-hannover.de> wrote in message
news:Pine.GSO.4.44.0407161705220.11169-100000@s5b003...[color=blue]
> On Fri, 16 Jul 2004, Harlan Messinger wrote:
>[color=green][color=darkred]
> >> <http://www.unicode.org/unicode/faq/utf_bom.html>[/color]
> >
> > I'm really interested in what the distinction is. I admit I don't know[/color][/color]
what[color=blue][color=green]
> > UTF-16 or why it's different from what I would call "Unicode encoding",[/color][/color]
[...][color=blue]
>
> Err, did you read the page above, which I cited with reason?
>[/color]
Sorry, I missed it somehow. I intend to read it later, but from glancing at
it, I have the following thoughts:
1. There's nothing any more nonsensical about the concept of a Unicode
encoding for the Unicode character set than there is about ASCII encoding
for the ASCII character set, but for whatever reasons (I assume efficiency
has something to do with it) it's not *used*.
2. EBCDIC and ASCII define the same characters, IIRC; but as character sets
they just number them differently. A document could be encoded in EBCDIC
just as easily as in ASCII. It wouldn't make any sense to speak of an EBCDIC
encoding of an ASCII document or an ASCII encoding of an EBCDIC document:
each is a separate encoding of a document based on the representations of
the document's characters in the respective character sets.
So why are the UTF-* encoding, "encodings of the Unicode character set"? Is
it because they are closely related to the Unicode character set by virtue
of the fact that there is a mapping from UCS to UTF-* produced by applying a
small set of simple functions?
2. | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
On Fri, 16 Jul 2004, Harlan Messinger wrote:
[color=blue]
> 1. There's nothing any more nonsensical about the concept of a Unicode
> encoding for the Unicode character set than there is about ASCII encoding
> for the ASCII character set,[/color]
Maybe I could understand this sentence with fewer negatives :-)
[color=blue]
> 2. EBCDIC and ASCII define the same characters, IIRC;[/color]
ASCII is a coded character set of 128 characters defined in ANSI X3.4
and ISO 646.
EBCDIC is a generic term for several (many?) coded character sets of
256 characters defined by IBM. Just four of them are listed here:
<http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/>
[color=blue]
> So why are the UTF-* encoding, "encodings of the Unicode character set"?[/color]
Think of Unicode as assigning characters to natural numbers - currently
from 0 to x10FFFF = 1114111. For example, number 945 = x3B1 means
the Greek small letter alpha.
The UTFs define how these numbers are represented by _byte_ sequences
(in a computer or on the Internet).
--
Top-posting.
What's the most irritating thing on Usenet? | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
> ASCII is a coded character set of 128 characters defined in ANSI X3.4[color=blue]
> and ISO 646.[/color]
Not quite. You are thinking of US-ASCII. There are a variety of national
ASCII character sets. | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
On Fri, 16 Jul 2004, Harlan Messinger wrote:
[color=blue]
> Sorry, I missed it somehow. I intend to read it later,[/color]
Call back here when you have done?
[color=blue]
> 1. There's nothing any more nonsensical about the concept of a Unicode
> encoding for the Unicode character set than there is about ASCII encoding
> for the ASCII character set,[/color]
Actually there are substantial differences. And you see this also
with that MIME parameter which is (mis)named "charset" - but specifies
what we now would call a "character encoding scheme".
Back when 7 or 8 bits were sufficient to represent all of the
characters of a repertoire, it was quasi-obvious that the "coded
character set" was defined by assigning numbers (0-127 or 0-255 as the
case may be) to the characters of the repertoire, and then to lay out
the fonts according to that scheme, and to transmit the characters by
means of bytes having that value.
Consequently, back then it looked as if the things that we now call
"coded character set", "character encoding" and "font arrangement"
were just different names for the same thing. Of course, you needed a
different font for each "charset" (i.e character encoding), which got
to be a considerable drag.
Nowadays these concepts have to be disambiguated. Unicode characters
are designated by a code point which can, in principle, go up to 2**31
(it hasn't got that far yet). Those numbers then have to be
represented in a way which is convenient for transmission and/or
storage (different design criteria apply for different purposes).
[color=blue]
> 2. EBCDIC and ASCII define the same characters, IIRC;[/color]
Actually not. But discussing that would be a pointless digression, so
let's move on.
[color=blue]
> So why are the UTF-* encoding, "encodings of the Unicode character set"?[/color]
It's not practical, for various reasons, to transmit characters as
32-bit units. For one thing, it's very wasteful. For another,
there's no unique byte-ordering, hence all this fuss about endian-ness
when units of 16 or 32 bits are involved.
There's also the question of representing unicode characters in a
mail-safe context (hence utf-7). That will fade with time, but even
8-bit-safe mail formats ban null bytes, which means that utf-16 or
utf-32/ucs-4 representations cannot be used without a further layer of
encoding.
[color=blue]
> Is it because they are closely related to the Unicode character set[/color]
Is it because you won't read the tutorial before asking further
questions?
ttfn | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
In our last episode,
<BjTJc.185939$rCA1.116992@news01.bloor.is.net.cabl e.rogers.com>,
the lovely and talented C A Upsdell
broadcast on comp.infosystems. www.authoring.html:
[color=blue][color=green]
>> ASCII is a coded character set of 128 characters defined in ANSI X3.4
>> and ISO 646.[/color][/color]
[color=blue]
> Not quite. You are thinking of US-ASCII. There are a variety of national
> ASCII character sets.[/color]
No. There is only one ASCII. It is a 7-bit code with 128 characters.
Think: what does the A in ASCII stand for?
--
Lars Eighner -finger for geek code- eighner@io.com http://www.io.com/~eighner/
If it wasn't for muscle spasms, I wouldn't get any exercise at all. | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
On Fri, 16 Jul 2004, C A Upsdell wrote:
[color=blue][color=green]
>> ASCII is a coded character set of 128 characters defined in ANSI X3.4
>> and ISO 646.[/color]
>
> Not quite. You are thinking of US-ASCII.[/color]
ASCII and US-ASCII are synonyms.
<http://www.iana.org/assignments/character-sets>
[color=blue]
> There are a variety of national ASCII character sets.[/color]
No, they are called "7-bit codes" or "7-bit coded character sets"
as defined in ISO 646. <http://www.itscj.ipsj.or.jp/ISO-IR/> | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
"Alan J. Flavell" <flavell@ph.gla.ac.uk> wrote in message
news:Pine.LNX.4.53.0407161707500.7333@ppepc56.ph.g la.ac.uk...[color=blue]
> On Fri, 16 Jul 2004, Harlan Messinger wrote:
>[color=green]
> > Sorry, I missed it somehow. I intend to read it later,[/color]
>
> Call back here when you have done?
>[color=green]
> > 1. There's nothing any more nonsensical about the concept of a Unicode
> > encoding for the Unicode character set than there is about ASCII[/color][/color]
encoding[color=blue][color=green]
> > for the ASCII character set,[/color]
>
> Actually there are substantial differences. And you see this also
> with that MIME parameter which is (mis)named "charset" - but specifies
> what we now would call a "character encoding scheme".
>
> Back when 7 or 8 bits were sufficient to represent all of the
> characters of a repertoire, it was quasi-obvious that the "coded
> character set" was defined by assigning numbers (0-127 or 0-255 as the
> case may be) to the characters of the repertoire, and then to lay out
> the fonts according to that scheme, and to transmit the characters by
> means of bytes having that value.
>
> Consequently, back then it looked as if the things that we now call
> "coded character set", "character encoding" and "font arrangement"
> were just different names for the same thing. Of course, you needed a
> different font for each "charset" (i.e character encoding), which got
> to be a considerable drag.
>
> Nowadays these concepts have to be disambiguated. Unicode characters
> are designated by a code point which can, in principle, go up to 2**31
> (it hasn't got that far yet). Those numbers then have to be
> represented in a way which is convenient for transmission and/or
> storage (different design criteria apply for different purposes).
>[color=green]
> > 2. EBCDIC and ASCII define the same characters, IIRC;[/color]
>
> Actually not. But discussing that would be a pointless digression, so
> let's move on.
>[color=green]
> > So why are the UTF-* encoding, "encodings of the Unicode character[/color][/color]
set"?[color=blue]
>
> It's not practical, for various reasons, to transmit characters as
> 32-bit units. For one thing, it's very wasteful. For another,
> there's no unique byte-ordering, hence all this fuss about endian-ness
> when units of 16 or 32 bits are involved.
>
> There's also the question of representing unicode characters in a
> mail-safe context (hence utf-7). That will fade with time, but even
> 8-bit-safe mail formats ban null bytes, which means that utf-16 or
> utf-32/ucs-4 representations cannot be used without a further layer of
> encoding.
>[color=green]
> > Is it because they are closely related to the Unicode character set[/color]
>
> Is it because you won't read the tutorial before asking further
> questions?[/color]
No, it's because sometimes questions can be satisfied by relatively simple
answers without requiring one to read a whole tutorial (though sometimes
not). Sometimes a tutorial or textbook will tell you the way things are
without explaining why they aren't some other way (though sometimes not). | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
On Fri, 16 Jul 2004, C A Upsdell wrote:
[color=blue]
> Not quite. You are thinking of US-ASCII. There are a variety of
> national ASCII character sets.[/color]
That's sloppy terminology. There's a variety of 7-bit national
character sets which are patterned on ASCII (US-ASCII is a more
accurate name, since - contrary to widespread belief amongst some
parties - America doesn't consist solely of the United States).
But those national character sets were mostly codified under ISO-646.
I give you my old page http://ppewww.ph.gla.ac.uk/~flavell/....html#national
and particularly the links to http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/CJK.html and http://www.terena.nl/library/multili...section04.html
This was a relevant topic in the early days of the WWW, since the code
positions which were set aside for national variations in iso-646 were
for example the basis for some of the "unsafe character" exclusions in
URLs.
Btw, I see there's a lovely comment in that Terena web page:
It will be clear that so-called "de facto standards" are related to
those discussed above as Monopoly banknotes to real money, valuable
as long as the game goes on. | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
"Andreas Prilop" <nhtcapri@rrzn-user.uni-hannover.de> wrote in message
news:Pine.GSO.4.44.0407161833140.11334-100000@s5b003...[color=blue]
> On Fri, 16 Jul 2004, C A Upsdell wrote:
>[color=green][color=darkred]
> >> ASCII is a coded character set of 128 characters defined in ANSI X3.4
> >> and ISO 646.[/color]
> >
> > Not quite. You are thinking of US-ASCII.[/color]
>
> ASCII and US-ASCII are synonyms.
> <http://www.iana.org/assignments/character-sets>[/color]
NOT TRUE!!!! Read the IANA page: "These names are expressed in
ANSI_X3.4-1968 which is commonly called US-ASCII or simply ASCII. The
character set most commonly use in the Internet and used especially in
protocol standards is US-ASCII, this is strongly encouraged. The use of the
name US-ASCII is also encouraged." This says that US-ASCII is commonly
called ASCII. It does not say that US-ASCII is ASCII.
Also, when I started developing software in the early 1970's -- before the
Internet, before PCs, before microprocessors -- I routinely worked with
various 7- and 8-bit ASCII character sets (in addition to EBCDIC and Gray
codes). I find many Internet references denying the existence of 8-bit
ASCII, but I can attest that, in the early 1970s, multiple 7- and 8-bit sets
were alive and well. | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
"C A Upsdell" <cupsdell0311XXX@-@-@XXXrogers.com> wrote in message
news:LzUJc.1$CmC1.0@news04.bloor.is.net.cable.roge rs.com...[color=blue]
> "Andreas Prilop" <nhtcapri@rrzn-user.uni-hannover.de> wrote in message
> news:Pine.GSO.4.44.0407161833140.11334-100000@s5b003...[color=green]
> > On Fri, 16 Jul 2004, C A Upsdell wrote:
> >[color=darkred]
> > >> ASCII is a coded character set of 128 characters defined in ANSI X3.4
> > >> and ISO 646.
> > >
> > > Not quite. You are thinking of US-ASCII.[/color]
> >
> > ASCII and US-ASCII are synonyms.
> > <http://www.iana.org/assignments/character-sets>[/color]
>
> NOT TRUE!!!! Read the IANA page: "These names are expressed in
> ANSI_X3.4-1968 which is commonly called US-ASCII or simply ASCII. The
> character set most commonly use in the Internet and used especially in
> protocol standards is US-ASCII, this is strongly encouraged. The use of[/color]
the[color=blue]
> name US-ASCII is also encouraged." This says that US-ASCII is commonly
> called ASCII. It does not say that US-ASCII is ASCII.[/color]
Uh, yeah, it does, unless the implication is along the lines of "... called
US-ASCII, or often simply ASCII, although this is technically incorrect
because ASCII properly refers to a different characters set". But that isn't
the implication and the statement is saying that US-ASCII, ASCII, and
ANSI_X3.4-1968 are all names for the same thing--which is the same as saying
that each of them is also each of the others.
[color=blue]
>
> Also, when I started developing software in the early 1970's -- before the
> Internet, before PCs, before microprocessors -- I routinely worked with
> various 7- and 8-bit ASCII character sets (in addition to EBCDIC and Gray
> codes). I find many Internet references denying the existence of 8-bit
> ASCII, but I can attest that, in the early 1970s, multiple 7- and 8-bit[/color]
sets[color=blue]
> were alive and well.[/color]
Multiple 7- and 8-bit sets were alive and well. But they were not ASCII.
They may have been ASCII extensions, but they were not ASCII. | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
"Harlan Messinger" <h.messinger@comcast.net> wrote in message
news:2lqjqsFfb5rcU1@uni-berlin.de...[color=blue][color=green]
> > Also, when I started developing software in the early 1970's -- before[/color][/color]
the[color=blue][color=green]
> > Internet, before PCs, before microprocessors -- I routinely worked with
> > various 7- and 8-bit ASCII character sets (in addition to EBCDIC and[/color][/color]
Gray[color=blue][color=green]
> > codes). I find many Internet references denying the existence of 8-bit
> > ASCII, but I can attest that, in the early 1970s, multiple 7- and 8-bit[/color]
> sets[color=green]
> > were alive and well.[/color]
>
> Multiple 7- and 8-bit sets were alive and well. But they were not ASCII.
> They may have been ASCII extensions, but they were not ASCII.[/color]
Indeed they were ASCII. Standards written later appear to have
disassociated the term ASCII from the national variants and extended sets,
preferring to give them numbered ANSI designations, but in the early 1970s
they were ASCII. National variants which I personally worked with included
French, German, and Italian ASCII sets, and one of my co-workers worked with
the Portugese set . US-ASCII is the preferred term now to avoid confusion
with the other ASCII sets. | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
On Fri, 16 Jul 2004, Harlan Messinger wrote:
[after a bout of over-enthusiatic quoting]
[color=blue][color=green]
> > Is it because you won't read the tutorial before asking further
> > questions?[/color]
>
> No, it's because sometimes questions can be satisfied by relatively
> simple answers without requiring one to read a whole tutorial[/color]
And it's because often, the relatively simple answers don't make any
sense until you've done the groundwork first so that you can
understand the answers (or even better - ask the right questions).
Your attention was directed to the tutorial for a constructive reason:
someone who knew the subject believed that it would be of genuine
benefit to you, it would position you better for the subsequent
discussion. As it happens, that is also my own opinion.
[color=blue]
> Sometimes a tutorial or textbook will tell you the way things are
> without explaining why they aren't some other way (though sometimes not).[/color]
You'll be able to tell us how it was when you've tried it, OK? That
is, if I haven't lost patience by then and put you back into the
killfile... | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
"C A Upsdell" <cupsdell0311XXX@-@-@XXXrogers.com> wrote in message
news:Z7VJc.1$DgD1.0@news04.bloor.is.net.cable.roge rs.com...[color=blue]
> "Harlan Messinger" <h.messinger@comcast.net> wrote in message
> news:2lqjqsFfb5rcU1@uni-berlin.de...[color=green][color=darkred]
> > > Also, when I started developing software in the early 1970's -- before[/color][/color]
> the[color=green][color=darkred]
> > > Internet, before PCs, before microprocessors -- I routinely worked[/color][/color][/color]
with[color=blue][color=green][color=darkred]
> > > various 7- and 8-bit ASCII character sets (in addition to EBCDIC and[/color][/color]
> Gray[color=green][color=darkred]
> > > codes). I find many Internet references denying the existence of[/color][/color][/color]
8-bit[color=blue][color=green][color=darkred]
> > > ASCII, but I can attest that, in the early 1970s, multiple 7- and[/color][/color][/color]
8-bit[color=blue][color=green]
> > sets[color=darkred]
> > > were alive and well.[/color]
> >
> > Multiple 7- and 8-bit sets were alive and well. But they were not ASCII.
> > They may have been ASCII extensions, but they were not ASCII.[/color]
>
> Indeed they were ASCII. Standards written later appear to have
> disassociated the term ASCII from the national variants and extended sets,
> preferring to give them numbered ANSI designations, but in the early 1970s
> they were ASCII. National variants which I personally worked with[/color]
included[color=blue]
> French, German, and Italian ASCII sets, and one of my co-workers worked[/color]
with[color=blue]
> the Portugese set .[/color]
You and they called them "ASCII" informally, or do you have a citation to
show that ASCII was officially regarded as the proper name for these sets?
[color=blue]
> US-ASCII is the preferred term now to avoid confusion
> with the other ASCII sets.[/color] | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
"Alan J. Flavell" <flavell@ph.gla.ac.uk> wrote in message
news:Pine.LNX.4.53.0407161933330.7123@ppepc56.ph.g la.ac.uk...[color=blue]
> On Fri, 16 Jul 2004, Harlan Messinger wrote:
>
> [after a bout of over-enthusiatic quoting]
>[color=green][color=darkred]
> > > Is it because you won't read the tutorial before asking further
> > > questions?[/color]
> >
> > No, it's because sometimes questions can be satisfied by relatively
> > simple answers without requiring one to read a whole tutorial[/color]
>
> And it's because often, the relatively simple answers don't make any
> sense until you've done the groundwork first so that you can
> understand the answers (or even better - ask the right questions).
>
> Your attention was directed to the tutorial for a constructive reason:
> someone who knew the subject believed that it would be of genuine
> benefit to you, it would position you better for the subsequent
> discussion. As it happens, that is also my own opinion.
>[color=green]
> > Sometimes a tutorial or textbook will tell you the way things are
> > without explaining why they aren't some other way (though sometimes[/color][/color]
not).[color=blue]
>
> You'll be able to tell us how it was when you've tried it, OK? That
> is, if I haven't lost patience by then and put you back into the
> killfile...[/color]
Oh, good grief, go ahead and get it over with. One would think I'd said
something simply terrible to you, instead of just asking questions and then
saying why I thought it was reasonable to do so. | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
> >[color=blue][color=green]
> > Indeed they were ASCII. Standards written later appear to have
> > disassociated the term ASCII from the national variants and extended[/color][/color]
sets,[color=blue][color=green]
> > preferring to give them numbered ANSI designations, but in the early[/color][/color]
1970s[color=blue][color=green]
> > they were ASCII. National variants which I personally worked with[/color]
> included[color=green]
> > French, German, and Italian ASCII sets, and one of my co-workers worked[/color]
> with[color=green]
> > the Portugese set .[/color]
>
> You and they called them "ASCII" informally, or do you have a citation to
> show that ASCII was officially regarded as the proper name for these sets?[/color]
I do not have any of the manuals etc. that I used 30+ years ago. Otherwise
I could show you. | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
On Fri, 16 Jul 2004, C A Upsdell wrote:
[color=blue]
> Indeed they were ASCII.[/color]
No, they may have been *based* on ASCII, they may have been informally
referred to as "national ASCII", but they were not literally the
"American Standard Code for Information Interchange".
[color=blue]
> Standards written later appear to have disassociated the term ASCII
> from the national variants[/color]
Uh-uh, it's an international conspiracy to hide the origin of these
codes, is it? You don't seriously believe that the US American
national standards body would go making national character codes for
other countries, do you?
[color=blue]
> and extended sets,[/color]
At this point nobody's arguing about "extended sets". It's about
national variants based on the 7-bit code called ASCII.
[color=blue]
> preferring to give them numbered ANSI designations,[/color]
There you go again. ANSI (the later name of the US American national
standards body) had no jurisdiction over other national variants; only
over the (US-)American one. The British national variant based on
ASCII was a British Standard designation, BS4370; other national
variants would have had designations under their respective standards
bodies (DIN in Germany, and so on).
Later these 7-bit codes were codified into ISO-646 under the auspices
of the international standards body.
[color=blue]
> but in the early 1970s they were ASCII.[/color]
I've been interested in character coding issues since before then, and
I say you are mistaken, or confusing loose everyday terms and formal
specfications. Not that any of this is relevant to authoring HTML for
the WWW, so I shan't keep this sub-thread going. | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
My thanks to all the respondents. I've been sitting here reading this
thread with my jaw dropped open -- people not only discussing the
arcane nuances of encoding methods, but flaming each other over it!
This thread was so far over my head that (for my purposes) it might as
well have been written in ancient Greek (say, is that a possible
encoding method?). But I got the core information I needed: there's no
speed advantage of — over —. I wish I could remember where
I read that nonsense so that (if it was in a book, which I suspect it
was) I could warn people about the title.
I guess now my decision comes down to this: named entities are more
intuitive (I can remember them while I type without looking at a
chart), but Netscape 4 doesn't understand them, and makes the text look
like junk -- but it does understand numerical entities, which I can't
remember. So which do I care more about, my convenience in writing code
or the <0.5% of NS4 users? (That's a subjective question to myself, of
course; I don't expect an answer here.) Or maybe I'll type the named
entities and then do a bulk search & replace to numeric ones before
uploading the pages...
Alan Flavell wrote:[color=blue]
> It's best, of course, if your HTML authoring software takes
> care of the details for you, according to some options which
> you can set.[/color]
My "HTML authoring software" is a simple text editor; I don't care for
the so-called WYSIWYG editors so I have to make decisions like this for
myself.
[color=blue]
> utf-8 encoding is widely supported and a compact representation; its
> problem is more the possibility of mishandling in the hands of
> authors who are not yet familiar with it.[/color]
How would I, for example, type an emdash in utf-8 code? (I'm pretty
sure I just asked something totally clueless, like "which hand does a
cow use to play the accordian?" Oh, well... in for a dime, in for a
dollar, as they say...)
By the way, I occasionally see garbage characters even on the big news
sites -- where it looks like they meant to insert some kind of
punctuation mark but instead I see something that looks like a Chinese
character. I'm pretty sure they're not seeing that on their end, or
they would have fixed it; and I've searched through my preference
settings (in Windows Explorer 6) but couldn't find anything that seemed
relevant in terms of character encodings. Any guess as to why I'm
seeing scattered Chinese characters (it happens fairly rarely,
actually) and the site coders (presumably) aren't? | 
July 20th, 2005, 08:21 PM
| | | Re: Named vs. numerical entities
Jonas Smithson wrote:
[color=blue]
> My thanks to all the respondents. I've been sitting here reading this
> thread with my jaw dropped open -- people not only discussing the
> arcane nuances of encoding methods, but flaming each other over it![/color]
I prefer the term "heated discussion" :)
[color=blue]
> This thread was so far over my head that (for my purposes) it might as
> well have been written in ancient Greek (say, is that a possible
> encoding method?).[/color]
Use a greek encoding or unicode :). AIUI, and I never took Ancient Greek
very far at school, it uses letters all found in modern Greek.
[color=blue][color=green]
>> utf-8 encoding is widely supported and a compact representation; its
>> problem is more the possibility of mishandling in the hands of
>> authors who are not yet familiar with it.[/color]
>
> How would I, for example, type an emdash in utf-8 code? (I'm pretty
> sure I just asked something totally clueless, like "which hand does a
> cow use to play the accordian?" Oh, well... in for a dime, in for a
> dollar, as they say...)[/color]
Set your text editor to UTF-8 encoding, and input the character. You can
copy/paste it from anywhere (e.g. character map, a handy web page) or use
your keyboard -- I edited my keyboard layout to give me lots of useful
symbols. For instance, ndash – and mdash â | | |