Named vs. numerical entities

Jonas Smithson

I recently read the claim somewhere that numerical entities (such as
—) have a speed advantage over the equivalent named entities
(such as —) because the numerical entity requires just a single
byte to be downloaded to the browser, while the named entity requires
one byte for each letter. (So in this case, it would presumably be one
byte vs. seven bytes.) I found this claim a little surprising -- I
would have thought *each* numeral in the numerical entity would require
one byte. Does the Web server really send the entire numerical entity
as a single... character or whatever... I don't even know how to phrase
this question correctly!

Also, which form of the entity enjoys wider browser support? They both
seem to work with modern browsers... but what about older or very buggy
browsers?

Jul 20 '05 #1

Subscribe Post Reply

5069

Brian

Jonas Smithson wrote:

I recently read the claim somewhere that numerical entities (such
as —) have a speed advantage over the equivalent named
entities (such as —) because the numerical entity requires
just a single byte to be downloaded to the browser, while the named
entity requires one byte for each letter.
My, that was a load of poppycock you were told.
I found this claim a little surprising
That's being too kind.
I would have thought *each* numeral in the numerical entity would
require one byte.
That depends on the encoding. You'd best consult the guides if you
want to know more. I wish I understood it all better. I don't, despite
reading **numerous** posts from folks here who are quite well-versed.
If you're interested, Google the group for "Alan Flavell encoding" or
"Andreas Prilop charset". That'll turn up lots of posts. I'd suggest
you read what they say carefully; read those who argue with them, at
least on character encoding issues, with a grain of salt.
Also, which form of the entity enjoys wider browser support? They
both seem to work with modern browsers... but what about older or
very buggy browsers?

Again, A. Flavell is your man. Brace yourself for some heavy reading:

http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist

--
Brian (remove ".invalid" to email me)
http://www.tsmchughs.com/

Jul 20 '05 #2

Stan Brown

"Jonas Smithson" <sm************@REMOVETHISboardermail.com> wrote in
comp.infosystems.www.authoring.html:

I recently read the claim somewhere that numerical entities (such as
—) have a speed advantage over the equivalent named entities
(such as —) because the numerical entity requires just a single
byte to be downloaded to the browser, while the named entity requires
one byte for each letter. (So in this case, it would presumably be one
byte vs. seven bytes.) I found this claim a little surprising -- I
would have thought *each* numeral in the numerical entity would require
one byte.

It does.

Where the difference arises is if you actually create your document
in Unicode instead of an 8-bit character set. If the document is
actually composed in Unicode, and transmitted in Unicode, then there
is an advantage of the actual 8212 character because it needs only
two bytes whereas — is 7 characters. (I can't remember whether
that's 7*2=14 bytes or some compression goes on, but it's certainly
more than 2 bytes.)

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
2.1 changes: http://www.w3.org/TR/CSS21/changes.html
validator: http://jigsaw.w3.org/css-validator/

Jul 20 '05 #3

Brian

Jonas Smithson wrote:

I recently read the claim somewhere that numerical entities (such
as —) have a speed advantage over the equivalent named
entities (such as —) because the numerical entity requires
just a single byte to be downloaded to the browser, while the named
entity requires one byte for each letter. (So in this case, it
would presumably be one byte vs. seven bytes.)

BTW, did the person whose work you read actually claim that there
would be a noticeable difference in 2 documents, where document (a)
had 6 (or 12, or, heck, even 60) bytes more than document (b)?

--
Brian (remove ".invalid" to email me)
http://www.tsmchughs.com/

Jul 20 '05 #4

Brian

Stan Brown wrote:

Where the difference arises is if you actually create your document
in Unicode
I'm not sure what you mean by this. Unicode is a character set, not an
encoding. AIUI, all HTML documents are presumed to be written in
Unicode, although that's an awkward thing to say.
instead of an 8-bit character set. If the document is actually
composed in Unicode, and transmitted in Unicode,
There's no such thing as "transmitted in Unicode". You mean
encoded in UTF-8? But UTF-8 is an 8-bit character set (hence the name).
then there is an advantage of the actual 8212 character because it
needs only two bytes whereas — is 7 characters.

The only sense I can make of this is that if you use an encoding that
permits a direct representation of a charcter instead of requiring an
entity you'll save few byes. So, in UTF-8, the letter A requires 1
byte where A would require 5. Is that what you meant?

--
Brian (remove ".invalid" to email me)
http://www.tsmchughs.com/

Jul 20 '05 #5

Jonas Smithson

Brian wrote:

BTW, did the person whose work you read actually claim that there
would be a noticeable difference in 2 documents, where document (a)
had 6 (or 12, or, heck, even 60) bytes more than document (b)?

No, he didn't put the remark in context, as I recall... although I
don't even remember whether I read it online or in some computer book,
and the whole subject of encodings is totally confusing to me so I
probably misunderstood whatever context there may have been.

However, some of my pages have numerous character entities on them...
let's say up to fifty on a page, perhaps; if they each entailed an
extra six bytes (for example) over some alternate method, then that
might add up to an extra 300 bytes. What does that equal in download
time? How many bytes of difference do *you* think would make a
"noticeable difference" between two documents... say, to a user on a
56K modem?

Jul 20 '05 #6

Alan J. Flavell

On Fri, 16 Jul 2004, Brian wrote:

There's no such thing as "transmitted in Unicode".
Agreed.
You mean encoded in UTF-8? But UTF-8 is an 8-bit character set
No, utf-8 isn't a "character set" at all (that MIME "charset"
parameter denotes what we nowadays call a "character encoding
scheme").
(hence the name).

The utf-8 scheme is built with 8-bit units, indeed, but characters are
represented by variable numbers of those units. (As you obviously
know).

cheers

Jul 20 '05 #7

Alan J. Flavell

On Thu, 15 Jul 2004, Brian wrote:

I found this claim a little surprising

That's being too kind.

;-)

If the hon Usenaut is worried about the size of their HTML documents,
it may be worth noting that most current browsers are happy to accept
gzip-compressed HTML. At least for documents which are in a Latin
base-language, this can make far more difference to total size than
worrying about the difference between a few &-notations and utf-8
encoding.

But it's probably not worth doing this until the individual HTML items
are significantly larger than the amount of HTTP red-tape involved in
retrieving the item. More than a few kBytes each, let's say.

For extra brownie points, the server can be set to honour the
browser's Accept-encoding header, sending gzip-compressed format to
those who say they accept it, and straight HTML to any who don't.

There are third-party Apache modules which take care of this "on the
fly", but it can be done more simply (i.e with MultiViews) if one is
willing to store both versions on the server. Disk space is cheap
nowadays, after all.

good luck

Jul 20 '05 #8

Alan J. Flavell

On Fri, 16 Jul 2004, Jonas Smithson wrote:

I recently read the claim somewhere that numerical entities (such as
—) have a speed advantage over the equivalent named entities
Others have rightly explained what nonsense that is...
Also, which form of the entity enjoys wider browser support?
You've been given the URL of my checklist for the wider picture, but
to summarise the relevant points:

- utf-8 encoding is widely supported and a compact representation; its
problem is more the possibility of mishandling in the hands of authors
who are not yet familiar with it.

- The Latin-1 named entities (those proposed in the appendix to
RFC1866/HTML2.0) are very well supported

- Generally speaking the entities introduced in HTML4 are now
supported, but there are still browsers around (e.g NN4.*) that don't
understand them. For almost all of these characters, I'd still say
that the &#number; representation is somewhat more widely supported.

It's best, of course, if your HTML authoring software takes care of
the details for you, according to some options which you can set.

€ is widely recognised, and at least still comprehensible in
browsers which don't implement it (since browsers usually display
character entities literally if they don't understand them).
They both seem to work with modern browsers... but what about older
or very buggy browsers?

The checklist does its best to take that into account and choose best
compromises depending on the character repertoire which you need.

WebTV seemed to be hopeless with anything outside of a subset of
Windows-1252 repertoire. If you have anything more challenging as
your content, then you'd basically have to write it off. I hear that
they're working on it.

Jul 20 '05 #9

Harlan Messinger

"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:

On Fri, 16 Jul 2004, Brian wrote:
There's no such thing as "transmitted in Unicode".

Agreed.

Why can't a document be encoded (and transmitted) in Unicode? If
Windows Notepad lets you save a text file as Unicode (big- or
little-endian), isn't that the same thing?
--
Harlan Messinger
Remove the first dot from my e-mail address.
Veuillez ôter le premier point de mon adresse de courriel.

Jul 20 '05 #10

Andreas Prilop

On Fri, 16 Jul 2004, Harlan Messinger wrote:

Why can't a document be encoded (and transmitted) in Unicode?
It cannot be "in Unicode" but UTF-8, UTF-16, or UTF-32;
and in addition in different byte order for UTF-16 and UTF-32.
<http://www.unicode.org/unicode/faq/utf_bom.html>
If
Windows Notepad lets you save a text file as Unicode (big- or
little-endian), isn't that the same thing?

"Big- or little-endian" rules out UTF-8, so probably it's UTF-16.
UTF-32 isn't used in MS Windows AFAIK.

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 20 '05 #11

Andreas Prilop

On Fri, 16 Jul 2004, Jonas Smithson wrote:

I recently read the claim somewhere that numerical entities (such as
—) have a speed advantage over the equivalent named entities
(such as —) because the numerical entity requires just a single
byte to be downloaded to the browser, while the named entity requires
one byte for each letter.

Others told you already that isn't true. But even if it were true,
a single image is usually bigger than your source text. So length
doesn't really matter. [ Oops, what did I write :-) ]

But as <http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist.html#s6>
explains, decimal references are somewhat better supported among
(older) browsers than hexadecimal references or entities.

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 20 '05 #12

Neal

On Fri, 16 Jul 2004 05:34:11 GMT, Jonas Smithson
<sm************@REMOVETHISboardermail.com> wrote:

Brian wrote:
BTW, did the person whose work you read actually claim that there
would be a noticeable difference in 2 documents, where document (a)
had 6 (or 12, or, heck, even 60) bytes more than document (b)?

No, he didn't put the remark in context, as I recall... although I
don't even remember whether I read it online or in some computer book,
and the whole subject of encodings is totally confusing to me so I
probably misunderstood whatever context there may have been.

However, some of my pages have numerous character entities on them...
let's say up to fifty on a page, perhaps; if they each entailed an
extra six bytes (for example) over some alternate method, then that
might add up to an extra 300 bytes. What does that equal in download
time? How many bytes of difference do *you* think would make a
"noticeable difference" between two documents... say, to a user on a
56K modem?

Negligible. Probably most pages have that much deletable/editable crap in
them plus some...

Jul 20 '05 #13

Alan J. Flavell

On Fri, 16 Jul 2004, Harlan Messinger wrote:

Why can't a document be encoded (and transmitted) in Unicode?
Because "Unicode" is not the name of an encoding scheme.
If Windows Notepad lets you save a text file as Unicode (big- or
little-endian), isn't that the same thing?

You're talking about just two of the possible encoding schemes for
Unicode. MS using baby-talk is maybe "good enough for government
work", but this here is a technical forum. What MS's terms are
denoting are utf-16LE and utf-16BE encoding schemes.

And in any case, probably the best choice (if no other constraints
apply) of Unicode encoding scheme for HTML used in a WWW context is
utf-8, not utf-16LE/BE.

Jul 20 '05 #14

Pierre Goiffon

"Alan J. Flavell" <fl*****@ph.gla.ac.uk> a écrit dans le message de
news:Pi******************************@ppepc56.ph.g la.ac.uk

And in any case, probably the best choice (if no other constraints
apply) of Unicode encoding scheme for HTML used in a WWW context is
utf-8, not utf-16LE/BE.

Do you mean, when using a vast majority of latin characters ?
If not, wouldn't the file will get very large ? Wouldn't it be better to use
UTF-16 ?

Jul 20 '05 #15

Andreas Prilop

On Fri, 16 Jul 2004, Pierre Goiffon wrote:

And in any case, probably the best choice (if no other constraints
apply) of Unicode encoding scheme for HTML used in a WWW context is
utf-8, not utf-16LE/BE.
Do you mean, when using a vast majority of latin characters ?
If not, wouldn't the file will get very large ?

Not bigger than a simple image.
Wouldn't it be better to use UTF-16 ?

Only if you prefer not to be indexed by Google correctly.
<http://www.google.com/search?q=%22UTF-1+6%22>

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 20 '05 #16

Nick Kew

In article <Pi******************************@ppepc56.ph.gla.a c.uk>,
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> writes:

There are third-party Apache modules which take care of this "on the
fly",

mod_deflate is standard. No need for third-party modules.

--
Nick Kew

Jul 20 '05 #17

Alan J. Flavell

On Fri, 16 Jul 2004, Pierre Goiffon wrote:

"Alan J. Flavell" <fl*****@ph.gla.ac.uk> a écrit dans le message de
news:Pi******************************@ppepc56.ph.g la.ac.uk
And in any case, probably the best choice (if no other constraints
apply) of Unicode encoding scheme for HTML used in a WWW context is
utf-8, not utf-16LE/BE.
Do you mean, when using a vast majority of latin characters ?

Not necessarily: Greek, Cyrillic, Arabic, Hebrew are all represented
by 2 octets in utf-8. Armenian, Syriac and Coptic too, hmmm. The
cutoff (IINM) is U+07FF.

CJK scripts are a different matter, but AFAICS they are still usually
represented in one of their traditional encodings, rather than in a
Unicode-based scheme.

Indic scripts will also need 3 octets per character in utf-8 (and in
this case AIUI the use of unicode-based encodings is very beneficial,
since there /was/ no widely accepted pre-unicode scheme: I'm told that
in order to read Indian newspapers on the web, pretty much each
newspaper needed a different "font" i.e in effect was using its own
private character encoding. But I'm no expert in that field, so the
information is only second-hand).
If not, wouldn't the file will get very large ? Wouldn't it be
better to use UTF-16 ?

I haven't widely tested browser compatibility for utf-16 encodings, so
I can't comment on that aspect. But keep in mind that the markup,
styles, etc. etc. are expressed by ASCII characters, and by using
utf-16 you're going to double the size of *those* as compared with
utf-8.

But yes, if your material is such that most of the data characters
need 3 octets in utf-8, and you've decided to use a unicode scheme,
then utf-16 could well be more-compact, you're right.

Jul 20 '05 #18

Alan J. Flavell

On Fri, 16 Jul 2004, Nick Kew wrote:

"Alan J. Flavell" <fl*****@ph.gla.ac.uk> writes:
There are third-party Apache modules which take care of this "on the
fly",

mod_deflate is standard. No need for third-party modules.

Thanks for the information!

Jul 20 '05 #19

Andreas Prilop

On Fri, 16 Jul 2004, Alan J. Flavell wrote:

Indic scripts will also need 3 octets per character in utf-8 (and in
this case AIUI the use of unicode-based encodings is very beneficial,
since there /was/ no widely accepted pre-unicode scheme: I'm told that
in order to read Indian newspapers on the web, pretty much each
newspaper needed a different "font" i.e in effect was using its own
private character encoding.

But there's also <http://www.bbc.co.uk/hindi/>
and <http://www.bbc.co.uk/tamil/> .

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 20 '05 #20

Brian

Jonas Smithson wrote:

However, some of my pages have numerous character entities on
them... let's say up to fifty on a page, perhaps; if they each
entailed an extra six bytes (for example) over some alternate
method, then that might add up to an extra 300 bytes. What does
that equal in download time? How many bytes of difference do *you*
think would make a "noticeable difference" between two documents...
say, to a user on a 56K modem?

Well, do the math. 300/56000 is not very significant. I suppose,
300/~33000 is more accurate a comparison, but even there, it's nothing
to worry about. Spending time tuning one image on a page will likely
have a greater impact than encoding will.

You should only worry about encoding if it causes rendering problems.

--
Brian (remove ".invalid" to email me)
http://www.tsmchughs.com/

Jul 20 '05 #21

Brian

Alan J. Flavell wrote:

On Fri, 16 Jul 2004, Brian wrote:
UTF-8 is an 8-bit character set

No, utf-8 isn't a "character set" at all (that MIME "charset"
parameter denotes what we nowadays call a "character encoding
scheme").

Cripes, I cannot keep the terminology straight. I wish they had called
that thing by its name, charenc or something. Yes, utf-8 is an encoding.

--
Brian (remove ".invalid" to email me)
http://www.tsmchughs.com/

Jul 20 '05 #22

Harlan Messinger

"Andreas Prilop" <nh******@rrzn-user.uni-hannover.de> wrote in message
news:Pine.GSO.4.44.0407161425010.9642-100000@s5b003...

On Fri, 16 Jul 2004, Harlan Messinger wrote:
Why can't a document be encoded (and transmitted) in Unicode?

It cannot be "in Unicode" but UTF-8, UTF-16, or UTF-32;
and in addition in different byte order for UTF-16 and UTF-32.
<http://www.unicode.org/unicode/faq/utf_bom.html>
If
Windows Notepad lets you save a text file as Unicode (big- or
little-endian), isn't that the same thing?

"Big- or little-endian" rules out UTF-8, so probably it's UTF-16.
UTF-32 isn't used in MS Windows AFAIK.

I'm really interested in what the distinction is. I admit I don't know what
UTF-16 or why it's different from what I would call "Unicode encoding", but
why wouldn't a fixed 16-bit encoding scheme where "A" is encoded as 0040, an
em-dash is encoded as 2014, a katakana "pu" is encoded as 30D7, and so forth
not be "Unicode encoding"?

Is it that this encoding scheme already existed and had the name "UTF-16"
before the term "Unicode" was coined? So that the reason we don't call it
"Unicode encoding" is simply that it already has another name?

Jul 20 '05 #23

Andreas Prilop

On Fri, 16 Jul 2004, Harlan Messinger wrote:

<http://www.unicode.org/unicode/faq/utf_bom.html>

I'm really interested in what the distinction is. I admit I don't know what
UTF-16 or why it's different from what I would call "Unicode encoding", [...]

Err, did you read the page above, which I cited with reason?

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 20 '05 #24

Harlan Messinger

"Andreas Prilop" <nh******@rrzn-user.uni-hannover.de> wrote in message
news:Pine.GSO.4.44.0407161705220.11169-100000@s5b003...

On Fri, 16 Jul 2004, Harlan Messinger wrote:
<http://www.unicode.org/unicode/faq/utf_bom.html>
I'm really interested in what the distinction is. I admit I don't know what UTF-16 or why it's different from what I would call "Unicode encoding",

[...]
Err, did you read the page above, which I cited with reason?

Sorry, I missed it somehow. I intend to read it later, but from glancing at
it, I have the following thoughts:

1. There's nothing any more nonsensical about the concept of a Unicode
encoding for the Unicode character set than there is about ASCII encoding
for the ASCII character set, but for whatever reasons (I assume efficiency
has something to do with it) it's not *used*.

2. EBCDIC and ASCII define the same characters, IIRC; but as character sets
they just number them differently. A document could be encoded in EBCDIC
just as easily as in ASCII. It wouldn't make any sense to speak of an EBCDIC
encoding of an ASCII document or an ASCII encoding of an EBCDIC document:
each is a separate encoding of a document based on the representations of
the document's characters in the respective character sets.

So why are the UTF-* encoding, "encodings of the Unicode character set"? Is
it because they are closely related to the Unicode character set by virtue
of the fact that there is a mapping from UCS to UTF-* produced by applying a
small set of simple functions?

2.

Jul 20 '05 #25

Andreas Prilop

On Fri, 16 Jul 2004, Harlan Messinger wrote:

1. There's nothing any more nonsensical about the concept of a Unicode
encoding for the Unicode character set than there is about ASCII encoding
for the ASCII character set,
Maybe I could understand this sentence with fewer negatives :-)
2. EBCDIC and ASCII define the same characters, IIRC;
ASCII is a coded character set of 128 characters defined in ANSI X3.4
and ISO 646.
EBCDIC is a generic term for several (many?) coded character sets of
256 characters defined by IBM. Just four of them are listed here:
<http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/>
So why are the UTF-* encoding, "encodings of the Unicode character set"?

Think of Unicode as assigning characters to natural numbers - currently
from 0 to x10FFFF = 1114111. For example, number 945 = x3B1 means
the Greek small letter alpha.

The UTFs define how these numbers are represented by _byte_ sequences
(in a computer or on the Internet).

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 20 '05 #26

C A Upsdell

> ASCII is a coded character set of 128 characters defined in ANSI X3.4

and ISO 646.

Not quite. You are thinking of US-ASCII. There are a variety of national
ASCII character sets.

Jul 20 '05 #27

Alan J. Flavell

On Fri, 16 Jul 2004, Harlan Messinger wrote:

Sorry, I missed it somehow. I intend to read it later,
Call back here when you have done?
1. There's nothing any more nonsensical about the concept of a Unicode
encoding for the Unicode character set than there is about ASCII encoding
for the ASCII character set,
Actually there are substantial differences. And you see this also
with that MIME parameter which is (mis)named "charset" - but specifies
what we now would call a "character encoding scheme".

Back when 7 or 8 bits were sufficient to represent all of the
characters of a repertoire, it was quasi-obvious that the "coded
character set" was defined by assigning numbers (0-127 or 0-255 as the
case may be) to the characters of the repertoire, and then to lay out
the fonts according to that scheme, and to transmit the characters by
means of bytes having that value.

Consequently, back then it looked as if the things that we now call
"coded character set", "character encoding" and "font arrangement"
were just different names for the same thing. Of course, you needed a
different font for each "charset" (i.e character encoding), which got
to be a considerable drag.

Nowadays these concepts have to be disambiguated. Unicode characters
are designated by a code point which can, in principle, go up to 2**31
(it hasn't got that far yet). Those numbers then have to be
represented in a way which is convenient for transmission and/or
storage (different design criteria apply for different purposes).
2. EBCDIC and ASCII define the same characters, IIRC;
Actually not. But discussing that would be a pointless digression, so
let's move on.
So why are the UTF-* encoding, "encodings of the Unicode character set"?
It's not practical, for various reasons, to transmit characters as
32-bit units. For one thing, it's very wasteful. For another,
there's no unique byte-ordering, hence all this fuss about endian-ness
when units of 16 or 32 bits are involved.

There's also the question of representing unicode characters in a
mail-safe context (hence utf-7). That will fade with time, but even
8-bit-safe mail formats ban null bytes, which means that utf-16 or
utf-32/ucs-4 representations cannot be used without a further layer of
encoding.
Is it because they are closely related to the Unicode character set

Is it because you won't read the tutorial before asking further
questions?

ttfn

Jul 20 '05 #28

Lars Eighner

In our last episode,
<Bj**********************@news01.bloor.is.net.cabl e.rogers.com>,
the lovely and talented C A Upsdell
broadcast on comp.infosystems.www.authoring.html:

ASCII is a coded character set of 128 characters defined in ANSI X3.4
and ISO 646.
Not quite. You are thinking of US-ASCII. There are a variety of national
ASCII character sets.

No. There is only one ASCII. It is a 7-bit code with 128 characters.
Think: what does the A in ASCII stand for?

--
Lars Eighner -finger for geek code- ei*****@io.com http://www.io.com/~eighner/
If it wasn't for muscle spasms, I wouldn't get any exercise at all.

Jul 20 '05 #29

Andreas Prilop

On Fri, 16 Jul 2004, C A Upsdell wrote:

ASCII is a coded character set of 128 characters defined in ANSI X3.4
and ISO 646.
Not quite. You are thinking of US-ASCII.

ASCII and US-ASCII are synonyms.
<http://www.iana.org/assignments/character-sets>
There are a variety of national ASCII character sets.

No, they are called "7-bit codes" or "7-bit coded character sets"
as defined in ISO 646. <http://www.itscj.ipsj.or.jp/ISO-IR/>

Jul 20 '05 #30

Harlan Messinger

"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote in message
news:Pi******************************@ppepc56.ph.g la.ac.uk...

On Fri, 16 Jul 2004, Harlan Messinger wrote:
Sorry, I missed it somehow. I intend to read it later,
Call back here when you have done?
1. There's nothing any more nonsensical about the concept of a Unicode
encoding for the Unicode character set than there is about ASCII encoding for the ASCII character set,

Actually there are substantial differences. And you see this also
with that MIME parameter which is (mis)named "charset" - but specifies
what we now would call a "character encoding scheme".

Back when 7 or 8 bits were sufficient to represent all of the
characters of a repertoire, it was quasi-obvious that the "coded
character set" was defined by assigning numbers (0-127 or 0-255 as the
case may be) to the characters of the repertoire, and then to lay out
the fonts according to that scheme, and to transmit the characters by
means of bytes having that value.

Consequently, back then it looked as if the things that we now call
"coded character set", "character encoding" and "font arrangement"
were just different names for the same thing. Of course, you needed a
different font for each "charset" (i.e character encoding), which got
to be a considerable drag.

Nowadays these concepts have to be disambiguated. Unicode characters
are designated by a code point which can, in principle, go up to 2**31
(it hasn't got that far yet). Those numbers then have to be
represented in a way which is convenient for transmission and/or
storage (different design criteria apply for different purposes).
2. EBCDIC and ASCII define the same characters, IIRC;

Actually not. But discussing that would be a pointless digression, so
let's move on.
So why are the UTF-* encoding, "encodings of the Unicode character

set"?
It's not practical, for various reasons, to transmit characters as
32-bit units. For one thing, it's very wasteful. For another,
there's no unique byte-ordering, hence all this fuss about endian-ness
when units of 16 or 32 bits are involved.

There's also the question of representing unicode characters in a
mail-safe context (hence utf-7). That will fade with time, but even
8-bit-safe mail formats ban null bytes, which means that utf-16 or
utf-32/ucs-4 representations cannot be used without a further layer of
encoding.
Is it because they are closely related to the Unicode character set

Is it because you won't read the tutorial before asking further
questions?

No, it's because sometimes questions can be satisfied by relatively simple
answers without requiring one to read a whole tutorial (though sometimes
not). Sometimes a tutorial or textbook will tell you the way things are
without explaining why they aren't some other way (though sometimes not).

Jul 20 '05 #31

Alan J. Flavell

On Fri, 16 Jul 2004, C A Upsdell wrote:

Not quite. You are thinking of US-ASCII. There are a variety of
national ASCII character sets.

That's sloppy terminology. There's a variety of 7-bit national
character sets which are patterned on ASCII (US-ASCII is a more
accurate name, since - contrary to widespread belief amongst some
parties - America doesn't consist solely of the United States).

But those national character sets were mostly codified under ISO-646.

I give you my old page
http://ppewww.ph.gla.ac.uk/~flavell/....html#national

and particularly the links to
http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/CJK.html and
http://www.terena.nl/library/multili...section04.html

This was a relevant topic in the early days of the WWW, since the code
positions which were set aside for national variations in iso-646 were
for example the basis for some of the "unsafe character" exclusions in
URLs.

Btw, I see there's a lovely comment in that Terena web page:

It will be clear that so-called "de facto standards" are related to
those discussed above as Monopoly banknotes to real money, valuable
as long as the game goes on.

Jul 20 '05 #32

C A Upsdell

"Andreas Prilop" <nh******@rrzn-user.uni-hannover.de> wrote in message
news:Pine.GSO.4.44.0407161833140.11334-100000@s5b003...

On Fri, 16 Jul 2004, C A Upsdell wrote:
ASCII is a coded character set of 128 characters defined in ANSI X3.4
and ISO 646.

Not quite. You are thinking of US-ASCII.

ASCII and US-ASCII are synonyms.
<http://www.iana.org/assignments/character-sets>

NOT TRUE!!!! Read the IANA page: "These names are expressed in
ANSI_X3.4-1968 which is commonly called US-ASCII or simply ASCII. The
character set most commonly use in the Internet and used especially in
protocol standards is US-ASCII, this is strongly encouraged. The use of the
name US-ASCII is also encouraged." This says that US-ASCII is commonly
called ASCII. It does not say that US-ASCII is ASCII.

Also, when I started developing software in the early 1970's -- before the
Internet, before PCs, before microprocessors -- I routinely worked with
various 7- and 8-bit ASCII character sets (in addition to EBCDIC and Gray
codes). I find many Internet references denying the existence of 8-bit
ASCII, but I can attest that, in the early 1970s, multiple 7- and 8-bit sets
were alive and well.

Jul 20 '05 #33

Harlan Messinger

"C A Upsdell" <cupsdell0311XXX@-@-@XXXrogers.com> wrote in message
news:Lz************@news04.bloor.is.net.cable.roge rs.com...

"Andreas Prilop" <nh******@rrzn-user.uni-hannover.de> wrote in message
news:Pine.GSO.4.44.0407161833140.11334-100000@s5b003...
On Fri, 16 Jul 2004, C A Upsdell wrote:
> ASCII is a coded character set of 128 characters defined in ANSI X3.4
> and ISO 646.

Not quite. You are thinking of US-ASCII.
ASCII and US-ASCII are synonyms.
<http://www.iana.org/assignments/character-sets>

NOT TRUE!!!! Read the IANA page: "These names are expressed in
ANSI_X3.4-1968 which is commonly called US-ASCII or simply ASCII. The
character set most commonly use in the Internet and used especially in
protocol standards is US-ASCII, this is strongly encouraged. The use of

the name US-ASCII is also encouraged." This says that US-ASCII is commonly
called ASCII. It does not say that US-ASCII is ASCII.
Uh, yeah, it does, unless the implication is along the lines of "... called
US-ASCII, or often simply ASCII, although this is technically incorrect
because ASCII properly refers to a different characters set". But that isn't
the implication and the statement is saying that US-ASCII, ASCII, and
ANSI_X3.4-1968 are all names for the same thing--which is the same as saying
that each of them is also each of the others.

Also, when I started developing software in the early 1970's -- before the
Internet, before PCs, before microprocessors -- I routinely worked with
various 7- and 8-bit ASCII character sets (in addition to EBCDIC and Gray
codes). I find many Internet references denying the existence of 8-bit
ASCII, but I can attest that, in the early 1970s, multiple 7- and 8-bit sets were alive and well.

Multiple 7- and 8-bit sets were alive and well. But they were not ASCII.
They may have been ASCII extensions, but they were not ASCII.

Jul 20 '05 #34

C A Upsdell

"Harlan Messinger" <h.*********@comcast.net> wrote in message
news:2l************@uni-berlin.de...

Also, when I started developing software in the early 1970's -- before the Internet, before PCs, before microprocessors -- I routinely worked with
various 7- and 8-bit ASCII character sets (in addition to EBCDIC and Gray codes). I find many Internet references denying the existence of 8-bit
ASCII, but I can attest that, in the early 1970s, multiple 7- and 8-bit

sets
were alive and well.

Multiple 7- and 8-bit sets were alive and well. But they were not ASCII.
They may have been ASCII extensions, but they were not ASCII.

Indeed they were ASCII. Standards written later appear to have
disassociated the term ASCII from the national variants and extended sets,
preferring to give them numbered ANSI designations, but in the early 1970s
they were ASCII. National variants which I personally worked with included
French, German, and Italian ASCII sets, and one of my co-workers worked with
the Portugese set . US-ASCII is the preferred term now to avoid confusion
with the other ASCII sets.

Jul 20 '05 #35

Alan J. Flavell

On Fri, 16 Jul 2004, Harlan Messinger wrote:

[after a bout of over-enthusiatic quoting]

Is it because you won't read the tutorial before asking further
questions?
No, it's because sometimes questions can be satisfied by relatively
simple answers without requiring one to read a whole tutorial

And it's because often, the relatively simple answers don't make any
sense until you've done the groundwork first so that you can
understand the answers (or even better - ask the right questions).

Your attention was directed to the tutorial for a constructive reason:
someone who knew the subject believed that it would be of genuine
benefit to you, it would position you better for the subsequent
discussion. As it happens, that is also my own opinion.
Sometimes a tutorial or textbook will tell you the way things are
without explaining why they aren't some other way (though sometimes not).

You'll be able to tell us how it was when you've tried it, OK? That
is, if I haven't lost patience by then and put you back into the
killfile...

Jul 20 '05 #36

Harlan Messinger

"C A Upsdell" <cupsdell0311XXX@-@-@XXXrogers.com> wrote in message
news:Z7************@news04.bloor.is.net.cable.roge rs.com...

"Harlan Messinger" <h.*********@comcast.net> wrote in message
news:2l************@uni-berlin.de...
Also, when I started developing software in the early 1970's -- before the Internet, before PCs, before microprocessors -- I routinely worked with various 7- and 8-bit ASCII character sets (in addition to EBCDIC and Gray codes). I find many Internet references denying the existence of 8-bit ASCII, but I can attest that, in the early 1970s, multiple 7- and
8-bit sets
were alive and well.
Multiple 7- and 8-bit sets were alive and well. But they were not ASCII.
They may have been ASCII extensions, but they were not ASCII.

Indeed they were ASCII. Standards written later appear to have
disassociated the term ASCII from the national variants and extended sets,
preferring to give them numbered ANSI designations, but in the early 1970s
they were ASCII. National variants which I personally worked with

included French, German, and Italian ASCII sets, and one of my co-workers worked with the Portugese set .
You and they called them "ASCII" informally, or do you have a citation to
show that ASCII was officially regarded as the proper name for these sets?
US-ASCII is the preferred term now to avoid confusion
with the other ASCII sets.

Jul 20 '05 #37

Harlan Messinger

"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote in message
news:Pi******************************@ppepc56.ph.g la.ac.uk...

On Fri, 16 Jul 2004, Harlan Messinger wrote:

[after a bout of over-enthusiatic quoting]
Is it because you won't read the tutorial before asking further
questions?
No, it's because sometimes questions can be satisfied by relatively
simple answers without requiring one to read a whole tutorial

And it's because often, the relatively simple answers don't make any
sense until you've done the groundwork first so that you can
understand the answers (or even better - ask the right questions).

Your attention was directed to the tutorial for a constructive reason:
someone who knew the subject believed that it would be of genuine
benefit to you, it would position you better for the subsequent
discussion. As it happens, that is also my own opinion.
Sometimes a tutorial or textbook will tell you the way things are
without explaining why they aren't some other way (though sometimes

not).
You'll be able to tell us how it was when you've tried it, OK? That
is, if I haven't lost patience by then and put you back into the
killfile...

Oh, good grief, go ahead and get it over with. One would think I'd said
something simply terrible to you, instead of just asking questions and then
saying why I thought it was reasonable to do so.

Jul 20 '05 #38

C A Upsdell

> >

Indeed they were ASCII. Standards written later appear to have
disassociated the term ASCII from the national variants and extended sets, preferring to give them numbered ANSI designations, but in the early 1970s they were ASCII. National variants which I personally worked with

included
French, German, and Italian ASCII sets, and one of my co-workers worked

with
the Portugese set .

You and they called them "ASCII" informally, or do you have a citation to
show that ASCII was officially regarded as the proper name for these sets?

I do not have any of the manuals etc. that I used 30+ years ago. Otherwise
I could show you.

Jul 20 '05 #39

Alan J. Flavell

On Fri, 16 Jul 2004, C A Upsdell wrote:

Indeed they were ASCII.
No, they may have been *based* on ASCII, they may have been informally
referred to as "national ASCII", but they were not literally the
"American Standard Code for Information Interchange".
Standards written later appear to have disassociated the term ASCII
from the national variants
Uh-uh, it's an international conspiracy to hide the origin of these
codes, is it? You don't seriously believe that the US American
national standards body would go making national character codes for
other countries, do you?
and extended sets,
At this point nobody's arguing about "extended sets". It's about
national variants based on the 7-bit code called ASCII.
preferring to give them numbered ANSI designations,
There you go again. ANSI (the later name of the US American national
standards body) had no jurisdiction over other national variants; only
over the (US-)American one. The British national variant based on
ASCII was a British Standard designation, BS4370; other national
variants would have had designations under their respective standards
bodies (DIN in Germany, and so on).

Later these 7-bit codes were codified into ISO-646 under the auspices
of the international standards body.
but in the early 1970s they were ASCII.

I've been interested in character coding issues since before then, and
I say you are mistaken, or confusing loose everyday terms and formal
specfications. Not that any of this is relevant to authoring HTML for
the WWW, so I shan't keep this sub-thread going.

Jul 20 '05 #40

Jonas Smithson

My thanks to all the respondents. I've been sitting here reading this
thread with my jaw dropped open -- people not only discussing the
arcane nuances of encoding methods, but flaming each other over it!
This thread was so far over my head that (for my purposes) it might as
well have been written in ancient Greek (say, is that a possible
encoding method?). But I got the core information I needed: there's no
speed advantage of — over —. I wish I could remember where
I read that nonsense so that (if it was in a book, which I suspect it
was) I could warn people about the title.

I guess now my decision comes down to this: named entities are more
intuitive (I can remember them while I type without looking at a
chart), but Netscape 4 doesn't understand them, and makes the text look
like junk -- but it does understand numerical entities, which I can't
remember. So which do I care more about, my convenience in writing code
or the <0.5% of NS4 users? (That's a subjective question to myself, of
course; I don't expect an answer here.) Or maybe I'll type the named
entities and then do a bulk search & replace to numeric ones before
uploading the pages...

Alan Flavell wrote:

It's best, of course, if your HTML authoring software takes
care of the details for you, according to some options which
you can set.
My "HTML authoring software" is a simple text editor; I don't care for
the so-called WYSIWYG editors so I have to make decisions like this for
myself.
utf-8 encoding is widely supported and a compact representation; its
problem is more the possibility of mishandling in the hands of
authors who are not yet familiar with it.

How would I, for example, type an emdash in utf-8 code? (I'm pretty
sure I just asked something totally clueless, like "which hand does a
cow use to play the accordian?" Oh, well... in for a dime, in for a
dollar, as they say...)

By the way, I occasionally see garbage characters even on the big news
sites -- where it looks like they meant to insert some kind of
punctuation mark but instead I see something that looks like a Chinese
character. I'm pretty sure they're not seeing that on their end, or
they would have fixed it; and I've searched through my preference
settings (in Windows Explorer 6) but couldn't find anything that seemed
relevant in terms of character encodings. Any guess as to why I'm
seeing scattered Chinese characters (it happens fairly rarely,
actually) and the site coders (presumably) aren't?

Jul 20 '05 #41

Matt

Jonas Smithson wrote:

My thanks to all the respondents. I've been sitting here reading this
thread with my jaw dropped open -- people not only discussing the
arcane nuances of encoding methods, but flaming each other over it!
I prefer the term "heated discussion" :)
This thread was so far over my head that (for my purposes) it might as
well have been written in ancient Greek (say, is that a possible
encoding method?).
Use a greek encoding or unicode :). AIUI, and I never took Ancient Greek
very far at school, it uses letters all found in modern Greek.

utf-8 encoding is widely supported and a compact representation; its
problem is more the possibility of mishandling in the hands of
authors who are not yet familiar with it.

How would I, for example, type an emdash in utf-8 code? (I'm pretty
sure I just asked something totally clueless, like "which hand does a
cow use to play the accordian?" Oh, well... in for a dime, in for a
dollar, as they say...)

Set your text editor to UTF-8 encoding, and input the character. You can
copy/paste it from anywhere (e.g. character map, a handy web page) or use
your keyboard -- I edited my keyboard layout to give me lots of useful
symbols. For instance, ndash â€“ and mdash â€” and AltGr + hypen and
Shift+AltGr+hyphen now.[1]
By the way, I occasionally see garbage characters even on the big news
sites -- where it looks like they meant to insert some kind of
punctuation mark but instead I see something that looks like a Chinese
character. I'm pretty sure they're not seeing that on their end, or
they would have fixed it; and I've searched through my preference
settings (in Windows Explorer 6) but couldn't find anything that seemed
relevant in terms of character encodings. Any guess as to why I'm
seeing scattered Chinese characters (it happens fairly rarely,
actually) and the site coders (presumably) aren't?

Someone's character encoding is not set correctly. If you've set the
encoding selection in IE (View, Encoding) to auto-select, maybe theirs is
set wrongly.

[1] US layouts don't have AltGr, so I'd have to use Ctrl+Shift. You can
make a keyboard layout for Windows 2k,XP,2003 with this tool:
<http://www.microsoft.com/downloads/details.aspx?FamilyID=fb7b3dcd-d4c1-4943-9c74-d8df57ef19d7&displaylang=en>
Much, much faster for typing things like â€˜Â*â€™ â€œ â€ Â· Â¼ Â½ Â¾ Â©

--
Matt
-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----== Over 100,000 Newsgroups - 19 Different Servers! =-----

Jul 20 '05 #42

Eric B. Bednarz

Jonas Smithson <sm************@REMOVETHISboardermail.com> writes:

[...] people not only discussing [...] but flaming each other over it! [...] numerical entities, [...]

If you call *character references* 'numerical entities' one more time,
you ain't seen nothing yet! ;-)

Entity references are an entirely different syntactical construct.
You are excused because the WWW is cluttered with disinformation, but
before you go to sleep you really gotta write down 100 times:

'&#' is _*/NOT/*_ an ERO delimiter

Append exclamation marks in amounts you see fit.
--
| ) 111010111011 | http://bednarz.nl/
-(
| ) Distribute me: http://binaries.bednarz.nl/mp3/aicha

Jul 20 '05 #43

Alan J. Flavell

On Fri, 16 Jul 2004, Jonas Smithson wrote:

My thanks to all the respondents. I've been sitting here reading this
thread with my jaw dropped open -- people not only discussing the
arcane nuances of encoding methods, but flaming each other over it!
Welcome to usenet. Gene Spafford had already said it in 1992
(google for usenet and "herd of performing elephants").
My "HTML authoring software" is a simple text editor;
But /how/ simple? Come back to that in a moment...
I don't care for the so-called WYSIWYG editors
I'm right with you there. But it isn't a binary choice between
type-every-character-by-hand or point-and-drool-and-never-see-any-HTML
How would I, for example, type an emdash in utf-8 code?
That's a non-sequitur: your keyboard doesn't generate "in" us-ascii or
iso-8859-1 or utf-8 code, it generates keyboard codes: it's the job of
input methods to turn keypresses into actual stored characters.

If your editor is sufficiently unicode-aware, then you can type-in an
emdash character (by some combination of keypressings), and when
you're done authoring, you can say save-As and tell the dialog to save
in utf-8 format. Or you can copy/paste characters from a menu, or use
a character picker utility or whatever. The key issue is that the
editor can store and work with these characters, and save them to file
in an encoding that you like (probably utf-8).

Recent versions of even such a "simple" editor as Notepad can do this
(in win2k, xp). Older ones can't, so you'd need to look for a
unicode-capable editor.

You could use the source-view mode of Mozilla Composer, for that
matter. A good choice, as it offers an immediate preview and various
other conveniences, such as translating &-notation to and from coded
characters.
(I'm pretty sure I just asked something totally clueless, like
"which hand does a cow use to play the accordian?" Oh, well... in
for a dime, in for a dollar, as they say...)
You recognise the problem, and that's well over half way to a
solution. Believe me, it's much harder to explain anything to people
who are convinced they already understand 90% of it (just that what
they think they understand is wrong!).

You could try Alan Wood's overview at
http://www.alanwood.net/unicode/utilities_editors.html
although it's a bit of a mix of text editors, word processors and
web-page extruders all in the same bucket, so be selective.

Or google for unicode editors (and related terms) and see if you care
for anything you get.
By the way, I occasionally see garbage characters even on the big news
sites -- where it looks like they meant to insert some kind of
punctuation mark but instead I see something that looks like a Chinese
character. I'm pretty sure they're not seeing that on their end,

This can happen if they fail to specify a character encoding, and the
browser is set to auto-guess the encoding. Or various related errors.
I don't think there's a single right answer to your question. Given a
specific instance, it might be possible to deduce what had gone wrong.
Sometimes they got a news feed in one encoding, and accidentally
incorporated it into a page in a different encoding (news sites are
done from content management systems, the pages aren't produced
individually by hand).

hope this helps.

Jul 20 '05 #44

Jonas Smithson

Alan J. Flavell wrote:

If your editor is sufficiently unicode-aware, then you can type-in an
emdash character (by some combination of keypressings), and when
you're done authoring, you can say save-As and tell the dialog to save
in utf-8 format....

The editor (an old version of BBEdit) gives me two options for the
document while I'm working on it: "Encode as Unicode" and, if that's
enabled, the option to "Swap Bytes". Whether or not I chose those
options, when I go to save the document, I have the further options to
"Save as Unicode" and, if that's enabled, to "Swap Bytes". (It also
gives me a choice of Macintosh, Unix, or DOS line breaks, which I
assume wouldn't affect the HTML display.) The "unicode/swap bytes"
choices, of course, mean nothing to me, and I've always left them off
(the default). I can't find anything in the editor's preferences or
dialogs about "utf-8". When they say "Save as Unicode", is it likely
they mean the same thing you mean by "save in utf-8 format"?

If I were working and saving in unicode, would that mean (for example)
that I could type an emdash the way we Mac users do it
(command-option-hyphen) and that would actually work in the HTML
document on other platforms, without my using any character entity (or
character reference or whatever it's called)? (I have a PC too so I
guess I could test that.) And would the emdash character then be more
"compact" (smaller download) than the character reference (—)
I've been using? But...um... didn't I read somewhere that unicode
documents are much larger than... the other kind... (what's a
'non-Unicode' document called?) and so should only be used if you need
support for large character sets like Chinese etc...? Or maybe they
were referring to something else... wait, I think it was called
"double-byte encoding" or something. Excuse me, my brain is exploding.
:)

And then, of course, there's the whole other issue that my FTP program
automatically converts code to iso-8859-1 charset when you upload it,
unless you tell it not to, and when BBEdit talks directly to the FTP
server I don't know what it does.

And if I did save a text file as unicode, when I opened it later in a
text editor (perhaps even a different one), would I be able to tell
what it was saved as?

(That's a lot of questions, and I'm sure I phrased this all wrong, but
maybe you can guess what I mean or what the stuff I've been reading
meant?)

Jul 20 '05 #45

C A Upsdell

"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote in message
news:Pi******************************@ppepc56.ph.g la.ac.uk...

On Fri, 16 Jul 2004, C A Upsdell wrote:
Standards written later appear to have disassociated the term ASCII
from the national variants

Uh-uh, it's an international conspiracy to hide the origin of these
codes, is it? You don't seriously believe that the US American
national standards body would go making national character codes for
other countries, do you?

I generally respect what you say, even when I disagree with you. But a
paragraph like this is unworthy of you. International conspiracy? ISO an
American standards body? Standards being set by one national standards body
without consulting with other nations? You speak as if the US were the only
legitimate country in the world! Surely you are not (gasp!) a US
Republican!

and extended sets,

At this point nobody's arguing about "extended sets". It's about national

variants based on the 7-bit code called ASCII.

And as I said before, there were 8-bit ASCII sets, sometimes called extended
ASCII: 7 bits are not adequate to code characters for most European
languages, or for specialized character sets.

I do wish I had never discarded the manuals I used 3 decades ago. And I
wish that people would refuse to believe that information does not exist if
it does not make its way to the Internet. I have used computers, languages,
operating systems, tools, and manuals that have long been extinct. E.g.,
how many remember 8080 assembly programming using Intel MDS Development
Systems running the ISIS-II operating system. Or my favourite programmer's
editor, the Sage Professional Editor for Windows and OS/2? Or how to
program Intel's 8259A UART for either 7- and 8-bit serial communications?)
Sigh?

Jul 20 '05 #46

Stan Brown

"Jonas Smithson" <sm************@REMOVETHISboardermail.com> wrote in
comp.infosystems.www.authoring.html:

But I got the core information I needed: there's no
speed advantage of — over —.

It's true that there's no speed advantage.

There is another advantage, however, one that I have not seen
mentioned in this thread: Netscape 4 understands — but does
not understand —. That might weigh in your decision.

--
Stan Brown, Oak Road Systems, Tompkins County, New York, USA
http://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2 spec: http://www.w3.org/TR/REC-CSS2/
2.1 changes: http://www.w3.org/TR/CSS21/changes.html
validator: http://jigsaw.w3.org/css-validator/

Jul 20 '05 #47

Tim

On Fri, 16 Jul 2004, Pierre Goiffon wrote:

If not, wouldn't the file will get very large ? Wouldn't it be
better to use UTF-16 ?

"Alan J. Flavell" <fl*****@ph.gla.ac.uk> posted:
I haven't widely tested browser compatibility for utf-16 encodings, so
I can't comment on that aspect.

Not that long ago I tried utf-16 on several different (and *current*
versions of) web browsers. Only some could use it.

I know that's vague, and I'm not inclined to run all the tests before I
post this response. But it was enough to convince *me* that it was a bad
idea.

--
If you insist on e-mailing me, use the reply-to address (it's real but
temporary). But please reply to the group, like you're supposed to.

This message was sent without a virus, please delete some files yourself.

Jul 20 '05 #48

Leif K-Brooks

Jonas Smithson wrote:

I can't find anything in the editor's preferences or
dialogs about "utf-8". When they say "Save as Unicode", is it likely
they mean the same thing you mean by "save in utf-8 format"?
I have way too much time on my hands, so I think I'll write a
(hopefully) easy to understant explanation of this stuff. I'm not an
expert, and I'm sure one will correct me on some of the finer points,
but I should at least be able to give you a good enough idea of this stuff.

Computers store things in bytes, which are numbers between 0 and 255.
This system works great for numbers, since you can use multiple bytes to
store numbers larger than 255, but text is a bit problematic when all
you have to work with is numbers.

Enter character sets and encodings. A character set is just that: a set
of characters. An encoding is a way to convert characters in a character
set into a series of bytes. Some simple character sets which define 256
characters or less can also be considered encodings, since nothing
special is required to convert them into bytes.

The first character set, which was also an encoding because it defined
only 128 characters, was called ASCII. It was fine for early computers,
but there was a problem: it only defined the Latin alphabet, digits, and
a few simple symbols. Countries which needed accented letters had
trouble, and countries which had entirely different alphabets couldn't
use ASCII at all.

In an attempt to fix all of those problems, the International
Orginization for Standardization and others defined encodings which kept
the 128 ASCII characters, but also used the other 128 integers in a byte
for other characters. Unfortunately, there were more than 128 characters
needed for other alphabets, so several incompatible encodings defining
different characters were created instead of just one. That worked for a
while, but the incompatibility of the different encodings stopped
characters from different alphabets from being used in the same
document, which some people needed to do.

The most important character set today is called Unicode. It currently
defines 96000 characters, and reserves the right to define a total of
1114112 characters in the future. It has Latin, Greek, Chinese, and
everything in between; hopefully enough for anyone.

Note that I said Unicode is a character set, not an encoding. It has
three different encodings: UTF-8, UTF-16, and UTF-32. UTF-8 is probably
the most used; it uses a different number of bytes (between 1 and 4) for
different characters, and all ASCII text is also valid UTF-8 text.
UTF-16 also uses a variable number of bytes; 2-4 in this case. UTF-32 is
the simplest for programs to process; it uses 4 bytes for every character.

As to whether your editor means UTF-8 by Unicode, I'm not sure. It
doesn't really mean Unicode, but whether it means UTF-8, UTF-16, or
UTF-32 is difficult to say.

If I were working and saving in unicode, would that mean (for example)
that I could type an emdash the way we Mac users do it
(command-option-hyphen) and that would actually work in the HTML
document on other platforms, without my using any character entity (or
character reference or whatever it's called)?
Yes. I believe Mac OS X handles these things very nicely, so you
shouldn't have any trouble.
And would the emdash character then be more
"compact" (smaller download) than the character reference (—)
I've been using?
Yes. — is 7 bytes in UTF-8, but the emdash encoded in UTF-8 is
only two bytes.
But...um... didn't I read somewhere that unicode
documents are much larger than... the other kind... (what's a
'non-Unicode' document called?) and so should only be used if you need
support for large character sets like Chinese etc...?
Yes and no. UTF-8 documents are the same size as iso-8859-1 documents,
but UTF-16 and UTF-32 documents are larger.
And then, of course, there's the whole other issue that my FTP program
automatically converts code to iso-8859-1 charset when you upload it,
unless you tell it not to, and when BBEdit talks directly to the FTP
server I don't know what it does.
My advice would be to replace your FTP client if it's that broken, but
you might be able to fix it by uploading in binary mode instead of text.
As for what BBEdit does, my guess would be that it does the right
thing if it has an option for Unicode when saving.
And if I did save a text file as unicode, when I opened it later in a
text editor (perhaps even a different one), would I be able to tell
what it was saved as?

Not unless your text editor told you, which it might.

Jul 20 '05 #49

Alan J. Flavell

On Sat, 17 Jul 2004, Jonas Smithson wrote:

The editor (an old version of BBEdit) gives me two options for the
document while I'm working on it: "Encode as Unicode" and, if that's
enabled, the option to "Swap Bytes".
Feel free to play around with this stuff and see what happens. E.g
put some interesting characters into a file, save it with the various
options, open the file in a unicode-capable web browser and play with
its view> character encoding options (whatever it calls them) till the
result makes sense. Then you'll have a better idea of what you've
got. View the source to make sure you're getting coded characters
instead of &-notations.

My hunch is that your editor is talking about the forerunner of utf-16
which was called ucs-2, back when the Unicode range could all be
represented in two bytes. For this subset of characters, you may be
able to treat utf-16 and ucs-2 as effectively synonymous.

My reading of Alan Wood's pages on editors (please consult them) is
that current versions of BBEdit support utf-8:

http://www.alanwood.net/unicode/util...ac.html#bbedit
(It also gives me a choice of Macintosh, Unix, or DOS line breaks,
which I assume wouldn't affect the HTML display.)
Agreed
If I were working and saving in unicode, would that mean (for example)
that I could type an emdash the way we Mac users do it
(command-option-hyphen) and that would actually work in the HTML
document on other platforms, without my using any character entity
Right
And would the emdash character then be more "compact" (smaller
download) than the character reference (—) I've been using?
Yes
But...um... didn't I read somewhere that unicode
documents are much larger than... the other kind...
utf-8 is a good compromise for western writing systems. We've
discussed some of the issues elsewhere on this thread.
And then, of course, there's the whole other issue that my FTP program
automatically converts code to iso-8859-1 charset when you upload it,
unless you tell it not to, and when BBEdit talks directly to the FTP
server I don't know what it does.

This is a detail which you'd need to get a grasp on, right.

But play around a bit, and read around a bit, so that competences and
understanding stay reasonably in step. In the end, it's all much
simpler and straightforward that it might have seemed at the outset.
But if your software doen't properly support what you're trying to do,
then you're confronted with extra difficulties. So do take a look at
Alan Wood's overview as it relates to your particular platform(s) and
pick something that appeals to you, at least for the first steps.
Then you'd be able to assess whether the software that you're already
using is actually capable of what you need.

Jul 20 '05 #50

Named vs. numerical entities

Similar topics