By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,787 Members | 1,124 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,787 IT Pros & Developers. It's quick & easy.

Using character entities in us-ascii

P: n/a
Ian
I'm using the following meta tag with my documents:

<meta http-equiv="Content-Type" content=
"text/html; charset=us-ascii" />

and yet using character entities like &rsquo; and &mdash;

It validates at W3C and WDG, and runs in standards compliance mode
in Firefox 0.9. What I'm wondering is, is this a good practice? I
assume my pages will load faster if declared as using the
"us-ascii" character code, but it seems odd to refer to characters
not actually in the ASCII character code.

As an aside, I have yet to figure out how to tell Apache to use
us-ascii for all my pages, as I assume that's what I'm supposed to
do instead of using the <meta> tag.

Ian
--
http://www.bookstacks.org/
Jul 20 '05 #1
Share this Question
Share on Google+
19 Replies


P: n/a
Ian wrote:
I'm using the following meta tag with my documents:

<meta http-equiv="Content-Type" content=
"text/html; charset=us-ascii" />

and yet using character entities like &rsquo; and &mdash;

<snip> but it seems odd to refer to characters
not actually in the ASCII character code.
You're specifying the US-ASCII encoding for this document, but HTML as a
whole uses the Unicode character set (there's a difference between
character sets and encodings, by the way). If Unicode's first 128
characters were different from ASCII's (they aren't), { would mean
something different from a real 123 byte in the document.

If you want a basic introduction to encodings and character sets, I
think this Usenet post of mine is decent: <http://tinyurl.com/6dx4p>.
As an aside, I have yet to figure out how to tell Apache to use
us-ascii for all my pages, as I assume that's what I'm supposed to
do instead of using the <meta> tag.


http://www.w3.org/International/ques...access-charset
Jul 20 '05 #2

P: n/a
Ian
On Mon, 26 Jul 2004 01:42:05 -0400, Leif K-Brooks
<eu*****@ecritters.biz> wrote:
You're specifying the US-ASCII encoding for this document, but HTML as a
whole uses the Unicode character set
Okay, I think I get it. I was looking at the W3C list of character
entities early, which looked like it was showing part of the HTML
4.01 DTD. The named entities were being converted into numeric
Unicode. This was initially what worried me, but I think I see
what you mean. I always thought a character entity "stood for" the
equivalent in whatever encoding you're using, so if you state
you're using an encoding that doesn't support that character
entity, it wouldn't get rendered in the browser. This doesn't seem
to be the case, though. It looks like character entities are part
of HTML in general, just like you can use &amp; in an XML document
without declaring it as an entity, since XML has that as a
default. I may be on the wrong track here, though.
If you want a basic introduction to encodings and character sets, I
think this Usenet post of mine is decent: <http://tinyurl.com/6dx4p>.
Thank you. That's what I've been looking for for months. The only
other thing I found useful was a page by Jukka. But I think
there's some aspects of this subject that are so obvious they
don't get stated. It's helpful, for instance, to know that
&#number; (since I'm bad with the terminology :), if it's below
255, will be the same thing in ASCII, ISO 8859-n, or UTF-8 (I
think). It's also helpful to know that US-ASCII is what people
mean by ASCII, and stuff like that. I may even have these wrong,
but they don't seem to get addressed, or at least it's hard to
find information about stuff like this. I think it's an important
topic, especially if it's true that pages load faster if the
encoding is switched from UTF-8 to US-ASCII.
http://www.w3.org/International/ques...access-charset


Oh man. It was just .htaccess, eh? :-)

Ian
--
http://www.bookstacks.org/
Jul 20 '05 #3

P: n/a
"Ian" <bl***@blank.com> a écrit dans le message de
news:0a********************************@4ax.com
I assume my pages will load faster if declared as using the
"us-ascii" character code


Why do you think that ? I can't figure why it would be true, and never could
see any difference in loading time between pages with different encodings.

Jul 20 '05 #4

P: n/a
In our last episode, <0a********************************@4ax.com>,
the lovely and talented Ian broadcast on
comp.infosystems.www.authoring.html:
I'm using the following meta tag with my documents: <meta http-equiv="Content-Type" content=
"text/html; charset=us-ascii" /> and yet using character entities like &rsquo; and &mdash;
So?

With the exception of a few like &amp;, &lt;, and &gt;, pretty
much the whole point of character entities is to allow you to
express characters in the object (i.e. rendered) document that
are not in the source (i.e. html) document's character set.

Content-Type says what character set the html document is
using. And sure enough "&mdash;" is composed entirely of
ascii characters.
It validates at W3C and WDG, and runs in standards compliance mode
in Firefox 0.9. What I'm wondering is, is this a good practice? I
assume my pages will load faster if declared as using the
"us-ascii" character code,
Why would you think that? Pretty much the default assumption
is ISO-8859-1.
but it seems odd to refer to characters
not actually in the ASCII character code.
The idea is to express (i.e. markup) real-world documents
so that they can be transmitted by http and interpreted
by browsers. Real-world documents may lapse into words or
passages in Greek or have math symbols or use dashes or
distinguish the various quotation marks.
As an aside, I have yet to figure out how to tell Apache to use
us-ascii for all my pages, as I assume that's what I'm supposed to
do instead of using the <meta> tag.


You can configure this for your own directory using AddDefaultCharset
in your .htaccess file (if AllowOverride FileInfo is enabled for user
directories in the server configuration), or if you administer the
server you can set the default character set in the server
configuration file. Out of the box, this is set at ISO-8859-1.

Your assumption that ascii will load faster is incorrect. As I have
said, ISO-8859-1 is the default out-of-the-box for Apache, and if you
can (and do) use it, you probably won't have to make any adjustments
to the server or to your .htaccess file.

--
Lars Eighner -finger for geek code- ei*****@io.com http://www.io.com/~eighner/
If it wasn't for muscle spasms, I wouldn't get any exercise at all.
Jul 20 '05 #5

P: n/a
Tim
Leif K-Brooks <eu*****@ecritters.biz> wrote:
You're specifying the US-ASCII encoding for this document, but HTML as a
whole uses the Unicode character set

Ian <bl***@blank.com> posted:
Okay, I think I get it. I was looking at the W3C list of character
entities early, which looked like it was showing part of the HTML
4.01 DTD. The named entities were being converted into numeric
Unicode. This was initially what worried me, but I think I see
what you mean. I always thought a character entity "stood for" the
equivalent in whatever encoding you're using, so if you state
you're using an encoding that doesn't support that character
entity, it wouldn't get rendered in the browser. This doesn't seem
to be the case, though. It looks like character entities are part
of HTML in general, just like you can use &amp; in an XML document
without declaring it as an entity, since XML has that as a
default. I may be on the wrong track here, though.


When you encode the letter "A" as character number 65 in ASCII (I'm doing
that from memory, so it not be 65), it's a *representation* of what you
want in the document (as is everything that you type). When you use &copy;
it's a representation of what you want in the document (a copyright
symbol). These are, if you like, *input* encodings.

What the browser is capable of outputting depends on what it can support,
separate from what you gave it as a source document. It'll depend on the
browser's understanding, the fonts available, and the system.

As a general rule, few browsers produce ASCII output. They produce their
output in the native format for the system they're running on. So, unless
the source document was already using the same scheme, there's at least one
translation involved.

e.g. Input --> browser's internal handling --> Displayed output

The input encoding *only* limits what can be directly typed into it (as
individual characters). You can type in references to use characters that
can't be typed directly (e.e. there's no copyright symbol in ASCII, but the
reference for it uses characters that are available in ASCII - the letters
in & c o p y ;)

--
If you insist on e-mailing me, use the reply-to address (it's real but
temporary). But please reply to the group, like you're supposed to.

This message was sent without a virus, please delete some files yourself.
Jul 20 '05 #6

P: n/a
In our last episode,
<2m************@uni-berlin.de>,
the lovely and talented Leif K-Brooks
broadcast on comp.infosystems.www.authoring.html:
Ian wrote:
I'm using the following meta tag with my documents:

<meta http-equiv="Content-Type" content=
"text/html; charset=us-ascii" />

and yet using character entities like &rsquo; and &mdash;

<snip> but it seems odd to refer to characters
not actually in the ASCII character code.
You're specifying the US-ASCII encoding for this document, but HTML as a
whole uses the Unicode character set (there's a difference between
character sets and encodings, by the way).
Maybe you meant something other than what you wrote, but what you
wrote isn't true.
If Unicode's first 128
characters were different from ASCII's (they aren't), { would mean
something different from a real 123 byte in the document. If you want a basic introduction to encodings and character sets, I
think this Usenet post of mine is decent: <http://tinyurl.com/6dx4p>. As an aside, I have yet to figure out how to tell Apache to use
us-ascii for all my pages, as I assume that's what I'm supposed to
do instead of using the <meta> tag.

http://www.w3.org/International/ques...access-charset


Not acceptable.

<http://www.w3.org/International/questions/qa-htaccess-charset.html>

--
Lars Eighner -finger for geek code- ei*****@io.com http://www.io.com/~eighner/
If it wasn't for muscle spasms, I wouldn't get any exercise at all.
Jul 20 '05 #7

P: n/a
Ian
On Mon, 26 Jul 2004 09:40:43 +0200, Pierre Goiffon
<pg******@nowhere.invalid> wrote:
"Ian" <bl***@blank.com> a écrit dans le message de
news:0a********************************@4ax.com
I assume my pages will load faster if declared as using the
"us-ascii" character code


Why do you think that ?


I ran across a perl program a while back designed to test whether or not
us-ascii could be safely used. The author had converted all his pages to
us-ascii to save downloading time.

Ian
--
http://www.bookstacks.org/
Jul 20 '05 #8

P: n/a
Ian
On Mon, 26 Jul 2004 06:27:34 -0500, Lars Eighner <ei*****@io.com> wrote:
Your assumption that ascii will load faster is incorrect.


Thanks, Lars. I wasn't sure about it, so coming here was a good idea.

Ian
--
http://www.bookstacks.org/
Jul 20 '05 #9

P: n/a
On Mon, 26 Jul 2004, Lars Eighner wrote:
Leif K-Brooks broadcast on comp.infosystems.www.authoring.html:
You're specifying the US-ASCII encoding for this document, but HTML as a
whole uses the Unicode character set (there's a difference between
character sets and encodings, by the way).


Maybe you meant something other than what you wrote, but what you
wrote isn't true.


If you're referring to the part which you quoted, I have to say it
looks OK to me. If you have some issue with it, you might care to be
more specific about what you reckon "isn't true".
http://www.w3.org/International/ques...access-charset


Not acceptable.


Was that meant to be your summary of a 406 response? Then it sounds
as if you've got your client agent configured to reject some dimension
of their content negotiation. (It works OK for me, on more than one
browser.)
Jul 20 '05 #10

P: n/a
On Mon, 26 Jul 2004, Leif K-Brooks wrote:
If Unicode's first 128
characters were different from ASCII's (they aren't), [...]


¿Qué?
"If Christopher Columbus were shot in 1491 (he wasn't) ..."

I don't see any use for your speculation.

--
Top-posting.
What's the most irritating thing on Usenet?
Jul 20 '05 #11

P: n/a
On Mon, 26 Jul 2004, Ian wrote:
I'm using the following meta tag with my documents:
<meta http-equiv="Content-Type" content=
"text/html; charset=us-ascii" />
Don't! Specify the encoding (charset) in the HTTP header:
<http://www.w3.org/International/O-HTTP-charset.html>
<http://ppewww.ph.gla.ac.uk/~flavell/charset/ns-burp.html>
and yet using character entities like &rsquo; and &mdash;
Don't! Use decimal character references:
<http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist.html#s6>
I assume my pages will load faster if declared as using the
"us-ascii" character code,
Huh?
As an aside, I have yet to figure out how to tell Apache to use
us-ascii for all my pages, as I assume that's what I'm supposed to
do instead of using the <meta> tag.


<http://www.w3.org/International/questions/qa-htaccess-charset>

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 20 '05 #12

P: n/a
Andreas Prilop wrote:
On Mon, 26 Jul 2004, Leif K-Brooks wrote:
If Unicode's first 128
characters were different from ASCII's (they aren't), [...]

¿Qué?
"If Christopher Columbus were shot in 1491 (he wasn't) ..."

I don't see any use for your speculation.


"If Christopher Columbus were shot in 1491 (he wasn't), America would
probably still have been discovered. Pure random chance suggests that
anything which is possible will happen eventually, although it may take
a while." It's an example.
Jul 20 '05 #13

P: n/a
"Ian" <id*******@hotmail.com> a écrit dans le message de
news:opsbqvc40rs2ohee@home-nmjjofzmt4
The author had converted all his
pages to us-ascii to save downloading time.


Between us-ascii and any 8 bit character repertoire the size of tha document
will be exactly the same : 8 bits per characters.
UTF-8 for exemple could lead to bigger files, but actually this could become
a problem only when you reach millions of viewed web pages a month. If so,
there's also some solution like using on the fly gzip compression.

Jul 20 '05 #14

P: n/a
On Mon, 26 Jul 2004, Leif K-Brooks wrote:
"If Christopher Columbus were shot in 1491 (he wasn't), America would
probably still have been discovered.


Or as someone perceptively commented:

"The Viking's greatest gift to humanity was that they discovered
America first, *and didn't tell anybody about it*".

SCNR.
Jul 20 '05 #15

P: n/a
On Mon, 26 Jul 2004, Pierre Goiffon wrote:
Between us-ascii and any 8 bit character repertoire the size of tha document
will be exactly the same : 8 bits per characters.
That depends on the content. To represent e.g a copyright character
in an HTML source document coded in us-ascii, it needs several bytes
(e.g "&copy;" = 6 bytes). To do it in utf-8 as a coded character
needs only two bytes. To do it in iso-8859-something, it needs only
one byte.
UTF-8 for exemple could lead to bigger files,
Not if the content consists entirely of us-ascii characters, no.

So again, it depends on the content.
but actually this could become
a problem only when you reach millions of viewed web pages a month. If so,
there's also some solution like using on the fly gzip compression.


Right, although there's no necessity for it to be done "on the fly".
Jul 20 '05 #16

P: n/a
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> a écrit dans le message de
news:Pi******************************@ppepc56.ph.g la.ac.uk
Between us-ascii and any 8 bit character repertoire the size of tha
document will be exactly the same : 8 bits per characters.


That depends on the content. To represent e.g a copyright character
in an HTML source document coded in us-ascii, it needs several bytes
(e.g "&copy;" = 6 bytes).


Oh right, very pertinent remarq. I forgot to mention the extra space used by
entities when needed.
UTF-8 for exemple could lead to bigger files,


Not if (...)


That was why I wrote : "could" :)

Anyway thanks for these extra info. Sorry I didn't gave them in my forst
post.

Jul 20 '05 #17

P: n/a
Alan J. Flavell wrote:
as someone perceptively commented:

"The Viking's greatest gift to humanity was that they discovered
America first, *and didn't tell anybody about it*".


A great gift to the people living here, that certain!

--
Brian (remove ".invalid" to email me)
http://www.tsmchughs.com/
Jul 20 '05 #18

P: n/a
Ian
On Mon, 26 Jul 2004 15:01:47 +0200, Andreas Prilop
<nh******@rrzn-user.uni-hannover.de> wrote:
Don't! Use decimal character references:
<http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist.html#s6>


Okay, that's good advice. The odd thing is I have it bookmarked,
but for some reason have misplaced it in my massive bookmark menu.

The pertinent bit is:

<quote>
For Latin-1 characters, &entityname; is recommended, whereas for
non-Latin-1 characters, &#bignumber; (unicode values) is
preferred, even where an HTML4 entity is defined, since browser
support is more widespread.
</quote>

Now, I really, really don't want to start a fight ... :-) ... so I
will just ask, in a general, vague way, if this "widespread
browser support" involves NN4 or IE4 not supporting it, or if some
newer browsers choke as well.

<ducking, running>

Ian
--
http://www.bookstacks.org/
Jul 20 '05 #19

P: n/a
On Mon, 26 Jul 2004, Ian wrote:
<http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist.html#s6>


Okay, that's good advice.

Now, I really, really don't want to start a fight ... :-) ... so I
will just ask, in a general, vague way, if this "widespread
browser support" involves NN4 or IE4 not supporting it,


The page says "See Note C" and refers to
<http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist.html#NoteUTF>
<http://ppewww.ph.gla.ac.uk/~flavell/charset/quick#cons>
where you can read that especially Netscape 4 will take advantage
from the &#number; expressions.

You can test your browser(s) here:
<http://www.unics.uni-hannover.de/nhtcapri/multilingual2.html>

--
Top-posting.
What's the most irritating thing on Usenet?

Jul 20 '05 #20

This discussion thread is closed

Replies have been disabled for this discussion.