By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
449,015 Members | 988 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 449,015 IT Pros & Developers. It's quick & easy.

Understanding simplest HTML page

P: n/a
I have been trying to get a better understanding of simple HTML, but I
am finding conflicting information is very common. Not only that, even
in what seemed elementary and without any possibility of getting wrong
it seems I am on very shaky ground .

For example, pretty much every book and web course on html that I have
read tells me I must include <html>, <head> and <body> tag pairs.

I have always done that, and never questioned it. It seemed perfectly
reasonable to me (and still does) to split meta information from
presented content, and indeed to require that browsers be told the
content was html. Although I guess having a server present mime type
text/html covers whether contents are html, as does having a doctype.

However on reading http://www.w3.org/TR/html401/struct/global.html I
noticed that the html, head and body tags were optional (although the
title tag is required). So I did a test page

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<title>Test whether required in head</title>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
<p>Paragraph of text

This validates without any warning.

If you leave out either the title or some body content it will not
validate. So the validator at least is making an assumption about what
is head and what is body. I would imagine most user agent parsers would
also.

Does anyone have any suggestions about good tutorial texts about html
that get everything correct? At the moment I am gradually going through
the W3C documentation, but I tend to find myself missing some of the
implications.

--
http://www.ericlindsay.com
Nov 23 '05 #1
Share this Question
Share on Google+
82 Replies


P: n/a
In our last episode,
<NO********************************@freenews.iinet .net.au>, the
lovely and talented Eric Lindsay broadcast on
comp.infosystems.www.authoring.html:
I have been trying to get a better understanding of simple HTML, but I
am finding conflicting information is very common. Not only that, even
in what seemed elementary and without any possibility of getting wrong
it seems I am on very shaky ground . For example, pretty much every book and web course on html that I have
read tells me I must include <html>, <head> and <body> tag pairs.
It would be a lot easier just to learn to read DTDs and to read
the DTD you are using.

In many html DTDs the only tags actually required are
<title></title>. The html, head, and body tags can be implied.

The first thing in a document that cannot be a head element is
assumed to close the head element and open the body element. If
you have a bit of stray plain text in what you think is the head
element, it closes the head element and opens the body element,
and thus if you have </head><body> tags later in the document,
validators or lints will complain.
I have always done that, and never questioned it. It seemed perfectly
reasonable to me (and still does) to split meta information from
presented content, and indeed to require that browsers be told the
content was html. Although I guess having a server present mime type
text/html covers whether contents are html, as does having a doctype. However on reading http://www.w3.org/TR/html401/struct/global.html I
noticed that the html, head and body tags were optional (although the
title tag is required). So I did a test page <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<title>Test whether required in head</title>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
<p>Paragraph of text This validates without any warning.
Right. Just as it should. <p> causes </head><body> to be
implied. (Inferred, for English majors.)
If you leave out either the title or some body content it will not
validate. So the validator at least is making an assumption about what
is head and what is body. I would imagine most user agent parsers would
also. Does anyone have any suggestions about good tutorial texts about html
that get everything correct? At the moment I am gradually going through
the W3C documentation, but I tend to find myself missing some of the
implications.


As I have said, these things are much easier if you learn to
read the DTDs. The DTD you declare for your document is the
law, and everything else is commentary -- and some of it is
unreliable commentary. It is good advice to form the habit of
properly nesting explicit tags, even when some of them could be
implied. But it makes it a lot easier to understand validator
output if you know what tags might have been implied. For
example, if the validator tells you that you tried to put a meta
in the body of your document, you might go over your document
for a long time checking that all the metas were before the
</head> without realizing that the validator is trying to tell
that you have some stray body element (which might be plain
text) inside <head></head> before a meta tag. That caused the
validator to infer </head> so the following meta tag looked like
it was in the body.

For example, you get something like this:

onsgmls:joe.tmp:7:78:E: document type does not allow element "LINK" here
onsgmls:joe.tmp:8:6:E: end tag for element "HEAD" which is not open
onsgmls:joe.tmp:9:5:E: document type does not allow element "BODY" here

The document is something like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html lang="en-us">
<head>
<title>An Example</title>Hi, world!
<link rel="start" href="index.html" title="My Index"type="text/html" lang="en">
</head>
<body>

<p>Hey, y'all</p>

</body>
</html>

Well, it is easy enough to spot the Hi, world! in this simple
example. That closed the head element and opened the body
element. As a result link is in the body element, a no-no,
There is a </head> after the head is closed and <body> when the
body is already open (and another <body> is not allowed). In a
forest of link, meta, and possibly script elements, a couple of
lose plain-text characters might be hard to spot, and if you
didn't know what the validator was trying to tell you, you
wouldn't even know what to look for.
Information on how to read DTDs is here:

<http://www.w3.org/TR/REC-html40/intro/sgmltut.html>
--
Lars Eighner us****@larseighner.com http://www.larseighner.com/
I have not seen as far as others because giants were standing on my shoulders.
Nov 23 '05 #2

P: n/a
Lars Eighner wrote:

Right. Just as it should. <p> causes </head><body> to be
implied. (Inferred, for English majors.)

"Implied" is correct. The rendering engine would infer it.

--
jmm (hyphen) list (at) sohnen-moe (dot) com
(Remove .AXSPAMGN for email)
Nov 23 '05 #3

P: n/a
In article <sl********************@goodwill.io.com>,
Lars Eighner <us****@larseighner.com> wrote:
Information on how to read DTDs is here:

<http://www.w3.org/TR/REC-html40/intro/sgmltut.html>


Thanks for the examples Lars. I had actually been to the page on
reading the DTDs, but by then my head was already full, so I only
skimmed it and bookmarked it for later. I've left it open in the
browser and I'll read it carefully tomorrow.

I would still like an easier text, if one exists. Especially as writing
a valid and correct web page does not seem a sufficient skill to ensure
all browsers will cope.

--
http://www.ericlindsay.com
Nov 23 '05 #4

P: n/a
On Thu, 17 Nov 2005, Eric Lindsay wrote:

[ reading an SGML DTD ]
I would still like an easier text, if one exists.
The one that I used, years ago, has disappeared now, but
you could find it at the wayback machine http://www.archive.org/
The url to look for is http://www.awpa.asn.au/html/dtdtrees/sgml.html
Especially as writing a valid and correct web page does not seem a
sufficient skill to ensure all browsers will cope.


I'm a great supporter of the idea of writing syntactically valid HTML,
but no-one should hold unrealistic expectations of its benefits for
browsers to "cope". After all, *most* of what they're presented with
is tag soup - much of it beyond redemption. Given that browser
developers feel it unwise to just say "invalid page" (the web might
have been a better place if they had done that from the outset, but
it's too late to start now), they have to make some effort to guess
what the author intended, and much of the browser development effort
must have gone into that.

Having said that, it doesn't persuade me to rely on browser fixups for
my own content, and I'm certainly not recommending it to anyone else.
I'm only saying one should not have *unrealistic* expectations of the
benefits of valid code. I can show you at least one place where the
HTML will almost certainly "work" when the validator says it's
invalid, but will likely fail (to do what the author intended) when
it's reported to be valid:
http://ppewww.ph.gla.ac.uk/~flavell/...url.html#tests

(The reason being that the "valid" HTML is indeed valid, but does not
mean what the author supposed it to mean.)
Nov 23 '05 #5

P: n/a
On Thu, 17 Nov 2005, Eric Lindsay wrote:
So I did a test page

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">


You should not get into the habit to specify the encoding (charset)
of your page through such a META thingy. Rather set the HTTP
charset parameter in your server software.
http://www.w3.org/International/O-HTTP-charset.html
http://ppewww.ph.gla.ac.uk/~flavell/...t/ns-burp.html

--
Netscape 3.04 does everything I need, and it's utterly reliable.
Why should I switch? Peter T. Daniels in <news:sci.lang>

Nov 23 '05 #6

P: n/a
In article
<Pi*************************************@s5b004.rr zn.uni-hannover.de>,
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
On Thu, 17 Nov 2005, Eric Lindsay wrote:
So I did a test page

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">

However if you don't set it, the W3C validator gives you a warning, even
though I believed that particular charset was considered a default value
for web pages that didn't otherwise specify what they were using.

I have been using ISO-8859-1 for ages because I thought it was a safe
fallback for a person authoring in English, especially since almost all
my pages were done on an old Windows PC with a text editor. Certainly
not set up to do much in the way of foreign characters. Now I have a
Mac, UTF-8 seems the default and it does handle foreign characters.
You should not get into the habit to specify the encoding (charset)
of your page through such a META thingy. Rather set the HTTP
charset parameter in your server software.
http://www.w3.org/International/O-HTTP-charset.html
http://ppewww.ph.gla.ac.uk/~flavell/...t/ns-burp.html


I had read Alan's page about setting that in the server options, and I'm
still trying to understand what I should be doing about it.

As far as I know I have no direct control of what http headers is
provided in any of the servers involved. I don't have access to the
..htaccess file, in the cases where it exists. Indeed, how likely is the
average web page writer to know (or care) anything about it?

Also, what are you to do in the event that your web page may actually
contain multiple languages? I guess you declare UTF-8 (in the http
headers if you find out how). However again I was under the impression
there were browser problems with this also.

So I guess what I should do is write a bunch of test pages, in HTML 4.01
Strict and XHTML 1 Strict, with and without the 8859-1 meta, and
possible another set with and without UTF-8 and put them on each of the
three servers I can get at, and see if I can find out what they serve?

--
http://www.ericlindsay.com
Nov 23 '05 #7

P: n/a
Eric Lindsay <NO**********@ericlindsay.com> wrote:
I have been trying to get a better understanding of simple HTML, but I
am finding conflicting information is very common.
Indeed, but you _do_ know where to find the authoritative specifications, do
you not? ( http://www.w3.org/Markup )
For example, pretty much every book and web course on html that I have
read tells me I must include <html>, <head> and <body> tag pairs.
None of the books or web pages I've written, except, of course, when they
discuss XHTML. Up to and including HTML 4.01, those tags are optional, though
often recommended for clarity. The particular reason to use <html> is that
it's the place where you should put the lang attribute that specifies the
human language used on the page, e.g. <html lang="en"> for English.
It seemed perfectly
reasonable to me (and still does) to split meta information from
presented content,
They are still in separate elements; the tags just make this explicit.
and indeed to require that browsers be told the
content was html.
They are supposed to know it from the media type (MIME type) announced by the
server. In fact, one of the _problems_ is that Internet Explorer sniffs at
the content, looking for tags, instead of honoring the protocol, so that you
cannot send an HTML document as text/plain and expect IE to do the right
thing (display it as such).
Although I guess having a server present mime type
text/html covers whether contents are html,
Indeed.
as does having a doctype.
The DOCTYPE declaration should have no effect on this, though in practice it
may, due to IE's sniffing (mis)behavior.
However on reading http://www.w3.org/TR/html401/struct/global.html I
noticed that the html, head and body tags were optional (although the
title tag is required).
Right.
If you leave out either the title or some body content it will not
validate. So the validator at least is making an assumption about what
is head and what is body.
You could put it that way. A validator is required to infer start and end
tags under certain conditions.
I would imagine most user agent parsers would also.


They do, even though they are rather different from correct parsers in
general. (Browsers don't really parse HTML correctly; HTML as an SGML
application is largely just theory.)

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Nov 23 '05 #8

P: n/a
Eric Lindsay wrote:
As far as I know I have no direct control of what http headers is
provided in any of the servers involved. I don't have access to the
.htaccess file, in the cases where it exists.
You create the .htaccess file yourself and place it in your public_html
directory (it can go in sub directories too, if you want to limit its
scope). All good web hosts that use Apache enable .htaccess for this
purpose. Just create a new text file, name it .htaccess, add this
directive and upload it to your server.

AddDefaultCharset ISO-8851-1

You can also set that to UTF-8 if you like, or any other encoding, but
you mentioned earlier that you were using ISO-8859-1. Look up .htaccess
files and the AddDefaultCharset and AddCharset directives. The
documentation on apache.org will be the best source.
Indeed, how likely is the average web page writer to know (or care)
anything about it?
Most don't, everyone should.
Also, what are you to do in the event that your web page may actually
contain multiple languages? I guess you declare UTF-8 (in the http
headers if you find out how).
Multiple languages has little to do with the encoding of the document,
although using an encoding that supports all the characters in use
within the document is advisable as it avoids the need for character
references. UTF-8 is, of course, usually the best option.
So I guess what I should do is write a bunch of test pages, in HTML 4.01
Strict and XHTML 1 Strict, with and without the 8859-1 meta,


The meta element is completely useless in XHTML documents when they are
served properly. Assuming you're serving it with an XML MIME type, the
XML rec defines how to determine the character encoding from the HTTP
headers (or other transport protocol info), the XML declaration or
default to UTF-8/UTF-16 depending on the BOM. If they're served as
text/html, there's absolutely no practical difference between HTML4 and
XHTML1; both will be parsed with tag soup rules.

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox
Nov 23 '05 #9

P: n/a
On Thu, 17 Nov 2005 09:08:22 +1000, Eric Lindsay
<NO**********@ericlindsay.com> wrote:
For example, pretty much every book and web course on html that I have
read tells me I must include <html>, <head> and <body> tag pairs.


These tags are required, but they're required by external good-practice
coding standards, not by the specification itself.

A correct HTML parser is based around SGML practice and the HTML DTD
(either referenced by the doctype or assumed). This parser "knows" that
a HTML document must have a <html> element at the root, <head> and
<body> children within that, and a number of tags that can only appear
within the <body> (and much else besides).

Note also that your terminology of "tag pairs" is useful, because pairs
of tags in a document are only "tag pairs" - they're not an "element"
until _after_ they've been parsed. An element is still produced by the
parser whether the tags are paired and valid or not (if the parser
doesn't give up altogether).

A HTML document containing a <p> start tag alone _MUST_ place this
within the <html> and <body> elements, because that's the only place
it's allowed to be -- even if you never use a <html> or <body> tag.

There are a few problems with this.

- it's not human friendly.

- it relies on correct parser behaviour

- if the document is invalid HTML, then it's hard to recover when so
much is based on inference. Is the sequence <p><link> a <head> with a
spurious <p> in it, or is it a <body> with a spurious <link> in it ?

So missing out <body> just isn't a good thing to do. Missing out </p> is
much less critical.

--
Cats have nine lives, which is why they rarely post to Usenet.
Nov 23 '05 #10

P: n/a
In article <Pi******************************@ppepc56.ph.gla.a c.uk>,
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
On Thu, 17 Nov 2005, Eric Lindsay wrote:

[ reading an SGML DTD ]
I would still like an easier text, if one exists.
The one that I used, years ago, has disappeared now, but
you could find it at the wayback machine http://www.archive.org/
The url to look for is http://www.awpa.asn.au/html/dtdtrees/sgml.html


Thanks Alan. Reading it now.
I'm a great supporter of the idea of writing syntactically valid HTML,
but no-one should hold unrealistic expectations of its benefits for
browsers to "cope".
I think I have recovered from the unrealistic expectations, but then I
went into a keep it so simple nothing can break mode. That in its way
is equally unrealistic, because all I've been doing is pages with really
simply text and images (which was often all that was needed).

I think what I need to start looking at is classifying CSS into ...
Well, I was about to say works and doesn't work. However works and
doesn't work is too sharp a distinction, and part of the reason I have
never pushed CSS. I guess what I really mean is "works (in recent
browsers) and does NOT cause catastrophic problems in IE" vs "causes all
manner of problems if IE sees it."

That example (in another thread) of using CSS3 and a background image to
show which links are external to a site is exactly what I should be
seeking. It works in modern browsers like Safari and Firefox. Doesn't
work at all in IE6, but all that means is the IE experience isn't as
good. The links still work as links. I need to find more like that,
and start using them.
HTML will almost certainly "work" when the validator says it's
invalid, but will likely fail (to do what the author intended) when
it's reported to be valid:
http://ppewww.ph.gla.ac.uk/~flavell/...url.html#tests


Yes, I think I ran across that page when I read through a bunch of your
web pages about web matters yesterday. Thanks

--
http://www.ericlindsay.com
Nov 23 '05 #11

P: n/a
Andy Dingley wrote:
A HTML document containing a <p> start tag alone _MUST_ place this
within the <html> and <body> elements, because that's the only place
it's allowed to be -- even if you never use a <html> or <body> tag.

There are a few problems with this.

- it's not human friendly.
That's a matter of opinion, some people prefer not having unnecessary
tags in the source.
- it relies on correct parser behaviour
Indeed, and that's the best practical reason to always include them. I
know at least IE has serious problems with this:

<!DOCTYPE html ... >
<title>Test</title>
<form ... >
<p>test
</form>

IE will put the form element within the head and the p element within
the body. Using at least the body start-tag does fix that erroneous
behaviour.
- if the document is invalid HTML, then it's hard to recover when so
much is based on inference. Is the sequence <p><link> a <head> with a
spurious <p> in it, or is it a <body> with a spurious <link> in it ?


The head element wouldn't be implied, unless it was a valid child of the
p element, which it is not. As for the link implying the end of the p
element, I'm not sure of the exact SGML rules in this case, but the
validator shows (in the parse tree) that it remains a child of the p
element.

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox
Nov 23 '05 #12

P: n/a
On Fri, 18 Nov 2005, Eric Lindsay wrote:
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">

However if you don't set it, the W3C validator gives you a warning,


No, it doesn't. Example:
http://www.unics.uni-hannover.de/nht...european.html1
I have been using ISO-8859-1 for ages because I thought it was a safe
fallback for a person authoring in English, especially since almost all
my pages were done on an old Windows PC with a text editor.
That's perfectly okay and I said nothing against ISO-8859-1.
As far as I know I have no direct control of what http headers is
provided in any of the servers involved. I don't have access to the
.htaccess file, in the cases where it exists.
First you need to know your actual server software. For configuration,
you might then post to
<news:comp.infosystems.www.servers.ms-windows>
<news:comp.infosystems.www.servers.unix>
Also, what are you to do in the event that your web page may actually
contain multiple languages? I guess you declare UTF-8 (in the http
headers if you find out how). However again I was under the impression
there were browser problems with this also.


If some browser has problems with "charset=UTF-8", adding a
<META ... charset=UTF-8> doesn't help at all.

--
Netscape 3.04 does everything I need, and it's utterly reliable.
Why should I switch? Peter T. Daniels in <news:sci.lang>

Nov 23 '05 #13

P: n/a
On Fri, 18 Nov 2005, Eric Lindsay wrote:
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
On Thu, 17 Nov 2005, Eric Lindsay wrote:
So I did a test page

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
However if you don't set it, the W3C validator gives you a warning,
"If you don't set" what? The validator correctly requires you to
specify the character encoding, by at least one route.

There are theoretical reasons which favour specifying it on the
real HTTP header. Once that happens, there is no point (as far as the
recipient is concerned) in also specifying it via meta http-equiv,
since the real HTTP header has priority.

There are a number of possible strategies: some people will promote
one, some will promote another. All of them have some kind of snag,
but what Andreas is promoting (and I do too) has a consistency to it.

(These arguments are for text/* content types: none of the arguments
transfer unmodified to application/xml+whatever, however, where
different considerations apply. There are discussion documents
about it at the W3C.)
even though I believed that particular charset was considered a
default value for web pages that didn't otherwise specify what they
were using.
That's a complex story, but you do better to put no reliance on that.
I have been using ISO-8859-1 for ages because I thought it was a safe
fallback for a person authoring in English,
Certainly is! Netscape 4.* would handle it wrongly when you included
&#bignumber; references to Unicode characters, but otherwise all
widely-used browsers know how to deal with it.
http://www.w3.org/International/O-HTTP-charset.html
http://ppewww.ph.gla.ac.uk/~flavell/...t/ns-burp.html


I had read Alan's page about setting that in the server options, and I'm
still trying to understand what I should be doing about it.


Server-dependent. The w3's page includes hints for popular server
software, but you may also have to deal with whatever configuration
options have been selected by your service provider.
Also, what are you to do in the event that your web page may actually
contain multiple languages?
Don't confuse language with character encoding! English is still
English, even when transcribed into Japanese characters.
I guess you declare UTF-8 (in the http headers if you find out how).
However again I was under the impression there were browser problems
with this also.
Not with any halfways web-compatible browser. utf-8 was even handled
correctly by that old Netscape 4.* thing, and older versions of MSIE
for quite some years now.
So I guess what I should do is write a bunch of test pages, in HTML
4.01 Strict and XHTML 1 Strict, with and without the 8859-1 meta,
and possible another set with and without UTF-8 and put them on each
of the three servers I can get at, and see if I can find out what
they serve?


It's good to play with the options and find out what happens, for
one's own education, indeed. We can offer you advice on which options
to use, but until you've played with them and developed confidence in
how they happen, it probably doesn't settle in one's concepts.

For maximal compatibility for different character repertoires, start
here http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist

But if you're confident of being able to author in, and handle, utf-8,
feel free to go right ahead and use it.

Nov 23 '05 #14

P: n/a
Eric Lindsay wrote:
In article
<Pi*************************************@s5b004.rr zn.uni-hannover.de>,
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
On Thu, 17 Nov 2005, Eric Lindsay wrote:
> So I did a test page
>
> <META HTTP-EQUIV="Content-Type" CONTENT="text/html;
> charset=ISO-8859-1">


However if you don't set it, the W3C validator gives you a warning, even
though I believed that particular charset was considered a default value
for web pages that didn't otherwise specify what they were using.

I have been using ISO-8859-1 for ages because I thought it was a safe
fallback for a person authoring in English, especially since almost all
my pages were done on an old Windows PC with a text editor. Certainly
not set up to do much in the way of foreign characters. Now I have a
Mac, UTF-8 seems the default and it does handle foreign characters.
You should not get into the habit to specify the encoding (charset)
of your page through such a META thingy. Rather set the HTTP
charset parameter in your server software.
http://www.w3.org/International/O-HTTP-charset.html
http://ppewww.ph.gla.ac.uk/~flavell/...t/ns-burp.html


I had read Alan's page about setting that in the server options, and I'm
still trying to understand what I should be doing about it.

As far as I know I have no direct control of what http headers is
provided in any of the servers involved. I don't have access to the
.htaccess file, in the cases where it exists. Indeed, how likely is the
average web page writer to know (or care) anything about it?

Also, what are you to do in the event that your web page may actually
contain multiple languages? I guess you declare UTF-8 (in the http
headers if you find out how). However again I was under the impression
there were browser problems with this also.

So I guess what I should do is write a bunch of test pages, in HTML 4.01
Strict and XHTML 1 Strict, with and without the 8859-1 meta, and
possible another set with and without UTF-8 and put them on each of the
three servers I can get at, and see if I can find out what they serve?

You can create your own .htaccess files. Windows won't (or wouldn't) let
you create a file with a name that is "all extension" so I once created a
file with a name that kept Windows happy, then changed it once it was on
the server. I am now running Linux, and have no problems. The problem
then was that dot-files were invisible.

Doug.
--
Registered Linux User No. 277548. My true email address has hotkey for
myaccess.
A man's feet should be planted in his country, but his eyes should survey
the world.
- George Santayana.

Nov 23 '05 #15

P: n/a
..htaccess reference:

http://www.javascriptkit.com/howto/htaccess.shtml

--
James Pickering
http://jp29.org/

Nov 23 '05 #16

P: n/a
In article <Xn****************************@193.229.4.246>,
"Jukka K. Korpela" <jk******@cs.tut.fi> wrote:
Eric Lindsay <NO**********@ericlindsay.com> wrote:
I have been trying to get a better understanding of simple HTML, but I
am finding conflicting information is very common.
Indeed, but you _do_ know where to find the authoritative specifications, do
you not? ( http://www.w3.org/Markup )


Thanks for that pointer. There are so many pages on the W3C site that I
don't believe I had read that one. I mostly had pointers to specific
topics, plus bits of the site I stumbled upon. I am finding it hard to
absorb everything available on the topic.
The particular reason to use <html> is that
it's the place where you should put the lang attribute that specifies the
human language used on the page, e.g. <html lang="en"> for English.


Thank you for that Jukka. I had totally missed incorporating that in
the notes I took when visiting W3C and your own site.
and indeed to require that browsers be told the
content was html.


They are supposed to know it from the media type (MIME type) announced by the
server. In fact, one of the _problems_ is that Internet Explorer sniffs at
the content, looking for tags, instead of honoring the protocol, so that you
cannot send an HTML document as text/plain and expect IE to do the right
thing (display it as such).


I must admit that whole "MIME type announced by the server" idea worries
me. Most people putting web pages, whether HTML or XHTML, on a server,
are likely to have very little control over whether html is served as
text/html (although it mostly seems to be done), or whether xhtml 1.0 or
1.1 is served as text/html or application/xhtml-xml. Given Internet
Explorer actions, that area seems very messy. Deciding to use HTML 4.01
rather than XHTML was a result of learning (via this newsgroup) about
that problem.

--
http://www.ericlindsay.com
Nov 23 '05 #17

P: n/a
In article <Pi*******************************@ppepc56.ph.gla. ac.uk>,
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
On Fri, 18 Nov 2005, Eric Lindsay wrote:
> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
However if you don't set it, the W3C validator gives you a warning,


"If you don't set" what? The validator correctly requires you to
specify the character encoding, by at least one route.


OK, let me see if I have this. Obviously if I can guarantee that every
server I have a web page on sends a character specification in the HTTP
header, then I should never need to use <meta http-equiv= etc. Plus the
HTTP header takes precedence anyway, as you point out.
There are theoretical reasons which favour specifying it on the
real HTTP header. Once that happens, there is no point (as far as the
recipient is concerned) in also specifying it via meta http-equiv,
since the real HTTP header has priority.
However the HTTP header applies only to pages served from a web server,
and the person doing the web page may not have managed to organise the
correct header. In particular, if a server claims all web pages are,
say ISO 8859-1 but the author has actually written something else, then
the page may make no sense. So if that is at all likely, serving using
UTF-8 seems a better choice.

The transition from an old web site to new also seems likely to be a
problem area. Say all my old pages are ISO-8859-1 and a rewrite isn't
likely, and all the new ones are UTF-8. If the server HTTP headers all
claim ISO-8859-1, but the pages are UTF-8 ... well, I guess it people
writing in English have some advantages from the equivalence of the
first 128 characters. However surely this could cause problems for
non-English pages?

If the file is copied to someplace else and served from an alternative
server (Google or Wayback Machine cache?), then even if the original
server did the correct HTTP header, the cached version may not.

Equally if a copy of the web page on a local file system is opened with
a browser, then there is no HTTP header available. To me, that seems a
strong argument for still including <meta http-equiv= in HTML files.
(These arguments are for text/* content types: none of the arguments
transfer unmodified to application/xml+whatever, however, where
different considerations apply. There are discussion documents
about it at the W3C.)
Yes. I read through a 50+ slide tutorial, plus a bunch of other
documents on just that point. I am not sure I am ever going to make
sense of my notes on the topic. Luckily I am only planning on using
HTML, so that makes it simpler for me.
I have been using ISO-8859-1 for ages because I thought it was a safe
fallback for a person authoring in English,


Certainly is!

Also, what are you to do in the event that your web page may actually
contain multiple languages?


Don't confuse language with character encoding! English is still
English, even when transcribed into Japanese characters.


OK. I was not clearly seeing the distinction between the character
encoding and the language specified. Re-reading those notes.
For maximal compatibility for different character repertoires, start
here http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist
Yes, I recall reading that a while ago, but not being confident at the
time of being able to author in UTF-8.
But if you're confident of being able to author in, and handle, utf-8,
feel free to go right ahead and use it.


Seems I may as well make the move entirely to UTF-8, and advertise my
pages as such. Using &entityname; for Latin-1 and &#bignumber; seems
easier on new pages the other UTF-8 alternative. May have a few old
pages that break (I think I may have Pound symbols as &#smallnumber; in
a few pages), but most should turn up in a search (I hope). I don't
believe I need to specially support NN4 or IE3 at this point. That
would have advantages if quoting other languages.

Thanks Alan (and for doing your fine pages).

--
http://www.ericlindsay.com
Nov 23 '05 #18

P: n/a
In article <Z9******************@news-server.bigpond.net.au>,
Lachlan Hunt <sp***********@gmail.com> wrote:
You create the .htaccess file yourself and place it in your public_html
directory (it can go in sub directories too, if you want to limit its
scope). All good web hosts that use Apache enable .htaccess for this
purpose. Just create a new text file, name it .htaccess, add this
directive and upload it to your server.

AddDefaultCharset ISO-8851-1
Thanks Lachlan. Two out of three of the servers seem to be running
Apache. One won't tell me what it is using as a server (maybe it thinks
that is a security risk).
UTF-8 is, of course, usually the best option.
I will change to it. More for the sake of potential future material.
So I guess what I should do is write a bunch of test pages, in HTML 4.01
Strict and XHTML 1 Strict, with and without the 8859-1 meta,


The meta element is completely useless in XHTML documents when they are
served properly. Assuming you're serving it with an XML MIME type, the
XML rec defines how to determine the character encoding from the HTTP
headers (or other transport protocol info), the XML declaration or
default to UTF-8/UTF-16 depending on the BOM.


But if I have understood the matter, if XHTML is served properly that
means application/html-xml and then IE doesn't like it? Luckily I have
decided to stick with HTML.
If they're served as
text/html, there's absolutely no practical difference between HTML4 and
XHTML1; both will be parsed with tag soup rules.


Have I misunderstood this? If I declare HTML 4.01 Strict, I was under
the impression that all recent browsers, including IE, would NOT use
quirks mode. However IE would use quirks mode on XHTML served as
text/html.

--
http://www.ericlindsay.com
Nov 23 '05 #19

P: n/a
Eric Lindsay <NO**********@ericlindsay.com> wrote:
I must admit that whole "MIME type announced by the server" idea worries
me. Most people putting web pages, whether HTML or XHTML, on a server,
are likely to have very little control over whether html is served as
text/html (although it mostly seems to be done), or whether xhtml 1.0 or
1.1 is served as text/html or application/xhtml-xml.


Things might have gone that way for various reasons, but they didn't use to
be that way. It was - and often still is - normal that an author can set such
things on a per-directory per-filename-extension basis, typically using
a .htaccess file. The problem used to be ignorance of such matters.

Logically, the media type of data needs to be expressed outside the data
itself, unless we impose artificial restrictions on data. It is grossly
illogical (Münchhausenian, I would say) to have stuff like
<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1">
inside an HTML document, since by the time that a program has parsed and
recognized the tag, it has already decided to treat the data as HTML
(and in iso-8859-1 encoding). There are practical reasons why the stuff
may be useful for specifying the _character encoding_ (charset), but
the media type text/html is there just because that's the way to
make the attribute conform to HTTP header value format.

--
Yucca, http://www.cs.tut.fi/~jkorpela/
Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html

Nov 23 '05 #20

P: n/a
On Sat, 19 Nov 2005, Jukka K. Korpela wrote:
Logically, the media type of data needs to be expressed outside the
data itself, unless we impose artificial restrictions on data. It is
grossly illogical (Münchhausenian, I would say) to have stuff like
<meta http-equiv="Content-Type"
content="text/html;charset=iso-8859-1"> inside an HTML document,
since by the time that a program has parsed and recognized the tag,
it has already decided to treat the data as HTML
right
(and in iso-8859-1 encoding).


Well, at least in /some/ encoding which has us-ascii as a subset,
since it needs nothing more to read and parse that initial part of the
document.

Where the logic does wrong is where there is on-the-fly transcoding
of document content. Few people (statistically) ever meet this
situation in practice, but it represents an important theoretical
principle nevertheless.

As an example, Windows' native internal coding for i18n purposes is
utf-16. On some platforms, as another example, the native internal
coding is still EBCDIC. Consider an implementation which decided to
store web documents in the native encoding, and to serve them out with
on-the-fly transcoding into a suitable encoding for HTTP transmission.
What should the "meta http-equiv" say then? The one in the file would
be unsuitable for HTTP transmission, and vice versa.

This also used to be (and maybe still is?) an issue for the Russian
version of Apache, since several different 8-bit encodings existed for
Russian and had been used alongside each other, with some recipients
preferring one and some preferring another. This actual page
http://apache.lexa.ru/english/internals.html might be outdated, but
gives an idea of what they did. See also
http://apache.lexa.ru/english/meta-http-eng.html about
misbehaviour by now-outdated Netscape and IE versions.

A plain text file (text/plain) can be transcoded at the character
level only, without any concern for the data it contains, and the
resulting encoding advertised to the recipient via the real HTTP
header. With HTML, on the other hand, if it contains "meta charset"
(or XML contains the ?xml encoding= thing), then one would want also
to parse the data looking for that inappropriate internal definition,
and to modify or remove it, to avoid confusion. (Yes, this is one of
the reasons why the real HTTP charset is defined to overrule the meta
http-equiv charset; but it's clearly inadvisable to retain a meta
charset within the document when it advertises a charset which has
nothing to do with the current properties of the document instance,
and was only a left-over from the document's internal file storage).

--

Nov 23 '05 #21

P: n/a
Eric Lindsay wrote:
But if I have understood the matter, if XHTML is served properly that
means application/html-xml
You need to get the maths right. You add xml, not subtract it, and it's
xhtml, not html: application/xhtml+xml
and then IE doesn't like it?


application/xhtml+xml + IE = Save As... dialog
application/xhtml+xml + Google = File Format: Unrecognised.
If they're served as
text/html, there's absolutely no practical difference between HTML4 and
XHTML1; both will be parsed with tag soup rules.


Have I misunderstood this? If I declare HTML 4.01 Strict, I was under
the impression that all recent browsers, including IE, would NOT use
quirks mode. However IE would use quirks mode on XHTML served as
text/html.


No, you misunderstood me. I was not referring to the differences
between quirks mode and standards mode, I was referring to the
differences between XML and tag-soup parsing rules.

While, for text/html, the parsing rules in standards mode are slightly
different from quirks mode, browsers will generally accept whatever
rubbish you throw at them and try to make some sense out of it,
regardless of the mode. As parsing in concerned, standards and quirks
mode are essentially just different error recovery techniques.

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox
Nov 23 '05 #22

P: n/a
Tim
On Sat, 19 Nov 2005 19:22:50 +1000, Eric Lindsay sent:
I must admit that whole "MIME type announced by the server" idea worries
me. Most people putting web pages, whether HTML or XHTML, on a server,
are likely to have very little control over whether html is served as
text/html (although it mostly seems to be done), or whether xhtml 1.0 or
1.1 is served as text/html or application/xhtml-xml.


There are ways and means of the user doing so, even when they can't
configure the server: They prepare their data in a specific way (whether
that be certain types of data in certain directories, certain filenaming
schemes, or something else).

Of course, that relies on configuration of the server by someone, and
publication of how to use the server, but then that's the case for any
number of HTTP serving parameters (expiry times, etc.), many of which
cannot be expressed within certain types of data (e.g. JPEGs), at all.

--
If you insist on e-mailing me, use the reply-to address (it's real but
temporary). But please reply to the group, like you're supposed to.

This message was sent without a virus, please destroy some files yourself.

Nov 23 '05 #23

P: n/a
Tim
On Sat, 19 Nov 2005 20:15:18 +1000, Eric Lindsay sent:
If the file is copied to someplace else and served from an alternative
server (Google or Wayback Machine cache?), then even if the original
server did the correct HTTP header, the cached version may not.
That's an issue of original serving and storage, firstly. It's got to be
identified, then stored appropriated (described, in an a manner handled by
the server). Then, it's an issue of subsequent serving.

Really that's a job that's best done when properly described in the first
place, from then on it doesn't matter how it originated. It's most
sensible to transcribe the content to the current system's default storage
method, and serve all pages the same way.
Equally if a copy of the web page on a local file system is opened with
a browser, then there is no HTTP header available. To me, that seems a
strong argument for still including <meta http-equiv= in HTML files.


If a browser saves a page as a file, it's doing so to be read on that
system. It's most sensible if it saves it in the local systems data
format, no matter how it originated. The methodology of HTML supports
that quite fine (there's a HTML character set, that's how the browser
interprets the content, and it may have to transform the input from it's
original transmission medium into the standard form, and can do the same
to the output when displaying locally, and when making a local store). At
this instance, there's no point in having to say what it's written in
within the file, if you always store it in the local system, and use
appropriate tools when passing it on to others.

If the browser's going to be a simpleton, and just store the file as-is,
then it's sensible for it to insert appropriate identification with the
data, somewhere (but better as a file system descriptor, than contents
within the file). And that wouldn't help, very much, if the original data
content was done in a manner that's incompatible with the local file
system.

--
If you insist on e-mailing me, use the reply-to address (it's real but
temporary). But please reply to the group, like you're supposed to.

This message was sent without a virus, please destroy some files yourself.

Nov 23 '05 #24

P: n/a
On Sat, 19 Nov 2005, Eric Lindsay wrote:
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
"If you don't set" what? The validator correctly requires you to
specify the character encoding, by at least one route.
OK, let me see if I have this. Obviously if I can guarantee that
every server I have a web page on sends a character specification in
the HTTP header, then I should never need to use <meta http-equiv=
etc. Plus the HTTP header takes precedence anyway, as you point
out.


If you can guarantee that the server sends a *correct* character
coding specification in the HTTP header, then you don't need to use
<meta http-equiv=...>, indeed.

The server sending a *wrong* character coding specification is worse
than anything else, since it overrules any other source of that
information, short of the recipient manually overriding it (which is
of course unacceptable for any production purposes).
There are theoretical reasons which favour specifying it on the
real HTTP header. Once that happens, there is no point (as far as
the recipient is concerned) in also specifying it via meta
http-equiv, since the real HTTP header has priority.


However the HTTP header applies only to pages served from a web
server,


That's correct. If they merely download the HTML page to disk, and
then try to view it from there, they will have problems. But that's
the case in other ways too (a downloaded page will not be able to
resolve relative URL references to stylesheets, images, relative
references to index pages via href="./" , etc.). If they want a
successful local copy on disk, then they must expect to massage the
downloaded HTML page in various ways (some browsers can do this for
them, maybe even transcoding the page into a character encoding of
their choice as well as inserting that into a meta http-equiv before
saving it on disk).
and the person doing the web page may not have managed to organise the
correct header.
RFC2616 takes it for granted that anyone who is doing serious web page
publishing can generate whatever relevant HTTP headers apply, for any
of the resources they are offering, whether they be HTML pages or not.
*How* they do that is their own business, but they *need* some viable
way of doing it. There are many different ways implemented, so it's
hard to offer specific advice without knowing the circumstances.
In particular, if a server claims all web pages are, say ISO 8859-1
but the author has actually written something else, then the page
may make no sense.
That's the reality. If the questioner is stuck with some inadequate
web service provider who forces charset=iso-8859-1 on HTML pages and
offers no possibility for the page owner to choose anything different,
then they have no alternative than to provide their pages in that
encoding. It's as simple as that. But for any kind of serious work,
the page publisher *needs* a way to take control of that, and if the
web service provider won't implement it somehow and tell them what it
is, they have the option to take their web pages elsewhere.
So if that is at all likely, serving using UTF-8 seems a better
choice.
It's an option which has much to commend it, but the author must be
capable of handling it. Even the BBC managed to put invalid character
data into their utf-8-encoded pages sometimes.
The transition from an old web site to new also seems likely to be a
problem area. Say all my old pages are ISO-8859-1 and a rewrite isn't
likely, and all the new ones are UTF-8.
Then you have to choose one of the ways of advertising the coding
correctly, for each page! Or turning off the actual HTTP coding and
using meta, but I've already made it clear that I consider that a
second-line option.
If the file is copied to someplace else and served from an
alternative server (Google or Wayback Machine cache?), then even if
the original server did the correct HTTP header, the cached version
may not.
It's the responsibility of whatever caches the content to maintain any
relevant HTTP headers (or to transcode the content into whatever
canonical character encoding it chooses to use). This is out of your
hands - either they get it right or they don't. (This is a known bug
in Bugzilla, for example, and some of the past discussions and
demonstrations of incorrect character handling in Mozilla are
practically incomprehensible, due to different contributors having
submitted material in different encodings into the bug database and
getting them all displayed in the same web page).
Equally if a copy of the web page on a local file system is opened
with a browser, then there is no HTTP header available.
That's not the only problem with browsing a local file system
directly...

That's only one of the reasons that I consistently recommend to
authors that they should run their own local previewing web server -
for example, one that only accepts access from localhost - and use
that to preview their web pages. Its configuration vis a vis charset
etc. should of course reflect the configuration used on their
production web server.

Apache is available to run on Windows, or Mac OS X, for example, and
can do a pretty fine job of reflecting what the pages are going to
find when they're uploaded to the production web server. Including
any SSI or PHP processing that you might be doing, for example. And
correctly handling things like href="./" , which direct file system
access does not.
Seems I may as well make the move entirely to UTF-8, and advertise my
pages as such. Using &entityname; for Latin-1 and &#bignumber; seems
easier on new pages the other UTF-8 alternative. May have a few old
pages that break (I think I may have Pound symbols as &#smallnumber; in
a few pages),


Hang on. Pound sterling is correctly £ - what you have to beware
of are the characters of the Windows-specific repertoire 128-159
decimal. For example the euro character is *not* € (but I'd
recommend coding that as &euro; anyway).

--

Nov 23 '05 #25

P: n/a
Alan J. Flavell wrote:
On Sat, 19 Nov 2005, Eric Lindsay wrote:
So if that is at all likely, serving using UTF-8 seems a better
choice.
It's an option which has much to commend it, but the author must be
capable of handling it. Even the BBC managed to put invalid character
data into their utf-8-encoded pages sometimes.


Ideally, the content author should not need to be aware of all the
technical details of using a particular encoding, such issues should be
handled by the CMS/authoring tool programmers and the system administrators.

The content authors should be able to enter whatever characters they
like and the CMS/authoring tool should ensure that they are encoded
correctly. It's extremely easy to validate UTF-8 or ISO-8859-1 input
(other encodings may be harder, but still possible) and, IMHO, there's
no excuse (beyond ignorance) for any CMS to not validate the encoding of
the input, and thus no excuse for invalid character data to appear.
Hang on. Pound sterling is correctly £ - what you have to beware
of are the characters of the Windows-specific repertoire 128-159
decimal. For example the euro character is *not* € (but I'd
recommend coding that as &euro; anyway).


For HTML, &euro; is acceptable, it will be handled just fine in all
modern browsers. However, for XHTML, the numeric or hex character
reference is better because entity references require a validating
parser. Besides, I'd recommend just using UTF-8 and entering €.

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox
Nov 23 '05 #26

P: n/a
On Sun, 20 Nov 2005, Lachlan Hunt wrote:
Alan J. Flavell wrote:
On Sat, 19 Nov 2005, Eric Lindsay wrote:
So if that is at all likely, serving using UTF-8 seems a better
choice.
It's an option which has much to commend it, but the author must
be capable of handling it. Even the BBC managed to put invalid
character data into their utf-8-encoded pages sometimes.


Ideally, the content author should not need to be aware of all the
technical details of using a particular encoding, such issues should
be handled by the CMS/authoring tool programmers and the system
administrators.


Ideally, you're right. However, at the discussion level of this
group, I suggest it's still useful to understand the underlying
mechanics, seeing that there are plenty of ways things can go wrong,
as my example from the BBC showed (they had seemingly concatenated
some iso-8859-1 data into their pages for readers from the
subcontinent - which represent Urdu, Bengali and so on using utf-8
encoding).
The content authors should be able to enter whatever characters they
like and the CMS/authoring tool should ensure that they are encoded
correctly.
It's a fine idea, certainly.
It's extremely easy to validate UTF-8 or ISO-8859-1 input
If you mean "on knowing that some non-trivial input must be either
utf-8 or iso-8859-1, it's easy to decide which", then I agree with
you.

utf-8 can be formally validated, although there's some small
probability of faking it. iso-8859-1 is just a sequence of octets:
it can only be verified by some kind of sanity check on its content.
(other encodings may be harder, but still possible)
Mozilla has routines for automatically guessing at character
encodings, which are useful when no other information is available:
but they can go sadly wrong.
and, IMHO, there's no excuse (beyond ignorance) for any CMS to not
validate the encoding of the input, and thus no excuse for invalid
character data to appear.
I'll give you an example. The web is awash with windows-1252 data
purporting to be iso-8859-1. Its authors claim that "it works", and I
can only admit "yes, it gives a fair impression of working" since most
browser developers have felt it necessary to accommodate this misuse.
By rights it should be declared invalid, but that's not going to
happen in this age of the world.

Let's hope for better in the future.

utf-8 is different. There's even a security mandate against
attempting to process invalid coding.
beware of are the characters of the Windows-specific repertoire
128-159 decimal. For example the euro character is *not* €
(but I'd recommend coding that as &euro; anyway).


For HTML, &euro; is acceptable, it will be handled just fine in all
modern browsers.


And will be recognisable even in HTML browsers which for some reason
don't interpret it. Which was the point of recommending it.
However, for XHTML, the numeric or hex character reference is better
because entity references require a validating parser.
Fair comment.
Besides, I'd recommend just using UTF-8 and entering .


For web authoring[1], it's a perfectly fine option, for those who are
comfortable with doing it. As I keep repeating. My point here was to
warn-off anyone who *was* using &-notation, from trying to reference
code points from Windows-1252 (as happens in so much HTML-extruding
software from a certain dominant vendor, until quite recently).

regards

[1] For posting to Usenet, on the other hand, it's not without its
gotchas, as we see here. SCNR.
Nov 23 '05 #27

P: n/a

Alan J. Flavell wrote:
Apache is available to run on Windows, or Mac OS X, for example, and
can do a pretty fine job of reflecting what the pages are going to
find when they're uploaded to the production web server. Including
any SSI or PHP processing ......


As does Zeus

--
James Pickering
http://jp29.org/

Nov 23 '05 #28

P: n/a


Eric Lindsay wrote:
I must admit that whole "MIME type announced by the server" idea worries
me. Most people putting web pages, whether HTML or XHTML, on a server,
are likely to have very little control over whether html is served as
text/html (although it mostly seems to be done), or whether xhtml 1.0 or
1.1 is served as text/html or application/xhtml-xml.


You souldn't care about what most people do. You should only care
about what *you* do. If you care enough to validate and to post to
comp.infosystems.www.authoring.html you should care enough to host
your pages on a server that lets you control such things. The good
news is that very few servers runnin Apache don't let you control how
your pages are served.
Nov 23 '05 #29

P: n/a


Eric Lindsay wrote:
Have I misunderstood this? If I declare HTML 4.01 Strict, I was under
the impression that all recent browsers, including IE, would NOT use
quirks mode. However IE would use quirks mode on XHTML served as
text/html.


Check out the "standards mode, quirks mode, almost standards mode"
section of http://en.wikipedia.org/wiki/Compari..._engines_(HTML)

Also see the charts at http://hsivonen.iki.fi/doctype/ and especially
the discussion at http://www.webstandards.org/learn/askw3c/sep2003.html

--
Guy Macon <http://www.guymacon.com/>

Nov 23 '05 #30

P: n/a
On Sun, 20 Nov 2005, Guy Macon wrote:
the discussion at http://www.webstandards.org/learn/askw3c/sep2003.html


Hmmm.

I would tend to take issue with the first part of their statement:

The MIME type indicates to the user-agent (as it receives the
document) how to handle and treat it accordingly, thereby allowing
you to associate a particular application or behavior to the
particular media type in your browser.

The MIME type is /supposed/ to indicate to the user-agent *what kind
of content this is*; it's meant to be up to the *user* to decide how
to associate different kinds of content with different applications or
behaviours in the browser (as, indeed, the latter part of the sentence
goes on to indicate).

Sure - in the majority of cases, the user wants the same that the
content provider wants; but in the event of a disagreement, there
needs to be a clear statement of whose responsibility it is meant to
be to do what. When a content provider sends me e.g MS Word content
(assuming for the moment that I am willing to accept that format) but
describes it as e.g application/download, in an attempt to force it to
be saved to file instead of being opened in my preferred viewer[1], I
then have an authoritative basis for saying that's rude.

[1] whatever that viewer might be (it's none of their business what it
is!).

How the first part of the quoted sentence is formulated, it almost
seems as if they're inciting content providers to choose a MIME type
according to how the content provider *wishes the browser to handle
it*, instead of using the one that represents an honest declaration of
what kind of content it is.

Maybe it's just an unfortunate choice of wording on their part, but it
seemed to me to convey an inappropriate message.

Furthermore, their well-intentioned idea of having XHTML sent out with
different MIME types to different browsers seems to me to raise more
issues than it solves. If the material is sufficiently back-level (to
XHTML/1.0 Appendix C) then it can be processed as text/html by *any*
browser: sure, it'll be processed by them all as tag soup, but can
still be rendered in Standards Mode rather than quirks mode. Sending
it to e.g Mozilla as full-blown XHTML doesn't gain any clear
advantage, and actually brings some disadvantage for the recipient.

On the other hand, if anything worthwhile is being done with XML (i.e
over and above what's feasible on HTML/4) then it wouldn't be any use
to HTML tag-slurpers anyway, and have to be served out with a proper
MIME type.

Nov 23 '05 #31

P: n/a
In article <11*************@corp.supernews.com>,
Guy Macon <http://www.guymacon.com/> wrote:
Eric Lindsay wrote:
I must admit that whole "MIME type announced by the server" idea worries
me. Most people putting web pages, whether HTML or XHTML, on a server,
are likely to have very little control over whether html is served as
text/html (although it mostly seems to be done), or whether xhtml 1.0 or
1.1 is served as text/html or application/xhtml-xml.


You souldn't care about what most people do. You should only care
about what *you* do. If you care enough to validate and to post to
comp.infosystems.www.authoring.html you should care enough to host
your pages on a server that lets you control such things. The good
news is that very few servers runnin Apache don't let you control how
your pages are served.


Well, I have made some test pages at http://www.ericlindsay.com/testthis
and will add them to the other servers I have access to for further
tests. Because I am trying to understand all this, I am trying to get
back to the simplest possible cases. So I have pages containing plain
text (named test1), HTML 4.01 Strict (test2), XHTML 1.0 (test3) and
XHTML 1.1 (test 4). That server returns everything except .html files
as text/plain. It returns files with the .html (or .htm) extension as
text/html

Since I don't have any XHTML files at this stage I don't need to rush to
change this. However I certainly intend to try it in a test directory
on all servers I can access.

However I am not certain how I should even be checking the HTTP header
from a server. I used curl --head for lack of any better idea. However
I can't see that I could (or should) suggest that to anyone who is
interested. I have no idea whether curl exists on a Windows system but
probably not (I use a Macintosh). In any case, anything that can not be
done with a GUI will not be done by most of the people I talk to. So
would anyone care to suggest better alternatives for Windows and
Macintosh?

Plus I notice another problem with curl. I can see no indication in
HTTP headers supplied via curl that my server is providing a character
specification. I thought something about that should be in the HTTP
header, so I now don't know whether it isn't being supplied by the
server, or whether curl is the wrong thing to use to find out. On the
other hand, I have seen it in a HTTP header from a web site randomly
chosen for being likely not to use English.

--
http://www.ericlindsay.com
Nov 23 '05 #32

P: n/a
In our last episode, Eric Lindsay <NO**********@ericlindsay.com>
pronounced to comp.infosystems.www.authoring.html:
However I am not certain how I should even be checking the HTTP header
from a server.


If you use Firefox:
http://livehttpheaders.mozdev.org/

Or you can do it online with Rex Swain's HTTP viewer:
http://www.rexswain.com/httpview.html

--
Mark Parnell
http://clarkecomputers.com.au
Nov 23 '05 #33

P: n/a
On Mon, 21 Nov 2005, Eric Lindsay wrote:

[about http://www.ericlindsay.com/testthis/ ]
However I am not certain how I should even be checking the HTTP
header from a server. I used curl --head for lack of any better
idea. However I can't see that I could (or should) suggest that to
anyone who is interested.
I would recommend Mozilla/Firefox (whichever the questioner prefers),
with Chris Pederick's web developer toolbar. An indispensable utility
for so many different reasons - even though, personally, I tend to use
lynx -head -dump for this purpose, out of sheer long-term habit.
Plus I notice another problem with curl. I can see no indication in
HTTP headers supplied via curl that my server is providing a character
specification.
It isn't sending any.
I thought something about that should be in the HTTP header,


Depends on server configuration, as we've all been saying.

Sooner or later you'll want to try uploading a simple .htaccess file
with variations of AddType and AddCharset in it. Just to verify
whether the server honours it. Once you know one way or the other,
you'll be in a better position to discuss options.
Nov 23 '05 #34

P: n/a
In article <pa***************************@mail.localhost.inva lid>,
Tim <ti*@mail.localhost.invalid> wrote:
On Sat, 19 Nov 2005 19:22:50 +1000, Eric Lindsay sent:
I must admit that whole "MIME type announced by the server" idea worries
me. Most people putting web pages, whether HTML or XHTML, on a server,
are likely to have very little control over whether html is served as
text/html (although it mostly seems to be done), or whether xhtml 1.0 or
1.1 is served as text/html or application/xhtml-xml.


There are ways and means of the user doing so, even when they can't
configure the server: They prepare their data in a specific way (whether
that be certain types of data in certain directories, certain filenaming
schemes, or something else).


I was actually wondering about whether I would have to make changes in
how a web site was prepared if I ever went to XHTML 1.1 and tried to
serve it as application/xhtml-xml. Luckily I am not planning on moving
from HTML 4.01 Strict

It seemed to me that a website would likely consist of legacy .html
files (which if old like my site would all be labelled .htm). Then
there would be maybe XHTML 1.0, maybe generated from some program, and
being served as text/html or text/plain (as on my server). Only the
final layer would be XHTML 1.1, to be served as application/xhtml-xml.
I was imagining that new content would need to be in new directories
with appropriate .htaccess file or some equivalent.

--
http://www.ericlindsay.com
Nov 23 '05 #35

P: n/a
In article <Iu*******************@news-server.bigpond.net.au>,
Lachlan Hunt <sp***********@gmail.com> wrote:
Eric Lindsay wrote: You need to get the maths right. You add xml, not subtract it, and it's
xhtml, not html: application/xhtml+xml
Ten years in a Math school, and still can't do arithmetic. No wonder I
have problems with web pages.
and then IE doesn't like it?


application/xhtml+xml + IE = Save As... dialog
application/xhtml+xml + Google = File Format: Unrecognised.


Thanks for that. I love the summary of what happens. I don't have IE,
so I wasn't sure precisely how it reacted badly, just knew it did.
If they're served as
text/html, there's absolutely no practical difference between HTML4 and
XHTML1; both will be parsed with tag soup rules.


Have I misunderstood this? If I declare HTML 4.01 Strict, I was under
the impression that all recent browsers, including IE, would NOT use
quirks mode. However IE would use quirks mode on XHTML served as
text/html.


No, you misunderstood me. I was not referring to the differences
between quirks mode and standards mode, I was referring to the
differences between XML and tag-soup parsing rules.


OK, despite all my recent reading I didn't really catch on to the idea
of calling "normal" browser parsing "tag-soup parsing", to indicate the
difference between it and XML parsing. Thanks for that.
While, for text/html, the parsing rules in standards mode are slightly
different from quirks mode, browsers will generally accept whatever
rubbish you throw at them and try to make some sense out of it,
regardless of the mode. As parsing in concerned, standards and quirks
mode are essentially just different error recovery techniques.


--
http://www.ericlindsay.com
Nov 23 '05 #36

P: n/a
In article <Pi******************************@ppepc56.ph.gla.a c.uk>,
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
Seems I may as well make the move entirely to UTF-8, and advertise my
pages as such. Using &entityname; for Latin-1 and &#bignumber; seems
easier on new pages the other UTF-8 alternative. May have a few old
pages that break (I think I may have Pound symbols as &#smallnumber; in
a few pages),


Hang on. Pound sterling is correctly £ - what you have to beware
of are the characters of the Windows-specific repertoire 128-159
decimal. For example the euro character is *not* € (but I'd
recommend coding that as &euro; anyway).


Thanks for that correction to my understanding Alan.

You guys are aware that I spend much of my time feeling like a person
sinking in quicksand, while you stand on the side making learned and
interesting comments on my swimming style, with the odd aside about how
best to use the springboard while diving in?

My little collection of essential notes about doing web pages now has 77
files in it, and I am only on about line 4 of my web page!

--
http://www.ericlindsay.com
Nov 23 '05 #37

P: n/a
Alan J. Flavell wrote:
It's extremely easy to validate UTF-8 or ISO-8859-1 input
If you mean "on knowing that some non-trivial input must be either
utf-8 or iso-8859-1, it's easy to decide which", then I agree with
you.


Yes, it is easy to decide which, but I was actually meaning that if the
system is expecting UTF-8 input, it's easy to formally validate it and
also ensure that there's no unwanted control characters (eg. most
characters less than U+0020, or in the range from U+0080 to U+009F,
etc.) that are invalid for use in HTML even if they're encoded
correctly. Similarly for ISO-8859-1 or other expected encoding.
utf-8 is different. There's even a security mandate against
attempting to process invalid coding.


Such issues should be carefully handled while validating the input.

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox
Nov 23 '05 #38

P: n/a
In article <Pi*******************************@ppepc56.ph.gla. ac.uk>,
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
On Mon, 21 Nov 2005, Eric Lindsay wrote:

[about http://www.ericlindsay.com/testthis/ ]
I would recommend Mozilla/Firefox (whichever the questioner prefers),
with Chris Pederick's web developer toolbar.


http://chrispederick.com/work/webdeveloper/

It certainly seems very nice. I mostly use Safari, but those extensions
for Firefox are just what I needed for these tests.

I also found an article about using those tools, in the context of
checking for accessibility, but similar checks would be helpful for
more general purposes.
http://www.ariadne.ac.uk/issue44/lauke/
character specification.


It isn't sending any.
I thought something about that should be in the HTTP header,


Depends on server configuration, as we've all been saying.

Sooner or later you'll want to try uploading a simple .htaccess file
with variations of AddType and AddCharset in it. Just to verify
whether the server honours it. Once you know one way or the other,
you'll be in a better position to discuss options.


I certainly will check out adding my own .htaccess files, but let me get
through finding out how various pages act before I try changing how the
server works.

I don't think I am likely to have any problems at present with any of
the servers being unchanged for a while. I am writing in HTML 4.01
Strict, not in xhtml, the servers seem to declare any .html or .htm as
text/html. Plus my character set (mostly) isn't straying outside the
ASCII range, so it will probably be displayed correctly regardless.
There is one server in particular that I suspect may never have been
changed from its defaults. There is another where I'd like them to
complete processing my new domain names (don't ask how long that has
been) before I get into a position where I might have to ask about
..htaccess files.

The Rex Swain's HTTP viewer Mark Parnell suggested was also a very nice
tool. Particularly for spotting which sites were redirecting without my
realising it.

Thanks to everyone for these tools.

--
http://www.ericlindsay.com
Nov 23 '05 #39

P: n/a
Also visit http://web-sniffer.net/ to view HTTP request and response
Headers.

--
James Pickering
http://jp29.org/

Nov 23 '05 #40

P: n/a
Alan J. Flavell wrote:
Apache is available to run on Windows, or Mac OS X, for example, and
can do a pretty fine job of reflecting what the pages are going to
find when they're uploaded to the production web server. Including
any SSI or PHP processing that you might be doing, for example.


Indeed Apache is bundled with Mac OS X. It doesn't include PHP by default,
but a nice Mac package for PHP 4.x can be downloaded here:
http://www.entropy.ch/software/macosx/

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact

Nov 23 '05 #41

P: n/a
Tim
On Mon, 21 Nov 2005 09:57:44 +1000, Eric Lindsay sent:
It seemed to me that a website would likely consist of legacy .html files
(which if old like my site would all be labelled .htm). Then there would
be maybe XHTML 1.0, maybe generated from some program, and being served as
text/html or text/plain (as on my server). Only the final layer would be
XHTML 1.1, to be served as application/xhtml-xml. I was imagining that new
content would need to be in new directories with appropriate .htaccess
file or some equivalent.


It's a simple task for a webmaster to say that the server will be
configured to serve HTML as text/html if the filename ends with .html, to
be served as XHTML if ending in .xhtml, XHTML bastardised as HTML if
ending in .xhtm (or any other letter sequence), and so on. Or the
determination could be done that all files in a certain directory are
HTML, etc.

The filenaming is entirely inconsequential as far as the end user is
concerned, you only need to use specific ones if the server expects you
to. And then, the ones it expects. You could, just as easily, have all
your HTML pages named as .page files.

As far as generated content goes, it generates appropriate descriptions
at the same time as it generates the content.

--
If you insist on e-mailing me, use the reply-to address (it's real but
temporary). But please reply to the group, like you're supposed to.

This message was sent without a virus, please destroy some files yourself.

Nov 23 '05 #42

P: n/a

Eric Lindsay wrote:
Well, I have made some test pages at http://www.ericlindsay.com/testthis


There are a few minor DNS improvements that you might wish to consider:

http://www.dnsreport.com/tools/dnsre...ww.ericlindsay
..com

Compare with:

http://www.dnsreport.com/tools/dnsre...ww.guymacon.co
m

Nov 23 '05 #43

P: n/a
In article <11*************@corp.supernews.com>,
Guy Macon <http://www.guymacon.com/> wrote:
There are a few minor DNS improvements that you might wish to consider:

http://www.dnsreport.com/tools/dnsre...riclindsay.com

Compare with:

http://www.dnsreport.com/tools/dnsre...w.guymacon.com


That is a really nice reporting tool. At least I only got warnings,
rather than a fail. I'll test all the other servers available to me as
well. Looks like it would be a good way to get an idea of what stuff an
ISP handles well before signing up. Thanks for that Guy.

--
http://www.ericlindsay.com
Nov 23 '05 #44

P: n/a
I just found another interesting problem while testing the three servers
to which I have access. This relates to .css files, which I didn't even
think of including in my test files. Two servers use HTTP headers that
say .css files are text/css, however one returns text/html.

The interesting bit comes when you read the Mozilla
http://developer.mozilla.org/en/docs..._Mode_Behavior
which says "Stylesheets linked in the document with an advisory mime
type of text/css will still be treated as CSS even if the server gives a
Content-Type header other than text/css.". However if you attempt to
force a standards mode DTD, here is what they say "Another often-noticed
change is that, in standards mode, we reject CSS stylesheets that have a
MIME type other than text/css."

I wonder how many people have a server that by default serves .css as
text/html? I wonder what other browsers do something similar in quirks
vs standards mode? Guess that can be my project for tomorrow. I think
I'm getting more and more impressed by the idea of having .htaccess
available. However I also think it is unrealistic to expect most web
page writers to know about these problems.

--
http://www.ericlindsay.com
Nov 23 '05 #45

P: n/a
In article <NO********************************@freenews.iinet .net.au>,
Eric Lindsay <NO**********@ericlindsay.com> wrote:
I wonder how many people have a server that by default serves .css as
text/html?
Less than there used to be. Call your non-conforming ISP and complain.
Or email them if they respond to that. That's their trivial one-time job.
I think
I'm getting more and more impressed by the idea of having .htaccess
available.
It all goes away if you switch to IIS.
However I also think it is unrealistic to expect most web
page writers to know about these problems.


True. But that can be an advantage if you can sell it.

leo

--
<http://web0.greatbasin.net/~leo/>
Nov 23 '05 #46

P: n/a
On Tue, 22 Nov 2005, Eric Lindsay wrote:
I just found another interesting problem while testing the three
servers to which I have access. This relates to .css files, which I
didn't even think of including in my test files. Two servers use
HTTP headers that say .css files are text/css, however one returns
text/html.
I've seen other wrong content-types for CSS files. It's definitively
wrong for a server to send the wrong content-type, for any resource
(RFC2616 defines the HTTP protcol).
The interesting bit comes when you read the Mozilla
http://developer.mozilla.org/en/docs..._Mode_Behavior
which says "Stylesheets linked in the document with an advisory mime
type of text/css will still be treated as CSS even if the server
gives a Content-Type header other than text/css.".
On a strict reading of RFC2616, this browser behaviour is illegal.
Admittedly this (mis)usage is quite common: but if a server sends a
wrong content type then (on my reading of the "if and only if" clause
in the RFC), the options available to the client are either to reject
the content entirely - as being unfit for purpose; or to have a user
dialogue asking for the user's consent to a potentially
security-relevant overriding of the protocol rules.
However if you attempt to force a standards mode DTD, here is what
they say "Another often-noticed change is that, in standards mode,
we reject CSS stylesheets that have a MIME type other than
text/css."
Right. This behaviour is correct. It's analogous to the mandate in
the CSS specification, to disregard stuff which the client does not
understand.
I wonder how many people have a server that by default serves .css
as text/html? I wonder what other browsers do something similar in
quirks vs standards mode?
But not a concern to a content provider who is doing their job
correctly!
I think I'm getting more and more impressed by the idea of having
.htaccess available.
Agreed, although it would better meet the principle of least
astonishment if server admins would include the customary MIME types
in their configuration. There's a file of them distributed with
Apache, for example, which makes a good starting point.
However I also think it is unrealistic to expect most web
page writers to know about these problems.


The mandate of RFC2616 is clear, albeit some software developers
(YKWIM) who are more interested in faking a pretence of working (than
in maintaining security by conforming with published interworking
rules) seem to have driven a coach and horses through the web.

The question of whose responsibility it is to get this right (the
server admin or the content provider) is something one could discuss,
but the bottom line is - if it doesn't get the right answer, then it
is failing.
Nov 23 '05 #47

P: n/a
Eric Lindsay wrote:
I wonder how many people have a server that by default serves .css as
text/html?


CSS are very often sent as text/html when they are dynamically generated
- lots of developpers unfortunatly still doesn't know of mime types and
don't care about http headers at all.
Nov 23 '05 #48

P: n/a
Eric Lindsay wrote:
I wonder how many people have a server that by default serves .css as
text/html?


Heh, my webmail pages serve css files as: "application/x-pointplus"
I've been reporting this error to them for several years!

--
-bts
-Warning: I brake for lawn deer
Nov 23 '05 #49

P: n/a
In our last episode,
<1p****************************@40tude.net>,
the lovely and talented Beauregard T. Shagnasty
broadcast on comp.infosystems.www.authoring.html:
Eric Lindsay wrote:
I wonder how many people have a server that by default serves .css as
text/html?

Heh, my webmail pages serve css files as: "application/x-pointplus"
I've been reporting this error to them for several years!


So, does the browser believe:

1) The server
2) The type attribute in the LINK element
3) What it can deduce by looking at the file itself
?

I really don't know.

--
Lars Eighner us****@larseighner.com http://www.larseighner.com/
If it wasn't for muscle spasms, I wouldn't get any exercise at all.
Nov 23 '05 #50

82 Replies

This discussion thread is closed

Replies have been disabled for this discussion.