473,574 Members | 3,144 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Understanding simplest HTML page

I have been trying to get a better understanding of simple HTML, but I
am finding conflicting information is very common. Not only that, even
in what seemed elementary and without any possibility of getting wrong
it seems I am on very shaky ground .

For example, pretty much every book and web course on html that I have
read tells me I must include <html>, <head> and <body> tag pairs.

I have always done that, and never questioned it. It seemed perfectly
reasonable to me (and still does) to split meta information from
presented content, and indeed to require that browsers be told the
content was html. Although I guess having a server present mime type
text/html covers whether contents are html, as does having a doctype.

However on reading http://www.w3.org/TR/html401/struct/global.html I
noticed that the html, head and body tags were optional (although the
title tag is required). So I did a test page

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<title>Test whether required in head</title>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
<p>Paragraph of text

This validates without any warning.

If you leave out either the title or some body content it will not
validate. So the validator at least is making an assumption about what
is head and what is body. I would imagine most user agent parsers would
also.

Does anyone have any suggestions about good tutorial texts about html
that get everything correct? At the moment I am gradually going through
the W3C documentation, but I tend to find myself missing some of the
implications.

--
http://www.ericlindsay.com
Nov 23 '05
82 6265
In article <Pi************ *************** ***@ppepc56.ph. gla.ac.uk>,
"Alan J. Flavell" <fl*****@ph.gla .ac.uk> wrote:
On Thu, 17 Nov 2005, Eric Lindsay wrote:

[ reading an SGML DTD ]
I would still like an easier text, if one exists.
The one that I used, years ago, has disappeared now, but
you could find it at the wayback machine http://www.archive.org/
The url to look for is http://www.awpa.asn.au/html/dtdtrees/sgml.html


Thanks Alan. Reading it now.
I'm a great supporter of the idea of writing syntactically valid HTML,
but no-one should hold unrealistic expectations of its benefits for
browsers to "cope".
I think I have recovered from the unrealistic expectations, but then I
went into a keep it so simple nothing can break mode. That in its way
is equally unrealistic, because all I've been doing is pages with really
simply text and images (which was often all that was needed).

I think what I need to start looking at is classifying CSS into ...
Well, I was about to say works and doesn't work. However works and
doesn't work is too sharp a distinction, and part of the reason I have
never pushed CSS. I guess what I really mean is "works (in recent
browsers) and does NOT cause catastrophic problems in IE" vs "causes all
manner of problems if IE sees it."

That example (in another thread) of using CSS3 and a background image to
show which links are external to a site is exactly what I should be
seeking. It works in modern browsers like Safari and Firefox. Doesn't
work at all in IE6, but all that means is the IE experience isn't as
good. The links still work as links. I need to find more like that,
and start using them.
HTML will almost certainly "work" when the validator says it's
invalid, but will likely fail (to do what the author intended) when
it's reported to be valid:
http://ppewww.ph.gla.ac.uk/~flavell/...url.html#tests


Yes, I think I ran across that page when I read through a bunch of your
web pages about web matters yesterday. Thanks

--
http://www.ericlindsay.com
Nov 23 '05 #11
Andy Dingley wrote:
A HTML document containing a <p> start tag alone _MUST_ place this
within the <html> and <body> elements, because that's the only place
it's allowed to be -- even if you never use a <html> or <body> tag.

There are a few problems with this.

- it's not human friendly.
That's a matter of opinion, some people prefer not having unnecessary
tags in the source.
- it relies on correct parser behaviour
Indeed, and that's the best practical reason to always include them. I
know at least IE has serious problems with this:

<!DOCTYPE html ... >
<title>Test</title>
<form ... >
<p>test
</form>

IE will put the form element within the head and the p element within
the body. Using at least the body start-tag does fix that erroneous
behaviour.
- if the document is invalid HTML, then it's hard to recover when so
much is based on inference. Is the sequence <p><link> a <head> with a
spurious <p> in it, or is it a <body> with a spurious <link> in it ?


The head element wouldn't be implied, unless it was a valid child of the
p element, which it is not. As for the link implying the end of the p
element, I'm not sure of the exact SGML rules in this case, but the
validator shows (in the parse tree) that it remains a child of the p
element.

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox
Nov 23 '05 #12
On Fri, 18 Nov 2005, Eric Lindsay wrote:
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">

However if you don't set it, the W3C validator gives you a warning,


No, it doesn't. Example:
http://www.unics.uni-hannover.de/nht...european.html1
I have been using ISO-8859-1 for ages because I thought it was a safe
fallback for a person authoring in English, especially since almost all
my pages were done on an old Windows PC with a text editor.
That's perfectly okay and I said nothing against ISO-8859-1.
As far as I know I have no direct control of what http headers is
provided in any of the servers involved. I don't have access to the
.htaccess file, in the cases where it exists.
First you need to know your actual server software. For configuration,
you might then post to
<news:comp.info systems.www.ser vers.ms-windows>
<news:comp.info systems.www.ser vers.unix>
Also, what are you to do in the event that your web page may actually
contain multiple languages? I guess you declare UTF-8 (in the http
headers if you find out how). However again I was under the impression
there were browser problems with this also.


If some browser has problems with "charset=UT F-8", adding a
<META ... charset=UTF-8> doesn't help at all.

--
Netscape 3.04 does everything I need, and it's utterly reliable.
Why should I switch? Peter T. Daniels in <news:sci.lan g>

Nov 23 '05 #13
On Fri, 18 Nov 2005, Eric Lindsay wrote:
Andreas Prilop <nh******@rrz n-user.uni-hannover.de> wrote:
On Thu, 17 Nov 2005, Eric Lindsay wrote:
So I did a test page

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
However if you don't set it, the W3C validator gives you a warning,
"If you don't set" what? The validator correctly requires you to
specify the character encoding, by at least one route.

There are theoretical reasons which favour specifying it on the
real HTTP header. Once that happens, there is no point (as far as the
recipient is concerned) in also specifying it via meta http-equiv,
since the real HTTP header has priority.

There are a number of possible strategies: some people will promote
one, some will promote another. All of them have some kind of snag,
but what Andreas is promoting (and I do too) has a consistency to it.

(These arguments are for text/* content types: none of the arguments
transfer unmodified to application/xml+whatever, however, where
different considerations apply. There are discussion documents
about it at the W3C.)
even though I believed that particular charset was considered a
default value for web pages that didn't otherwise specify what they
were using.
That's a complex story, but you do better to put no reliance on that.
I have been using ISO-8859-1 for ages because I thought it was a safe
fallback for a person authoring in English,
Certainly is! Netscape 4.* would handle it wrongly when you included
&#bignumber; references to Unicode characters, but otherwise all
widely-used browsers know how to deal with it.
http://www.w3.org/International/O-HTTP-charset.html
http://ppewww.ph.gla.ac.uk/~flavell/...t/ns-burp.html


I had read Alan's page about setting that in the server options, and I'm
still trying to understand what I should be doing about it.


Server-dependent. The w3's page includes hints for popular server
software, but you may also have to deal with whatever configuration
options have been selected by your service provider.
Also, what are you to do in the event that your web page may actually
contain multiple languages?
Don't confuse language with character encoding! English is still
English, even when transcribed into Japanese characters.
I guess you declare UTF-8 (in the http headers if you find out how).
However again I was under the impression there were browser problems
with this also.
Not with any halfways web-compatible browser. utf-8 was even handled
correctly by that old Netscape 4.* thing, and older versions of MSIE
for quite some years now.
So I guess what I should do is write a bunch of test pages, in HTML
4.01 Strict and XHTML 1 Strict, with and without the 8859-1 meta,
and possible another set with and without UTF-8 and put them on each
of the three servers I can get at, and see if I can find out what
they serve?


It's good to play with the options and find out what happens, for
one's own education, indeed. We can offer you advice on which options
to use, but until you've played with them and developed confidence in
how they happen, it probably doesn't settle in one's concepts.

For maximal compatibility for different character repertoires, start
here http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist

But if you're confident of being able to author in, and handle, utf-8,
feel free to go right ahead and use it.

Nov 23 '05 #14
Eric Lindsay wrote:
In article
<Pi************ *************** **********@s5b0 04.rrzn.uni-hannover.de>,
Andreas Prilop <nh******@rrz n-user.uni-hannover.de> wrote:
On Thu, 17 Nov 2005, Eric Lindsay wrote:
> So I did a test page
>
> <META HTTP-EQUIV="Content-Type" CONTENT="text/html;
> charset=ISO-8859-1">


However if you don't set it, the W3C validator gives you a warning, even
though I believed that particular charset was considered a default value
for web pages that didn't otherwise specify what they were using.

I have been using ISO-8859-1 for ages because I thought it was a safe
fallback for a person authoring in English, especially since almost all
my pages were done on an old Windows PC with a text editor. Certainly
not set up to do much in the way of foreign characters. Now I have a
Mac, UTF-8 seems the default and it does handle foreign characters.
You should not get into the habit to specify the encoding (charset)
of your page through such a META thingy. Rather set the HTTP
charset parameter in your server software.
http://www.w3.org/International/O-HTTP-charset.html
http://ppewww.ph.gla.ac.uk/~flavell/...t/ns-burp.html


I had read Alan's page about setting that in the server options, and I'm
still trying to understand what I should be doing about it.

As far as I know I have no direct control of what http headers is
provided in any of the servers involved. I don't have access to the
.htaccess file, in the cases where it exists. Indeed, how likely is the
average web page writer to know (or care) anything about it?

Also, what are you to do in the event that your web page may actually
contain multiple languages? I guess you declare UTF-8 (in the http
headers if you find out how). However again I was under the impression
there were browser problems with this also.

So I guess what I should do is write a bunch of test pages, in HTML 4.01
Strict and XHTML 1 Strict, with and without the 8859-1 meta, and
possible another set with and without UTF-8 and put them on each of the
three servers I can get at, and see if I can find out what they serve?

You can create your own .htaccess files. Windows won't (or wouldn't) let
you create a file with a name that is "all extension" so I once created a
file with a name that kept Windows happy, then changed it once it was on
the server. I am now running Linux, and have no problems. The problem
then was that dot-files were invisible.

Doug.
--
Registered Linux User No. 277548. My true email address has hotkey for
myaccess.
A man's feet should be planted in his country, but his eyes should survey
the world.
- George Santayana.

Nov 23 '05 #15
..htaccess reference:

http://www.javascriptkit.com/howto/htaccess.shtml

--
James Pickering
http://jp29.org/

Nov 23 '05 #16
In article <Xn************ *************** *@193.229.4.246 >,
"Jukka K. Korpela" <jk******@cs.tu t.fi> wrote:
Eric Lindsay <NO**********@e riclindsay.com> wrote:
I have been trying to get a better understanding of simple HTML, but I
am finding conflicting information is very common.
Indeed, but you _do_ know where to find the authoritative specifications, do
you not? ( http://www.w3.org/Markup )


Thanks for that pointer. There are so many pages on the W3C site that I
don't believe I had read that one. I mostly had pointers to specific
topics, plus bits of the site I stumbled upon. I am finding it hard to
absorb everything available on the topic.
The particular reason to use <html> is that
it's the place where you should put the lang attribute that specifies the
human language used on the page, e.g. <html lang="en"> for English.


Thank you for that Jukka. I had totally missed incorporating that in
the notes I took when visiting W3C and your own site.
and indeed to require that browsers be told the
content was html.


They are supposed to know it from the media type (MIME type) announced by the
server. In fact, one of the _problems_ is that Internet Explorer sniffs at
the content, looking for tags, instead of honoring the protocol, so that you
cannot send an HTML document as text/plain and expect IE to do the right
thing (display it as such).


I must admit that whole "MIME type announced by the server" idea worries
me. Most people putting web pages, whether HTML or XHTML, on a server,
are likely to have very little control over whether html is served as
text/html (although it mostly seems to be done), or whether xhtml 1.0 or
1.1 is served as text/html or application/xhtml-xml. Given Internet
Explorer actions, that area seems very messy. Deciding to use HTML 4.01
rather than XHTML was a result of learning (via this newsgroup) about
that problem.

--
http://www.ericlindsay.com
Nov 23 '05 #17
In article <Pi************ *************** ****@ppepc56.ph .gla.ac.uk>,
"Alan J. Flavell" <fl*****@ph.gla .ac.uk> wrote:
On Fri, 18 Nov 2005, Eric Lindsay wrote:
> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
However if you don't set it, the W3C validator gives you a warning,


"If you don't set" what? The validator correctly requires you to
specify the character encoding, by at least one route.


OK, let me see if I have this. Obviously if I can guarantee that every
server I have a web page on sends a character specification in the HTTP
header, then I should never need to use <meta http-equiv= etc. Plus the
HTTP header takes precedence anyway, as you point out.
There are theoretical reasons which favour specifying it on the
real HTTP header. Once that happens, there is no point (as far as the
recipient is concerned) in also specifying it via meta http-equiv,
since the real HTTP header has priority.
However the HTTP header applies only to pages served from a web server,
and the person doing the web page may not have managed to organise the
correct header. In particular, if a server claims all web pages are,
say ISO 8859-1 but the author has actually written something else, then
the page may make no sense. So if that is at all likely, serving using
UTF-8 seems a better choice.

The transition from an old web site to new also seems likely to be a
problem area. Say all my old pages are ISO-8859-1 and a rewrite isn't
likely, and all the new ones are UTF-8. If the server HTTP headers all
claim ISO-8859-1, but the pages are UTF-8 ... well, I guess it people
writing in English have some advantages from the equivalence of the
first 128 characters. However surely this could cause problems for
non-English pages?

If the file is copied to someplace else and served from an alternative
server (Google or Wayback Machine cache?), then even if the original
server did the correct HTTP header, the cached version may not.

Equally if a copy of the web page on a local file system is opened with
a browser, then there is no HTTP header available. To me, that seems a
strong argument for still including <meta http-equiv= in HTML files.
(These arguments are for text/* content types: none of the arguments
transfer unmodified to application/xml+whatever, however, where
different considerations apply. There are discussion documents
about it at the W3C.)
Yes. I read through a 50+ slide tutorial, plus a bunch of other
documents on just that point. I am not sure I am ever going to make
sense of my notes on the topic. Luckily I am only planning on using
HTML, so that makes it simpler for me.
I have been using ISO-8859-1 for ages because I thought it was a safe
fallback for a person authoring in English,


Certainly is!

Also, what are you to do in the event that your web page may actually
contain multiple languages?


Don't confuse language with character encoding! English is still
English, even when transcribed into Japanese characters.


OK. I was not clearly seeing the distinction between the character
encoding and the language specified. Re-reading those notes.
For maximal compatibility for different character repertoires, start
here http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist
Yes, I recall reading that a while ago, but not being confident at the
time of being able to author in UTF-8.
But if you're confident of being able to author in, and handle, utf-8,
feel free to go right ahead and use it.


Seems I may as well make the move entirely to UTF-8, and advertise my
pages as such. Using &entityname; for Latin-1 and &#bignumber; seems
easier on new pages the other UTF-8 alternative. May have a few old
pages that break (I think I may have Pound symbols as &#smallnumbe r; in
a few pages), but most should turn up in a search (I hope). I don't
believe I need to specially support NN4 or IE3 at this point. That
would have advantages if quoting other languages.

Thanks Alan (and for doing your fine pages).

--
http://www.ericlindsay.com
Nov 23 '05 #18
In article <Z9************ ******@news-server.bigpond. net.au>,
Lachlan Hunt <sp***********@ gmail.com> wrote:
You create the .htaccess file yourself and place it in your public_html
directory (it can go in sub directories too, if you want to limit its
scope). All good web hosts that use Apache enable .htaccess for this
purpose. Just create a new text file, name it .htaccess, add this
directive and upload it to your server.

AddDefaultChars et ISO-8851-1
Thanks Lachlan. Two out of three of the servers seem to be running
Apache. One won't tell me what it is using as a server (maybe it thinks
that is a security risk).
UTF-8 is, of course, usually the best option.
I will change to it. More for the sake of potential future material.
So I guess what I should do is write a bunch of test pages, in HTML 4.01
Strict and XHTML 1 Strict, with and without the 8859-1 meta,


The meta element is completely useless in XHTML documents when they are
served properly. Assuming you're serving it with an XML MIME type, the
XML rec defines how to determine the character encoding from the HTTP
headers (or other transport protocol info), the XML declaration or
default to UTF-8/UTF-16 depending on the BOM.


But if I have understood the matter, if XHTML is served properly that
means application/html-xml and then IE doesn't like it? Luckily I have
decided to stick with HTML.
If they're served as
text/html, there's absolutely no practical difference between HTML4 and
XHTML1; both will be parsed with tag soup rules.


Have I misunderstood this? If I declare HTML 4.01 Strict, I was under
the impression that all recent browsers, including IE, would NOT use
quirks mode. However IE would use quirks mode on XHTML served as
text/html.

--
http://www.ericlindsay.com
Nov 23 '05 #19
On Sat, 19 Nov 2005, Jukka K. Korpela wrote:
Logically, the media type of data needs to be expressed outside the
data itself, unless we impose artificial restrictions on data. It is
grossly illogical (Münchhausenian , I would say) to have stuff like
<meta http-equiv="Content-Type"
content="text/html;charset=is o-8859-1"> inside an HTML document,
since by the time that a program has parsed and recognized the tag,
it has already decided to treat the data as HTML
right
(and in iso-8859-1 encoding).


Well, at least in /some/ encoding which has us-ascii as a subset,
since it needs nothing more to read and parse that initial part of the
document.

Where the logic does wrong is where there is on-the-fly transcoding
of document content. Few people (statistically) ever meet this
situation in practice, but it represents an important theoretical
principle nevertheless.

As an example, Windows' native internal coding for i18n purposes is
utf-16. On some platforms, as another example, the native internal
coding is still EBCDIC. Consider an implementation which decided to
store web documents in the native encoding, and to serve them out with
on-the-fly transcoding into a suitable encoding for HTTP transmission.
What should the "meta http-equiv" say then? The one in the file would
be unsuitable for HTTP transmission, and vice versa.

This also used to be (and maybe still is?) an issue for the Russian
version of Apache, since several different 8-bit encodings existed for
Russian and had been used alongside each other, with some recipients
preferring one and some preferring another. This actual page
http://apache.lexa.ru/english/internals.html might be outdated, but
gives an idea of what they did. See also
http://apache.lexa.ru/english/meta-http-eng.html about
misbehaviour by now-outdated Netscape and IE versions.

A plain text file (text/plain) can be transcoded at the character
level only, without any concern for the data it contains, and the
resulting encoding advertised to the recipient via the real HTTP
header. With HTML, on the other hand, if it contains "meta charset"
(or XML contains the ?xml encoding= thing), then one would want also
to parse the data looking for that inappropriate internal definition,
and to modify or remove it, to avoid confusion. (Yes, this is one of
the reasons why the real HTTP charset is defined to overrule the meta
http-equiv charset; but it's clearly inadvisable to retain a meta
charset within the document when it advertises a charset which has
nothing to do with the current properties of the document instance,
and was only a left-over from the document's internal file storage).

--

Nov 23 '05 #20

This thread has been closed and replies have been disabled. Please start a new discussion.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.