On Tue, 13 Jan 2004 16:21:49 -0600, Terry
<go**************@REMOVE.yahoo.com> wrote:
Are you referring to the parsing of HTML, or its display once it's been
parsed?
Mainly parsing. Thanks Mike!
Be aware that I've never actually written a browser or a HTML parser. I
toyed with the idea a while ago, but by the end of this post, you'll
probably see why I decided against it. What I describe below would be my
approach, but as it hasn't been implemented, it could be seriously flawed.
If every HTML document on the WWW followed the W3C's HTML Recommendations
(the Standard) to the letter (very easy to do with a little effort and
some reading), parsing HTML would be really quite easy. One could follow
the DOM (Document Object Model) and build a hierarchical tree that
represented the structure of the document. Each node would be of a certain
type (table, paragraph, button, etc), and that gives the basic display
characteristics of the element. Each node would also hold the attributes
and stylesheet information that then modified those basic characteristics.
The tree would be simple to build. Every time a new HTML element is
encountered, it is added to the tree at the current 'depth'. If the new
element is a container of any kind[1], all subsequent elements are added
as children (they are at a lower depth). When a closing tag is
encountered, new elements will be added at a higher, previous depth. Take
the following document, for example (attributes and the DTD have been
omitted).
<HTML>
<HEAD>
<TITLE>A HTML Page</TITLE>
</HEAD>
<BODY>
<P>This is a demonstration page.
<TABLE>
<TR>
<TD>Cell 1
<TR>
<TD>Cell 2
</TABLE>
</BODY>
</HTML>
The tree representation would then look like this:
Document
+- Head
| +- Title
| +- (Text) A HTML Page
|
+- Body
+- P
| +- (Text) This is a demonstration page.
|
+- Table
+- TR
| +- TD
| +- (Text) Cell 1
|
+- TR
+- TD
+- (Text) Cell 2
Once built, the tree could be traversed from the top-down, building each
element as you go. Once enough information is available to begin rendering
(text in a paragraph, first complete cell or row in a table, etc), the
element can be displayed.
Unfortunately, the caveat that I mentioned at the beginning is very
important (and particularly relevant). From the point of view of the
Standard, the majority of HTML documents on the WWW don't conform at best,
and at worst they are complete gibberish. This has lead to the rise of
quirks-mode parsing, where the browser basically guesses what the author
wanted[2]. The simple treatment given to parsing above doesn't work with
invalid documents. One has to anticipate every little thing that a
would-be author might try to do, even if it makes absolutely no sense[3].
I don't know if that was what you were looking for, or if you wanted
something more specific (like how a particular browser parses documents).
In any case, it was an interesting thought exercise for me. :)
Mike
[1] Examples would be tables, forms and links (anything with an opening
and closing tag). The first two are block-level elements, so they are
containers by nature. A link is still a container of sorts, because it
holds text. Something like HR (horizontal rule), INPUT or BR (line break)
elements aren't containers because they are defined to be empty.
[2] This is probably acheived from experience; amateur webmasters might
e-mail browser developers asking why their pages don't render as they
wanted. This way developers can build a picture of common mistakes and
compensate for them automatically. Another approach would be to build
"what if" scenarios and decide what the best course of action would be if
an author missed some important information. A common example would be the
script language - the general response is to assume JavaScript.
[3] **Beware: rant** That's the general approach with all general user
applications but it's more complicated, in my mind, with parsing and why I
can't understand the attitude of some authors. I'm referring to the ones
that think that because browsers can parse rubbish, authors should write
rubbish and just let the browser sort it out. I'm certain that if browsers
didn't have to innately cope with ridiculous errors, they'd be faster and
a lot more stable. I realise that browsers would have to have basic
tolerance to handle typos and a lapse in concentration on the author's
part. However, a browser could simply ignore the erroneous mark-up and
flag an error (a pop-up or in-line, bold error text). When the author
checks the document with the browser (though a validator should be used to
check for errors), the problem is obvious and is then corrected. That's
what happens with most applications, so why not browsers?
--
Michael Winter
M.******@blueyonder.co.invalid (replace ".invalid" with ".uk" to reply)