This question is fairly theoretical (even for me), but it started to
puzzle me:
According to the SGML declaration for HTML 4.01, at
http://www.w3.org/TR/REC-html40/sgml...cl.html#h-20.1
the Form Feed character, U+000C (12 in decimal), is UNUSED, i.e. forbidden:
DESCSET 0 9 UNUSED
9 2 9
11 2 UNUSED
Yet, the prose of the specification discusses it as if it were an
allowed character. Section 9.1 White space says:
"In HTML, only the following characters are defined as white space
characters:
- ASCII space ( )
- ASCII tab (	)
- ASCII form feed ()
- Zero-width space (​)"
( http://www.w3.org/TR/REC-html40/struct/text.html#h-9.1 )
Is this just a slip in the SGML declaration, or in the prose? I'd
suppose the latter, since the formal rule was the same in HTML 3.2,
which did not mention U+000C at all in the prose. So when people wrote
the HTML 4.01 prose, they just didn't check what's in the formal
declaration.
The W3C validator and the WDG validator seem to report U+000C as an
error ("Non-SGML character number 12"), apparently playing by the SGML
declaration for HTML 4.01.
(XHTML, as XML in general, forbids U+000C explicitly. And U+000C is not
useful in HTML: it's just another whitespace character, not a page eject
character, as one might naively expect.)