On Tue, 13 Oct 2003, S. suddenly blurted out:
oops i meant is there a formula to map unicode to sgml notation
If you're aiming to participate in big-8 newsgroups... (well, see
footnote [1])
{ was an example.
Sure, no problem with that.
i have finished my site using sgml notation for the special
chars.out of curiosity what i wanted to know was that if sgml is
related to unicode.
The SGML notation &#number; (technically this is a "numeric character
reference") refers to the decimal value of the character position in
the chosen "Document Character Set" (a technical term from SGML).
In versions of HTML starting with RFC2070 and continuing through
HTML4.01 and into XHTML, there is only one "Document Character Set" in
HTML, and that is Unicode.
In SGML itself, you could define any character set to be a "Document
Character Set". (This would be done in the "SGML Declaration", as I
understand it). But for HTML there is a non-negotiable "SGML
Declaration" which sets the document character set (for the particular
version of HTML under discussion): and that, for good reason, was
chosen to be Unicode.
Note that the Document Character Set has *no* relationship to the
external document character coding: by using the &#number; notation
it's possible, if you wish, to represent the entire Document Character
Set (i.e Unicode) using nothing more challenging than US-ASCII
character coding. But that's only one possible option - there are
many possible choices[2]
The external document coding is specified in MIME notation by the
so-called "charset" attribute, which is very confusing in this
context, since it has nothing whatever to do with the Document
Character Set. In current versions of HTML, many different "charset"
values are used, according to the locale and writing system in use and
other considerations; but there is only one Document Character Set,
namely Unicode.
i.e. is a sgml value Ӓ decimal for some unicode char
In HTML, this is certainly so.
or did sgml define its own table of char to decimal values.
No, SGML doesn't have a specific "table" of such values: it can use
whatever Document Character Set the SGML user cares to put in their
declaration, I believe. Of course, there are good practical reasons
for choosing Unicode.
is there a mathematical formula to calculate a unicode value
Unicode publications represent their characters using a hexadecimal
notation e.g U+04D2 for "CYRILLIC CAPITAL LETTER A WITH DIAERESIS",
which you would represent as your example Ӓ
Later, SGML adopted a syntax for hexadecimal numeric character
references, e.g Ӓ if you don't want to do the conversion - but
as far as its use in HTML for the WWW, you get slightly better support
across browsers if you use the decimal syntax instead.
given its utf8 value?
As I said before, utf-8 is an encoding of Unicode. The details are
published, but you'd be better advised to use some available library
or module which supports this encoding, and the other encodings of
Unicode, for you.
good luck
[1] If you're aiming to participate in big-8 newsgroups, you'd be
strongly advised to catch up with the netiquette conventions. In
particular, about not posting the same question separately to
different groups (yes, some of us read more than one group, and we
spot these things), and following the accepted rules of quotation: one
quotes, with attribution, the specific part of the previous thread
which sets the context for your followup - one puts one's comment
below the context-setting quote - and one snips all extraneous matter,
signatures etc. from what one is quoting.
[2] I have an overview aimed at optimising the choice, at
http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist
(this now cross-posted to c.t.sgml, and followups suggested back at
c.i.w.a.html).