Can an HTML source file be specified in unicode ?

Patrick Van Esch

Hello,

I have the following problem of principle:
in writing HTML pages containing ancient greek, there are two
possibilities: one is to write the unicode characters directly
(encoded as two bytes) into the HTML source, and save this source not
as an ASCII text, but as a UNICODE text file (using 16 bits per
character, also for the Western ASCII characters, which are usually
encoded as Ox00XX with XX the ASCII code) ; or to write a pure ASCII
HTML source, where the greek characters are all encoded with the
&#XXXX symbols. I have even a small computerprogram that converts the
former in the latter.
The funny thing is, that a browser such as Netscape7.2 seems to have
no problems accepting a unicode encoded sourcefile and displays
everything all right.
Now, the discussion I'm having with other people is the following:
as it is easier to type directly the unicode HTML source, is this, in
general, an acceptable thing to do, or is this (that's my viewpoint) a
totally unethical thing to do that simply works because of some
sloppiness in Netscape, but that HTML source code was never
intentioned not to be ASCII text in the first place ? I would like
them to see that I should run their source files through my program
that converts a unicode file into an ASCII file with the true unicode
characters (in casu ancient greek symbols) replaced by &#XXX ascii
character sequences ; their point of view is that this is bullshit,
and given the fact that it works for Netscape, that means that it is a
correct thing to do.

So, what should be the outcome of this (academic) discussion ?
Must HTML source code be an ASCII code, or is it now allowed to be
UNICODE encoded text ?
thanks for any learned enlightment,

Patrick.

Jul 23 '05 #1

Subscribe Post Reply

3610

C A Upsdell

Patrick Van Esch wrote:

So, what should be the outcome of this (academic) discussion ?
Must HTML source code be an ASCII code, or is it now allowed to be
UNICODE encoded text ?

HTML uses unicode.

Jul 23 '05 #2

Henri Sivonen

In article <c2**************************@posting.google.com >,
va*****@ill.fr (Patrick Van Esch) wrote:

So, what should be the outcome of this (academic) discussion ?

Editing as straight characters (no &#...;) and saving as UTF-8 (as
opposed to UTF-16 like you were doing; UTF-8 is safer than UTF-16).
Microsoft et al. call UTF-16 Unicode.

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

Jul 23 '05 #3

Peter Greenwood

In article <c2**************************@posting.google.com >,

Must HTML source code be an ASCII code, or is it now allowed to be
UNICODE encoded text ?

The Horse's Mouth (tm) is at: http://www.w3.org/TR/html4/charset.html
--
Peter Greenwood pe****@pgid.co.uk
http://www.pgid.co.uk
+44 1253 821678

Jul 23 '05 #4

Alan J. Flavell

On Sun, 13 Mar 2005, C A Upsdell wrote:

Patrick Van Esch wrote:
So, what should be the outcome of this (academic) discussion ?
Must HTML source code be an ASCII code, or is it now allowed to be
UNICODE encoded text ?

HTML uses unicode.

Anyone who *understood* what that cryptic answer meant, would not have
needed to ask the question in the first place!!!

I see that Henri Sivonen has offered a more constructive answer.

I might add that writing &#number; notations with ASCII characters
certainly produces a rather bullet-proof source, that can be calmly
passed through cross-platform transfers and so forth, in ways that
might result in trashed utf-8-encoded source. But frankly, if you
have a means to author documents using utf-8 encoding, and a proven
way to upload them to the server and serve them out properly, then
there's really nothing to gain from resorting to &#number; notations
in ASCII instead.

My offering on this topic would be the charset checklist -
http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist -
which offers a number of scenarios. But as time goes by, the earlier
techniques (coding in ascii and using &-notations) become less and
less /necessary/ to use, even though they continue to be entirely
/valid/ if you have some other reason to want to use them.

Bottom line: if the questioner's authoring software supports it, then
follow scenario 7 in the checklist - actual utf-8 coded characters.

As Henri says, Windows's internal representation of unicode uses utf16
(little-endian, if I'm not mistaken), but for use on the WWW but I
would definitely prefer utf-8, which has been in use for quite a while
(it's even supported by that old dog Netscape 4.*, at least to a
degree). But read also the current "which charset" thread for remarks
about forms submission.

Jul 23 '05 #5

phil_gg04

> Must HTML source code be an ASCII code, or is it now allowed to be

UNICODE encoded text ?

Your web server specifies the character set in the headers of the HTTP
response that preceed that actual HTML. For example, if it sends:

Content-Type: text/html; charset=iso-8859-1
then it is latin 1, whereas if it sends
Content-Type: text/html; charset=UTF-16
then it is 16-bit unicode.

So if you set up your web server appropriately you can certainly send
the greek in Unicode, and browsers will understand it.

If the server doesn't specify a character set you may be able to use a
META tag in the start of the document, but generally this will only
work to distinguish between characters sets like UTF-8 and iso-8859-1
where the "ASCII" characters overlap; a META tag will not help if you
are sending UTF-16 (I think).

Do read http://www.w3.org/TR/REC-html40/charset.html

--Phil.

Jul 23 '05 #6

Patrick Van Esch

Thanks already for all answers here, they are very enlightening!
I'm beginning to see a bit more clear in this character jungle.
Patrick.

Jul 23 '05 #7

Patrick Van Esch

"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote in message news:<Pi*******************************@ppepc56.ph .gla.ac.uk>...

I might add that writing &#number; notations with ASCII characters
certainly produces a rather bullet-proof source, that can be calmly
passed through cross-platform transfers and so forth, in ways that
might result in trashed utf-8-encoded source. But frankly, if you
have a means to author documents using utf-8 encoding, and a proven
way to upload them to the server and serve them out properly, then
there's really nothing to gain from resorting to &#number; notations
in ASCII instead.
Ah, that's very sensible. The problem (was) that we were using
Dreamweaver 4, which doesn't support any unicode scheme, and that for
the pages using ancient greek, we switched simply to the composer of
Netscape (7.2), which does so, but (on an Win XP machine) generates
unicode (which is indeed encoded under UTF-16, as I understand it
now). If you open that code with anything that expects ASCII (such as
a basic program or so reading it as a text file) you get a "funny"
file which has as first byte a 255 code, and as second byte a 254
code, and then all true ascii is indeed encoded by a 0 byte preceded
by a byte containing the ascii code, and the greek characters are
simply encoded by "first byte value" + 256 x "second byte value".
So I wrote a small Reality Basic program that detects this 255 - 254
initial two-byte sequence, and then replaces each "XX and 00" sequence
simply by XX, and if it is "XX and YY" replaces it by "&#(value of XX
+ 256 * YY)", to make an ascii file out of it.
However, I discovered yesterday that Dreamweaver MX DOES have unicode
support.
So I'll see if this can generate true UTF-8 encoded files instead of
the UTF-16 encoded files, which seem to give problems in certain
circumstances, but not in all (and which was the first reason for me
to write this conversion program).

My offering on this topic would be the charset checklist -
http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist -
which offers a number of scenarios. But as time goes by, the earlier
techniques (coding in ascii and using &-notations) become less and
less /necessary/ to use, even though they continue to be entirely
/valid/ if you have some other reason to want to use them.
Ah, thank you. As I told here before, I (mistakenly) thought that
only pure ASCII was allowed in the HTML code but that other encodings
slipped through the mazes of sloppiness within a browser. But
apparently this IS a valid way of doing things, *if you know what you
are doing* (I'm - I think - in the process of learning that :-)

Bottom line: if the questioner's authoring software supports it, then
follow scenario 7 in the checklist - actual utf-8 coded characters.

As Henri says, Windows's internal representation of unicode uses utf16
(little-endian, if I'm not mistaken), but for use on the WWW but I
would definitely prefer utf-8, which has been in use for quite a while
(it's even supported by that old dog Netscape 4.*, at least to a
degree). But read also the current "which charset" thread for remarks
about forms submission.

ok, thanks,
Patrick.

Jul 23 '05 #8

Andreas Prilop

On 13 Mar 2005, Patrick Van Esch wrote:

in writing HTML pages containing ancient greek,

Ancient Greek was written without any accents - so the characters
on http://www.unics.uni-hannover.de/nhtcapri/greek.html7
should be sufficient. If Euripides could do without accents,
you can do, too.

Of course, you meant "polytonic Greek" - but "polytonic Greek"
is not the same as "ancient Greek" and "monotonic Greek" is not
the same as "modern Greek".

Jul 23 '05 #9

Andreas Prilop

On 14 Mar 2005, Patrick Van Esch wrote:

So I'll see if this can generate true UTF-8 encoded files instead of
the UTF-16 encoded files, which seem to give problems in certain
circumstances, but not in all (and which was the first reason for me
to write this conversion program).
http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist -

UTF-16 is not recommended at present for the web due to browser and,
especially, search engines shortcomings.
As I told here before, I (mistakenly) thought that
only pure ASCII was allowed in the HTML code but that other encodings
slipped through the mazes of sloppiness within a browser.

It's a good idea to restrict HTML markup to ASCII, i.e. using only
ISO-8859-x or UTF-8 but not UTF-16 or UTF-32. Otherwise, you may
end up like these in Google:
http://www.google.com/search?q=%22UTF+1+6%22

Jul 23 '05 #10

Shmuel (Seymour J.) Metz

In <c2**************************@posting.google.com >, on 03/13/2005
at 11:29 AM, va*****@ill.fr (Patrick Van Esch) said:

in writing HTML pages containing ancient greek, there are two
possibilities:
No. You can also encode Unicode data as UTF-8, and that is what I
would recommend. You can also use the standard entity definitions
instead of cryptic numeric values.
Now, the discussion I'm having with other people is the following:
as it is easier to type directly the unicode HTML source, is this,
in general, an acceptable thing to do,
Yes, as long as you're not concerned with browsers that don't support
Unicode.
or is this (that's my viewpoint) a totally unethical thing to do
I can see why you might consider it inadvisable, but not why you would
consider it unethical.
that simply works because of some
sloppiness in Netscape,
Why do you think that it is sloppiness? From your description,
Netscape is doin g what it should be doing.
but that HTML source code was never intentioned not to be ASCII text
in the first place ?
HTML is based on SGML, and as evolved considerably since the original
version. It's been a long time since HTML was ASCII only, and the
current W3C standards specify Unicode.
I would like them to see that I should run their source files through
my program that converts a unicode file into an ASCII file with the
true unicode characters (in casu ancient greek symbols) replaced by
&#XXX ascii character sequences ; their point of view is that this is
bullshit,
I'm inclined to agree with them; the only transformation that makes
sense is UTF-8.
So, what should be the outcome of this (academic) discussion ?
Use UTF-8 for the characters that your editor supports. Use standard
entity names for characters that your editor doesn't support. Only
use numeric values where unavoidable.
Must HTML source code be an ASCII code,
Not for uears.
or is it now allowed to be UNICODE encoded text ?

Yes.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Jul 23 '05 #11

Alan J. Flavell

On Mon, 14 Mar 2005, Patrick Van Esch wrote:

[...] but (on an Win XP machine) generates
unicode (which is indeed encoded under UTF-16, as I understand it
now). If you open that code with anything that expects ASCII (such as
a basic program or so reading it as a text file) you get a "funny"
file which has as first byte a 255 code, and as second byte a 254
code, and then all true ascii is indeed encoded by a 0 byte preceded
by a byte containing the ascii code, and the greek characters are
simply encoded by "first byte value" + 256 x "second byte value".
If you want to understand what this is, it could be useful to read the
appropriate section in the Unicode standard.

I know that many readers react to this by saying "this is more
complicated than I want to know", but in fact by taking a little extra
time to understand this additional complication, it can save a lot of
confusion later; whereas people who insist on inventing
over-simplified versions of the story inevitably waste time later when
it leads to confusion.

My recommendation would be to read chapter 2
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf
at least from section 2.4 down to 2.6 inclusive, and in particular
to study table 2-3.

The "character encoding form" utf-16 (see section 2.5) defines the
way to represent any Unicode character by means of 16-bit-wide data
unit(s).

However, in practice (since computer architectures are typically
byte-oriented) you need a way to be more precise about how these
16-bit units will be stored in a computer or transmitted on a
communications channel.

For this reason the "Character Encoding Form" breaks up into a number
of "Character Encoding Schemes". For utf-16 we first break the
categories down by whether the byte-ordering is defined internally or
externally to the data stream.

With utf-16BE and utf-16LE, the byte-ordering (big-endian or
little-endian) is specified externally to the data stream, by the name
of the encoding scheme.

With the utf-16 Encoding Scheme, the byte-ordering is specified by
means of the "byte order mark" at the start of the data (and this is
what you saw on Windows). There are of course two flavours of this
encoding /scheme/, i.e big and little endian, but they both have the
same /name/, since the BOM itself is sufficient to distinguish between
them.

And thus the Unicode specification has just three Encoding Scheme
names for what are, in a sense, four different encoding schemes for
the one utf-16 "encoding form".

Notice that the name utf-16 appears both as a Character Encoding Form
(which comprises the Character Encoding Schemes utf-16BE, utf-16LE and
both kinds of utf-16), as well as appearing as a Character Encoding
Scheme. This can be a bit confusing.

The Unicode FAQ also has an informative article on the BOM.

So I wrote a small Reality Basic program that detects this 255 - 254
initial two-byte sequence, and then replaces each "XX and 00" sequence
simply by XX, and if it is "XX and YY" replaces it by "&#(value of XX
+ 256 * YY)", to make an ascii file out of it.

This is good fun for exploring the issues, of course, but you don't
want to do that in practice. There are libraries that are guaranteed
to give the right answers (including surrogates and all that stuff)

Admittedly at the moment you aren't trying to use Byzantine musical
symbols, Linear B, or any of the other stuff that would need
surrogates, but that's no reason to dig oneself into a hole when
there's no good cause to do so.

None of this discussion about utf-16 should distract, of course, from
the general recommendations you've been getting in favour of using
utf-8.

best regards

Jul 23 '05 #12

Similar topics

Proposal: require 7-bit source str's

by: Hallvard B Furuseth | last post by:

Now that the '-*- coding: <charset> -*-' feature has arrived, I'd like to see an addition: # -*- str7bit:True -*- After the source file has been converted to Unicode, cause a parse error if a...

Python

PEP 263 status check

by: John Roth | last post by:

PEP 263 is marked finished in the PEP index, however I haven't seen the specified Phase 2 in the list of changes for 2.4 which is when I expected it. Did phase 2 get cancelled, or is it just not...

Python

Naming rules for JavaScript/HTML scripting

by: VK | last post by:

09/30/03 Phil Powell posted his "Radio buttons do not appear checked" question. This question led to a long discussion about the naming rules applying to variables, objects, methods and properties...

Javascript

Unicode and html - help for simple web site

by: chri_schiller | last post by:

I have a home-made website that provides a free 1100 page physics textbook. It is written in html and css. I recently added some chinese text, and since that day there are problems. The entry...

HTML / CSS

135

Tabs versus Spaces in Source Code

by: Xah Lee | last post by:

Tabs versus Spaces in Source Code Xah Lee, 2006-05-13 In coding a computer program, there's often the choices of tabs or spaces for code indentation. There is a large amount of confusion about...

Python

unicode html

by: lorenzo.viscanti | last post by:

X-No-Archive: yes Hi, I've found lots of material on the net about unicode html conversions, but still i'm having many problems converting unicode characters to html entities. Is there any...