Can an HTML source file be specified in unicode ?

Patrick Van Esch

Hello,

I have the following problem of principle:
in writing HTML pages containing ancient greek, there are two
possibilities: one is to write the unicode characters directly
(encoded as two bytes) into the HTML source, and save this source not
as an ASCII text, but as a UNICODE text file (using 16 bits per
character, also for the Western ASCII characters, which are usually
encoded as Ox00XX with XX the ASCII code) ; or to write a pure ASCII
HTML source, where the greek characters are all encoded with the
&#XXXX symbols. I have even a small computerprogram that converts the
former in the latter.
The funny thing is, that a browser such as Netscape7.2 seems to have
no problems accepting a unicode encoded sourcefile and displays
everything all right.
Now, the discussion I'm having with other people is the following:
as it is easier to type directly the unicode HTML source, is this, in
general, an acceptable thing to do, or is this (that's my viewpoint) a
totally unethical thing to do that simply works because of some
sloppiness in Netscape, but that HTML source code was never
intentioned not to be ASCII text in the first place ? I would like
them to see that I should run their source files through my program
that converts a unicode file into an ASCII file with the true unicode
characters (in casu ancient greek symbols) replaced by &#XXX ascii
character sequences ; their point of view is that this is bullshit,
and given the fact that it works for Netscape, that means that it is a
correct thing to do.

So, what should be the outcome of this (academic) discussion ?
Must HTML source code be an ASCII code, or is it now allowed to be
UNICODE encoded text ?
thanks for any learned enlightment,

Patrick.

Jul 23 '05 #1

Subscribe Reply

3642

C A Upsdell

Patrick Van Esch wrote:

So, what should be the outcome of this (academic) discussion ?
Must HTML source code be an ASCII code, or is it now allowed to be
UNICODE encoded text ?

HTML uses unicode.

Jul 23 '05 #2

Henri Sivonen

In article <c2************ **************@ posting.google. com>,
va*****@ill.fr (Patrick Van Esch) wrote:

So, what should be the outcome of this (academic) discussion ?

Editing as straight characters (no &#...;) and saving as UTF-8 (as
opposed to UTF-16 like you were doing; UTF-8 is safer than UTF-16).
Microsoft et al. call UTF-16 Unicode.

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

Jul 23 '05 #3

Peter Greenwood

In article <c2************ **************@ posting.google. com>,

Must HTML source code be an ASCII code, or is it now allowed to be
UNICODE encoded text ?

The Horse's Mouth (tm) is at: http://www.w3.org/TR/html4/charset.html
--
Peter Greenwood pe****@pgid.co. uk
http://www.pgid.co.uk
+44 1253 821678

Jul 23 '05 #4

Alan J. Flavell

On Sun, 13 Mar 2005, C A Upsdell wrote:

Patrick Van Esch wrote:
So, what should be the outcome of this (academic) discussion ?
Must HTML source code be an ASCII code, or is it now allowed to be
UNICODE encoded text ?

HTML uses unicode.

Anyone who *understood* what that cryptic answer meant, would not have
needed to ask the question in the first place!!!

I see that Henri Sivonen has offered a more constructive answer.

I might add that writing &#number; notations with ASCII characters
certainly produces a rather bullet-proof source, that can be calmly
passed through cross-platform transfers and so forth, in ways that
might result in trashed utf-8-encoded source. But frankly, if you
have a means to author documents using utf-8 encoding, and a proven
way to upload them to the server and serve them out properly, then
there's really nothing to gain from resorting to &#number; notations
in ASCII instead.

My offering on this topic would be the charset checklist -
http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist -
which offers a number of scenarios. But as time goes by, the earlier
techniques (coding in ascii and using &-notations) become less and
less /necessary/ to use, even though they continue to be entirely
/valid/ if you have some other reason to want to use them.

Bottom line: if the questioner's authoring software supports it, then
follow scenario 7 in the checklist - actual utf-8 coded characters.

As Henri says, Windows's internal representation of unicode uses utf16
(little-endian, if I'm not mistaken), but for use on the WWW but I
would definitely prefer utf-8, which has been in use for quite a while
(it's even supported by that old dog Netscape 4.*, at least to a
degree). But read also the current "which charset" thread for remarks
about forms submission.

Jul 23 '05 #5

phil_gg04

> Must HTML source code be an ASCII code, or is it now allowed to be

UNICODE encoded text ?

Your web server specifies the character set in the headers of the HTTP
response that preceed that actual HTML. For example, if it sends:

Content-Type: text/html; charset=iso-8859-1
then it is latin 1, whereas if it sends
Content-Type: text/html; charset=UTF-16
then it is 16-bit unicode.

So if you set up your web server appropriately you can certainly send
the greek in Unicode, and browsers will understand it.

If the server doesn't specify a character set you may be able to use a
META tag in the start of the document, but generally this will only
work to distinguish between characters sets like UTF-8 and iso-8859-1
where the "ASCII" characters overlap; a META tag will not help if you
are sending UTF-16 (I think).

Do read http://www.w3.org/TR/REC-html40/charset.html

--Phil.

Jul 23 '05 #6

Patrick Van Esch

Thanks already for all answers here, they are very enlightening!
I'm beginning to see a bit more clear in this character jungle.
Patrick.

Jul 23 '05 #7

Patrick Van Esch

"Alan J. Flavell" <fl*****@ph.gla .ac.uk> wrote in message news:<Pi******* *************** *********@ppepc 56.ph.gla.ac.uk >...

I might add that writing &#number; notations with ASCII characters
certainly produces a rather bullet-proof source, that can be calmly
passed through cross-platform transfers and so forth, in ways that
might result in trashed utf-8-encoded source. But frankly, if you
have a means to author documents using utf-8 encoding, and a proven
way to upload them to the server and serve them out properly, then
there's really nothing to gain from resorting to &#number; notations
in ASCII instead.
Ah, that's very sensible. The problem (was) that we were using
Dreamweaver 4, which doesn't support any unicode scheme, and that for
the pages using ancient greek, we switched simply to the composer of
Netscape (7.2), which does so, but (on an Win XP machine) generates
unicode (which is indeed encoded under UTF-16, as I understand it
now). If you open that code with anything that expects ASCII (such as
a basic program or so reading it as a text file) you get a "funny"
file which has as first byte a 255 code, and as second byte a 254
code, and then all true ascii is indeed encoded by a 0 byte preceded
by a byte containing the ascii code, and the greek characters are
simply encoded by "first byte value" + 256 x "second byte value".
So I wrote a small Reality Basic program that detects this 255 - 254
initial two-byte sequence, and then replaces each "XX and 00" sequence
simply by XX, and if it is "XX and YY" replaces it by "&#(value of XX
+ 256 * YY)", to make an ascii file out of it.
However, I discovered yesterday that Dreamweaver MX DOES have unicode
support.
So I'll see if this can generate true UTF-8 encoded files instead of
the UTF-16 encoded files, which seem to give problems in certain
circumstances, but not in all (and which was the first reason for me
to write this conversion program).

My offering on this topic would be the charset checklist -
http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist -
which offers a number of scenarios. But as time goes by, the earlier
techniques (coding in ascii and using &-notations) become less and
less /necessary/ to use, even though they continue to be entirely
/valid/ if you have some other reason to want to use them.
Ah, thank you. As I told here before, I (mistakenly) thought that
only pure ASCII was allowed in the HTML code but that other encodings
slipped through the mazes of sloppiness within a browser. But
apparently this IS a valid way of doing things, *if you know what you
are doing* (I'm - I think - in the process of learning that :-)

Bottom line: if the questioner's authoring software supports it, then
follow scenario 7 in the checklist - actual utf-8 coded characters.

As Henri says, Windows's internal representation of unicode uses utf16
(little-endian, if I'm not mistaken), but for use on the WWW but I
would definitely prefer utf-8, which has been in use for quite a while
(it's even supported by that old dog Netscape 4.*, at least to a
degree). But read also the current "which charset" thread for remarks
about forms submission.

ok, thanks,
Patrick.

Jul 23 '05 #8

Andreas Prilop

On 13 Mar 2005, Patrick Van Esch wrote:

in writing HTML pages containing ancient greek,

Ancient Greek was written without any accents - so the characters
on http://www.unics.uni-hannover.de/nhtcapri/greek.html7
should be sufficient. If Euripides could do without accents,
you can do, too.

Of course, you meant "polytonic Greek" - but "polytonic Greek"
is not the same as "ancient Greek" and "monotonic Greek" is not
the same as "modern Greek".

Jul 23 '05 #9

Andreas Prilop

On 14 Mar 2005, Patrick Van Esch wrote:

So I'll see if this can generate true UTF-8 encoded files instead of
the UTF-16 encoded files, which seem to give problems in certain
circumstances, but not in all (and which was the first reason for me
to write this conversion program).
http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist -

UTF-16 is not recommended at present for the web due to browser and,
especially, search engines shortcomings.
As I told here before, I (mistakenly) thought that
only pure ASCII was allowed in the HTML code but that other encodings
slipped through the mazes of sloppiness within a browser.

It's a good idea to restrict HTML markup to ASCII, i.e. using only
ISO-8859-x or UTF-8 but not UTF-16 or UTF-32. Otherwise, you may
end up like these in Google:
http://www.google.com/search?q=%22UTF+1+6%22

Jul 23 '05 #10

Similar topics

2205

Proposal: require 7-bit source str's

by: Hallvard B Furuseth | last post by:

Now that the '-*- coding: <charset> -*-' feature has arrived, I'd like to see an addition: # -*- str7bit:True -*- After the source file has been converted to Unicode, cause a parse error if a non-u'' string contains a non-7bit source character. It can be used to ensure that the source file doesn't contain national characters that the...

Python

2582

PEP 263 status check

by: John Roth | last post by:

PEP 263 is marked finished in the PEP index, however I haven't seen the specified Phase 2 in the list of changes for 2.4 which is when I expected it. Did phase 2 get cancelled, or is it just not in the changes document? John Roth

Python

8396

Naming rules for JavaScript/HTML scripting

by: VK | last post by:

09/30/03 Phil Powell posted his "Radio buttons do not appear checked" question. This question led to a long discussion about the naming rules applying to variables, objects, methods and properties in JavaScript/JScript and HTML/XML elements. Without trying to get famous :-) but thinking it would be interesting to others I decided to post the...

Javascript

2817

Unicode and html - help for simple web site

by: chri_schiller | last post by:

I have a home-made website that provides a free 1100 page physics textbook. It is written in html and css. I recently added some chinese text, and since that day there are problems. The entry page has two chinese characters, but these are not seen on all browsers, even though the page is validated by the w3c validator. (...

HTML / CSS

135

7430

Tabs versus Spaces in Source Code

by: Xah Lee | last post by:

Tabs versus Spaces in Source Code Xah Lee, 2006-05-13 In coding a computer program, there's often the choices of tabs or spaces for code indentation. There is a large amount of confusion about which is better. It has become what's known as â€œreligious warâ€ â€” a heated fight over trivia. In this essay, i like to explain what is the...

Python

2802

unicode html

by: lorenzo.viscanti | last post by:

X-No-Archive: yes Hi, I've found lots of material on the net about unicode html conversions, but still i'm having many problems converting unicode characters to html entities. Is there any available function to solve this issue? As an example I would like to do this kind of conversion: \uc3B4 =ô for all available html entities. ...