473,587 Members | 2,473 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Can an HTML source file be specified in unicode ?

Hello,

I have the following problem of principle:
in writing HTML pages containing ancient greek, there are two
possibilities: one is to write the unicode characters directly
(encoded as two bytes) into the HTML source, and save this source not
as an ASCII text, but as a UNICODE text file (using 16 bits per
character, also for the Western ASCII characters, which are usually
encoded as Ox00XX with XX the ASCII code) ; or to write a pure ASCII
HTML source, where the greek characters are all encoded with the
&#XXXX symbols. I have even a small computerprogram that converts the
former in the latter.
The funny thing is, that a browser such as Netscape7.2 seems to have
no problems accepting a unicode encoded sourcefile and displays
everything all right.
Now, the discussion I'm having with other people is the following:
as it is easier to type directly the unicode HTML source, is this, in
general, an acceptable thing to do, or is this (that's my viewpoint) a
totally unethical thing to do that simply works because of some
sloppiness in Netscape, but that HTML source code was never
intentioned not to be ASCII text in the first place ? I would like
them to see that I should run their source files through my program
that converts a unicode file into an ASCII file with the true unicode
characters (in casu ancient greek symbols) replaced by &#XXX ascii
character sequences ; their point of view is that this is bullshit,
and given the fact that it works for Netscape, that means that it is a
correct thing to do.

So, what should be the outcome of this (academic) discussion ?
Must HTML source code be an ASCII code, or is it now allowed to be
UNICODE encoded text ?
thanks for any learned enlightment,

Patrick.
Jul 23 '05 #1
11 3642
Patrick Van Esch wrote:
So, what should be the outcome of this (academic) discussion ?
Must HTML source code be an ASCII code, or is it now allowed to be
UNICODE encoded text ?


HTML uses unicode.
Jul 23 '05 #2
In article <c2************ **************@ posting.google. com>,
va*****@ill.fr (Patrick Van Esch) wrote:
So, what should be the outcome of this (academic) discussion ?


Editing as straight characters (no &#...;) and saving as UTF-8 (as
opposed to UTF-16 like you were doing; UTF-8 is safer than UTF-16).
Microsoft et al. call UTF-16 Unicode.

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 23 '05 #3
In article <c2************ **************@ posting.google. com>,
Must HTML source code be an ASCII code, or is it now allowed to be
UNICODE encoded text ?


The Horse's Mouth (tm) is at: http://www.w3.org/TR/html4/charset.html
--
Peter Greenwood pe****@pgid.co. uk
http://www.pgid.co.uk
+44 1253 821678
Jul 23 '05 #4
On Sun, 13 Mar 2005, C A Upsdell wrote:
Patrick Van Esch wrote:
So, what should be the outcome of this (academic) discussion ?
Must HTML source code be an ASCII code, or is it now allowed to be
UNICODE encoded text ?


HTML uses unicode.


Anyone who *understood* what that cryptic answer meant, would not have
needed to ask the question in the first place!!!

I see that Henri Sivonen has offered a more constructive answer.

I might add that writing &#number; notations with ASCII characters
certainly produces a rather bullet-proof source, that can be calmly
passed through cross-platform transfers and so forth, in ways that
might result in trashed utf-8-encoded source. But frankly, if you
have a means to author documents using utf-8 encoding, and a proven
way to upload them to the server and serve them out properly, then
there's really nothing to gain from resorting to &#number; notations
in ASCII instead.

My offering on this topic would be the charset checklist -
http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist -
which offers a number of scenarios. But as time goes by, the earlier
techniques (coding in ascii and using &-notations) become less and
less /necessary/ to use, even though they continue to be entirely
/valid/ if you have some other reason to want to use them.

Bottom line: if the questioner's authoring software supports it, then
follow scenario 7 in the checklist - actual utf-8 coded characters.

As Henri says, Windows's internal representation of unicode uses utf16
(little-endian, if I'm not mistaken), but for use on the WWW but I
would definitely prefer utf-8, which has been in use for quite a while
(it's even supported by that old dog Netscape 4.*, at least to a
degree). But read also the current "which charset" thread for remarks
about forms submission.
Jul 23 '05 #5
> Must HTML source code be an ASCII code, or is it now allowed to be
UNICODE encoded text ?


Your web server specifies the character set in the headers of the HTTP
response that preceed that actual HTML. For example, if it sends:

Content-Type: text/html; charset=iso-8859-1
then it is latin 1, whereas if it sends
Content-Type: text/html; charset=UTF-16
then it is 16-bit unicode.

So if you set up your web server appropriately you can certainly send
the greek in Unicode, and browsers will understand it.

If the server doesn't specify a character set you may be able to use a
META tag in the start of the document, but generally this will only
work to distinguish between characters sets like UTF-8 and iso-8859-1
where the "ASCII" characters overlap; a META tag will not help if you
are sending UTF-16 (I think).

Do read http://www.w3.org/TR/REC-html40/charset.html

--Phil.

Jul 23 '05 #6
Thanks already for all answers here, they are very enlightening!
I'm beginning to see a bit more clear in this character jungle.
Patrick.
Jul 23 '05 #7
"Alan J. Flavell" <fl*****@ph.gla .ac.uk> wrote in message news:<Pi******* *************** *********@ppepc 56.ph.gla.ac.uk >...
I might add that writing &#number; notations with ASCII characters
certainly produces a rather bullet-proof source, that can be calmly
passed through cross-platform transfers and so forth, in ways that
might result in trashed utf-8-encoded source. But frankly, if you
have a means to author documents using utf-8 encoding, and a proven
way to upload them to the server and serve them out properly, then
there's really nothing to gain from resorting to &#number; notations
in ASCII instead.
Ah, that's very sensible. The problem (was) that we were using
Dreamweaver 4, which doesn't support any unicode scheme, and that for
the pages using ancient greek, we switched simply to the composer of
Netscape (7.2), which does so, but (on an Win XP machine) generates
unicode (which is indeed encoded under UTF-16, as I understand it
now). If you open that code with anything that expects ASCII (such as
a basic program or so reading it as a text file) you get a "funny"
file which has as first byte a 255 code, and as second byte a 254
code, and then all true ascii is indeed encoded by a 0 byte preceded
by a byte containing the ascii code, and the greek characters are
simply encoded by "first byte value" + 256 x "second byte value".
So I wrote a small Reality Basic program that detects this 255 - 254
initial two-byte sequence, and then replaces each "XX and 00" sequence
simply by XX, and if it is "XX and YY" replaces it by "&#(value of XX
+ 256 * YY)", to make an ascii file out of it.
However, I discovered yesterday that Dreamweaver MX DOES have unicode
support.
So I'll see if this can generate true UTF-8 encoded files instead of
the UTF-16 encoded files, which seem to give problems in certain
circumstances, but not in all (and which was the first reason for me
to write this conversion program).

My offering on this topic would be the charset checklist -
http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist -
which offers a number of scenarios. But as time goes by, the earlier
techniques (coding in ascii and using &-notations) become less and
less /necessary/ to use, even though they continue to be entirely
/valid/ if you have some other reason to want to use them.
Ah, thank you. As I told here before, I (mistakenly) thought that
only pure ASCII was allowed in the HTML code but that other encodings
slipped through the mazes of sloppiness within a browser. But
apparently this IS a valid way of doing things, *if you know what you
are doing* (I'm - I think - in the process of learning that :-)

Bottom line: if the questioner's authoring software supports it, then
follow scenario 7 in the checklist - actual utf-8 coded characters.

As Henri says, Windows's internal representation of unicode uses utf16
(little-endian, if I'm not mistaken), but for use on the WWW but I
would definitely prefer utf-8, which has been in use for quite a while
(it's even supported by that old dog Netscape 4.*, at least to a
degree). But read also the current "which charset" thread for remarks
about forms submission.


ok, thanks,
Patrick.
Jul 23 '05 #8
On 13 Mar 2005, Patrick Van Esch wrote:
in writing HTML pages containing ancient greek,


Ancient Greek was written without any accents - so the characters
on http://www.unics.uni-hannover.de/nhtcapri/greek.html7
should be sufficient. If Euripides could do without accents,
you can do, too.

Of course, you meant "polytonic Greek" - but "polytonic Greek"
is not the same as "ancient Greek" and "monotonic Greek" is not
the same as "modern Greek".

Jul 23 '05 #9
On 14 Mar 2005, Patrick Van Esch wrote:
So I'll see if this can generate true UTF-8 encoded files instead of
the UTF-16 encoded files, which seem to give problems in certain
circumstances, but not in all (and which was the first reason for me
to write this conversion program).
http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist -

UTF-16 is not recommended at present for the web due to browser and,
especially, search engines shortcomings.
As I told here before, I (mistakenly) thought that
only pure ASCII was allowed in the HTML code but that other encodings
slipped through the mazes of sloppiness within a browser.


It's a good idea to restrict HTML markup to ASCII, i.e. using only
ISO-8859-x or UTF-8 but not UTF-16 or UTF-32. Otherwise, you may
end up like these in Google:
http://www.google.com/search?q=%22UTF+1+6%22

Jul 23 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

30
2205
by: Hallvard B Furuseth | last post by:
Now that the '-*- coding: <charset> -*-' feature has arrived, I'd like to see an addition: # -*- str7bit:True -*- After the source file has been converted to Unicode, cause a parse error if a non-u'' string contains a non-7bit source character. It can be used to ensure that the source file doesn't contain national characters that the...
27
2582
by: John Roth | last post by:
PEP 263 is marked finished in the PEP index, however I haven't seen the specified Phase 2 in the list of changes for 2.4 which is when I expected it. Did phase 2 get cancelled, or is it just not in the changes document? John Roth
4
8396
by: VK | last post by:
09/30/03 Phil Powell posted his "Radio buttons do not appear checked" question. This question led to a long discussion about the naming rules applying to variables, objects, methods and properties in JavaScript/JScript and HTML/XML elements. Without trying to get famous :-) but thinking it would be interesting to others I decided to post the...
24
2817
by: chri_schiller | last post by:
I have a home-made website that provides a free 1100 page physics textbook. It is written in html and css. I recently added some chinese text, and since that day there are problems. The entry page has two chinese characters, but these are not seen on all browsers, even though the page is validated by the w3c validator. (...
135
7430
by: Xah Lee | last post by:
Tabs versus Spaces in Source Code Xah Lee, 2006-05-13 In coding a computer program, there's often the choices of tabs or spaces for code indentation. There is a large amount of confusion about which is better. It has become what's known as “religious war” — a heated fight over trivia. In this essay, i like to explain what is the...
8
2802
by: lorenzo.viscanti | last post by:
X-No-Archive: yes Hi, I've found lots of material on the net about unicode html conversions, but still i'm having many problems converting unicode characters to html entities. Is there any available function to solve this issue? As an example I would like to do this kind of conversion: \uc3B4 =&ocirc; for all available html entities. ...
1
1271
by: encoding | last post by:
v Hi! I read the folowing discussion: http://groups.google.com/group/comp.infosystems.www.authoring.html/browse_thread/thread/e310d85640e54eb3/b50e36acb6d68931?lnk=gst&q=Can+an+HTML+source+file+be+specified+in+unicode+%3F+&rnum=1#b50e36acb6d68931 Wich is about charset encoding and HTML. I am looking...
4
1797
by: tinkerbarbet | last post by:
Hi I've read around quite a bit about Unicode and python's support for it, and I'm still unclear about how it all fits together in certain scenarios. Can anyone help clarify? * When I say "# -*- coding: utf-8 -*-" and confirm my IDE is saving the source file as UTF-8, do I still need to prefix all the strings constructed in the source...
0
7843
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
0
8340
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
1
7967
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
0
6621
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development projectplanning, coding, testing, and deploymentwithout human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5713
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupr who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
5392
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
1
2353
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1452
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
1185
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.