Hello,
I have the following problem of principle:
in writing HTML pages containing ancient greek, there are two
possibilities: one is to write the unicode characters directly
(encoded as two bytes) into the HTML source, and save this source not
as an ASCII text, but as a UNICODE text file (using 16 bits per
character, also for the Western ASCII characters, which are usually
encoded as Ox00XX with XX the ASCII code) ; or to write a pure ASCII
HTML source, where the greek characters are all encoded with the
&#XXXX symbols. I have even a small computerprogram that converts the
former in the latter.
The funny thing is, that a browser such as Netscape7.2 seems to have
no problems accepting a unicode encoded sourcefile and displays
everything all right.
Now, the discussion I'm having with other people is the following:
as it is easier to type directly the unicode HTML source, is this, in
general, an acceptable thing to do, or is this (that's my viewpoint) a
totally unethical thing to do that simply works because of some
sloppiness in Netscape, but that HTML source code was never
intentioned not to be ASCII text in the first place ? I would like
them to see that I should run their source files through my program
that converts a unicode file into an ASCII file with the true unicode
characters (in casu ancient greek symbols) replaced by &#XXX ascii
character sequences ; their point of view is that this is bullshit,
and given the fact that it works for Netscape, that means that it is a
correct thing to do.
So, what should be the outcome of this (academic) discussion ?
Must HTML source code be an ASCII code, or is it now allowed to be
UNICODE encoded text ?
thanks for any learned enlightment,
Patrick. 11 3642
Patrick Van Esch wrote: So, what should be the outcome of this (academic) discussion ? Must HTML source code be an ASCII code, or is it now allowed to be UNICODE encoded text ?
HTML uses unicode.
On Sun, 13 Mar 2005, C A Upsdell wrote: Patrick Van Esch wrote: So, what should be the outcome of this (academic) discussion ? Must HTML source code be an ASCII code, or is it now allowed to be UNICODE encoded text ?
HTML uses unicode.
Anyone who *understood* what that cryptic answer meant, would not have
needed to ask the question in the first place!!!
I see that Henri Sivonen has offered a more constructive answer.
I might add that writing &#number; notations with ASCII characters
certainly produces a rather bullet-proof source, that can be calmly
passed through cross-platform transfers and so forth, in ways that
might result in trashed utf-8-encoded source. But frankly, if you
have a means to author documents using utf-8 encoding, and a proven
way to upload them to the server and serve them out properly, then
there's really nothing to gain from resorting to &#number; notations
in ASCII instead.
My offering on this topic would be the charset checklist - http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist -
which offers a number of scenarios. But as time goes by, the earlier
techniques (coding in ascii and using &-notations) become less and
less /necessary/ to use, even though they continue to be entirely
/valid/ if you have some other reason to want to use them.
Bottom line: if the questioner's authoring software supports it, then
follow scenario 7 in the checklist - actual utf-8 coded characters.
As Henri says, Windows's internal representation of unicode uses utf16
(little-endian, if I'm not mistaken), but for use on the WWW but I
would definitely prefer utf-8, which has been in use for quite a while
(it's even supported by that old dog Netscape 4.*, at least to a
degree). But read also the current "which charset" thread for remarks
about forms submission.
> Must HTML source code be an ASCII code, or is it now allowed to be UNICODE encoded text ?
Your web server specifies the character set in the headers of the HTTP
response that preceed that actual HTML. For example, if it sends:
Content-Type: text/html; charset=iso-8859-1
then it is latin 1, whereas if it sends
Content-Type: text/html; charset=UTF-16
then it is 16-bit unicode.
So if you set up your web server appropriately you can certainly send
the greek in Unicode, and browsers will understand it.
If the server doesn't specify a character set you may be able to use a
META tag in the start of the document, but generally this will only
work to distinguish between characters sets like UTF-8 and iso-8859-1
where the "ASCII" characters overlap; a META tag will not help if you
are sending UTF-16 (I think).
Do read http://www.w3.org/TR/REC-html40/charset.html
--Phil.
Thanks already for all answers here, they are very enlightening!
I'm beginning to see a bit more clear in this character jungle.
Patrick.
"Alan J. Flavell" <fl*****@ph.gla .ac.uk> wrote in message news:<Pi******* *************** *********@ppepc 56.ph.gla.ac.uk >... I might add that writing &#number; notations with ASCII characters certainly produces a rather bullet-proof source, that can be calmly passed through cross-platform transfers and so forth, in ways that might result in trashed utf-8-encoded source. But frankly, if you have a means to author documents using utf-8 encoding, and a proven way to upload them to the server and serve them out properly, then there's really nothing to gain from resorting to &#number; notations in ASCII instead.
Ah, that's very sensible. The problem (was) that we were using
Dreamweaver 4, which doesn't support any unicode scheme, and that for
the pages using ancient greek, we switched simply to the composer of
Netscape (7.2), which does so, but (on an Win XP machine) generates
unicode (which is indeed encoded under UTF-16, as I understand it
now). If you open that code with anything that expects ASCII (such as
a basic program or so reading it as a text file) you get a "funny"
file which has as first byte a 255 code, and as second byte a 254
code, and then all true ascii is indeed encoded by a 0 byte preceded
by a byte containing the ascii code, and the greek characters are
simply encoded by "first byte value" + 256 x "second byte value".
So I wrote a small Reality Basic program that detects this 255 - 254
initial two-byte sequence, and then replaces each "XX and 00" sequence
simply by XX, and if it is "XX and YY" replaces it by "&#(value of XX
+ 256 * YY)", to make an ascii file out of it.
However, I discovered yesterday that Dreamweaver MX DOES have unicode
support.
So I'll see if this can generate true UTF-8 encoded files instead of
the UTF-16 encoded files, which seem to give problems in certain
circumstances, but not in all (and which was the first reason for me
to write this conversion program). My offering on this topic would be the charset checklist - http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist - which offers a number of scenarios. But as time goes by, the earlier techniques (coding in ascii and using &-notations) become less and less /necessary/ to use, even though they continue to be entirely /valid/ if you have some other reason to want to use them.
Ah, thank you. As I told here before, I (mistakenly) thought that
only pure ASCII was allowed in the HTML code but that other encodings
slipped through the mazes of sloppiness within a browser. But
apparently this IS a valid way of doing things, *if you know what you
are doing* (I'm - I think - in the process of learning that :-) Bottom line: if the questioner's authoring software supports it, then follow scenario 7 in the checklist - actual utf-8 coded characters.
As Henri says, Windows's internal representation of unicode uses utf16 (little-endian, if I'm not mistaken), but for use on the WWW but I would definitely prefer utf-8, which has been in use for quite a while (it's even supported by that old dog Netscape 4.*, at least to a degree). But read also the current "which charset" thread for remarks about forms submission.
ok, thanks,
Patrick.
On 13 Mar 2005, Patrick Van Esch wrote: in writing HTML pages containing ancient greek,
Ancient Greek was written without any accents - so the characters
on http://www.unics.uni-hannover.de/nhtcapri/greek.html7
should be sufficient. If Euripides could do without accents,
you can do, too.
Of course, you meant "polytonic Greek" - but "polytonic Greek"
is not the same as "ancient Greek" and "monotonic Greek" is not
the same as "modern Greek".
On 14 Mar 2005, Patrick Van Esch wrote: So I'll see if this can generate true UTF-8 encoded files instead of the UTF-16 encoded files, which seem to give problems in certain circumstances, but not in all (and which was the first reason for me to write this conversion program).
http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist -
UTF-16 is not recommended at present for the web due to browser and,
especially, search engines shortcomings.
As I told here before, I (mistakenly) thought that only pure ASCII was allowed in the HTML code but that other encodings slipped through the mazes of sloppiness within a browser.
It's a good idea to restrict HTML markup to ASCII, i.e. using only
ISO-8859-x or UTF-8 but not UTF-16 or UTF-32. Otherwise, you may
end up like these in Google: http://www.google.com/search?q=%22UTF+1+6%22 This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Hallvard B Furuseth |
last post by:
Now that the '-*- coding: <charset> -*-' feature has arrived,
I'd like to see an addition:
# -*- str7bit:True -*-
After the source file has been converted to Unicode, cause a parse
error if a non-u'' string contains a non-7bit source character.
It can be used to ensure that the source file doesn't contain national
characters that the...
|
by: John Roth |
last post by:
PEP 263 is marked finished in the PEP index, however
I haven't seen the specified Phase 2 in the list of changes
for 2.4 which is when I expected it.
Did phase 2 get cancelled, or is it just not in the
changes document?
John Roth
|
by: VK |
last post by:
09/30/03 Phil Powell posted his "Radio buttons do not appear checked"
question.
This question led to a long discussion about the naming rules applying to
variables, objects, methods and properties in JavaScript/JScript and
HTML/XML elements.
Without trying to get famous :-) but thinking it would be interesting to
others I decided to post the...
|
by: chri_schiller |
last post by:
I have a home-made website that provides a free
1100 page physics textbook. It is written in html and
css. I recently added some chinese text, and
since that day there are problems.
The entry page has two chinese characters,
but these are not seen on all browsers, even
though the page is validated by
the w3c validator.
(...
|
by: Xah Lee |
last post by:
Tabs versus Spaces in Source Code
Xah Lee, 2006-05-13
In coding a computer program, there's often the choices of tabs or
spaces for code indentation. There is a large amount of confusion about
which is better. It has become what's known as “religious war” —
a heated fight over trivia. In this essay, i like to explain what is
the...
| |
by: lorenzo.viscanti |
last post by:
X-No-Archive: yes
Hi, I've found lots of material on the net about unicode html
conversions, but still i'm having many problems converting unicode
characters to html entities. Is there any available function to solve
this issue?
As an example I would like to do this kind of conversion:
\uc3B4 =ô
for all available html entities.
...
|
by: encoding |
last post by:
v
Hi!
I read the folowing discussion:
http://groups.google.com/group/comp.infosystems.www.authoring.html/browse_thread/thread/e310d85640e54eb3/b50e36acb6d68931?lnk=gst&q=Can+an+HTML+source+file+be+specified+in+unicode+%3F+&rnum=1#b50e36acb6d68931
Wich is about charset encoding and HTML.
I am looking...
|
by: tinkerbarbet |
last post by:
Hi
I've read around quite a bit about Unicode and python's support for
it, and I'm still unclear about how it all fits together in certain
scenarios. Can anyone help clarify?
* When I say "# -*- coding: utf-8 -*-" and confirm my IDE is saving
the source file as UTF-8, do I still need to prefix all the strings
constructed in the source...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
| |
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development projectplanning, coding, testing, and deploymentwithout human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupr who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert...
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
|
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
|
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...
| |