473,325 Members | 2,870 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,325 software developers and data experts.

Unicode and html - help for simple web site


I have a home-made website that provides a free
1100 page physics textbook. It is written in html and
css. I recently added some chinese text, and
since that day there are problems.

The entry page has two chinese characters,
but these are not seen on all browsers, even
though the page is validated by
the w3c validator.
( http://www.motionmountain.net/welcome.html)
(1) Why not?

Other pages do not validate in w3c
( http://www.motionmountain.net/contents.html)
(2) What is wrong here?

Since I plan to add more languages, and the unicode
issues are so tough:

(3) When uploading unicode files via ftp,
which line endings have to be used (mac, unix, other)?
Ascii mode or binary mode? (I have Mac OSX)

(4) To get IE to read the pages, is it best to use UTF-8
or UTF-16?

Thank you for any help!

Christoph

Aug 25 '05 #1
24 2782
ch***********@yahoo.com wrote:
The entry page has two chinese characters,
but these are not seen on all browsers, even
though the page is validated by
the w3c validator.
( http://www.motionmountain.net/welcome.html)
(1) Why not?
It validates just because the validator is so permissive and does not
care about the conflict between the encoding you declare in the meta tag
(ISO-8859-1) and the encoding you actually use. In fact, I would say
that the validator is in error here: the encoding is specified as
ISO-8859-1 (the meta tag takes effect when no charset is specified in
HTTP headers), so the first two octets of the data _must_ be interpreted
as þÿ (Latin letter small thorn and Latin letter small y with
diaeresis), which of course violate HTML syntax when appearing before a
DOCTYPE declaration.

The validator incorrectly guesses that þÿ is meant to act as a byte
order mark in UTF-16 encoding and therefore treats the document as
UTF-16 encoded. (The guess is "correct" of course in a pragmatic sense,
but it's still an error.)

Browsers may behave in the same incorrect way, or they may correctly
interpret the document as ISO-8859-1 encoded, in which case it is
syntactically wrong and browsers may do what they like. Here's what Lynx
shows (there's a subtle hint to some problems other than encoding
problems in my quoting this):

þÿ

. jpg jpg
jpg
jpg jpg

MOTION MOUNTAIN

THE PHYSICS TEXTBOOK

logo

Welcome Contents Download Search Project Guest Book Links Author
Prizes July 5, 2005
jpg jpg
jpg jpg

Apparently, if you wish to use UTF-16, remove the tag
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
or (perhaps better) replace ISO-8859-1 by UTF-16 (in which case you need
to remember to change it again if you change the document's encoding).
Other pages do not validate in w3c
( http://www.motionmountain.net/contents.html)
(2) What is wrong here?
1695 errors, vow! :-)

I suspect they relate to character encoding problems too; "non SGML
character number 0" sounds like the validator had encountered the NUL
character (U+0000) and got confused, but if I remember correctly, this
cryptic message arises in different situations.

How do you produce and edit your HTML files? It seems that they might
not all be properly UTF-16 encoded.
(3) When uploading unicode files via ftp,
which line endings have to be used (mac, unix, other)?
In HTML, all commonly used line endings are traditionally (and by the
specs) accepted.
Ascii mode or binary mode? (I have Mac OSX)
If you use UTF-16 or UTF-8, binary - you do _not_ want any
Mac-to-something else conversions, since you are using a standard
Unicode encoding already. What matters is whether your editing software
produces correct UTF-something.
(4) To get IE to read the pages, is it best to use UTF-8
or UTF-16?


Even IE can handle both, but UTF-8 is surely more efficient, if the
majority of the text is English. In UTF-16, every (BMP) character is two
octets.
Aug 25 '05 #2
Jukka K. Korpela wrote:
( http://www.motionmountain.net/contents.html)
(2) What is wrong here?


1695 errors, vow! :-)

I suspect they relate to character encoding problems too; "non SGML
character number 0" sounds like the validator had encountered the NUL
character (U+0000) and got confused, but if I remember correctly, this
cryptic message arises in different situations.


In this case, the error message is correct: the document contains data like
<td>&nbsp;</td>
so that each of those characters is followed by NUL, U+0000. The
validator reports NUL as an error, since it is a "non SGML character",
which is a technicality I won't dig into now. The problem is apparently
that the data comes, presumably via server-side include (as the comment
before it suggests) from an ASCII file that is converted to Unicode
format too eagerly. If you have ASCII data to be embedded into an UTF-16
encoded document, each octet shall be followed by a zero octet. What has
happened here is that each octet is followed by _three_ zero octets (as
if the encoding were UTF-32), which means in UTF-16 interpretation that
you have NULs all around. Although browsers may skip NULs, NULs are an
error in HTML.

So perhaps there is some simple ASCII to UTF-16 transformation that is
applied _twice_ by mistake, or maybe there is an ASCII to UTF-32
transformation.
Aug 25 '05 #3
On Thu, 25 Aug 2005, Jukka K. Korpela wrote:

[...]
(4) To get IE to read the pages, is it best to use UTF-8
or UTF-16?


Even IE can handle both, but UTF-8 is surely more efficient, if the
majority of the text is English.


Agreed, and utf-8 is, in general, better supported than utf-16
(talking not only about browsers, but also about search engines etc.).

However, *if* the bulk of the content were in Chinese, presumably
utf-16 would be more compact than utf-8. (I have the impression that
currently, most Chinese documents are in one of the specifically
Chinese encodings, rather than Unicode, but that's by the by. Oh, and
when using Unicode, remember to specify the language, to help browsers
to choose a preferred rendering for unified Han characters[1]).

hope this helps

[1] This is not my field at all, but a web search throws up a
wikipedia article which, as far as I can tell, seems to be a
reasonable discussion at a level that I can understand.
http://en.wikipedia.org/wiki/Han_unification

I can't speak personally for any of its technical detail - like any
wikipedia article, who knows what a specialist in the field would have
to say about it? For all I know, it may be spotless, I just can't
tell; but at least it gives the flavour of the issues involved.
Aug 25 '05 #4

ch***********@yahoo.com wrote:
Since I plan to add more languages, and the unicode
issues are so tough:

(3) When uploading unicode files via ftp,
which line endings have to be used (mac, unix, other)?
Ascii mode or binary mode? (I have Mac OSX)

(4) To get IE to read the pages, is it best to use UTF-8
or UTF-16?


Unicode is not tough. Just make sure that you know the encoding of
your files, and ensure that the same encoding is specified in the HTTP
header and in a meta tag.

Do NOT use UTF-16. I.E. for Mac does not understand UTF-16.

--
Alan Wood
http://www.alanwood.net (Unicode, special characters, pesticide names)

Aug 25 '05 #5
On 25 Aug 2005, Alan Wood wrote:
Do NOT use UTF-16. I.E. for Mac does not understand UTF-16.


And Google does not understand UTF-16.
http://www.google.com/search?q=%22UTF+1+6%22

Aug 25 '05 #6
On Thu, 25 Aug 2005, Alan J. Flavell wrote:
However, *if* the bulk of the content were in Chinese, presumably
utf-16 would be more compact than utf-8.


For text/plain. However for text/html, even Chinese texts may be more
compact in UTF-8 depending on the amount of your (ASCII!) markup.

Aug 25 '05 #7
On Thu, 25 Aug 2005, Jukka K. Korpela wrote:
Apparently, if you wish to use UTF-16, remove the tag
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
or (perhaps better) replace ISO-8859-1 by UTF-16


Do you mean

< m e t a h t t p - e q u i v = " C o n t e n t - T y p e "
c o n t e n t = " t e x t / h t m l ; c h a r s e t = U T F - 1 6 " >

? This would be pointless. You would need to know the encoding (UTF-16)
in advance before you could even read this.

Aug 25 '05 #8
On 24 Aug 2005 ch***********@yahoo.com wrote:
The entry page has two chinese characters,
The easiest way is to write &#number; for only two characters.
but these are not seen on all browsers,
Of course. You cannot expect that everyone has fonts with
Chinese characters on his computer.
( http://www.motionmountain.net/contents.html)
Please *do not* enclose the URL in parentheses! This might mean
your file is "contents.html)" . Always leave a space on *both*
sides of URLs.
(3) When uploading unicode files via ftp,
which line endings have to be used (mac, unix, other)?
Ascii mode or binary mode? (I have Mac OSX)
The best way is to upload in "text mode" (misnomer: "ASCII mode")
and have your files stored on _your_ computer with local line
endings. You must disable any transcoding "MacRoman <-> ISO-8859-1"
in your FTP program. At least Fetch has such an option.
(4) To get IE to read the pages, is it best to use UTF-8
or UTF-16?


UTF-8. Never use UTF-16 for the WWW.

Aug 25 '05 #9
Andreas Prilop wrote:
On Thu, 25 Aug 2005, Jukka K. Korpela wrote:

Apparently, if you wish to use UTF-16, remove the tag
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
or (perhaps better) replace ISO-8859-1 by UTF-16

Do you mean

< m e t a h t t p - e q u i v = " C o n t e n t - T y p e "
c o n t e n t = " t e x t / h t m l ; c h a r s e t = U T F - 1 6 " >

? This would be pointless. You would need to know the encoding (UTF-16)
in advance before you could even read this.


Would this work?

< m e t a h t t p - e q u i v = " C o n t e n t - T y p e "
c o n t e n t = " t e x t / h t m l ; c h a r s e t = U T F - 1 6 " >
<meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1">

(akin to certain CSS hacks that involve trying to accomplish the same
thing in two different ways)
Aug 25 '05 #10
On Thu, 25 Aug 2005, Harlan Messinger wrote:
Would this work?

< m e t a h t t p - e q u i v = " C o n t e n t - T y p e "
c o n t e n t = " t e x t / h t m l ; c h a r s e t = U T F - 1 6 " >
<meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1">


Of course not. The encoding (charset) can only be *either* UTF-16
*or* ISO-8859-1 but not both at the same time.

Aug 25 '05 #11
In article
<Pi*************************************@s5b004.rr zn-user.uni-hannover.d
e>,
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
However for text/html, even Chinese texts may be more
compact in UTF-8 depending on the amount of your (ASCII!) markup.


Indeed. Also, gzip is a byte-oriented compression method and works
rather nicely on UTF-8-encoded markup.

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Aug 25 '05 #12
Andreas Prilop wrote:
On Thu, 25 Aug 2005, Jukka K. Korpela wrote:
Apparently, if you wish to use UTF-16, remove the tag
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
or (perhaps better) replace ISO-8859-1 by UTF-16

Do you mean

< m e t a h t t p - e q u i v = " C o n t e n t - T y p e "
c o n t e n t = " t e x t / h t m l ; c h a r s e t = U T F - 1 6 " >

? This would be pointless. You would need to know the encoding (UTF-16)
in advance before you could even read this.


I stand corrected. Would you believe that the low-order byte of all my
brain cells was disabled when I wrote that?

There's apparently no way to leave the encoding unspecified at other
protocol levels and set it to UTF-16 in a meta tag, logicall speaking. I
guess browsers might still get it, by first reading the data in implicit
ISO-8859-1 (or windows-1252) encoding, then realize it's now told to be
UTF-16, and proceed with that, assuming that everything read so far was
correctly interpreted.

Apparently the validator (and many browsers) actually play by XHTML
rules, which allow and mandate the recognition of UTF-16 from the byte
order mark.
Aug 26 '05 #13
On Fri, 26 Aug 2005, Jukka K. Korpela wrote:
There's apparently no way to leave the encoding unspecified at other
protocol levels and set it to UTF-16 in a meta tag, logicall
speaking. I guess browsers might still get it, by first reading the
data in implicit ISO-8859-1 (or windows-1252) encoding, then realize
it's now told to be UTF-16, and proceed with that, assuming that
everything read so far was correctly interpreted.


Browsers often have some kind of auto-recognition algorithm for
character coding, and I'd suggest it's more likely that a browser
would auto-recognise utf-16, if that's what it is.

After all, HTML documents can be expected to start (aside from a
possible BOM) with a coded representation of characters from the ASCII
repertoire, even if they then go on to present a document body
containing a wide range of Unicode. There aren't too many different
ways of representing the ASCII repertoire, so a heuristic has a good
chance of recognising what's going on in such a case.

Of course, this kind of browser-specific heuristic does not in
any way replace the proper way of doing things! But it might
rescue a badly-served document that would otherwise be unusable.

On the other hand, CERT CA-2000-02 says that a document served out
without an HTTP charset attribute is a potential security risk!

best regards
Aug 26 '05 #14
On Thu, 25 Aug 2005, I wrote:
And Google does not understand UTF-16.
http://www.google.com/search?q=%22UTF+1+6%22


After reading some of those web pages:
*Really* clueless are the webpupils whose UTF-16-encoded pages
contain only characters from the ASCII repertoire. I call
this "overoverkill".

Aug 26 '05 #15
On Fri, 26 Aug 2005, Alan J. Flavell wrote:
Browsers often have some kind of auto-recognition algorithm for
character coding,


Interestingly, Mozilla identifies the encoding (charset) of
[ Warning: Very slow! Read without images! ]
http://www.apple.com.ge/contacts.html
as Windows-1251 because of the Russian *comments*. The Georgian
characters are written as &#number; - so the charset could be
anything. Non-ASCII characters exist only inside comments:
Russian text in cp1251.

Aug 26 '05 #16

Thank you all for the advice. That was really useful.
Using Mac's Textedit, I took two files, namely

http://www.motionmountain.net/welcome.html
and
http://www.motionmountain.net/index.html

conveted them to UTF-8, changed the charset
to UTF-8, used (UNIX-) ftp with ascii mode to upload them,
and put them on the server.
(It is simple handmade html, no server-side-includes)

Are these pages ok now?

Christoph

Aug 26 '05 #17

On the other hand, with Textedit I do not seem to be able to
get rid of the nuls in

http://www.motionmountain.net/ contents.html

What is the best way to do this?

Christoph

Aug 26 '05 #18
In article
<Pi*************************************@s5b004.rr zn-user.uni-hannover.d
e>,
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote:
Interestingly, Mozilla identifies the encoding (charset) of
[ Warning: Very slow! Read without images! ]
http://www.apple.com.ge/contacts.html
as Windows-1251 because of the Russian *comments*.


The detector operates on the byte stream before the parser, so it does
not matter if the text is in comments.

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Aug 26 '05 #19

The files http://www.motionmountain.net/search.html
and http://www.motionmountain.net/project.html
etc are somehow messed up.
I will use an older backup copy and convert them to utf-8 afresh.
Thank you far all the help!

Regards

Christoph

Aug 27 '05 #20
In <11**********************@g47g2000cwa.googlegroups .com>, on
08/24/2005
at 06:44 PM, ch***********@yahoo.com said:
(1) Why not?
Just because you encode your data in Unicode doesn't mean that the
user's browser can render them; he needs to have the appropriate
fonts. If he doesn't have a Chinese font for Unicode, . . .
(3) When uploading unicode files via ftp,
which line endings have to be used (mac, unix, other)?
Whatever is appropriate for the sending and receiving systems.
(4) To get IE to read the pages, is it best to use UTF-8 or UTF-16?


UTF-8. UTF-16 is only appropriate if you are sending 16-bit bytes.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Aug 28 '05 #21
On Sun, 28 Aug 2005, Shmuel (Seymour J.) Metz wrote:
UTF-16 is only appropriate if you are sending 16-bit bytes.


"16-bit bytes"?

Aug 29 '05 #22

(apologies for the delayed response)

On Fri, 26 Aug 2005, Andreas Prilop wrote:
On Fri, 26 Aug 2005, Alan J. Flavell wrote:
Browsers often have some kind of auto-recognition algorithm for
character coding,
Interestingly, Mozilla identifies the encoding (charset) of
[ Warning: Very slow! Read without images! ]
http://www.apple.com.ge/contacts.html
as Windows-1251 because of the Russian *comments*.


So it does!

Worryingly, when auto charset recognition was turned off, the encoding
was reported as utf-8: but surely these strings of cp1251 bytes could
not be valid utf-8? !! When Moz. displays the HTML source under those
conditions, the Russian comments are displayed as strings of ??????
shown in an oblique font.

It should be noted that Unicode lays down rules with the intention of
promoting security i.e avoiding spoofing which might otherwise be done
by supplying defective utf-8 sequences. I knew that in principle, but
didn't know the details till I looked just now. Let's see...

http://www.unicode.org/reports/tr36/#UTF-8_Exploit

Oh, this is about the interpretation of non-shortest-form utf-8: that
rule was new at 3.0. For the more general rule on defective utf-8 I
need to look elsewhere...

Aha, I'm getting close: under utf-8 clause D37 at
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf
it refers back to conformance clause C12a, which is in section 3.2.

It rules-out the interpretation of the ill-formed "code units"
themselves. It doesn't rule out attempting to interpret the remaining
sequences. So (as usual) Mozilla is conforming to the applicable
rules, and "representing the [defective] code unit with a marker", as
the Unicode requirement puts it (using the oblique "?" characters as
its marker), while still processing the remaining "code units" (the
ASCII source code characters).

OK, something learned.
The Georgian characters are written as &#number; - so the charset
could be anything.


In a properly-behaved browser, indeed. (That old Netscape 4.* thing
would fail to render this, if it got to know that it was Windows-1251,
I reckon - but right now I don't have access to a copy of any NN4.*
version to try it. Whereas, I reckon, if one put it into its utf-8
mood, and gave it a unicode font with adequate repertoire, I'd expect
it to be able to render Georgian. Can we rate NN4.* as past history,
yet?)

cheers
Aug 29 '05 #23
On Mon, 29 Aug 2005, Alan J. Flavell wrote:
Interestingly, Mozilla identifies the encoding (charset) of
[ Warning: Very slow! Read without images! ]
http://www.apple.com.ge/contacts.html
as Windows-1251 because of the Russian *comments*.
So it does!


Google, which is broken in many ways, thinks it is ISO-8859-5
and puts a <meta ... charset=ISO-8859-5> into the cached version:
http://google.com/search?q=cache:www...s.html&strip=1
Worryingly, when auto charset recognition was turned off, the encoding
was reported as utf-8: but surely these strings of cp1251 bytes could
not be valid utf-8? !!


UTF-8 is _your_ default, it seems.

Sep 2 '05 #24
On Fri, 2 Sep 2005, Andreas Prilop wrote:
On Mon, 29 Aug 2005, Alan J. Flavell wrote:
Worryingly, when auto charset recognition was turned off, the encoding
was reported as utf-8: but surely these strings of cp1251 bytes could
not be valid utf-8? !!


UTF-8 is _your_ default, it seems.


Oh, right. Sorry for any confusion caused.

But it led to some interesting considerations about utf-8 error
handling, nevertheless.

gruesse

Sep 2 '05 #25

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: Edo van der Zouwen | last post by:
I have the following problem. On a website there's a (simple) feedback form. This is used also by Polish visitors who (of course) type Polish text using special characters. However, when I...
8
by: Bill Eldridge | last post by:
I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...
0
by: Matt Price | last post by:
Hello, I'm a python (& xml, & unicode!) newbie working on an interface to a bibliographic reference server (refdb); I'm running into some encoding problems & am ifnding the plethora of tools a...
6
by: S. | last post by:
if in my website i am using the sgml { notation, is it accurate to say to my users that the site uses unicode or that it requires unicode? is there a mathematical formula to calculate a unicode...
48
by: Zenobia | last post by:
Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at...
4
by: webdev | last post by:
lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3...
18
by: Ger | last post by:
I have not been able to find a simple, straight forward Unicode to ASCII string conversion function in VB.Net. Is that because such a function does not exists or do I overlook it? I found...
7
by: Robert | last post by:
Hello, I'm using Pythonwin and py2.3 (py2.4). I did not come clear with this: I want to use win32-fuctions like win32ui.MessageBox, listctrl.InsertItem ..... to get unicode strings on the...
3
by: pratik.best | last post by:
Hi, I just seen the web site of the unicode committee and was amazed to see the site showing document in Hindi without using any such fonts like "Kruti Dev" or "Dev Lys". "Webdunia.com" is also...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.