Connecting Tech Pros Worldwide Forums | Help | Site Map

Unicode and html - help for simple web site

chri_schiller@yahoo.com
Guest
 
Posts: n/a
#1: Aug 25 '05

I have a home-made website that provides a free
1100 page physics textbook. It is written in html and
css. I recently added some chinese text, and
since that day there are problems.

The entry page has two chinese characters,
but these are not seen on all browsers, even
though the page is validated by
the w3c validator.
( http://www.motionmountain.net/welcome.html)
(1) Why not?

Other pages do not validate in w3c
( http://www.motionmountain.net/contents.html)
(2) What is wrong here?

Since I plan to add more languages, and the unicode
issues are so tough:

(3) When uploading unicode files via ftp,
which line endings have to be used (mac, unix, other)?
Ascii mode or binary mode? (I have Mac OSX)

(4) To get IE to read the pages, is it best to use UTF-8
or UTF-16?

Thank you for any help!

Christoph


Jukka K. Korpela
Guest
 
Posts: n/a
#2: Aug 25 '05

re: Unicode and html - help for simple web site


chri_schiller@yahoo.com wrote:
[color=blue]
> The entry page has two chinese characters,
> but these are not seen on all browsers, even
> though the page is validated by
> the w3c validator.
> ( http://www.motionmountain.net/welcome.html)
> (1) Why not?[/color]

It validates just because the validator is so permissive and does not
care about the conflict between the encoding you declare in the meta tag
(ISO-8859-1) and the encoding you actually use. In fact, I would say
that the validator is in error here: the encoding is specified as
ISO-8859-1 (the meta tag takes effect when no charset is specified in
HTTP headers), so the first two octets of the data _must_ be interpreted
as þÿ (Latin letter small thorn and Latin letter small y with
diaeresis), which of course violate HTML syntax when appearing before a
DOCTYPE declaration.

The validator incorrectly guesses that þÿ is meant to act as a byte
order mark in UTF-16 encoding and therefore treats the document as
UTF-16 encoded. (The guess is "correct" of course in a pragmatic sense,
but it's still an error.)

Browsers may behave in the same incorrect way, or they may correctly
interpret the document as ISO-8859-1 encoded, in which case it is
syntactically wrong and browsers may do what they like. Here's what Lynx
shows (there's a subtle hint to some problems other than encoding
problems in my quoting this):

þÿ

. jpg jpg
jpg
jpg jpg

MOTION MOUNTAIN

THE PHYSICS TEXTBOOK

logo

Welcome Contents Download Search Project Guest Book Links Author
Prizes July 5, 2005
jpg jpg
jpg jpg

Apparently, if you wish to use UTF-16, remove the tag
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
or (perhaps better) replace ISO-8859-1 by UTF-16 (in which case you need
to remember to change it again if you change the document's encoding).
[color=blue]
> Other pages do not validate in w3c
> ( http://www.motionmountain.net/contents.html)
> (2) What is wrong here?[/color]

1695 errors, vow! :-)

I suspect they relate to character encoding problems too; "non SGML
character number 0" sounds like the validator had encountered the NUL
character (U+0000) and got confused, but if I remember correctly, this
cryptic message arises in different situations.

How do you produce and edit your HTML files? It seems that they might
not all be properly UTF-16 encoded.
[color=blue]
> (3) When uploading unicode files via ftp,
> which line endings have to be used (mac, unix, other)?[/color]

In HTML, all commonly used line endings are traditionally (and by the
specs) accepted.
[color=blue]
> Ascii mode or binary mode? (I have Mac OSX)[/color]

If you use UTF-16 or UTF-8, binary - you do _not_ want any
Mac-to-something else conversions, since you are using a standard
Unicode encoding already. What matters is whether your editing software
produces correct UTF-something.
[color=blue]
> (4) To get IE to read the pages, is it best to use UTF-8
> or UTF-16?[/color]

Even IE can handle both, but UTF-8 is surely more efficient, if the
majority of the text is English. In UTF-16, every (BMP) character is two
octets.
Jukka K. Korpela
Guest
 
Posts: n/a
#3: Aug 25 '05

re: Unicode and html - help for simple web site


Jukka K. Korpela wrote:
[color=blue][color=green]
>> ( http://www.motionmountain.net/contents.html)
>> (2) What is wrong here?[/color]
>
> 1695 errors, vow! :-)
>
> I suspect they relate to character encoding problems too; "non SGML
> character number 0" sounds like the validator had encountered the NUL
> character (U+0000) and got confused, but if I remember correctly, this
> cryptic message arises in different situations.[/color]

In this case, the error message is correct: the document contains data like
<td>&nbsp;</td>
so that each of those characters is followed by NUL, U+0000. The
validator reports NUL as an error, since it is a "non SGML character",
which is a technicality I won't dig into now. The problem is apparently
that the data comes, presumably via server-side include (as the comment
before it suggests) from an ASCII file that is converted to Unicode
format too eagerly. If you have ASCII data to be embedded into an UTF-16
encoded document, each octet shall be followed by a zero octet. What has
happened here is that each octet is followed by _three_ zero octets (as
if the encoding were UTF-32), which means in UTF-16 interpretation that
you have NULs all around. Although browsers may skip NULs, NULs are an
error in HTML.

So perhaps there is some simple ASCII to UTF-16 transformation that is
applied _twice_ by mistake, or maybe there is an ASCII to UTF-32
transformation.
Alan J. Flavell
Guest
 
Posts: n/a
#4: Aug 25 '05

re: Unicode and html - help for simple web site


On Thu, 25 Aug 2005, Jukka K. Korpela wrote:

[...]
[color=blue][color=green]
> > (4) To get IE to read the pages, is it best to use UTF-8
> > or UTF-16?[/color]
>
> Even IE can handle both, but UTF-8 is surely more efficient, if the
> majority of the text is English.[/color]

Agreed, and utf-8 is, in general, better supported than utf-16
(talking not only about browsers, but also about search engines etc.).

However, *if* the bulk of the content were in Chinese, presumably
utf-16 would be more compact than utf-8. (I have the impression that
currently, most Chinese documents are in one of the specifically
Chinese encodings, rather than Unicode, but that's by the by. Oh, and
when using Unicode, remember to specify the language, to help browsers
to choose a preferred rendering for unified Han characters[1]).

hope this helps

[1] This is not my field at all, but a web search throws up a
wikipedia article which, as far as I can tell, seems to be a
reasonable discussion at a level that I can understand.
http://en.wikipedia.org/wiki/Han_unification

I can't speak personally for any of its technical detail - like any
wikipedia article, who knows what a specialist in the field would have
to say about it? For all I know, it may be spotless, I just can't
tell; but at least it gives the flavour of the issues involved.
Alan Wood
Guest
 
Posts: n/a
#5: Aug 25 '05

re: Unicode and html - help for simple web site



chri_schiller@yahoo.com wrote:[color=blue]
> Since I plan to add more languages, and the unicode
> issues are so tough:
>
> (3) When uploading unicode files via ftp,
> which line endings have to be used (mac, unix, other)?
> Ascii mode or binary mode? (I have Mac OSX)
>
> (4) To get IE to read the pages, is it best to use UTF-8
> or UTF-16?[/color]

Unicode is not tough. Just make sure that you know the encoding of
your files, and ensure that the same encoding is specified in the HTTP
header and in a meta tag.

Do NOT use UTF-16. I.E. for Mac does not understand UTF-16.

--
Alan Wood
http://www.alanwood.net (Unicode, special characters, pesticide names)

Andreas Prilop
Guest
 
Posts: n/a
#6: Aug 25 '05

re: Unicode and html - help for simple web site


On Thu, 25 Aug 2005, Alan J. Flavell wrote:
[color=blue]
> However, *if* the bulk of the content were in Chinese, presumably
> utf-16 would be more compact than utf-8.[/color]

For text/plain. However for text/html, even Chinese texts may be more
compact in UTF-8 depending on the amount of your (ASCII!) markup.

Andreas Prilop
Guest
 
Posts: n/a
#7: Aug 25 '05

re: Unicode and html - help for simple web site


On 25 Aug 2005, Alan Wood wrote:
[color=blue]
> Do NOT use UTF-16. I.E. for Mac does not understand UTF-16.[/color]

And Google does not understand UTF-16.
http://www.google.com/search?q=%22UTF+1+6%22

Andreas Prilop
Guest
 
Posts: n/a
#8: Aug 25 '05

re: Unicode and html - help for simple web site


On Thu, 25 Aug 2005, Jukka K. Korpela wrote:
[color=blue]
> Apparently, if you wish to use UTF-16, remove the tag
> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
> or (perhaps better) replace ISO-8859-1 by UTF-16[/color]

Do you mean

< m e t a h t t p - e q u i v = " C o n t e n t - T y p e "
c o n t e n t = " t e x t / h t m l ; c h a r s e t = U T F - 1 6 " >

? This would be pointless. You would need to know the encoding (UTF-16)
in advance before you could even read this.

Andreas Prilop
Guest
 
Posts: n/a
#9: Aug 25 '05

re: Unicode and html - help for simple web site


On 24 Aug 2005 chri_schiller@yahoo.com wrote:
[color=blue]
> The entry page has two chinese characters,[/color]

The easiest way is to write &#number; for only two characters.
[color=blue]
> but these are not seen on all browsers,[/color]

Of course. You cannot expect that everyone has fonts with
Chinese characters on his computer.
[color=blue]
> ( http://www.motionmountain.net/contents.html)[/color]

Please *do not* enclose the URL in parentheses! This might mean
your file is "contents.html)" . Always leave a space on *both*
sides of URLs.
[color=blue]
> (3) When uploading unicode files via ftp,
> which line endings have to be used (mac, unix, other)?
> Ascii mode or binary mode? (I have Mac OSX)[/color]

The best way is to upload in "text mode" (misnomer: "ASCII mode")
and have your files stored on _your_ computer with local line
endings. You must disable any transcoding "MacRoman <-> ISO-8859-1"
in your FTP program. At least Fetch has such an option.
[color=blue]
> (4) To get IE to read the pages, is it best to use UTF-8
> or UTF-16?[/color]

UTF-8. Never use UTF-16 for the WWW.

Harlan Messinger
Guest
 
Posts: n/a
#10: Aug 25 '05

re: Unicode and html - help for simple web site


Andreas Prilop wrote:[color=blue]
> On Thu, 25 Aug 2005, Jukka K. Korpela wrote:
>
>[color=green]
>>Apparently, if you wish to use UTF-16, remove the tag
>><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
>>or (perhaps better) replace ISO-8859-1 by UTF-16[/color]
>
>
> Do you mean
>
> < m e t a h t t p - e q u i v = " C o n t e n t - T y p e "
> c o n t e n t = " t e x t / h t m l ; c h a r s e t = U T F - 1 6 " >
>
> ? This would be pointless. You would need to know the encoding (UTF-16)
> in advance before you could even read this.
>[/color]

Would this work?

< m e t a h t t p - e q u i v = " C o n t e n t - T y p e "
c o n t e n t = " t e x t / h t m l ; c h a r s e t = U T F - 1 6 " >
<meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1">

(akin to certain CSS hacks that involve trying to accomplish the same
thing in two different ways)
Andreas Prilop
Guest
 
Posts: n/a
#11: Aug 25 '05

re: Unicode and html - help for simple web site


On Thu, 25 Aug 2005, Harlan Messinger wrote:
[color=blue]
> Would this work?
>
> < m e t a h t t p - e q u i v = " C o n t e n t - T y p e "
> c o n t e n t = " t e x t / h t m l ; c h a r s e t = U T F - 1 6 " >
> <meta http-equiv="Content-Type"
> content="text/html; charset=ISO-8859-1">[/color]

Of course not. The encoding (charset) can only be *either* UTF-16
*or* ISO-8859-1 but not both at the same time.

Henri Sivonen
Guest
 
Posts: n/a
#12: Aug 25 '05

re: Unicode and html - help for simple web site


In article
<Pine.GSO.4.44.0508251506530.5232-100000@s5b004.rrzn-user.uni-hannover.d
e>,
Andreas Prilop <nhtcapri@rrzn-user.uni-hannover.de> wrote:
[color=blue]
> However for text/html, even Chinese texts may be more
> compact in UTF-8 depending on the amount of your (ASCII!) markup.[/color]

Indeed. Also, gzip is a byte-oriented compression method and works
rather nicely on UTF-8-encoded markup.

--
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jukka K. Korpela
Guest
 
Posts: n/a
#13: Aug 26 '05

re: Unicode and html - help for simple web site


Andreas Prilop wrote:
[color=blue]
> On Thu, 25 Aug 2005, Jukka K. Korpela wrote:
>[color=green]
>>Apparently, if you wish to use UTF-16, remove the tag
>><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
>>or (perhaps better) replace ISO-8859-1 by UTF-16[/color]
>
>
> Do you mean
>
> < m e t a h t t p - e q u i v = " C o n t e n t - T y p e "
> c o n t e n t = " t e x t / h t m l ; c h a r s e t = U T F - 1 6 " >
>
> ? This would be pointless. You would need to know the encoding (UTF-16)
> in advance before you could even read this.[/color]

I stand corrected. Would you believe that the low-order byte of all my
brain cells was disabled when I wrote that?

There's apparently no way to leave the encoding unspecified at other
protocol levels and set it to UTF-16 in a meta tag, logicall speaking. I
guess browsers might still get it, by first reading the data in implicit
ISO-8859-1 (or windows-1252) encoding, then realize it's now told to be
UTF-16, and proceed with that, assuming that everything read so far was
correctly interpreted.

Apparently the validator (and many browsers) actually play by XHTML
rules, which allow and mandate the recognition of UTF-16 from the byte
order mark.
Alan J. Flavell
Guest
 
Posts: n/a
#14: Aug 26 '05

re: Unicode and html - help for simple web site


On Fri, 26 Aug 2005, Jukka K. Korpela wrote:
[color=blue]
> There's apparently no way to leave the encoding unspecified at other
> protocol levels and set it to UTF-16 in a meta tag, logicall
> speaking. I guess browsers might still get it, by first reading the
> data in implicit ISO-8859-1 (or windows-1252) encoding, then realize
> it's now told to be UTF-16, and proceed with that, assuming that
> everything read so far was correctly interpreted.[/color]

Browsers often have some kind of auto-recognition algorithm for
character coding, and I'd suggest it's more likely that a browser
would auto-recognise utf-16, if that's what it is.

After all, HTML documents can be expected to start (aside from a
possible BOM) with a coded representation of characters from the ASCII
repertoire, even if they then go on to present a document body
containing a wide range of Unicode. There aren't too many different
ways of representing the ASCII repertoire, so a heuristic has a good
chance of recognising what's going on in such a case.

Of course, this kind of browser-specific heuristic does not in
any way replace the proper way of doing things! But it might
rescue a badly-served document that would otherwise be unusable.

On the other hand, CERT CA-2000-02 says that a document served out
without an HTTP charset attribute is a potential security risk!

best regards
Andreas Prilop
Guest
 
Posts: n/a
#15: Aug 26 '05

re: Unicode and html - help for simple web site


On Thu, 25 Aug 2005, I wrote:
[color=blue]
> And Google does not understand UTF-16.
> http://www.google.com/search?q=%22UTF+1+6%22[/color]

After reading some of those web pages:
*Really* clueless are the webpupils whose UTF-16-encoded pages
contain only characters from the ASCII repertoire. I call
this "overoverkill".

Andreas Prilop
Guest
 
Posts: n/a
#16: Aug 26 '05

re: Unicode and html - help for simple web site


On Fri, 26 Aug 2005, Alan J. Flavell wrote:
[color=blue]
> Browsers often have some kind of auto-recognition algorithm for
> character coding,[/color]

Interestingly, Mozilla identifies the encoding (charset) of
[ Warning: Very slow! Read without images! ]
http://www.apple.com.ge/contacts.html
as Windows-1251 because of the Russian *comments*. The Georgian
characters are written as &#number; - so the charset could be
anything. Non-ASCII characters exist only inside comments:
Russian text in cp1251.

chri_schiller@yahoo.com
Guest
 
Posts: n/a
#17: Aug 26 '05

re: Unicode and html - help for simple web site



Thank you all for the advice. That was really useful.
Using Mac's Textedit, I took two files, namely

http://www.motionmountain.net/welcome.html
and
http://www.motionmountain.net/index.html

conveted them to UTF-8, changed the charset
to UTF-8, used (UNIX-) ftp with ascii mode to upload them,
and put them on the server.
(It is simple handmade html, no server-side-includes)

Are these pages ok now?

Christoph

chri_schiller@yahoo.com
Guest
 
Posts: n/a
#18: Aug 26 '05

re: Unicode and html - help for simple web site



On the other hand, with Textedit I do not seem to be able to
get rid of the nuls in

http://www.motionmountain.net/ contents.html

What is the best way to do this?

Christoph

Henri Sivonen
Guest
 
Posts: n/a
#19: Aug 26 '05

re: Unicode and html - help for simple web site


In article
<Pine.GSO.4.44.0508261541490.7631-100000@s5b004.rrzn-user.uni-hannover.d
e>,
Andreas Prilop <nhtcapri@rrzn-user.uni-hannover.de> wrote:
[color=blue]
> Interestingly, Mozilla identifies the encoding (charset) of
> [ Warning: Very slow! Read without images! ]
> http://www.apple.com.ge/contacts.html
> as Windows-1251 because of the Russian *comments*.[/color]

The detector operates on the byte stream before the parser, so it does
not matter if the text is in comments.

--
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
chri_schiller@yahoo.com
Guest
 
Posts: n/a
#20: Aug 27 '05

re: Unicode and html - help for simple web site



The files http://www.motionmountain.net/search.html
and http://www.motionmountain.net/project.html
etc are somehow messed up.
I will use an older backup copy and convert them to utf-8 afresh.
Thank you far all the help!

Regards

Christoph

Shmuel (Seymour J.) Metz
Guest
 
Posts: n/a
#21: Aug 28 '05

re: Unicode and html - help for simple web site


In <1124934287.154754.283770@g47g2000cwa.googlegroups .com>, on
08/24/2005
at 06:44 PM, chri_schiller@yahoo.com said:
[color=blue]
>(1) Why not?[/color]

Just because you encode your data in Unicode doesn't mean that the
user's browser can render them; he needs to have the appropriate
fonts. If he doesn't have a Chinese font for Unicode, . . .
[color=blue]
>(3) When uploading unicode files via ftp,
>which line endings have to be used (mac, unix, other)?[/color]

Whatever is appropriate for the sending and receiving systems.
[color=blue]
>(4) To get IE to read the pages, is it best to use UTF-8 or UTF-16?[/color]

UTF-8. UTF-16 is only appropriate if you are sending 16-bit bytes.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to spamtrap@library.lspace.org

Andreas Prilop
Guest
 
Posts: n/a
#22: Aug 29 '05

re: Unicode and html - help for simple web site


On Sun, 28 Aug 2005, Shmuel (Seymour J.) Metz wrote:
[color=blue]
> UTF-16 is only appropriate if you are sending 16-bit bytes.[/color]

"16-bit bytes"?

Alan J. Flavell
Guest
 
Posts: n/a
#23: Aug 29 '05

re: Unicode and html - help for simple web site



(apologies for the delayed response)

On Fri, 26 Aug 2005, Andreas Prilop wrote:
[color=blue]
> On Fri, 26 Aug 2005, Alan J. Flavell wrote:
>[color=green]
> > Browsers often have some kind of auto-recognition algorithm for
> > character coding,[/color]
>
> Interestingly, Mozilla identifies the encoding (charset) of
> [ Warning: Very slow! Read without images! ]
> http://www.apple.com.ge/contacts.html
> as Windows-1251 because of the Russian *comments*.[/color]

So it does!

Worryingly, when auto charset recognition was turned off, the encoding
was reported as utf-8: but surely these strings of cp1251 bytes could
not be valid utf-8? !! When Moz. displays the HTML source under those
conditions, the Russian comments are displayed as strings of ??????
shown in an oblique font.

It should be noted that Unicode lays down rules with the intention of
promoting security i.e avoiding spoofing which might otherwise be done
by supplying defective utf-8 sequences. I knew that in principle, but
didn't know the details till I looked just now. Let's see...

http://www.unicode.org/reports/tr36/#UTF-8_Exploit

Oh, this is about the interpretation of non-shortest-form utf-8: that
rule was new at 3.0. For the more general rule on defective utf-8 I
need to look elsewhere...

Aha, I'm getting close: under utf-8 clause D37 at
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf
it refers back to conformance clause C12a, which is in section 3.2.

It rules-out the interpretation of the ill-formed "code units"
themselves. It doesn't rule out attempting to interpret the remaining
sequences. So (as usual) Mozilla is conforming to the applicable
rules, and "representing the [defective] code unit with a marker", as
the Unicode requirement puts it (using the oblique "?" characters as
its marker), while still processing the remaining "code units" (the
ASCII source code characters).

OK, something learned.
[color=blue]
> The Georgian characters are written as &#number; - so the charset
> could be anything.[/color]

In a properly-behaved browser, indeed. (That old Netscape 4.* thing
would fail to render this, if it got to know that it was Windows-1251,
I reckon - but right now I don't have access to a copy of any NN4.*
version to try it. Whereas, I reckon, if one put it into its utf-8
mood, and gave it a unicode font with adequate repertoire, I'd expect
it to be able to render Georgian. Can we rate NN4.* as past history,
yet?)

cheers
Andreas Prilop
Guest
 
Posts: n/a
#24: Sep 2 '05

re: Unicode and html - help for simple web site


On Mon, 29 Aug 2005, Alan J. Flavell wrote:
[color=blue][color=green]
>> Interestingly, Mozilla identifies the encoding (charset) of
>> [ Warning: Very slow! Read without images! ]
>> http://www.apple.com.ge/contacts.html
>> as Windows-1251 because of the Russian *comments*.[/color]
>
> So it does![/color]

Google, which is broken in many ways, thinks it is ISO-8859-5
and puts a <meta ... charset=ISO-8859-5> into the cached version:
http://google.com/search?q=cache:www...s.html&strip=1
[color=blue]
> Worryingly, when auto charset recognition was turned off, the encoding
> was reported as utf-8: but surely these strings of cp1251 bytes could
> not be valid utf-8? !![/color]

UTF-8 is _your_ default, it seems.

Alan J. Flavell
Guest
 
Posts: n/a
#25: Sep 2 '05

re: Unicode and html - help for simple web site


On Fri, 2 Sep 2005, Andreas Prilop wrote:
[color=blue]
> On Mon, 29 Aug 2005, Alan J. Flavell wrote:
>[color=green]
> > Worryingly, when auto charset recognition was turned off, the encoding
> > was reported as utf-8: but surely these strings of cp1251 bytes could
> > not be valid utf-8? !![/color]
>
> UTF-8 is _your_ default, it seems.[/color]

Oh, right. Sorry for any confusion caused.

But it led to some interesting considerations about utf-8 error
handling, nevertheless.

gruesse

Closed Thread