I have a home-made website that provides a free
1100 page physics textbook. It is written in html and
css. I recently added some chinese text, and
since that day there are problems.
The entry page has two chinese characters,
but these are not seen on all browsers, even
though the page is validated by
the w3c validator.
( http://www.motionmountain.net/welcome.html)
(1) Why not?
Other pages do not validate in w3c
( http://www.motionmountain.net/contents.html)
(2) What is wrong here?
Since I plan to add more languages, and the unicode
issues are so tough:
(3) When uploading unicode files via ftp,
which line endings have to be used (mac, unix, other)?
Ascii mode or binary mode? (I have Mac OSX)
(4) To get IE to read the pages, is it best to use UTF-8
or UTF-16?
Thank you for any help!
Christoph 24 2782 ch***********@yahoo.com wrote: The entry page has two chinese characters, but these are not seen on all browsers, even though the page is validated by the w3c validator. ( http://www.motionmountain.net/welcome.html) (1) Why not?
It validates just because the validator is so permissive and does not
care about the conflict between the encoding you declare in the meta tag
(ISO-8859-1) and the encoding you actually use. In fact, I would say
that the validator is in error here: the encoding is specified as
ISO-8859-1 (the meta tag takes effect when no charset is specified in
HTTP headers), so the first two octets of the data _must_ be interpreted
as þÿ (Latin letter small thorn and Latin letter small y with
diaeresis), which of course violate HTML syntax when appearing before a
DOCTYPE declaration.
The validator incorrectly guesses that þÿ is meant to act as a byte
order mark in UTF-16 encoding and therefore treats the document as
UTF-16 encoded. (The guess is "correct" of course in a pragmatic sense,
but it's still an error.)
Browsers may behave in the same incorrect way, or they may correctly
interpret the document as ISO-8859-1 encoded, in which case it is
syntactically wrong and browsers may do what they like. Here's what Lynx
shows (there's a subtle hint to some problems other than encoding
problems in my quoting this):
þÿ
. jpg jpg
jpg
jpg jpg
MOTION MOUNTAIN
THE PHYSICS TEXTBOOK
logo
Welcome Contents Download Search Project Guest Book Links Author
Prizes July 5, 2005
jpg jpg
jpg jpg
Apparently, if you wish to use UTF-16, remove the tag
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
or (perhaps better) replace ISO-8859-1 by UTF-16 (in which case you need
to remember to change it again if you change the document's encoding).
Other pages do not validate in w3c ( http://www.motionmountain.net/contents.html) (2) What is wrong here?
1695 errors, vow! :-)
I suspect they relate to character encoding problems too; "non SGML
character number 0" sounds like the validator had encountered the NUL
character (U+0000) and got confused, but if I remember correctly, this
cryptic message arises in different situations.
How do you produce and edit your HTML files? It seems that they might
not all be properly UTF-16 encoded.
(3) When uploading unicode files via ftp, which line endings have to be used (mac, unix, other)?
In HTML, all commonly used line endings are traditionally (and by the
specs) accepted.
Ascii mode or binary mode? (I have Mac OSX)
If you use UTF-16 or UTF-8, binary - you do _not_ want any
Mac-to-something else conversions, since you are using a standard
Unicode encoding already. What matters is whether your editing software
produces correct UTF-something.
(4) To get IE to read the pages, is it best to use UTF-8 or UTF-16?
Even IE can handle both, but UTF-8 is surely more efficient, if the
majority of the text is English. In UTF-16, every (BMP) character is two
octets.
Jukka K. Korpela wrote: ( http://www.motionmountain.net/contents.html) (2) What is wrong here?
1695 errors, vow! :-)
I suspect they relate to character encoding problems too; "non SGML character number 0" sounds like the validator had encountered the NUL character (U+0000) and got confused, but if I remember correctly, this cryptic message arises in different situations.
In this case, the error message is correct: the document contains data like
<td> </td>
so that each of those characters is followed by NUL, U+0000. The
validator reports NUL as an error, since it is a "non SGML character",
which is a technicality I won't dig into now. The problem is apparently
that the data comes, presumably via server-side include (as the comment
before it suggests) from an ASCII file that is converted to Unicode
format too eagerly. If you have ASCII data to be embedded into an UTF-16
encoded document, each octet shall be followed by a zero octet. What has
happened here is that each octet is followed by _three_ zero octets (as
if the encoding were UTF-32), which means in UTF-16 interpretation that
you have NULs all around. Although browsers may skip NULs, NULs are an
error in HTML.
So perhaps there is some simple ASCII to UTF-16 transformation that is
applied _twice_ by mistake, or maybe there is an ASCII to UTF-32
transformation.
On Thu, 25 Aug 2005, Jukka K. Korpela wrote:
[...] (4) To get IE to read the pages, is it best to use UTF-8 or UTF-16?
Even IE can handle both, but UTF-8 is surely more efficient, if the majority of the text is English.
Agreed, and utf-8 is, in general, better supported than utf-16
(talking not only about browsers, but also about search engines etc.).
However, *if* the bulk of the content were in Chinese, presumably
utf-16 would be more compact than utf-8. (I have the impression that
currently, most Chinese documents are in one of the specifically
Chinese encodings, rather than Unicode, but that's by the by. Oh, and
when using Unicode, remember to specify the language, to help browsers
to choose a preferred rendering for unified Han characters[1]).
hope this helps
[1] This is not my field at all, but a web search throws up a
wikipedia article which, as far as I can tell, seems to be a
reasonable discussion at a level that I can understand. http://en.wikipedia.org/wiki/Han_unification
I can't speak personally for any of its technical detail - like any
wikipedia article, who knows what a specialist in the field would have
to say about it? For all I know, it may be spotless, I just can't
tell; but at least it gives the flavour of the issues involved. ch***********@yahoo.com wrote: Since I plan to add more languages, and the unicode issues are so tough:
(3) When uploading unicode files via ftp, which line endings have to be used (mac, unix, other)? Ascii mode or binary mode? (I have Mac OSX)
(4) To get IE to read the pages, is it best to use UTF-8 or UTF-16?
Unicode is not tough. Just make sure that you know the encoding of
your files, and ensure that the same encoding is specified in the HTTP
header and in a meta tag.
Do NOT use UTF-16. I.E. for Mac does not understand UTF-16.
--
Alan Wood http://www.alanwood.net (Unicode, special characters, pesticide names)
On Thu, 25 Aug 2005, Alan J. Flavell wrote: However, *if* the bulk of the content were in Chinese, presumably utf-16 would be more compact than utf-8.
For text/plain. However for text/html, even Chinese texts may be more
compact in UTF-8 depending on the amount of your (ASCII!) markup.
On Thu, 25 Aug 2005, Jukka K. Korpela wrote: Apparently, if you wish to use UTF-16, remove the tag <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> or (perhaps better) replace ISO-8859-1 by UTF-16
Do you mean
< m e t a h t t p - e q u i v = " C o n t e n t - T y p e "
c o n t e n t = " t e x t / h t m l ; c h a r s e t = U T F - 1 6 " >
? This would be pointless. You would need to know the encoding (UTF-16)
in advance before you could even read this.
On 24 Aug 2005 ch***********@yahoo.com wrote: The entry page has two chinese characters,
The easiest way is to write &#number; for only two characters.
but these are not seen on all browsers,
Of course. You cannot expect that everyone has fonts with
Chinese characters on his computer.
( http://www.motionmountain.net/contents.html)
Please *do not* enclose the URL in parentheses! This might mean
your file is "contents.html)" . Always leave a space on *both*
sides of URLs.
(3) When uploading unicode files via ftp, which line endings have to be used (mac, unix, other)? Ascii mode or binary mode? (I have Mac OSX)
The best way is to upload in "text mode" (misnomer: "ASCII mode")
and have your files stored on _your_ computer with local line
endings. You must disable any transcoding "MacRoman <-> ISO-8859-1"
in your FTP program. At least Fetch has such an option.
(4) To get IE to read the pages, is it best to use UTF-8 or UTF-16?
UTF-8. Never use UTF-16 for the WWW.
Andreas Prilop wrote: On Thu, 25 Aug 2005, Jukka K. Korpela wrote:
Apparently, if you wish to use UTF-16, remove the tag <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> or (perhaps better) replace ISO-8859-1 by UTF-16
Do you mean
< m e t a h t t p - e q u i v = " C o n t e n t - T y p e " c o n t e n t = " t e x t / h t m l ; c h a r s e t = U T F - 1 6 " >
? This would be pointless. You would need to know the encoding (UTF-16) in advance before you could even read this.
Would this work?
< m e t a h t t p - e q u i v = " C o n t e n t - T y p e "
c o n t e n t = " t e x t / h t m l ; c h a r s e t = U T F - 1 6 " >
<meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1">
(akin to certain CSS hacks that involve trying to accomplish the same
thing in two different ways)
On Thu, 25 Aug 2005, Harlan Messinger wrote: Would this work?
< m e t a h t t p - e q u i v = " C o n t e n t - T y p e " c o n t e n t = " t e x t / h t m l ; c h a r s e t = U T F - 1 6 " > <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
Of course not. The encoding (charset) can only be *either* UTF-16
*or* ISO-8859-1 but not both at the same time.
In article
<Pi*************************************@s5b004.rr zn-user.uni-hannover.d
e>,
Andreas Prilop <nh******@rrzn-user.uni-hannover.de> wrote: However for text/html, even Chinese texts may be more compact in UTF-8 depending on the amount of your (ASCII!) markup.
Indeed. Also, gzip is a byte-oriented compression method and works
rather nicely on UTF-8-encoded markup.
--
Henri Sivonen hs******@iki.fi http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Andreas Prilop wrote: On Thu, 25 Aug 2005, Jukka K. Korpela wrote:
Apparently, if you wish to use UTF-16, remove the tag <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> or (perhaps better) replace ISO-8859-1 by UTF-16
Do you mean
< m e t a h t t p - e q u i v = " C o n t e n t - T y p e " c o n t e n t = " t e x t / h t m l ; c h a r s e t = U T F - 1 6 " >
? This would be pointless. You would need to know the encoding (UTF-16) in advance before you could even read this.
I stand corrected. Would you believe that the low-order byte of all my
brain cells was disabled when I wrote that?
There's apparently no way to leave the encoding unspecified at other
protocol levels and set it to UTF-16 in a meta tag, logicall speaking. I
guess browsers might still get it, by first reading the data in implicit
ISO-8859-1 (or windows-1252) encoding, then realize it's now told to be
UTF-16, and proceed with that, assuming that everything read so far was
correctly interpreted.
Apparently the validator (and many browsers) actually play by XHTML
rules, which allow and mandate the recognition of UTF-16 from the byte
order mark.
On Fri, 26 Aug 2005, Jukka K. Korpela wrote: There's apparently no way to leave the encoding unspecified at other protocol levels and set it to UTF-16 in a meta tag, logicall speaking. I guess browsers might still get it, by first reading the data in implicit ISO-8859-1 (or windows-1252) encoding, then realize it's now told to be UTF-16, and proceed with that, assuming that everything read so far was correctly interpreted.
Browsers often have some kind of auto-recognition algorithm for
character coding, and I'd suggest it's more likely that a browser
would auto-recognise utf-16, if that's what it is.
After all, HTML documents can be expected to start (aside from a
possible BOM) with a coded representation of characters from the ASCII
repertoire, even if they then go on to present a document body
containing a wide range of Unicode. There aren't too many different
ways of representing the ASCII repertoire, so a heuristic has a good
chance of recognising what's going on in such a case.
Of course, this kind of browser-specific heuristic does not in
any way replace the proper way of doing things! But it might
rescue a badly-served document that would otherwise be unusable.
On the other hand, CERT CA-2000-02 says that a document served out
without an HTTP charset attribute is a potential security risk!
best regards
On Thu, 25 Aug 2005, I wrote: And Google does not understand UTF-16. http://www.google.com/search?q=%22UTF+1+6%22
After reading some of those web pages:
*Really* clueless are the webpupils whose UTF-16-encoded pages
contain only characters from the ASCII repertoire. I call
this "overoverkill".
On Fri, 26 Aug 2005, Alan J. Flavell wrote: Browsers often have some kind of auto-recognition algorithm for character coding,
Interestingly, Mozilla identifies the encoding (charset) of
[ Warning: Very slow! Read without images! ] http://www.apple.com.ge/contacts.html
as Windows-1251 because of the Russian *comments*. The Georgian
characters are written as &#number; - so the charset could be
anything. Non-ASCII characters exist only inside comments:
Russian text in cp1251.
On the other hand, with Textedit I do not seem to be able to
get rid of the nuls in http://www.motionmountain.net/ contents.html
What is the best way to do this?
Christoph
In <11**********************@g47g2000cwa.googlegroups .com>, on
08/24/2005
at 06:44 PM, ch***********@yahoo.com said: (1) Why not?
Just because you encode your data in Unicode doesn't mean that the
user's browser can render them; he needs to have the appropriate
fonts. If he doesn't have a Chinese font for Unicode, . . .
(3) When uploading unicode files via ftp, which line endings have to be used (mac, unix, other)?
Whatever is appropriate for the sending and receiving systems.
(4) To get IE to read the pages, is it best to use UTF-8 or UTF-16?
UTF-8. UTF-16 is only appropriate if you are sending 16-bit bytes.
--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>
Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org
On Sun, 28 Aug 2005, Shmuel (Seymour J.) Metz wrote: UTF-16 is only appropriate if you are sending 16-bit bytes.
"16-bit bytes"?
(apologies for the delayed response)
On Fri, 26 Aug 2005, Andreas Prilop wrote: On Fri, 26 Aug 2005, Alan J. Flavell wrote:
Browsers often have some kind of auto-recognition algorithm for character coding, Interestingly, Mozilla identifies the encoding (charset) of [ Warning: Very slow! Read without images! ] http://www.apple.com.ge/contacts.html as Windows-1251 because of the Russian *comments*.
So it does!
Worryingly, when auto charset recognition was turned off, the encoding
was reported as utf-8: but surely these strings of cp1251 bytes could
not be valid utf-8? !! When Moz. displays the HTML source under those
conditions, the Russian comments are displayed as strings of ??????
shown in an oblique font.
It should be noted that Unicode lays down rules with the intention of
promoting security i.e avoiding spoofing which might otherwise be done
by supplying defective utf-8 sequences. I knew that in principle, but
didn't know the details till I looked just now. Let's see... http://www.unicode.org/reports/tr36/#UTF-8_Exploit
Oh, this is about the interpretation of non-shortest-form utf-8: that
rule was new at 3.0. For the more general rule on defective utf-8 I
need to look elsewhere...
Aha, I'm getting close: under utf-8 clause D37 at http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf
it refers back to conformance clause C12a, which is in section 3.2.
It rules-out the interpretation of the ill-formed "code units"
themselves. It doesn't rule out attempting to interpret the remaining
sequences. So (as usual) Mozilla is conforming to the applicable
rules, and "representing the [defective] code unit with a marker", as
the Unicode requirement puts it (using the oblique "?" characters as
its marker), while still processing the remaining "code units" (the
ASCII source code characters).
OK, something learned.
The Georgian characters are written as &#number; - so the charset could be anything.
In a properly-behaved browser, indeed. (That old Netscape 4.* thing
would fail to render this, if it got to know that it was Windows-1251,
I reckon - but right now I don't have access to a copy of any NN4.*
version to try it. Whereas, I reckon, if one put it into its utf-8
mood, and gave it a unicode font with adequate repertoire, I'd expect
it to be able to render Georgian. Can we rate NN4.* as past history,
yet?)
cheers
On Mon, 29 Aug 2005, Alan J. Flavell wrote: Interestingly, Mozilla identifies the encoding (charset) of [ Warning: Very slow! Read without images! ] http://www.apple.com.ge/contacts.html as Windows-1251 because of the Russian *comments*. So it does!
Google, which is broken in many ways, thinks it is ISO-8859-5
and puts a <meta ... charset=ISO-8859-5> into the cached version: http://google.com/search?q=cache:www...s.html&strip=1
Worryingly, when auto charset recognition was turned off, the encoding was reported as utf-8: but surely these strings of cp1251 bytes could not be valid utf-8? !!
UTF-8 is _your_ default, it seems.
On Fri, 2 Sep 2005, Andreas Prilop wrote: On Mon, 29 Aug 2005, Alan J. Flavell wrote:
Worryingly, when auto charset recognition was turned off, the encoding was reported as utf-8: but surely these strings of cp1251 bytes could not be valid utf-8? !!
UTF-8 is _your_ default, it seems.
Oh, right. Sorry for any confusion caused.
But it led to some interesting considerations about utf-8 error
handling, nevertheless.
gruesse This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Edo van der Zouwen |
last post by:
I have the following problem. On a website there's a (simple) feedback
form. This is used also by Polish visitors who (of course) type Polish
text using special characters.
However, when I...
|
by: Bill Eldridge |
last post by:
I'm trying to grab a document off the Web and toss it
into a MySQL database, but I keep running into the
various encoding problems with Unicode (that aren't
a problem for me with GB2312, BIG 5,...
|
by: Matt Price |
last post by:
Hello,
I'm a python (& xml, & unicode!) newbie working on an interface to a
bibliographic reference server (refdb); I'm running into some encoding
problems & am ifnding the plethora of tools a...
|
by: S. |
last post by:
if in my website i am using the sgml { notation, is it accurate
to say to my users that the site uses unicode or that it requires
unicode?
is there a mathematical formula to calculate a unicode...
|
by: Zenobia |
last post by:
Recently I was editing a document in GoLive 6. I like GoLive because it has some nice
features such as:
* rewrite source code
* check syntax
* global search & replace (through several files at...
|
by: webdev |
last post by:
lo all,
some of the questions i'll ask below have most certainly been discussed
already, i just hope someone's kind enough to answer them again to help
me out..
so i started a python 2.3...
|
by: Ger |
last post by:
I have not been able to find a simple, straight forward Unicode to ASCII
string conversion function in VB.Net.
Is that because such a function does not exists or do I overlook it?
I found...
|
by: Robert |
last post by:
Hello,
I'm using Pythonwin and py2.3 (py2.4). I did not come clear with this:
I want to use win32-fuctions like win32ui.MessageBox,
listctrl.InsertItem ..... to get unicode strings on the...
|
by: pratik.best |
last post by:
Hi,
I just seen the web site of the unicode committee and was amazed to see
the site showing document in Hindi without using any such fonts like
"Kruti Dev" or "Dev Lys". "Webdunia.com" is also...
|
by: DolphinDB |
last post by:
Tired of spending countless mintues downsampling your data? Look no further!
In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM).
In this month's session, we are pleased to welcome back...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM).
In this month's session, we are pleased to welcome back...
|
by: PapaRatzi |
last post by:
Hello,
I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
|
by: CloudSolutions |
last post by:
Introduction:
For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
|
by: Defcon1945 |
last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
|
by: Shællîpôpï 09 |
last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
|
by: af34tf |
last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome former...
| |