Unicode and html - help for simple web site

chri_schiller

I have a home-made website that provides a free
1100 page physics textbook. It is written in html and
css. I recently added some chinese text, and
since that day there are problems.

The entry page has two chinese characters,
but these are not seen on all browsers, even
though the page is validated by
the w3c validator.
( http://www.motionmountain.net/welcome.html)
(1) Why not?

Other pages do not validate in w3c
( http://www.motionmountain.net/contents.html)
(2) What is wrong here?

Since I plan to add more languages, and the unicode
issues are so tough:

(3) When uploading unicode files via ftp,
which line endings have to be used (mac, unix, other)?
Ascii mode or binary mode? (I have Mac OSX)

(4) To get IE to read the pages, is it best to use UTF-8
or UTF-16?

Thank you for any help!

Christoph

Aug 25 '05 #1

Subscribe Reply

2845

Jukka K. Korpela

ch***********@y ahoo.com wrote:

The entry page has two chinese characters,
but these are not seen on all browsers, even
though the page is validated by
the w3c validator.
( http://www.motionmountain.net/welcome.html)
(1) Why not?
It validates just because the validator is so permissive and does not
care about the conflict between the encoding you declare in the meta tag
(ISO-8859-1) and the encoding you actually use. In fact, I would say
that the validator is in error here: the encoding is specified as
ISO-8859-1 (the meta tag takes effect when no charset is specified in
HTTP headers), so the first two octets of the data _must_ be interpreted
as þÿ (Latin letter small thorn and Latin letter small y with
diaeresis), which of course violate HTML syntax when appearing before a
DOCTYPE declaration.

The validator incorrectly guesses that þÿ is meant to act as a byte
order mark in UTF-16 encoding and therefore treats the document as
UTF-16 encoded. (The guess is "correct" of course in a pragmatic sense,
but it's still an error.)

Browsers may behave in the same incorrect way, or they may correctly
interpret the document as ISO-8859-1 encoded, in which case it is
syntactically wrong and browsers may do what they like. Here's what Lynx
shows (there's a subtle hint to some problems other than encoding
problems in my quoting this):

þÿ

. jpg jpg
jpg
jpg jpg

MOTION MOUNTAIN

THE PHYSICS TEXTBOOK

logo

Welcome Contents Download Search Project Guest Book Links Author
Prizes July 5, 2005
jpg jpg
jpg jpg

Apparently, if you wish to use UTF-16, remove the tag
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
or (perhaps better) replace ISO-8859-1 by UTF-16 (in which case you need
to remember to change it again if you change the document's encoding).
Other pages do not validate in w3c
( http://www.motionmountain.net/contents.html)
(2) What is wrong here?
1695 errors, vow! :-)

I suspect they relate to character encoding problems too; "non SGML
character number 0" sounds like the validator had encountered the NUL
character (U+0000) and got confused, but if I remember correctly, this
cryptic message arises in different situations.

How do you produce and edit your HTML files? It seems that they might
not all be properly UTF-16 encoded.
(3) When uploading unicode files via ftp,
which line endings have to be used (mac, unix, other)?
In HTML, all commonly used line endings are traditionally (and by the
specs) accepted.
Ascii mode or binary mode? (I have Mac OSX)
If you use UTF-16 or UTF-8, binary - you do _not_ want any
Mac-to-something else conversions, since you are using a standard
Unicode encoding already. What matters is whether your editing software
produces correct UTF-something.
(4) To get IE to read the pages, is it best to use UTF-8
or UTF-16?

Even IE can handle both, but UTF-8 is surely more efficient, if the
majority of the text is English. In UTF-16, every (BMP) character is two
octets.

Aug 25 '05 #2

Jukka K. Korpela

Jukka K. Korpela wrote:

( http://www.motionmountain.net/contents.html)
(2) What is wrong here?

1695 errors, vow! :-)

I suspect they relate to character encoding problems too; "non SGML
character number 0" sounds like the validator had encountered the NUL
character (U+0000) and got confused, but if I remember correctly, this
cryptic message arises in different situations.

In this case, the error message is correct: the document contains data like
<td> </td>
so that each of those characters is followed by NUL, U+0000. The
validator reports NUL as an error, since it is a "non SGML character",
which is a technicality I won't dig into now. The problem is apparently
that the data comes, presumably via server-side include (as the comment
before it suggests) from an ASCII file that is converted to Unicode
format too eagerly. If you have ASCII data to be embedded into an UTF-16
encoded document, each octet shall be followed by a zero octet. What has
happened here is that each octet is followed by _three_ zero octets (as
if the encoding were UTF-32), which means in UTF-16 interpretation that
you have NULs all around. Although browsers may skip NULs, NULs are an
error in HTML.

So perhaps there is some simple ASCII to UTF-16 transformation that is
applied _twice_ by mistake, or maybe there is an ASCII to UTF-32
transformation.

Aug 25 '05 #3

Alan J. Flavell

On Thu, 25 Aug 2005, Jukka K. Korpela wrote:

[...]

(4) To get IE to read the pages, is it best to use UTF-8
or UTF-16?

Even IE can handle both, but UTF-8 is surely more efficient, if the
majority of the text is English.

Agreed, and utf-8 is, in general, better supported than utf-16
(talking not only about browsers, but also about search engines etc.).

However, *if* the bulk of the content were in Chinese, presumably
utf-16 would be more compact than utf-8. (I have the impression that
currently, most Chinese documents are in one of the specifically
Chinese encodings, rather than Unicode, but that's by the by. Oh, and
when using Unicode, remember to specify the language, to help browsers
to choose a preferred rendering for unified Han characters[1]).

hope this helps

[1] This is not my field at all, but a web search throws up a
wikipedia article which, as far as I can tell, seems to be a
reasonable discussion at a level that I can understand.
http://en.wikipedia.org/wiki/Han_unification

I can't speak personally for any of its technical detail - like any
wikipedia article, who knows what a specialist in the field would have
to say about it? For all I know, it may be spotless, I just can't
tell; but at least it gives the flavour of the issues involved.

Aug 25 '05 #4

Alan Wood

ch***********@y ahoo.com wrote:

Since I plan to add more languages, and the unicode
issues are so tough:

(3) When uploading unicode files via ftp,
which line endings have to be used (mac, unix, other)?
Ascii mode or binary mode? (I have Mac OSX)

(4) To get IE to read the pages, is it best to use UTF-8
or UTF-16?

Unicode is not tough. Just make sure that you know the encoding of
your files, and ensure that the same encoding is specified in the HTTP
header and in a meta tag.

Do NOT use UTF-16. I.E. for Mac does not understand UTF-16.

--
Alan Wood
http://www.alanwood.net (Unicode, special characters, pesticide names)

Aug 25 '05 #5

Andreas Prilop

On 25 Aug 2005, Alan Wood wrote:

Do NOT use UTF-16. I.E. for Mac does not understand UTF-16.

And Google does not understand UTF-16.
http://www.google.com/search?q=%22UTF+1+6%22

Aug 25 '05 #6

Andreas Prilop

On Thu, 25 Aug 2005, Alan J. Flavell wrote:

However, *if* the bulk of the content were in Chinese, presumably
utf-16 would be more compact than utf-8.

For text/plain. However for text/html, even Chinese texts may be more
compact in UTF-8 depending on the amount of your (ASCII!) markup.

Aug 25 '05 #7

Andreas Prilop

On Thu, 25 Aug 2005, Jukka K. Korpela wrote:

Apparently, if you wish to use UTF-16, remove the tag
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
or (perhaps better) replace ISO-8859-1 by UTF-16

Do you mean

< m e t a h t t p - e q u i v = " C o n t e n t - T y p e "
c o n t e n t = " t e x t / h t m l ; c h a r s e t = U T F - 1 6 " >

? This would be pointless. You would need to know the encoding (UTF-16)
in advance before you could even read this.

Aug 25 '05 #8

Andreas Prilop

On 24 Aug 2005 ch***********@y ahoo.com wrote:

The entry page has two chinese characters,
The easiest way is to write &#number; for only two characters.
but these are not seen on all browsers,
Of course. You cannot expect that everyone has fonts with
Chinese characters on his computer.
( http://www.motionmountain.net/contents.html)
Please *do not* enclose the URL in parentheses! This might mean
your file is "contents.html) " . Always leave a space on *both*
sides of URLs.
(3) When uploading unicode files via ftp,
which line endings have to be used (mac, unix, other)?
Ascii mode or binary mode? (I have Mac OSX)
The best way is to upload in "text mode" (misnomer: "ASCII mode")
and have your files stored on _your_ computer with local line
endings. You must disable any transcoding "MacRoman <-> ISO-8859-1"
in your FTP program. At least Fetch has such an option.
(4) To get IE to read the pages, is it best to use UTF-8
or UTF-16?

UTF-8. Never use UTF-16 for the WWW.

Aug 25 '05 #9

Harlan Messinger

Andreas Prilop wrote:

On Thu, 25 Aug 2005, Jukka K. Korpela wrote:

Apparently, if you wish to use UTF-16, remove the tag
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
or (perhaps better) replace ISO-8859-1 by UTF-16

Do you mean

< m e t a h t t p - e q u i v = " C o n t e n t - T y p e "
c o n t e n t = " t e x t / h t m l ; c h a r s e t = U T F - 1 6 " >

? This would be pointless. You would need to know the encoding (UTF-16)
in advance before you could even read this.

Would this work?

< m e t a h t t p - e q u i v = " C o n t e n t - T y p e "
c o n t e n t = " t e x t / h t m l ; c h a r s e t = U T F - 1 6 " >
<meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1">

(akin to certain CSS hacks that involve trying to accomplish the same
thing in two different ways)

Aug 25 '05 #10

Similar topics

10194

PHP - using mail() and unicode text - text gets disturbed

by: Edo van der Zouwen | last post by:

I have the following problem. On a website there's a (simple) feedback form. This is used also by Polish visitors who (of course) type Polish text using special characters. However, when I receive the text in my mailbox, all special characters have been turned into mess...... For example: "wspólprace" is turned into "wspÃ³Å‚prace". It seems PHP is handling the Unicode-8 strings quite well (when I

PHP

5276

Unicode from Web to MySQL

by: Bill Eldridge | last post by:

I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5, etc.) What I'd like is something as simple as: CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8)); import MySQLdb, re,urllib

Python

1571

unicode and xml/xsl

by: Matt Price | last post by:

Hello, I'm a python (& xml, & unicode!) newbie working on an interface to a bibliographic reference server (refdb); I'm running into some encoding problems & am ifnding the plethora of tools a little confusing. Here is the basic situation: I connect to the server and receive an xml document whose content is a bibliographic dataset. The document can be encoded in two ways: ISO-8859-1 or unicode. My program simply takes the document...

Python

2785

sgml vs unicode notation

by: S. | last post by:

if in my website i am using the sgml { notation, is it accurate to say to my users that the site uses unicode or that it requires unicode? is there a mathematical formula to calculate a unicode value given its utf8 value? Rgds, Sam

HTML / CSS

4641

Adobe GoLive 6 - Nasty feature with UTF-8 encoding

by: Zenobia | last post by:

Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at once) * regular expression search & replace. Normally my documents are encoded with the ISO setting. Recently I was writing an XHTML document. After changing the encoding to UTF-8 I used the

HTML / CSS

6071

minidom xml & non ascii / unicode & files

by: webdev | last post by:

lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3 script that grabs some web pages from the web, regex parse the data and stores it localy to xml file for further use.. at first i had no problem using python minidom and everything concerning

Python

34143

Unicode to ASCII string conversion

by: Ger | last post by:

I have not been able to find a simple, straight forward Unicode to ASCII string conversion function in VB.Net. Is that because such a function does not exists or do I overlook it? I found Encoding.Convert, but that needs byte arrays. Thanks, /Ger

Visual Basic .NET

4202

Unicode & Pythonwin / win32 / console?

by: Robert | last post by:

Hello, I'm using Pythonwin and py2.3 (py2.4). I did not come clear with this: I want to use win32-fuctions like win32ui.MessageBox, listctrl.InsertItem ..... to get unicode strings on the screen - best results according to the platform/language settings (mainly XP Home, W2K, ...). Also unicode strings should be displayed as nice as possible at the console with normal print-s to stdout (on varying platforms, different

Python

29727

Using Hindi Language with Unicode

by: pratik.best | last post by:

Hi, I just seen the web site of the unicode committee and was amazed to see the site showing document in Hindi without using any such fonts like "Kruti Dev" or "Dev Lys". "Webdunia.com" is also showing documents in Hindi without the need to download any specific font. How's that done? Also, can I build such a page?

HTML / CSS

9298

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10072

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9906

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

8737

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6562

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5172

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5329

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3829

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

2698

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General