473,769 Members | 2,081 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Unicode and html - help for simple web site


I have a home-made website that provides a free
1100 page physics textbook. It is written in html and
css. I recently added some chinese text, and
since that day there are problems.

The entry page has two chinese characters,
but these are not seen on all browsers, even
though the page is validated by
the w3c validator.
( http://www.motionmountain.net/welcome.html)
(1) Why not?

Other pages do not validate in w3c
( http://www.motionmountain.net/contents.html)
(2) What is wrong here?

Since I plan to add more languages, and the unicode
issues are so tough:

(3) When uploading unicode files via ftp,
which line endings have to be used (mac, unix, other)?
Ascii mode or binary mode? (I have Mac OSX)

(4) To get IE to read the pages, is it best to use UTF-8
or UTF-16?

Thank you for any help!

Christoph

Aug 25 '05
24 2851
On Thu, 25 Aug 2005, Harlan Messinger wrote:
Would this work?

< m e t a h t t p - e q u i v = " C o n t e n t - T y p e "
c o n t e n t = " t e x t / h t m l ; c h a r s e t = U T F - 1 6 " >
<meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1">


Of course not. The encoding (charset) can only be *either* UTF-16
*or* ISO-8859-1 but not both at the same time.

Aug 25 '05 #11
In article
<Pi************ *************** **********@s5b0 04.rrzn-user.uni-hannover.d
e>,
Andreas Prilop <nh******@rrz n-user.uni-hannover.de> wrote:
However for text/html, even Chinese texts may be more
compact in UTF-8 depending on the amount of your (ASCII!) markup.


Indeed. Also, gzip is a byte-oriented compression method and works
rather nicely on UTF-8-encoded markup.

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Aug 25 '05 #12
Andreas Prilop wrote:
On Thu, 25 Aug 2005, Jukka K. Korpela wrote:
Apparently, if you wish to use UTF-16, remove the tag
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
or (perhaps better) replace ISO-8859-1 by UTF-16

Do you mean

< m e t a h t t p - e q u i v = " C o n t e n t - T y p e "
c o n t e n t = " t e x t / h t m l ; c h a r s e t = U T F - 1 6 " >

? This would be pointless. You would need to know the encoding (UTF-16)
in advance before you could even read this.


I stand corrected. Would you believe that the low-order byte of all my
brain cells was disabled when I wrote that?

There's apparently no way to leave the encoding unspecified at other
protocol levels and set it to UTF-16 in a meta tag, logicall speaking. I
guess browsers might still get it, by first reading the data in implicit
ISO-8859-1 (or windows-1252) encoding, then realize it's now told to be
UTF-16, and proceed with that, assuming that everything read so far was
correctly interpreted.

Apparently the validator (and many browsers) actually play by XHTML
rules, which allow and mandate the recognition of UTF-16 from the byte
order mark.
Aug 26 '05 #13
On Fri, 26 Aug 2005, Jukka K. Korpela wrote:
There's apparently no way to leave the encoding unspecified at other
protocol levels and set it to UTF-16 in a meta tag, logicall
speaking. I guess browsers might still get it, by first reading the
data in implicit ISO-8859-1 (or windows-1252) encoding, then realize
it's now told to be UTF-16, and proceed with that, assuming that
everything read so far was correctly interpreted.


Browsers often have some kind of auto-recognition algorithm for
character coding, and I'd suggest it's more likely that a browser
would auto-recognise utf-16, if that's what it is.

After all, HTML documents can be expected to start (aside from a
possible BOM) with a coded representation of characters from the ASCII
repertoire, even if they then go on to present a document body
containing a wide range of Unicode. There aren't too many different
ways of representing the ASCII repertoire, so a heuristic has a good
chance of recognising what's going on in such a case.

Of course, this kind of browser-specific heuristic does not in
any way replace the proper way of doing things! But it might
rescue a badly-served document that would otherwise be unusable.

On the other hand, CERT CA-2000-02 says that a document served out
without an HTTP charset attribute is a potential security risk!

best regards
Aug 26 '05 #14
On Thu, 25 Aug 2005, I wrote:
And Google does not understand UTF-16.
http://www.google.com/search?q=%22UTF+1+6%22


After reading some of those web pages:
*Really* clueless are the webpupils whose UTF-16-encoded pages
contain only characters from the ASCII repertoire. I call
this "overoverki ll".

Aug 26 '05 #15
On Fri, 26 Aug 2005, Alan J. Flavell wrote:
Browsers often have some kind of auto-recognition algorithm for
character coding,


Interestingly, Mozilla identifies the encoding (charset) of
[ Warning: Very slow! Read without images! ]
http://www.apple.com.ge/contacts.html
as Windows-1251 because of the Russian *comments*. The Georgian
characters are written as &#number; - so the charset could be
anything. Non-ASCII characters exist only inside comments:
Russian text in cp1251.

Aug 26 '05 #16

Thank you all for the advice. That was really useful.
Using Mac's Textedit, I took two files, namely

http://www.motionmountain.net/welcome.html
and
http://www.motionmountain.net/index.html

conveted them to UTF-8, changed the charset
to UTF-8, used (UNIX-) ftp with ascii mode to upload them,
and put them on the server.
(It is simple handmade html, no server-side-includes)

Are these pages ok now?

Christoph

Aug 26 '05 #17

On the other hand, with Textedit I do not seem to be able to
get rid of the nuls in

http://www.motionmountain.net/ contents.html

What is the best way to do this?

Christoph

Aug 26 '05 #18
In article
<Pi************ *************** **********@s5b0 04.rrzn-user.uni-hannover.d
e>,
Andreas Prilop <nh******@rrz n-user.uni-hannover.de> wrote:
Interestingly, Mozilla identifies the encoding (charset) of
[ Warning: Very slow! Read without images! ]
http://www.apple.com.ge/contacts.html
as Windows-1251 because of the Russian *comments*.


The detector operates on the byte stream before the parser, so it does
not matter if the text is in comments.

--
Henri Sivonen
hs******@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Aug 26 '05 #19

The files http://www.motionmountain.net/search.html
and http://www.motionmountain.net/project.html
etc are somehow messed up.
I will use an older backup copy and convert them to utf-8 afresh.
Thank you far all the help!

Regards

Christoph

Aug 27 '05 #20

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
10194
by: Edo van der Zouwen | last post by:
I have the following problem. On a website there's a (simple) feedback form. This is used also by Polish visitors who (of course) type Polish text using special characters. However, when I receive the text in my mailbox, all special characters have been turned into mess...... For example: "wspólprace" is turned into "współprace". It seems PHP is handling the Unicode-8 strings quite well (when I
8
5277
by: Bill Eldridge | last post by:
I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5, etc.) What I'd like is something as simple as: CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8)); import MySQLdb, re,urllib
0
1571
by: Matt Price | last post by:
Hello, I'm a python (& xml, & unicode!) newbie working on an interface to a bibliographic reference server (refdb); I'm running into some encoding problems & am ifnding the plethora of tools a little confusing. Here is the basic situation: I connect to the server and receive an xml document whose content is a bibliographic dataset. The document can be encoded in two ways: ISO-8859-1 or unicode. My program simply takes the document...
6
2785
by: S. | last post by:
if in my website i am using the sgml { notation, is it accurate to say to my users that the site uses unicode or that it requires unicode? is there a mathematical formula to calculate a unicode value given its utf8 value? Rgds, Sam
48
4644
by: Zenobia | last post by:
Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at once) * regular expression search & replace. Normally my documents are encoded with the ISO setting. Recently I was writing an XHTML document. After changing the encoding to UTF-8 I used the
4
6071
by: webdev | last post by:
lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3 script that grabs some web pages from the web, regex parse the data and stores it localy to xml file for further use.. at first i had no problem using python minidom and everything concerning
18
34145
by: Ger | last post by:
I have not been able to find a simple, straight forward Unicode to ASCII string conversion function in VB.Net. Is that because such a function does not exists or do I overlook it? I found Encoding.Convert, but that needs byte arrays. Thanks, /Ger
7
4202
by: Robert | last post by:
Hello, I'm using Pythonwin and py2.3 (py2.4). I did not come clear with this: I want to use win32-fuctions like win32ui.MessageBox, listctrl.InsertItem ..... to get unicode strings on the screen - best results according to the platform/language settings (mainly XP Home, W2K, ...). Also unicode strings should be displayed as nice as possible at the console with normal print-s to stdout (on varying platforms, different
3
29728
by: pratik.best | last post by:
Hi, I just seen the web site of the unicode committee and was amazed to see the site showing document in Hindi without using any such fonts like "Kruti Dev" or "Dev Lys". "Webdunia.com" is also showing documents in Hindi without the need to download any specific font. How's that done? Also, can I build such a page?
0
10216
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
9997
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9865
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7413
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6675
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5448
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3965
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3565
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2815
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.