Connecting Tech Pros Worldwide Forums | Help | Site Map

Can an HTML source file be specified in unicode ?

Jan Jaap
Guest
 
Posts: n/a
#1: Oct 12 '06
Hi!

I read the folowing discussion:
http://groups.google.com/group/comp....0e36acb6d68931

Wich is about charset encoding and HTML.

I am looking for a way to make my online service compatible for any
language. The text is stored in MySQL as UTF-8, it's collected from the
database as UTF-8, there is a Meta header in the HTML wich specifies
charset UTF-8, the PHP script wich serves the html file places an
header with content type UTF-8, the server will add an UTF-8 header to
all .html and .php files but still the source of the html file is
malformed with UTF-8 encoded characters!

The HTML file itself however displays the characters fine, just the
HTML source is malformed and this may cause google to index the site
improperly since we discovered that before.

Does anyone have any suggetion what else I should do to get the HTML
source to display the characters properly?

This is my site: http://testforum.obeyon.com/

Best Regards,
Jan Jaap Hakvoort


Andy Dingley
Guest
 
Posts: n/a
#2: Oct 12 '06

re: Can an HTML source file be specified in unicode ?



Jan Jaap wrote:
Quote:
I am looking for a way to make my online service compatible for any
language. The text is stored in MySQL as UTF-8, it's collected from the
database as UTF-8, there is a Meta header in the HTML wich specifies
charset UTF-8, the PHP script wich serves the html file places an
header with content type UTF-8, the server will add an UTF-8 header to
all .html and .php files but still the source of the html file is
malformed with UTF-8 encoded characters!
Sounds like you don't have unicode in your database. It doesn't matter
what you label it as, it actually has to _be_ unicode too. If you store
non-ASCII ISO-8859-* in there, then it'll give you bad characters.
ASCII will be OK though, so you might not notice this problem at first
- it's only evident on the non-ASCII.

Andreas Prilop
Guest
 
Posts: n/a
#3: Oct 12 '06

re: Can an HTML source file be specified in unicode ?


On 12 Oct 2006, Jan Jaap wrote:
Quote:
The HTML file itself however displays the characters fine,
"The HTML file itself" displays nothing - the browser does.
Quote:
just the HTML source is malformed
What does this mean, "malformed"?
Quote:
This is my site: http://testforum.obeyon.com/
Sorry, I don't know what to do with it. Should I ask a question
in Russian there?

Jan Jaap
Guest
 
Posts: n/a
#4: Oct 27 '06

re: Can an HTML source file be specified in unicode ?


Hi!

After a lot of research, it comes down to the folowing: PHP has a bug,
it doesn't have support for Unicode! This is to be expected with the
release of PHP6.

Zend offers a partial solution for PHP5+ but it doesn't solve the HTML
source problem.

The actual problem is this: Unicode (UTF-8) documents contain something
called a BOM, the BOM are 3 characters on top of the file wich will
tell viewers that the documents is Unicode and for UTF-16 it can also
tell the viewer wich type it is.

As you may know, PHP can't send headers when you've already outputted
to the browser. Since PHP does not recognize the BOM, it will simply
print these first 3 characters before the <?php causing PHP to fail.

A partial fix for this failure, is to encode your PHP script in ANSI ->
UTF-8 w/o BOM.

It will then work fine, but still viewers of the document source will
not recognize the file as UTF-8!

If anyone has more information on this it would be verry welcome!

Best Regards,
Jan Jaap Hakvoort

Pierre Goiffon
Guest
 
Posts: n/a
#5: Oct 27 '06

re: Can an HTML source file be specified in unicode ?


Jan Jaap wrote:
Quote:
PHP has a bug,
it doesn't have support for Unicode! This is to be expected with the
release of PHP6.
(...)
Quote:
The actual problem is this: Unicode (UTF-8) documents contain something
called a BOM
(...)
Quote:
A partial fix for this failure, is to encode your PHP script in ANSI ->
UTF-8 w/o BOM.
You're making some mistakes :

- Adding a BOM at the beginning of UTF-8 encoded data is absolutly not
mandatory.
You should read the very good Unicode.org FAQ, for exemple :
http://www.unicode.org/faq/utf_bom.html

- PHP can't deal with UTF-8 + BOM encoded file, though everything is ok
with UTF-8 encoded files without a BOM (lots of PHP users deals with
that everyday - I do)

- "ANSI" is sometimes used as a surname for the Windows specific
encoding (windows-1252 in western Europe)
"ANSI" has nothing to do with Unicode !
Andreas Prilop
Guest
 
Posts: n/a
#6: Oct 27 '06

re: Can an HTML source file be specified in unicode ?


On 26 Oct 2006, Jan Jaap wrote:
Quote:
The actual problem is this: Unicode (UTF-8) documents contain something
called a BOM,
No - they *might* contain a BOM but they need not. Use a BOM with
UTF-16 and UTF-32 only.
Quote:
A partial fix for this failure, is to encode your PHP script in ANSI ->
ANSI = American National Standards Institute http://www.ansi.org/

What do you mean, "encode in ANSI"?
Quote:
It will then work fine, but still viewers of the document source will
not recognize the file as UTF-8!
Define the encoding of your documents in the HTTP header:
http://www.w3.org/International/O-HT...html#scripting


Sample page:
http://www.unics.uni-hannover.de/nht...ilingual1.html
*no* BOM, *no* <meta charset>

Jan Jaap
Guest
 
Posts: n/a
#7: Nov 4 '06

re: Can an HTML source file be specified in unicode ?


Sample page:
Quote:
http://www.unics.uni-hannover.de/nht...ilingual1.html
*no* BOM, *no* <meta charset>
This page shows malformed characters in the source!

Will google index these malformed characters?

Btw, my editor Notepad++ sais "Encode as ANSI" and some other options
like UTF-8 or ANSI + UTF-8 w/o BOM.

Best Regards,
Jan Jaap Hakvoort

Michael Winter
Guest
 
Posts: n/a
#8: Nov 5 '06

re: Can an HTML source file be specified in unicode ?


Jan Jaap wrote:
Quote:
Quote:
>Sample page:
> http://www.unics.uni-hannover.de/nht...ilingual1.html
>*no* BOM, *no* <meta charset>
>
This page shows malformed characters in the source!
Doubtful. Why would Andreas cite it?

I see no obvious problems.

[snip]
Quote:
Btw, my editor Notepad++ sais "Encode as ANSI" ...
Well that would be your problem, then: the document is encoded using
UTF-8. If your editor is treating it as ASCII (which is what I presume
is meant by "ANSI"), then it's bound not to display it correctly.

[snip]

Mike
Andreas Prilop
Guest
 
Posts: n/a
#9: Nov 6 '06

re: Can an HTML source file be specified in unicode ?


On 4 Nov 2006, Jan Jaap wrote:
Quote:
>
This page shows malformed characters in the source!
What are "malformed characters"?
Quote:
Will google index these malformed characters?
http://google.com/search?q=cache:www...ilingual1.html

http://google.com/search?ie=ISO-8859...BB%22&filter=0

http://google.com/search?q=%22%E0%A4...A4%22&filter=0
Quote:
Btw, my editor Notepad++ sais "Encode as ANSI" and some other options
Some people say the earth is flat.

Jan Jaap
Guest
 
Posts: n/a
#10: Nov 8 '06

re: Can an HTML source file be specified in unicode ?


What are "malformed characters"?

The special characters look like raw UTF-8 encoded characters, like Ã
for ë etc.

Andreas Prilop
Guest
 
Posts: n/a
#11: Nov 8 '06

re: Can an HTML source file be specified in unicode ?


On 7 Nov 2006, Jan Jaap wrote:
Quote:
What are "malformed characters"?
*I* asked this question - not you. Please quoted properly!
Quote:
raw UTF-8 encoded characters,
What's that? I'm afraid you mess up everything. The character
"small a with diaeresis" (ä) remains "small a with diaeresis" (ä)
no matter how it is encoded. You may speak of "UTF-8-encoded
characters" and "ISO-8859-1-encoded characters"; but there is
no such thing as "raw UTF-8-encoded".

You probably confuse the concepts "byte" and "character".

I suggest to go to a library and read (or even buy) the book

Unicode explained / Jukka K. Korpela. -
Sebastopol, CA : O'Reilly Media, 2006. -
ISBN 978-0-596-10121-3

to learn more about Unicode.

Closed Thread