Hi,
[Exuse me for a rather lengthy post. I try to explain as well as I can
what I do understand on multibyte encoding and what not.]
Background: I am working on a multilanguage project now, so I decided to
switch to UTF-8 completely to avoid troubles with unicode character.
I hope somebody can review my approach and comment on it.
I am working on:
Server: Apache/2.2.3 (Debian) PHP/5.2.0-8+etch11
I am testing on FF2/FF3/IE7.
What I did so far:
Please interupt anything that is wrong/vague/stupid. ;-)
1) Every page contains this header:
Content-Type: text/html; charset=UTF-8
and has the following doctype:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
(All HTML is checked against W3C validator, so far so good.)
2) My Database (Postgres8.1) is created using UTF-8 encoding.
(As I didn't overrule anything for any table or column, all my text-like
fields use UTF-8)
3) I do NOT specify any character encoding in a META-tag.
(Ill-advised by W3C, they say the header takes precedence over
META-tags, and using the META tag may confuse some clients)
4) Whenever I need strlen($aString) or something similar, I use the
multibytevariant mb_strlen($aString,'UTF-8').
5) When I need to display a random string (from the database for
example), I use:
htmlspecialchars($someStrFromDB,ENT_QUOTES,'UTF-8');
If I must put a value in a text-element or textarea in a form, I use the
same.
6) I use ADODB5 as database abstractionlayer. It has a build-in
qstr-method that makes the passed string safe for use in SQL.
7) I get my multibyte characters from here for testing:
http://freenet-homepage.de/prilop/multilingual-1.html
So far, so good (as far as I can tell).
php.net says the following for mb_strlen:
int mb_strlen ( string $str [, string $encoding ] )--I do not understand what this 'internal character encoding value' is.
Parameters
str: The string being checked for length.
encoding : The encoding parameter is the character encoding. If it is
omitted, the internal character encoding value will be used.
The page points to: mb_internal_encoding()
Which reads:
Set/Get the internal character encodingIf I echo mb_internal_encoding() it says: ISO-8859-1
Return Values: If encoding is set, then Returns TRUE on success or FALSE
on failure. If encoding is omitted, then the current character encoding
name is returned.
I wonder where PHP did get that value from.
I tried saving my PHP file in UTF-8, but it stays on ISO-8859-1.
My main questions are:
1) What is this mb_internal_encoding excactly?
It that something set during compilation?
Should I overwite it to UTF-8, or is using the extra parameter in all
mb_* functions good enough (and set it to UTF-8)?
2) Should I put in all my forms accept-charset="UTF-8" or is that set
implicity by my header (which always contain: Content-Type: text/html;
charset=UTF-8)?
3) Is it wise to safe all my PHP files in UTF-8?
I hope somebody can enlighten me a little on these issues. :-)
Thanks for your time!
Regards,
Erwin Moller
--
============================
Erwin Moller
Now dropping all postings from googlegroups.
Why? http://improve-usenet.org/
============================