By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,967 Members | 1,684 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,967 IT Pros & Developers. It's quick & easy.

What is mb_internal_encoding() excactly?

P: n/a

Hi,

[Exuse me for a rather lengthy post. I try to explain as well as I can
what I do understand on multibyte encoding and what not.]

Background: I am working on a multilanguage project now, so I decided to
switch to UTF-8 completely to avoid troubles with unicode character.

I hope somebody can review my approach and comment on it.
I am working on:
Server: Apache/2.2.3 (Debian) PHP/5.2.0-8+etch11
I am testing on FF2/FF3/IE7.
What I did so far:
Please interupt anything that is wrong/vague/stupid. ;-)

1) Every page contains this header:
Content-Type: text/html; charset=UTF-8
and has the following doctype:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
(All HTML is checked against W3C validator, so far so good.)

2) My Database (Postgres8.1) is created using UTF-8 encoding.
(As I didn't overrule anything for any table or column, all my text-like
fields use UTF-8)

3) I do NOT specify any character encoding in a META-tag.
(Ill-advised by W3C, they say the header takes precedence over
META-tags, and using the META tag may confuse some clients)

4) Whenever I need strlen($aString) or something similar, I use the
multibytevariant mb_strlen($aString,'UTF-8').

5) When I need to display a random string (from the database for
example), I use:
htmlspecialchars($someStrFromDB,ENT_QUOTES,'UTF-8');
If I must put a value in a text-element or textarea in a form, I use the
same.

6) I use ADODB5 as database abstractionlayer. It has a build-in
qstr-method that makes the passed string safe for use in SQL.

7) I get my multibyte characters from here for testing:
http://freenet-homepage.de/prilop/multilingual-1.html

So far, so good (as far as I can tell).
php.net says the following for mb_strlen:
int mb_strlen ( string $str [, string $encoding ] )
Parameters
str: The string being checked for length.
encoding : The encoding parameter is the character encoding. If it is
omitted, the internal character encoding value will be used.
--I do not understand what this 'internal character encoding value' is.

The page points to: mb_internal_encoding()
Which reads:
Set/Get the internal character encoding

Return Values: If encoding is set, then Returns TRUE on success or FALSE
on failure. If encoding is omitted, then the current character encoding
name is returned.
If I echo mb_internal_encoding() it says: ISO-8859-1
I wonder where PHP did get that value from.

I tried saving my PHP file in UTF-8, but it stays on ISO-8859-1.

My main questions are:
1) What is this mb_internal_encoding excactly?
It that something set during compilation?
Should I overwite it to UTF-8, or is using the extra parameter in all
mb_* functions good enough (and set it to UTF-8)?

2) Should I put in all my forms accept-charset="UTF-8" or is that set
implicity by my header (which always contain: Content-Type: text/html;
charset=UTF-8)?

3) Is it wise to safe all my PHP files in UTF-8?

I hope somebody can enlighten me a little on these issues. :-)
Thanks for your time!

Regards,
Erwin Moller
--
============================
Erwin Moller
Now dropping all postings from googlegroups.
Why? http://improve-usenet.org/
============================
Sep 17 '08 #1
Share this Question
Share on Google+
4 Replies


P: n/a
Erwin Moller wrote:
>
Hi,

[Exuse me for a rather lengthy post. I try to explain as well as I can
what I do understand on multibyte encoding and what not.]

Background: I am working on a multilanguage project now, so I decided to
switch to UTF-8 completely to avoid troubles with unicode character.

I hope somebody can review my approach and comment on it.
I am working on:
Server: Apache/2.2.3 (Debian) PHP/5.2.0-8+etch11
I am testing on FF2/FF3/IE7.
What I did so far:
Please interupt anything that is wrong/vague/stupid. ;-)

1) Every page contains this header:
Content-Type: text/html; charset=UTF-8
and has the following doctype:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
(All HTML is checked against W3C validator, so far so good.)

2) My Database (Postgres8.1) is created using UTF-8 encoding.
(As I didn't overrule anything for any table or column, all my text-like
fields use UTF-8)

3) I do NOT specify any character encoding in a META-tag.
(Ill-advised by W3C, they say the header takes precedence over
META-tags, and using the META tag may confuse some clients)

4) Whenever I need strlen($aString) or something similar, I use the
multibytevariant mb_strlen($aString,'UTF-8').

5) When I need to display a random string (from the database for
example), I use:
htmlspecialchars($someStrFromDB,ENT_QUOTES,'UTF-8');
If I must put a value in a text-element or textarea in a form, I use the
same.

6) I use ADODB5 as database abstractionlayer. It has a build-in
qstr-method that makes the passed string safe for use in SQL.

7) I get my multibyte characters from here for testing:
http://freenet-homepage.de/prilop/multilingual-1.html

So far, so good (as far as I can tell).
php.net says the following for mb_strlen:
int mb_strlen ( string $str [, string $encoding ] )
Parameters
str: The string being checked for length.
encoding : The encoding parameter is the character encoding. If it is
omitted, the internal character encoding value will be used.

--I do not understand what this 'internal character encoding value' is.

The page points to: mb_internal_encoding()
Which reads:
Set/Get the internal character encoding

Return Values: If encoding is set, then Returns TRUE on success or FALSE
on failure. If encoding is omitted, then the current character encoding
name is returned.

If I echo mb_internal_encoding() it says: ISO-8859-1
I wonder where PHP did get that value from.

I tried saving my PHP file in UTF-8, but it stays on ISO-8859-1.

My main questions are:
1) What is this mb_internal_encoding excactly?
It that something set during compilation?
Should I overwite it to UTF-8, or is using the extra parameter in all
mb_* functions good enough (and set it to UTF-8)?

2) Should I put in all my forms accept-charset="UTF-8" or is that set
implicity by my header (which always contain: Content-Type: text/html;
charset=UTF-8)?

3) Is it wise to safe all my PHP files in UTF-8?

I hope somebody can enlighten me a little on these issues. :-)
Thanks for your time!

Regards,
Erwin Moller

I was also investigating this the other day. As for your concern of
where PHP gets the internal coding setting, it comes from the
[mbstring] portion of the php.ini config. If the directives are
commented out, it seems to default to ISO-8859-1.

Other than that, I'm just as curious as you. :-)

--
Curtis
Sep 17 '08 #2

P: n/a
AqD
On Sep 17, 5:58*pm, Erwin Moller
<Since_humans_read_this_I_am_spammed_too_m...@spam yourself.comwrote:
Hi,

[Exuse me for a rather lengthy post. I try to explain as well as I can
what I do understand on multibyte encoding and what not.]

Background: I am working on a multilanguage project now, so I decided to
switch to UTF-8 completely to avoid troubles with unicode character.

I hope somebody can review my approach and comment on it.
I am working on:
Server: Apache/2.2.3 (Debian) PHP/5.2.0-8+etch11
I am testing on FF2/FF3/IE7.

What I did so far:
Please interupt anything that is wrong/vague/stupid. ;-)

1) Every page contains this header:
Content-Type: text/html; charset=UTF-8
and has the following doctype:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
(All HTML is checked against W3C validator, so far so good.)
Yes
>
2) My Database (Postgres8.1) is created using UTF-8 encoding.
(As I didn't overrule anything for any table or column, all my text-like
fields use UTF-8)
If you're using mysql, be careful that you have to set your client
encoding for connection. If you don't (a lot of 'unicode' projects
don't do that), it would treat your utf-8 sql statements as latin1 and
convert them wrongly inside the db.

To set the encoding, you need to call functions such as
mysqli_set_charset. It also affects the string escape method.
>
3) I do NOT specify any character encoding in a META-tag.
(Ill-advised by W3C, they say the header takes precedence over
META-tags, and using the META tag may confuse some clients)
some clients like IE4? ;) Basically all websites here (mis-)use the
meta tag for charset instead of setting the header. As long as the
encoding is latin1-compatible (like utf8), it should be fine.

I stopped listening to their advices or reading their references for a
long time. If you want something to work, it's better to test it with
real implementations (i.e. the browsers).
>
4) Whenever I need strlen($aString) or something similar, I use the
multibytevariant mb_strlen($aString,'UTF-8').
Same for sub-string and any other operations on string characters. But
there are performance issues and I hope you'll not run into them ;)
>
5) When I need to display a random string (from the database for
example), I use:
htmlspecialchars($someStrFromDB,ENT_QUOTES,'UTF-8');
If I must put a value in a text-element or textarea in a form, I use the
same.
yes
>
6) I use ADODB5 as database abstractionlayer. It has a build-in
qstr-method that makes the passed string safe for use in SQL.
safe only for the correct encoding. You need to set the encoding like
I wrote above. If ADODB doesn't provide the method to change encoding,
you can do a query "SET NAMES utf8" after connecting - I'm not sure
how this works with the escape function though.
>
7) I get my multibyte characters from here for testing:http://freenet-homepage.de/prilop/multilingual-1.html

So far, so good (as far as I can tell).

php.net says the following for mb_strlen:
int mb_strlen *( string $str *[, string $encoding *] )
Parameters
str: The string being checked for length.
encoding : The encoding parameter is the character encoding. If it is
omitted, the internal character encoding value will be used.

--I do not understand what this 'internal character encoding value' is.

The page points to: mb_internal_encoding()
Which reads:
Set/Get the internal character encoding
It's the default encoding for certain mbstring functiosn. Not
"internal". The mbstring extension (except for some regex functions)
can be used to deal with strings of more than encodings at the same
once.
>
Return Values: If encoding is set, then Returns TRUE on success or FALSE
on failure. If encoding is omitted, then the current character encoding
name is returned.

If I echo mb_internal_encoding() it says: ISO-8859-1
I wonder where PHP did get that value from.

I tried saving my PHP file in UTF-8, but it stays on ISO-8859-1.

My main questions are:
1) What is this mb_internal_encoding excactly?
It that something set during compilation?
Should I overwite it to UTF-8, or is using the extra parameter in all
mb_* functions good enough (and set it to UTF-8)?
php.ini

You can also set it in the beginning of code. Don't use the extra
parameter unless you want to deal other encodings - as I said some
regex fuctions don't have it, because they save states between
different calls and the encoding cannot change during it.
>
2) Should I put in all my forms *accept-charset="UTF-8" or is that set
implicity by my header (which always contain: Content-Type: text/html;
charset=UTF-8)?
No need.
3) Is it wise to safe all my PHP files in UTF-8?
yes, and do not save with utf-8 signature.
Sep 18 '08 #3

P: n/a
On Sep 18, 2:08*am, AqD <aquila.d...@gmail.comwrote:
On Sep 17, 5:58*pm, Erwin Moller

3) I do NOT specify any character encoding in a META-tag.
(Ill-advised by W3C, they say the header takes precedence over
META-tags, and using the META tag may confuse some clients)

some clients like IE4? ;) Basically all websites here (mis-)use the
meta tag for charset instead of setting the header. As long as the
encoding is latin1-compatible (like utf8), it should be fine.

I stopped listening to their advices or reading their references for a
long time. If you want something to work, it's better to test it with
real implementations (i.e. the browsers).
I think the meta option is provided because in some environments you
don't have full control of the headers being generated (eg: hosted
solutions). I could be wrong on this.

I don't know why a client would get confused if they got the character
encoding in both the header and a meta tag... perhaps if they were
different?
>
6) I use ADODB5 as database abstractionlayer. It has a build-in
qstr-method that makes the passed string safe for use in SQL.

safe only for the correct encoding. You need to set the encoding like
I wrote above. If ADODB doesn't provide the method to change encoding,
you can do a query "SET NAMES utf8" after connecting - I'm not sure
how this works with the escape function though.
The mysql_real_escape_string takes into account the character encoding
the database is expecting.. not sure about your DBAL though.
[quote]
--I do not understand what this 'internal character encoding value' is.
The page points to: mb_internal_encoding()
Which reads:
Set/Get the internal character encoding

It's the default encoding for certain mbstring functiosn. Not
"internal". The mbstring extension (except for some regex functions)
can be used to deal with strings of more than encodings at the same
once.

That's what I gathered, 'internal encoding' is a bit misleading, I
tend to think of it more as a 'default' encoding.. many of the mb
functions take in a character encoding as an optional parameter, if
you don't supply it this parameter, it will assume that the encoding
of the input string is the 'internal' (ie: default) one.

HTH

Taras
Sep 19 '08 #4

P: n/a
AqD
On Sep 19, 7:41*pm, Taras_96 <taras...@gmail.comwrote:
On Sep 18, 2:08*am,AqD<aquila.d...@gmail.comwrote:
On Sep 17, 5:58*pm, Erwin Moller
3) I do NOT specify any character encoding in a META-tag.
(Ill-advised by W3C, they say the header takes precedence over
META-tags, and using the META tag may confuse some clients)
some clients like IE4? ;) Basically all websites here (mis-)use the
meta tag for charset instead of setting the header. As long as the
encoding is latin1-compatible (like utf8), it should be fine.
I stopped listening to their advices or reading their references for a
long time. If you want something to work, it's better to test it with
real implementations (i.e. the browsers).

I think the meta option is provided because in some environments you
don't have full control of the headers being generated (eg: hosted
solutions). I could be wrong on this.

I don't know why a client would get confused if they got the character
encoding in both the header and a meta tag... perhaps if they were
different?
If it's different, browser should use the encoding from header (I
tested this before). But the meta tag only works with ASCII/iso8859-1
based encodings, not UCS2 or UCS4.
>

6) I use ADODB5 as database abstractionlayer. It has a build-in
qstr-method that makes the passed string safe for use in SQL.
safe only for the correct encoding. You need to set the encoding like
I wrote above. If ADODB doesn't provide the method to change encoding,
you can do a query "SET NAMES utf8" after connecting - I'm not sure
how this works with the escape function though.

The mysql_real_escape_string takes into account the character encoding
the database is expecting.. not sure about your DBAL though.
True but most developers only set the database encoding not connection
encoding, which is assumed to be latin1 by mysql, so they end up
storing data in wrong encoding in database even through the text on
webpages are correct ;) The problem is still very *popular" now - you
can check the code of some open-source projects such as phpbb and
xoops.
Sep 22 '08 #5

This discussion thread is closed

Replies have been disabled for this discussion.