472,362 Members | 1,702 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,362 software developers and data experts.

What is mb_internal_encoding() excactly?


Hi,

[Exuse me for a rather lengthy post. I try to explain as well as I can
what I do understand on multibyte encoding and what not.]

Background: I am working on a multilanguage project now, so I decided to
switch to UTF-8 completely to avoid troubles with unicode character.

I hope somebody can review my approach and comment on it.
I am working on:
Server: Apache/2.2.3 (Debian) PHP/5.2.0-8+etch11
I am testing on FF2/FF3/IE7.
What I did so far:
Please interupt anything that is wrong/vague/stupid. ;-)

1) Every page contains this header:
Content-Type: text/html; charset=UTF-8
and has the following doctype:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
(All HTML is checked against W3C validator, so far so good.)

2) My Database (Postgres8.1) is created using UTF-8 encoding.
(As I didn't overrule anything for any table or column, all my text-like
fields use UTF-8)

3) I do NOT specify any character encoding in a META-tag.
(Ill-advised by W3C, they say the header takes precedence over
META-tags, and using the META tag may confuse some clients)

4) Whenever I need strlen($aString) or something similar, I use the
multibytevariant mb_strlen($aString,'UTF-8').

5) When I need to display a random string (from the database for
example), I use:
htmlspecialchars($someStrFromDB,ENT_QUOTES,'UTF-8');
If I must put a value in a text-element or textarea in a form, I use the
same.

6) I use ADODB5 as database abstractionlayer. It has a build-in
qstr-method that makes the passed string safe for use in SQL.

7) I get my multibyte characters from here for testing:
http://freenet-homepage.de/prilop/multilingual-1.html

So far, so good (as far as I can tell).
php.net says the following for mb_strlen:
int mb_strlen ( string $str [, string $encoding ] )
Parameters
str: The string being checked for length.
encoding : The encoding parameter is the character encoding. If it is
omitted, the internal character encoding value will be used.
--I do not understand what this 'internal character encoding value' is.

The page points to: mb_internal_encoding()
Which reads:
Set/Get the internal character encoding

Return Values: If encoding is set, then Returns TRUE on success or FALSE
on failure. If encoding is omitted, then the current character encoding
name is returned.
If I echo mb_internal_encoding() it says: ISO-8859-1
I wonder where PHP did get that value from.

I tried saving my PHP file in UTF-8, but it stays on ISO-8859-1.

My main questions are:
1) What is this mb_internal_encoding excactly?
It that something set during compilation?
Should I overwite it to UTF-8, or is using the extra parameter in all
mb_* functions good enough (and set it to UTF-8)?

2) Should I put in all my forms accept-charset="UTF-8" or is that set
implicity by my header (which always contain: Content-Type: text/html;
charset=UTF-8)?

3) Is it wise to safe all my PHP files in UTF-8?

I hope somebody can enlighten me a little on these issues. :-)
Thanks for your time!

Regards,
Erwin Moller
--
============================
Erwin Moller
Now dropping all postings from googlegroups.
Why? http://improve-usenet.org/
============================
Sep 17 '08 #1
4 6876
Erwin Moller wrote:
>
Hi,

[Exuse me for a rather lengthy post. I try to explain as well as I can
what I do understand on multibyte encoding and what not.]

Background: I am working on a multilanguage project now, so I decided to
switch to UTF-8 completely to avoid troubles with unicode character.

I hope somebody can review my approach and comment on it.
I am working on:
Server: Apache/2.2.3 (Debian) PHP/5.2.0-8+etch11
I am testing on FF2/FF3/IE7.
What I did so far:
Please interupt anything that is wrong/vague/stupid. ;-)

1) Every page contains this header:
Content-Type: text/html; charset=UTF-8
and has the following doctype:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
(All HTML is checked against W3C validator, so far so good.)

2) My Database (Postgres8.1) is created using UTF-8 encoding.
(As I didn't overrule anything for any table or column, all my text-like
fields use UTF-8)

3) I do NOT specify any character encoding in a META-tag.
(Ill-advised by W3C, they say the header takes precedence over
META-tags, and using the META tag may confuse some clients)

4) Whenever I need strlen($aString) or something similar, I use the
multibytevariant mb_strlen($aString,'UTF-8').

5) When I need to display a random string (from the database for
example), I use:
htmlspecialchars($someStrFromDB,ENT_QUOTES,'UTF-8');
If I must put a value in a text-element or textarea in a form, I use the
same.

6) I use ADODB5 as database abstractionlayer. It has a build-in
qstr-method that makes the passed string safe for use in SQL.

7) I get my multibyte characters from here for testing:
http://freenet-homepage.de/prilop/multilingual-1.html

So far, so good (as far as I can tell).
php.net says the following for mb_strlen:
int mb_strlen ( string $str [, string $encoding ] )
Parameters
str: The string being checked for length.
encoding : The encoding parameter is the character encoding. If it is
omitted, the internal character encoding value will be used.

--I do not understand what this 'internal character encoding value' is.

The page points to: mb_internal_encoding()
Which reads:
Set/Get the internal character encoding

Return Values: If encoding is set, then Returns TRUE on success or FALSE
on failure. If encoding is omitted, then the current character encoding
name is returned.

If I echo mb_internal_encoding() it says: ISO-8859-1
I wonder where PHP did get that value from.

I tried saving my PHP file in UTF-8, but it stays on ISO-8859-1.

My main questions are:
1) What is this mb_internal_encoding excactly?
It that something set during compilation?
Should I overwite it to UTF-8, or is using the extra parameter in all
mb_* functions good enough (and set it to UTF-8)?

2) Should I put in all my forms accept-charset="UTF-8" or is that set
implicity by my header (which always contain: Content-Type: text/html;
charset=UTF-8)?

3) Is it wise to safe all my PHP files in UTF-8?

I hope somebody can enlighten me a little on these issues. :-)
Thanks for your time!

Regards,
Erwin Moller

I was also investigating this the other day. As for your concern of
where PHP gets the internal coding setting, it comes from the
[mbstring] portion of the php.ini config. If the directives are
commented out, it seems to default to ISO-8859-1.

Other than that, I'm just as curious as you. :-)

--
Curtis
Sep 17 '08 #2
AqD
On Sep 17, 5:58*pm, Erwin Moller
<Since_humans_read_this_I_am_spammed_too_m...@spam yourself.comwrote:
Hi,

[Exuse me for a rather lengthy post. I try to explain as well as I can
what I do understand on multibyte encoding and what not.]

Background: I am working on a multilanguage project now, so I decided to
switch to UTF-8 completely to avoid troubles with unicode character.

I hope somebody can review my approach and comment on it.
I am working on:
Server: Apache/2.2.3 (Debian) PHP/5.2.0-8+etch11
I am testing on FF2/FF3/IE7.

What I did so far:
Please interupt anything that is wrong/vague/stupid. ;-)

1) Every page contains this header:
Content-Type: text/html; charset=UTF-8
and has the following doctype:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
(All HTML is checked against W3C validator, so far so good.)
Yes
>
2) My Database (Postgres8.1) is created using UTF-8 encoding.
(As I didn't overrule anything for any table or column, all my text-like
fields use UTF-8)
If you're using mysql, be careful that you have to set your client
encoding for connection. If you don't (a lot of 'unicode' projects
don't do that), it would treat your utf-8 sql statements as latin1 and
convert them wrongly inside the db.

To set the encoding, you need to call functions such as
mysqli_set_charset. It also affects the string escape method.
>
3) I do NOT specify any character encoding in a META-tag.
(Ill-advised by W3C, they say the header takes precedence over
META-tags, and using the META tag may confuse some clients)
some clients like IE4? ;) Basically all websites here (mis-)use the
meta tag for charset instead of setting the header. As long as the
encoding is latin1-compatible (like utf8), it should be fine.

I stopped listening to their advices or reading their references for a
long time. If you want something to work, it's better to test it with
real implementations (i.e. the browsers).
>
4) Whenever I need strlen($aString) or something similar, I use the
multibytevariant mb_strlen($aString,'UTF-8').
Same for sub-string and any other operations on string characters. But
there are performance issues and I hope you'll not run into them ;)
>
5) When I need to display a random string (from the database for
example), I use:
htmlspecialchars($someStrFromDB,ENT_QUOTES,'UTF-8');
If I must put a value in a text-element or textarea in a form, I use the
same.
yes
>
6) I use ADODB5 as database abstractionlayer. It has a build-in
qstr-method that makes the passed string safe for use in SQL.
safe only for the correct encoding. You need to set the encoding like
I wrote above. If ADODB doesn't provide the method to change encoding,
you can do a query "SET NAMES utf8" after connecting - I'm not sure
how this works with the escape function though.
>
7) I get my multibyte characters from here for testing:http://freenet-homepage.de/prilop/multilingual-1.html

So far, so good (as far as I can tell).

php.net says the following for mb_strlen:
int mb_strlen *( string $str *[, string $encoding *] )
Parameters
str: The string being checked for length.
encoding : The encoding parameter is the character encoding. If it is
omitted, the internal character encoding value will be used.

--I do not understand what this 'internal character encoding value' is.

The page points to: mb_internal_encoding()
Which reads:
Set/Get the internal character encoding
It's the default encoding for certain mbstring functiosn. Not
"internal". The mbstring extension (except for some regex functions)
can be used to deal with strings of more than encodings at the same
once.
>
Return Values: If encoding is set, then Returns TRUE on success or FALSE
on failure. If encoding is omitted, then the current character encoding
name is returned.

If I echo mb_internal_encoding() it says: ISO-8859-1
I wonder where PHP did get that value from.

I tried saving my PHP file in UTF-8, but it stays on ISO-8859-1.

My main questions are:
1) What is this mb_internal_encoding excactly?
It that something set during compilation?
Should I overwite it to UTF-8, or is using the extra parameter in all
mb_* functions good enough (and set it to UTF-8)?
php.ini

You can also set it in the beginning of code. Don't use the extra
parameter unless you want to deal other encodings - as I said some
regex fuctions don't have it, because they save states between
different calls and the encoding cannot change during it.
>
2) Should I put in all my forms *accept-charset="UTF-8" or is that set
implicity by my header (which always contain: Content-Type: text/html;
charset=UTF-8)?
No need.
3) Is it wise to safe all my PHP files in UTF-8?
yes, and do not save with utf-8 signature.
Sep 18 '08 #3
On Sep 18, 2:08*am, AqD <aquila.d...@gmail.comwrote:
On Sep 17, 5:58*pm, Erwin Moller

3) I do NOT specify any character encoding in a META-tag.
(Ill-advised by W3C, they say the header takes precedence over
META-tags, and using the META tag may confuse some clients)

some clients like IE4? ;) Basically all websites here (mis-)use the
meta tag for charset instead of setting the header. As long as the
encoding is latin1-compatible (like utf8), it should be fine.

I stopped listening to their advices or reading their references for a
long time. If you want something to work, it's better to test it with
real implementations (i.e. the browsers).
I think the meta option is provided because in some environments you
don't have full control of the headers being generated (eg: hosted
solutions). I could be wrong on this.

I don't know why a client would get confused if they got the character
encoding in both the header and a meta tag... perhaps if they were
different?
>
6) I use ADODB5 as database abstractionlayer. It has a build-in
qstr-method that makes the passed string safe for use in SQL.

safe only for the correct encoding. You need to set the encoding like
I wrote above. If ADODB doesn't provide the method to change encoding,
you can do a query "SET NAMES utf8" after connecting - I'm not sure
how this works with the escape function though.
The mysql_real_escape_string takes into account the character encoding
the database is expecting.. not sure about your DBAL though.
[quote]
--I do not understand what this 'internal character encoding value' is.
The page points to: mb_internal_encoding()
Which reads:
Set/Get the internal character encoding

It's the default encoding for certain mbstring functiosn. Not
"internal". The mbstring extension (except for some regex functions)
can be used to deal with strings of more than encodings at the same
once.

That's what I gathered, 'internal encoding' is a bit misleading, I
tend to think of it more as a 'default' encoding.. many of the mb
functions take in a character encoding as an optional parameter, if
you don't supply it this parameter, it will assume that the encoding
of the input string is the 'internal' (ie: default) one.

HTH

Taras
Sep 19 '08 #4
AqD
On Sep 19, 7:41*pm, Taras_96 <taras...@gmail.comwrote:
On Sep 18, 2:08*am,AqD<aquila.d...@gmail.comwrote:
On Sep 17, 5:58*pm, Erwin Moller
3) I do NOT specify any character encoding in a META-tag.
(Ill-advised by W3C, they say the header takes precedence over
META-tags, and using the META tag may confuse some clients)
some clients like IE4? ;) Basically all websites here (mis-)use the
meta tag for charset instead of setting the header. As long as the
encoding is latin1-compatible (like utf8), it should be fine.
I stopped listening to their advices or reading their references for a
long time. If you want something to work, it's better to test it with
real implementations (i.e. the browsers).

I think the meta option is provided because in some environments you
don't have full control of the headers being generated (eg: hosted
solutions). I could be wrong on this.

I don't know why a client would get confused if they got the character
encoding in both the header and a meta tag... perhaps if they were
different?
If it's different, browser should use the encoding from header (I
tested this before). But the meta tag only works with ASCII/iso8859-1
based encodings, not UCS2 or UCS4.
>

6) I use ADODB5 as database abstractionlayer. It has a build-in
qstr-method that makes the passed string safe for use in SQL.
safe only for the correct encoding. You need to set the encoding like
I wrote above. If ADODB doesn't provide the method to change encoding,
you can do a query "SET NAMES utf8" after connecting - I'm not sure
how this works with the escape function though.

The mysql_real_escape_string takes into account the character encoding
the database is expecting.. not sure about your DBAL though.
True but most developers only set the database encoding not connection
encoding, which is assumed to be latin1 by mysql, so they end up
storing data in wrong encoding in database even through the text on
webpages are correct ;) The problem is still very *popular" now - you
can check the code of some open-source projects such as phpbb and
xoops.
Sep 22 '08 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: Randell D. | last post by:
I have just recompiled, upgraded to PHP 4.3.4. As an exercise (and curiosity) I've decided to test out PDF functions and got the test in the PHP online manual working. I had one problem whereby...
52
by: Tony Marston | last post by:
Several months ago I started a thread with the title "What is/is not considered to be good OO programming" which started a long and interesting discussion. I have condensed the arguments into a...
5
by: Richie | last post by:
What I want is to have a link to a file (it could be a .zip, .exe, .jpg, ..txt or even .html) and when the user clicks on it they are prompted with the Save As box, as opposed to it opening in the...
5
by: vishal | last post by:
hello vishal here. i have seen some scripts which includes file like include('time.inc') so can anyone tell me what this file contain. and what is extension
2
by: thecrow | last post by:
Alright, what the hell is going on here? In the following code, I expect the printed result to be: DEBUG: frank's last name is burns. Instead, what I get is: DEBUG: frank's last name is...
125
by: Sarah Tanembaum | last post by:
Beside its an opensource and supported by community, what's the fundamental differences between PostgreSQL and those high-price commercial database (and some are bloated such as Oracle) from...
1
by: bdawg | last post by:
what i want to do is create several radio buttons and a textbox for searching purposes. the search will perform a search depending on which button the user selects. here is what i have now: ...
121
by: typingcat | last post by:
First of all, I'm an Asian and I need to input Japanese, Korean and so on. I've tried many PHP IDEs today, but almost non of them supported Unicode (UTF-8) file. I've found that the only Unicode...
8
by: Midnight Java Junkie | last post by:
Dear Colleagues: I feel that the dumbest questions are those that are never asked. I have been given the opportunity to get into .NET. Our organization has a subscription with Microsoft that...
2
by: Kemmylinns12 | last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and efficiency. While initially associated with cryptocurrencies...
0
by: antdb | last post by:
Ⅰ. Advantage of AntDB: hyper-convergence + streaming processing engine In the overall architecture, a new "hyper-convergence" concept was proposed, which integrated multiple engines and...
0
by: AndyPSV | last post by:
HOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and on my computerHOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and...
0
by: Arjunsri | last post by:
I have a Redshift database that I need to use as an import data source. I have configured the DSN connection using the server, port, database, and credentials and received a successful connection...
0
hi
by: WisdomUfot | last post by:
It's an interesting question you've got about how Gmail hides the HTTP referrer when a link in an email is clicked. While I don't have the specific technical details, Gmail likely implements measures...
1
by: Matthew3360 | last post by:
Hi, I have been trying to connect to a local host using php curl. But I am finding it hard to do this. I am doing the curl get request from my web server and have made sure to enable curl. I get a...
0
Oralloy
by: Oralloy | last post by:
Hello Folks, I am trying to hook up a CPU which I designed using SystemC to I/O pins on an FPGA. My problem (spelled failure) is with the synthesis of my design into a bitstream, not the C++...
0
by: Carina712 | last post by:
Setting background colors for Excel documents can help to improve the visual appeal of the document and make it easier to read and understand. Background colors can be used to highlight important...
1
by: ezappsrUS | last post by:
Hi, I wonder if someone knows where I am going wrong below. I have a continuous form and two labels where only one would be visible depending on the checkbox being checked or not. Below is the...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.