By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
425,478 Members | 1,833 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 425,478 IT Pros & Developers. It's quick & easy.

php + mysql multilingual support

P: n/a
We have a CMS which is written is based on php & mysql. Recently we received
a request to support multiple languages so that sites in that particular
laguage can be created. I did some search on the google and it seems I have
to build in multibyte support for php and mysql. Mbstring
(http://us3.php.net/mbstring) claims to support multiple languages with a
caution saying it might not work properly.

After further research it seems unicode might be the way to go, since
unicode can represents all characters (in all languages) with integers,
which in turn can be handled in php as it has excellent integer support. But
again since all the data is store in mysql we need unicode support for mysql
too and it has 2 formats (http://www.mysql.com/doc/en/Charset-Unicode.html)
usc-2 (for storing data) and utf-8 (for encoding). Here is where I need
help. Do I opt for usc-2 or go ahead with utf-8? What are the advantages and
disadvantages of both.

Now back to our CMS; can we make changes so that this new support is
transparent to the code (that doesn't sound right). Any suggestion on how I
can minimize the amount of rework we have to do on the code to accomodate
for unicode. Are there any other suggestions on how to approach this
transformation?

--Turi
Jul 17 '05 #1
Share this Question
Share on Google+
4 Replies


P: n/a
I would recommend storing the Unicode text as UTF-8 as well, as PHP doesn't
have a function that convert16-bit Unicode to UTF8. Also mySQL can store
UTF8 as text, while USC-2/UTF-16 would have to be stored as binary.

As long as your application doesn't perform much text analysis, it would
handle the UTF-8 text more or less transparently. One thing to look out for
is database field lengths. Because UTF-8 is a variable length encoding, you
can't rely on the size attribute in your <input> tags to limit the length of
user input. A Hindi or Chinese character, for instance, takes up 3 bytes in
UTF-8. Any text truncation code would have to take that into account. Bad
things happen when you drop off half a character at the end of a string.

Some languages present special challenges as well. In Chinese there's no
space between words, for example. The Arabic script displays from right to
left. And Hindi will only work in Internet Explorer on Windows. Non-Latin
scripts also tend to need to be bigger in order t be ligible.

Uzytkownik "Aditya Ivaturi" <ai******@aijalon.net> napisal w wiadomosci
news:qx*********************@twister.rdc-kc.rr.com...
We have a CMS which is written is based on php & mysql. Recently we received a request to support multiple languages so that sites in that particular
laguage can be created. I did some search on the google and it seems I have to build in multibyte support for php and mysql. Mbstring
(http://us3.php.net/mbstring) claims to support multiple languages with a
caution saying it might not work properly.

After further research it seems unicode might be the way to go, since
unicode can represents all characters (in all languages) with integers,
which in turn can be handled in php as it has excellent integer support. But again since all the data is store in mysql we need unicode support for mysql too and it has 2 formats (http://www.mysql.com/doc/en/Charset-Unicode.html) usc-2 (for storing data) and utf-8 (for encoding). Here is where I need
help. Do I opt for usc-2 or go ahead with utf-8? What are the advantages and disadvantages of both.

Now back to our CMS; can we make changes so that this new support is
transparent to the code (that doesn't sound right). Any suggestion on how I can minimize the amount of rework we have to do on the code to accomodate
for unicode. Are there any other suggestions on how to approach this
transformation?

--Turi

Jul 17 '05 #2

P: n/a
Thanks for you lucid explanation Chung. As you mentioned "As long as your
application doesn't perform much text analysis...", our application uses
regular expression and text manipulation heavily. There are arguments for
and against it, but at this point I don't think redoing everything is an
option either. But for most part there is one hope. Our application evolved
over time and it was not very long ago that we introduced a new DB layer.
Even though efforts were made to change code to use this new DB layer, I am
afraid there still might be instances where we use the php mysql functions
to access db.

The obvious option might be to make sure db access is done via the new layer
and we introduce code to handle proper conversion to UTF-8. But what other
options do you feel might be an alternative. Personally, I don't see a whole
lot to wiggle around here. Any suggestions are welcome.

--Turi

"Chung Leong" <ch***********@hotmail.com> wrote in message
news:Gb********************@comcast.com...
I would recommend storing the Unicode text as UTF-8 as well, as PHP doesn't have a function that convert16-bit Unicode to UTF8. Also mySQL can store
UTF8 as text, while USC-2/UTF-16 would have to be stored as binary.

As long as your application doesn't perform much text analysis, it would
handle the UTF-8 text more or less transparently. One thing to look out for is database field lengths. Because UTF-8 is a variable length encoding, you can't rely on the size attribute in your <input> tags to limit the length of user input. A Hindi or Chinese character, for instance, takes up 3 bytes in UTF-8. Any text truncation code would have to take that into account. Bad
things happen when you drop off half a character at the end of a string.

Some languages present special challenges as well. In Chinese there's no
space between words, for example. The Arabic script displays from right to
left. And Hindi will only work in Internet Explorer on Windows. Non-Latin
scripts also tend to need to be bigger in order t be ligible.

Uzytkownik "Aditya Ivaturi" <ai******@aijalon.net> napisal w wiadomosci
news:qx*********************@twister.rdc-kc.rr.com...
We have a CMS which is written is based on php & mysql. Recently we received
a request to support multiple languages so that sites in that particular
laguage can be created. I did some search on the google and it seems I

have
to build in multibyte support for php and mysql. Mbstring
(http://us3.php.net/mbstring) claims to support multiple languages with a caution saying it might not work properly.

After further research it seems unicode might be the way to go, since
unicode can represents all characters (in all languages) with integers,
which in turn can be handled in php as it has excellent integer support.

But
again since all the data is store in mysql we need unicode support for

mysql
too and it has 2 formats

(http://www.mysql.com/doc/en/Charset-Unicode.html)
usc-2 (for storing data) and utf-8 (for encoding). Here is where I need
help. Do I opt for usc-2 or go ahead with utf-8? What are the advantages

and
disadvantages of both.

Now back to our CMS; can we make changes so that this new support is
transparent to the code (that doesn't sound right). Any suggestion on how I
can minimize the amount of rework we have to do on the code to

accomodate for unicode. Are there any other suggestions on how to approach this
transformation?

--Turi


Jul 17 '05 #3

P: n/a
Newer versions of PHP supports regular expression matching on UTF8 strings,
although it's still far from being useful. There's no support for character
classes, for instance. Using UTF16 doesn't help much either in that regard.
You might actually want to stick with using just ISO character sets.

The big question, of course, is which languages do you want to support.
Adding support for languages like Chinese or Arabic is definitely not
trivial.

Uzytkownik "Aditya Ivaturi" <ai******@aijalon.net> napisal w wiadomosci
news:6v*********************@twister.rdc-kc.rr.com...
Thanks for you lucid explanation Chung. As you mentioned "As long as your
application doesn't perform much text analysis...", our application uses
regular expression and text manipulation heavily. There are arguments for
and against it, but at this point I don't think redoing everything is an
option either. But for most part there is one hope. Our application evolved over time and it was not very long ago that we introduced a new DB layer.
Even though efforts were made to change code to use this new DB layer, I am afraid there still might be instances where we use the php mysql functions
to access db.

The obvious option might be to make sure db access is done via the new layer and we introduce code to handle proper conversion to UTF-8. But what other
options do you feel might be an alternative. Personally, I don't see a whole lot to wiggle around here. Any suggestions are welcome.

--Turi

"Chung Leong" <ch***********@hotmail.com> wrote in message
news:Gb********************@comcast.com...
I would recommend storing the Unicode text as UTF-8 as well, as PHP doesn't
have a function that convert16-bit Unicode to UTF8. Also mySQL can store
UTF8 as text, while USC-2/UTF-16 would have to be stored as binary.

As long as your application doesn't perform much text analysis, it would
handle the UTF-8 text more or less transparently. One thing to look out

for
is database field lengths. Because UTF-8 is a variable length encoding,

you
can't rely on the size attribute in your <input> tags to limit the length of
user input. A Hindi or Chinese character, for instance, takes up 3 bytes in
UTF-8. Any text truncation code would have to take that into account. Bad things happen when you drop off half a character at the end of a string.

Some languages present special challenges as well. In Chinese there's no
space between words, for example. The Arabic script displays from right to left. And Hindi will only work in Internet Explorer on Windows. Non-Latin scripts also tend to need to be bigger in order t be ligible.

Uzytkownik "Aditya Ivaturi" <ai******@aijalon.net> napisal w wiadomosci
news:qx*********************@twister.rdc-kc.rr.com...
We have a CMS which is written is based on php & mysql. Recently we

received
a request to support multiple languages so that sites in that particular laguage can be created. I did some search on the google and it seems I

have
to build in multibyte support for php and mysql. Mbstring
(http://us3.php.net/mbstring) claims to support multiple languages with a caution saying it might not work properly.

After further research it seems unicode might be the way to go, since
unicode can represents all characters (in all languages) with
integers, which in turn can be handled in php as it has excellent integer support.
But
again since all the data is store in mysql we need unicode support for

mysql
too and it has 2 formats

(http://www.mysql.com/doc/en/Charset-Unicode.html)
usc-2 (for storing data) and utf-8 (for encoding). Here is where I
need help. Do I opt for usc-2 or go ahead with utf-8? What are the

advantages and
disadvantages of both.

Now back to our CMS; can we make changes so that this new support is
transparent to the code (that doesn't sound right). Any suggestion on

how
I
can minimize the amount of rework we have to do on the code to

accomodate for unicode. Are there any other suggestions on how to approach this
transformation?

--Turi



Jul 17 '05 #4

P: n/a
Our main goal is to add support for Chinese and Japanese. So judging by your
response it seems like a challenging project.

--Turi

"Chung Leong" <ch***********@hotmail.com> wrote in message
news:mu********************@comcast.com...
Newer versions of PHP supports regular expression matching on UTF8 strings, although it's still far from being useful. There's no support for character classes, for instance. Using UTF16 doesn't help much either in that regard. You might actually want to stick with using just ISO character sets.

The big question, of course, is which languages do you want to support.
Adding support for languages like Chinese or Arabic is definitely not
trivial.

Uzytkownik "Aditya Ivaturi" <ai******@aijalon.net> napisal w wiadomosci
news:6v*********************@twister.rdc-kc.rr.com...
Thanks for you lucid explanation Chung. As you mentioned "As long as your
application doesn't perform much text analysis...", our application uses
regular expression and text manipulation heavily. There are arguments for and against it, but at this point I don't think redoing everything is an
option either. But for most part there is one hope. Our application evolved
over time and it was not very long ago that we introduced a new DB layer. Even though efforts were made to change code to use this new DB layer, I

am
afraid there still might be instances where we use the php mysql functions to access db.

The obvious option might be to make sure db access is done via the new

layer
and we introduce code to handle proper conversion to UTF-8. But what other options do you feel might be an alternative. Personally, I don't see a

whole
lot to wiggle around here. Any suggestions are welcome.

--Turi

"Chung Leong" <ch***********@hotmail.com> wrote in message
news:Gb********************@comcast.com...
I would recommend storing the Unicode text as UTF-8 as well, as PHP

doesn't
have a function that convert16-bit Unicode to UTF8. Also mySQL can store UTF8 as text, while USC-2/UTF-16 would have to be stored as binary.

As long as your application doesn't perform much text analysis, it would handle the UTF-8 text more or less transparently. One thing to look out
for
is database field lengths. Because UTF-8 is a variable length
encoding, you
can't rely on the size attribute in your <input> tags to limit the length
of
user input. A Hindi or Chinese character, for instance, takes up 3

bytes in
UTF-8. Any text truncation code would have to take that into account. Bad things happen when you drop off half a character at the end of a
string.
Some languages present special challenges as well. In Chinese there's no space between words, for example. The Arabic script displays from right to left. And Hindi will only work in Internet Explorer on Windows. Non-Latin scripts also tend to need to be bigger in order t be ligible.

Uzytkownik "Aditya Ivaturi" <ai******@aijalon.net> napisal w
wiadomosci news:qx*********************@twister.rdc-kc.rr.com...
> We have a CMS which is written is based on php & mysql. Recently we
received
> a request to support multiple languages so that sites in that

particular > laguage can be created. I did some search on the google and it seems I have
> to build in multibyte support for php and mysql. Mbstring
> (http://us3.php.net/mbstring) claims to support multiple languages with
a
> caution saying it might not work properly.
>
> After further research it seems unicode might be the way to go, since > unicode can represents all characters (in all languages) with

integers, > which in turn can be handled in php as it has excellent integer support. But
> again since all the data is store in mysql we need unicode support for mysql
> too and it has 2 formats
(http://www.mysql.com/doc/en/Charset-Unicode.html)
> usc-2 (for storing data) and utf-8 (for encoding). Here is where I need > help. Do I opt for usc-2 or go ahead with utf-8? What are the advantages and
> disadvantages of both.
>
> Now back to our CMS; can we make changes so that this new support is
> transparent to the code (that doesn't sound right). Any suggestion

on how
I
> can minimize the amount of rework we have to do on the code to

accomodate
> for unicode. Are there any other suggestions on how to approach this
> transformation?
>
> --Turi
>
>



Jul 17 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.