472,352 Members | 1,540 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,352 software developers and data experts.

php + mysql multilingual support

We have a CMS which is written is based on php & mysql. Recently we received
a request to support multiple languages so that sites in that particular
laguage can be created. I did some search on the google and it seems I have
to build in multibyte support for php and mysql. Mbstring
(http://us3.php.net/mbstring) claims to support multiple languages with a
caution saying it might not work properly.

After further research it seems unicode might be the way to go, since
unicode can represents all characters (in all languages) with integers,
which in turn can be handled in php as it has excellent integer support. But
again since all the data is store in mysql we need unicode support for mysql
too and it has 2 formats (http://www.mysql.com/doc/en/Charset-Unicode.html)
usc-2 (for storing data) and utf-8 (for encoding). Here is where I need
help. Do I opt for usc-2 or go ahead with utf-8? What are the advantages and
disadvantages of both.

Now back to our CMS; can we make changes so that this new support is
transparent to the code (that doesn't sound right). Any suggestion on how I
can minimize the amount of rework we have to do on the code to accomodate
for unicode. Are there any other suggestions on how to approach this
transformation?

--Turi
Jul 17 '05 #1
4 9356
I would recommend storing the Unicode text as UTF-8 as well, as PHP doesn't
have a function that convert16-bit Unicode to UTF8. Also mySQL can store
UTF8 as text, while USC-2/UTF-16 would have to be stored as binary.

As long as your application doesn't perform much text analysis, it would
handle the UTF-8 text more or less transparently. One thing to look out for
is database field lengths. Because UTF-8 is a variable length encoding, you
can't rely on the size attribute in your <input> tags to limit the length of
user input. A Hindi or Chinese character, for instance, takes up 3 bytes in
UTF-8. Any text truncation code would have to take that into account. Bad
things happen when you drop off half a character at the end of a string.

Some languages present special challenges as well. In Chinese there's no
space between words, for example. The Arabic script displays from right to
left. And Hindi will only work in Internet Explorer on Windows. Non-Latin
scripts also tend to need to be bigger in order t be ligible.

Uzytkownik "Aditya Ivaturi" <ai******@aijalon.net> napisal w wiadomosci
news:qx*********************@twister.rdc-kc.rr.com...
We have a CMS which is written is based on php & mysql. Recently we received a request to support multiple languages so that sites in that particular
laguage can be created. I did some search on the google and it seems I have to build in multibyte support for php and mysql. Mbstring
(http://us3.php.net/mbstring) claims to support multiple languages with a
caution saying it might not work properly.

After further research it seems unicode might be the way to go, since
unicode can represents all characters (in all languages) with integers,
which in turn can be handled in php as it has excellent integer support. But again since all the data is store in mysql we need unicode support for mysql too and it has 2 formats (http://www.mysql.com/doc/en/Charset-Unicode.html) usc-2 (for storing data) and utf-8 (for encoding). Here is where I need
help. Do I opt for usc-2 or go ahead with utf-8? What are the advantages and disadvantages of both.

Now back to our CMS; can we make changes so that this new support is
transparent to the code (that doesn't sound right). Any suggestion on how I can minimize the amount of rework we have to do on the code to accomodate
for unicode. Are there any other suggestions on how to approach this
transformation?

--Turi

Jul 17 '05 #2
Thanks for you lucid explanation Chung. As you mentioned "As long as your
application doesn't perform much text analysis...", our application uses
regular expression and text manipulation heavily. There are arguments for
and against it, but at this point I don't think redoing everything is an
option either. But for most part there is one hope. Our application evolved
over time and it was not very long ago that we introduced a new DB layer.
Even though efforts were made to change code to use this new DB layer, I am
afraid there still might be instances where we use the php mysql functions
to access db.

The obvious option might be to make sure db access is done via the new layer
and we introduce code to handle proper conversion to UTF-8. But what other
options do you feel might be an alternative. Personally, I don't see a whole
lot to wiggle around here. Any suggestions are welcome.

--Turi

"Chung Leong" <ch***********@hotmail.com> wrote in message
news:Gb********************@comcast.com...
I would recommend storing the Unicode text as UTF-8 as well, as PHP doesn't have a function that convert16-bit Unicode to UTF8. Also mySQL can store
UTF8 as text, while USC-2/UTF-16 would have to be stored as binary.

As long as your application doesn't perform much text analysis, it would
handle the UTF-8 text more or less transparently. One thing to look out for is database field lengths. Because UTF-8 is a variable length encoding, you can't rely on the size attribute in your <input> tags to limit the length of user input. A Hindi or Chinese character, for instance, takes up 3 bytes in UTF-8. Any text truncation code would have to take that into account. Bad
things happen when you drop off half a character at the end of a string.

Some languages present special challenges as well. In Chinese there's no
space between words, for example. The Arabic script displays from right to
left. And Hindi will only work in Internet Explorer on Windows. Non-Latin
scripts also tend to need to be bigger in order t be ligible.

Uzytkownik "Aditya Ivaturi" <ai******@aijalon.net> napisal w wiadomosci
news:qx*********************@twister.rdc-kc.rr.com...
We have a CMS which is written is based on php & mysql. Recently we received
a request to support multiple languages so that sites in that particular
laguage can be created. I did some search on the google and it seems I

have
to build in multibyte support for php and mysql. Mbstring
(http://us3.php.net/mbstring) claims to support multiple languages with a caution saying it might not work properly.

After further research it seems unicode might be the way to go, since
unicode can represents all characters (in all languages) with integers,
which in turn can be handled in php as it has excellent integer support.

But
again since all the data is store in mysql we need unicode support for

mysql
too and it has 2 formats

(http://www.mysql.com/doc/en/Charset-Unicode.html)
usc-2 (for storing data) and utf-8 (for encoding). Here is where I need
help. Do I opt for usc-2 or go ahead with utf-8? What are the advantages

and
disadvantages of both.

Now back to our CMS; can we make changes so that this new support is
transparent to the code (that doesn't sound right). Any suggestion on how I
can minimize the amount of rework we have to do on the code to

accomodate for unicode. Are there any other suggestions on how to approach this
transformation?

--Turi


Jul 17 '05 #3
Newer versions of PHP supports regular expression matching on UTF8 strings,
although it's still far from being useful. There's no support for character
classes, for instance. Using UTF16 doesn't help much either in that regard.
You might actually want to stick with using just ISO character sets.

The big question, of course, is which languages do you want to support.
Adding support for languages like Chinese or Arabic is definitely not
trivial.

Uzytkownik "Aditya Ivaturi" <ai******@aijalon.net> napisal w wiadomosci
news:6v*********************@twister.rdc-kc.rr.com...
Thanks for you lucid explanation Chung. As you mentioned "As long as your
application doesn't perform much text analysis...", our application uses
regular expression and text manipulation heavily. There are arguments for
and against it, but at this point I don't think redoing everything is an
option either. But for most part there is one hope. Our application evolved over time and it was not very long ago that we introduced a new DB layer.
Even though efforts were made to change code to use this new DB layer, I am afraid there still might be instances where we use the php mysql functions
to access db.

The obvious option might be to make sure db access is done via the new layer and we introduce code to handle proper conversion to UTF-8. But what other
options do you feel might be an alternative. Personally, I don't see a whole lot to wiggle around here. Any suggestions are welcome.

--Turi

"Chung Leong" <ch***********@hotmail.com> wrote in message
news:Gb********************@comcast.com...
I would recommend storing the Unicode text as UTF-8 as well, as PHP doesn't
have a function that convert16-bit Unicode to UTF8. Also mySQL can store
UTF8 as text, while USC-2/UTF-16 would have to be stored as binary.

As long as your application doesn't perform much text analysis, it would
handle the UTF-8 text more or less transparently. One thing to look out

for
is database field lengths. Because UTF-8 is a variable length encoding,

you
can't rely on the size attribute in your <input> tags to limit the length of
user input. A Hindi or Chinese character, for instance, takes up 3 bytes in
UTF-8. Any text truncation code would have to take that into account. Bad things happen when you drop off half a character at the end of a string.

Some languages present special challenges as well. In Chinese there's no
space between words, for example. The Arabic script displays from right to left. And Hindi will only work in Internet Explorer on Windows. Non-Latin scripts also tend to need to be bigger in order t be ligible.

Uzytkownik "Aditya Ivaturi" <ai******@aijalon.net> napisal w wiadomosci
news:qx*********************@twister.rdc-kc.rr.com...
We have a CMS which is written is based on php & mysql. Recently we

received
a request to support multiple languages so that sites in that particular laguage can be created. I did some search on the google and it seems I

have
to build in multibyte support for php and mysql. Mbstring
(http://us3.php.net/mbstring) claims to support multiple languages with a caution saying it might not work properly.

After further research it seems unicode might be the way to go, since
unicode can represents all characters (in all languages) with
integers, which in turn can be handled in php as it has excellent integer support.
But
again since all the data is store in mysql we need unicode support for

mysql
too and it has 2 formats

(http://www.mysql.com/doc/en/Charset-Unicode.html)
usc-2 (for storing data) and utf-8 (for encoding). Here is where I
need help. Do I opt for usc-2 or go ahead with utf-8? What are the

advantages and
disadvantages of both.

Now back to our CMS; can we make changes so that this new support is
transparent to the code (that doesn't sound right). Any suggestion on

how
I
can minimize the amount of rework we have to do on the code to

accomodate for unicode. Are there any other suggestions on how to approach this
transformation?

--Turi



Jul 17 '05 #4
Our main goal is to add support for Chinese and Japanese. So judging by your
response it seems like a challenging project.

--Turi

"Chung Leong" <ch***********@hotmail.com> wrote in message
news:mu********************@comcast.com...
Newer versions of PHP supports regular expression matching on UTF8 strings, although it's still far from being useful. There's no support for character classes, for instance. Using UTF16 doesn't help much either in that regard. You might actually want to stick with using just ISO character sets.

The big question, of course, is which languages do you want to support.
Adding support for languages like Chinese or Arabic is definitely not
trivial.

Uzytkownik "Aditya Ivaturi" <ai******@aijalon.net> napisal w wiadomosci
news:6v*********************@twister.rdc-kc.rr.com...
Thanks for you lucid explanation Chung. As you mentioned "As long as your
application doesn't perform much text analysis...", our application uses
regular expression and text manipulation heavily. There are arguments for and against it, but at this point I don't think redoing everything is an
option either. But for most part there is one hope. Our application evolved
over time and it was not very long ago that we introduced a new DB layer. Even though efforts were made to change code to use this new DB layer, I

am
afraid there still might be instances where we use the php mysql functions to access db.

The obvious option might be to make sure db access is done via the new

layer
and we introduce code to handle proper conversion to UTF-8. But what other options do you feel might be an alternative. Personally, I don't see a

whole
lot to wiggle around here. Any suggestions are welcome.

--Turi

"Chung Leong" <ch***********@hotmail.com> wrote in message
news:Gb********************@comcast.com...
I would recommend storing the Unicode text as UTF-8 as well, as PHP

doesn't
have a function that convert16-bit Unicode to UTF8. Also mySQL can store UTF8 as text, while USC-2/UTF-16 would have to be stored as binary.

As long as your application doesn't perform much text analysis, it would handle the UTF-8 text more or less transparently. One thing to look out
for
is database field lengths. Because UTF-8 is a variable length
encoding, you
can't rely on the size attribute in your <input> tags to limit the length
of
user input. A Hindi or Chinese character, for instance, takes up 3

bytes in
UTF-8. Any text truncation code would have to take that into account. Bad things happen when you drop off half a character at the end of a
string.
Some languages present special challenges as well. In Chinese there's no space between words, for example. The Arabic script displays from right to left. And Hindi will only work in Internet Explorer on Windows. Non-Latin scripts also tend to need to be bigger in order t be ligible.

Uzytkownik "Aditya Ivaturi" <ai******@aijalon.net> napisal w
wiadomosci news:qx*********************@twister.rdc-kc.rr.com...
> We have a CMS which is written is based on php & mysql. Recently we
received
> a request to support multiple languages so that sites in that

particular > laguage can be created. I did some search on the google and it seems I have
> to build in multibyte support for php and mysql. Mbstring
> (http://us3.php.net/mbstring) claims to support multiple languages with
a
> caution saying it might not work properly.
>
> After further research it seems unicode might be the way to go, since > unicode can represents all characters (in all languages) with

integers, > which in turn can be handled in php as it has excellent integer support. But
> again since all the data is store in mysql we need unicode support for mysql
> too and it has 2 formats
(http://www.mysql.com/doc/en/Charset-Unicode.html)
> usc-2 (for storing data) and utf-8 (for encoding). Here is where I need > help. Do I opt for usc-2 or go ahead with utf-8? What are the advantages and
> disadvantages of both.
>
> Now back to our CMS; can we make changes so that this new support is
> transparent to the code (that doesn't sound right). Any suggestion

on how
I
> can minimize the amount of rework we have to do on the code to

accomodate
> for unicode. Are there any other suggestions on how to approach this
> transformation?
>
> --Turi
>
>



Jul 17 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Mike Chirico | last post by:
Interesting Things to Know about MySQL Mike Chirico (mchirico@users.sourceforge.net) Copyright (GPU Free Documentation License) 2004 Last Updated:...
2
by: indyjason79 | last post by:
I'm in the process of creating a huge Global Website for a large company. I was wondering how I could separate the code from the text and I'm...
4
by: Dave Moore | last post by:
Hi All, I'm putting a website together using PHP and a MySQL database. I've been using phpMyAdmin as it makes updating the DB nice and easy. For...
4
by: Asim Qazi | last post by:
Hi All i need to develop a multilingual s/w, i did it in ASP3.0 two years back, The main functionality i need is to put all the interface data in...
0
by: Jim Adams | last post by:
I'm planning an upgrade to an existing ASP.Net project to support multiple display languages (e.g. English, Spanish, ...). I'd like to use a...
4
by: Jim Adams | last post by:
Anyone have any insights into this? I'm planning an upgrade to an existing ASP.Net project to support multiple display languages (e.g. English,...
3
by: roland.saad | last post by:
Hi Everyone, I have been trying to build a website that has multilingual support using the LAMP setup. I have created tables that store language...
7
by: =?Utf-8?B?TWlrZQ==?= | last post by:
Is it possible to create multilingual support in a Windows form like you can do in a web page by using resource files?
1
by: Abhijit D. Babar | last post by:
I have to create a multilingual application in Visual c++ .net 2008. I have a Windows form application and i want to run this on multilingual ...
1
by: Kemmylinns12 | last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and...
0
by: antdb | last post by:
Ⅰ. Advantage of AntDB: hyper-convergence + streaming processing engine In the overall architecture, a new "hyper-convergence" concept was...
0
by: Matthew3360 | last post by:
Hi there. I have been struggling to find out how to use a variable as my location in my header redirect function. Here is my code. ...
2
by: Matthew3360 | last post by:
Hi, I have a python app that i want to be able to get variables from a php page on my webserver. My python app is on my computer. How would I make it...
0
by: AndyPSV | last post by:
HOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and on my computerHOW CAN I CREATE AN AI with an .executable...
0
by: Arjunsri | last post by:
I have a Redshift database that I need to use as an import data source. I have configured the DSN connection using the server, port, database, and...
0
hi
by: WisdomUfot | last post by:
It's an interesting question you've got about how Gmail hides the HTTP referrer when a link in an email is clicked. While I don't have the specific...
0
by: Matthew3360 | last post by:
Hi, I have been trying to connect to a local host using php curl. But I am finding it hard to do this. I am doing the curl get request from my web...
0
by: Carina712 | last post by:
Setting background colors for Excel documents can help to improve the visual appeal of the document and make it easier to read and understand....

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.