Connecting Tech Pros Worldwide Help | Site Map

php + mysql multilingual support

Aditya Ivaturi
Guest
 
Posts: n/a
#1: Jul 17 '05
We have a CMS which is written is based on php & mysql. Recently we received
a request to support multiple languages so that sites in that particular
laguage can be created. I did some search on the google and it seems I have
to build in multibyte support for php and mysql. Mbstring
(http://us3.php.net/mbstring) claims to support multiple languages with a
caution saying it might not work properly.

After further research it seems unicode might be the way to go, since
unicode can represents all characters (in all languages) with integers,
which in turn can be handled in php as it has excellent integer support. But
again since all the data is store in mysql we need unicode support for mysql
too and it has 2 formats (http://www.mysql.com/doc/en/Charset-Unicode.html)
usc-2 (for storing data) and utf-8 (for encoding). Here is where I need
help. Do I opt for usc-2 or go ahead with utf-8? What are the advantages and
disadvantages of both.

Now back to our CMS; can we make changes so that this new support is
transparent to the code (that doesn't sound right). Any suggestion on how I
can minimize the amount of rework we have to do on the code to accomodate
for unicode. Are there any other suggestions on how to approach this
transformation?

--Turi


Chung Leong
Guest
 
Posts: n/a
#2: Jul 17 '05

re: php + mysql multilingual support


I would recommend storing the Unicode text as UTF-8 as well, as PHP doesn't
have a function that convert16-bit Unicode to UTF8. Also mySQL can store
UTF8 as text, while USC-2/UTF-16 would have to be stored as binary.

As long as your application doesn't perform much text analysis, it would
handle the UTF-8 text more or less transparently. One thing to look out for
is database field lengths. Because UTF-8 is a variable length encoding, you
can't rely on the size attribute in your <input> tags to limit the length of
user input. A Hindi or Chinese character, for instance, takes up 3 bytes in
UTF-8. Any text truncation code would have to take that into account. Bad
things happen when you drop off half a character at the end of a string.

Some languages present special challenges as well. In Chinese there's no
space between words, for example. The Arabic script displays from right to
left. And Hindi will only work in Internet Explorer on Windows. Non-Latin
scripts also tend to need to be bigger in order t be ligible.

Uzytkownik "Aditya Ivaturi" <aivaturi@aijalon.net> napisal w wiadomosci
news:qxBKb.217846$Eq1.113132@twister.rdc-kc.rr.com...[color=blue]
> We have a CMS which is written is based on php & mysql. Recently we[/color]
received[color=blue]
> a request to support multiple languages so that sites in that particular
> laguage can be created. I did some search on the google and it seems I[/color]
have[color=blue]
> to build in multibyte support for php and mysql. Mbstring
> (http://us3.php.net/mbstring) claims to support multiple languages with a
> caution saying it might not work properly.
>
> After further research it seems unicode might be the way to go, since
> unicode can represents all characters (in all languages) with integers,
> which in turn can be handled in php as it has excellent integer support.[/color]
But[color=blue]
> again since all the data is store in mysql we need unicode support for[/color]
mysql[color=blue]
> too and it has 2 formats[/color]
(http://www.mysql.com/doc/en/Charset-Unicode.html)[color=blue]
> usc-2 (for storing data) and utf-8 (for encoding). Here is where I need
> help. Do I opt for usc-2 or go ahead with utf-8? What are the advantages[/color]
and[color=blue]
> disadvantages of both.
>
> Now back to our CMS; can we make changes so that this new support is
> transparent to the code (that doesn't sound right). Any suggestion on how[/color]
I[color=blue]
> can minimize the amount of rework we have to do on the code to accomodate
> for unicode. Are there any other suggestions on how to approach this
> transformation?
>
> --Turi
>
>[/color]


Aditya Ivaturi
Guest
 
Posts: n/a
#3: Jul 17 '05

re: php + mysql multilingual support


Thanks for you lucid explanation Chung. As you mentioned "As long as your
application doesn't perform much text analysis...", our application uses
regular expression and text manipulation heavily. There are arguments for
and against it, but at this point I don't think redoing everything is an
option either. But for most part there is one hope. Our application evolved
over time and it was not very long ago that we introduced a new DB layer.
Even though efforts were made to change code to use this new DB layer, I am
afraid there still might be instances where we use the php mysql functions
to access db.

The obvious option might be to make sure db access is done via the new layer
and we introduce code to handle proper conversion to UTF-8. But what other
options do you feel might be an alternative. Personally, I don't see a whole
lot to wiggle around here. Any suggestions are welcome.

--Turi

"Chung Leong" <chernyshevsky@hotmail.com> wrote in message
news:GbWdnQQR8_WI7GaiRVn-vA@comcast.com...[color=blue]
> I would recommend storing the Unicode text as UTF-8 as well, as PHP[/color]
doesn't[color=blue]
> have a function that convert16-bit Unicode to UTF8. Also mySQL can store
> UTF8 as text, while USC-2/UTF-16 would have to be stored as binary.
>
> As long as your application doesn't perform much text analysis, it would
> handle the UTF-8 text more or less transparently. One thing to look out[/color]
for[color=blue]
> is database field lengths. Because UTF-8 is a variable length encoding,[/color]
you[color=blue]
> can't rely on the size attribute in your <input> tags to limit the length[/color]
of[color=blue]
> user input. A Hindi or Chinese character, for instance, takes up 3 bytes[/color]
in[color=blue]
> UTF-8. Any text truncation code would have to take that into account. Bad
> things happen when you drop off half a character at the end of a string.
>
> Some languages present special challenges as well. In Chinese there's no
> space between words, for example. The Arabic script displays from right to
> left. And Hindi will only work in Internet Explorer on Windows. Non-Latin
> scripts also tend to need to be bigger in order t be ligible.
>
> Uzytkownik "Aditya Ivaturi" <aivaturi@aijalon.net> napisal w wiadomosci
> news:qxBKb.217846$Eq1.113132@twister.rdc-kc.rr.com...[color=green]
> > We have a CMS which is written is based on php & mysql. Recently we[/color]
> received[color=green]
> > a request to support multiple languages so that sites in that particular
> > laguage can be created. I did some search on the google and it seems I[/color]
> have[color=green]
> > to build in multibyte support for php and mysql. Mbstring
> > (http://us3.php.net/mbstring) claims to support multiple languages with[/color][/color]
a[color=blue][color=green]
> > caution saying it might not work properly.
> >
> > After further research it seems unicode might be the way to go, since
> > unicode can represents all characters (in all languages) with integers,
> > which in turn can be handled in php as it has excellent integer support.[/color]
> But[color=green]
> > again since all the data is store in mysql we need unicode support for[/color]
> mysql[color=green]
> > too and it has 2 formats[/color]
> (http://www.mysql.com/doc/en/Charset-Unicode.html)[color=green]
> > usc-2 (for storing data) and utf-8 (for encoding). Here is where I need
> > help. Do I opt for usc-2 or go ahead with utf-8? What are the advantages[/color]
> and[color=green]
> > disadvantages of both.
> >
> > Now back to our CMS; can we make changes so that this new support is
> > transparent to the code (that doesn't sound right). Any suggestion on[/color][/color]
how[color=blue]
> I[color=green]
> > can minimize the amount of rework we have to do on the code to[/color][/color]
accomodate[color=blue][color=green]
> > for unicode. Are there any other suggestions on how to approach this
> > transformation?
> >
> > --Turi
> >
> >[/color]
>
>[/color]


Chung Leong
Guest
 
Posts: n/a
#4: Jul 17 '05

re: php + mysql multilingual support


Newer versions of PHP supports regular expression matching on UTF8 strings,
although it's still far from being useful. There's no support for character
classes, for instance. Using UTF16 doesn't help much either in that regard.
You might actually want to stick with using just ISO character sets.

The big question, of course, is which languages do you want to support.
Adding support for languages like Chinese or Arabic is definitely not
trivial.

Uzytkownik "Aditya Ivaturi" <aivaturi@aijalon.net> napisal w wiadomosci
news:6vVKb.223690$Eq1.105771@twister.rdc-kc.rr.com...[color=blue]
> Thanks for you lucid explanation Chung. As you mentioned "As long as your
> application doesn't perform much text analysis...", our application uses
> regular expression and text manipulation heavily. There are arguments for
> and against it, but at this point I don't think redoing everything is an
> option either. But for most part there is one hope. Our application[/color]
evolved[color=blue]
> over time and it was not very long ago that we introduced a new DB layer.
> Even though efforts were made to change code to use this new DB layer, I[/color]
am[color=blue]
> afraid there still might be instances where we use the php mysql functions
> to access db.
>
> The obvious option might be to make sure db access is done via the new[/color]
layer[color=blue]
> and we introduce code to handle proper conversion to UTF-8. But what other
> options do you feel might be an alternative. Personally, I don't see a[/color]
whole[color=blue]
> lot to wiggle around here. Any suggestions are welcome.
>
> --Turi
>
> "Chung Leong" <chernyshevsky@hotmail.com> wrote in message
> news:GbWdnQQR8_WI7GaiRVn-vA@comcast.com...[color=green]
> > I would recommend storing the Unicode text as UTF-8 as well, as PHP[/color]
> doesn't[color=green]
> > have a function that convert16-bit Unicode to UTF8. Also mySQL can store
> > UTF8 as text, while USC-2/UTF-16 would have to be stored as binary.
> >
> > As long as your application doesn't perform much text analysis, it would
> > handle the UTF-8 text more or less transparently. One thing to look out[/color]
> for[color=green]
> > is database field lengths. Because UTF-8 is a variable length encoding,[/color]
> you[color=green]
> > can't rely on the size attribute in your <input> tags to limit the[/color][/color]
length[color=blue]
> of[color=green]
> > user input. A Hindi or Chinese character, for instance, takes up 3 bytes[/color]
> in[color=green]
> > UTF-8. Any text truncation code would have to take that into account.[/color][/color]
Bad[color=blue][color=green]
> > things happen when you drop off half a character at the end of a string.
> >
> > Some languages present special challenges as well. In Chinese there's no
> > space between words, for example. The Arabic script displays from right[/color][/color]
to[color=blue][color=green]
> > left. And Hindi will only work in Internet Explorer on Windows.[/color][/color]
Non-Latin[color=blue][color=green]
> > scripts also tend to need to be bigger in order t be ligible.
> >
> > Uzytkownik "Aditya Ivaturi" <aivaturi@aijalon.net> napisal w wiadomosci
> > news:qxBKb.217846$Eq1.113132@twister.rdc-kc.rr.com...[color=darkred]
> > > We have a CMS which is written is based on php & mysql. Recently we[/color]
> > received[color=darkred]
> > > a request to support multiple languages so that sites in that[/color][/color][/color]
particular[color=blue][color=green][color=darkred]
> > > laguage can be created. I did some search on the google and it seems I[/color]
> > have[color=darkred]
> > > to build in multibyte support for php and mysql. Mbstring
> > > (http://us3.php.net/mbstring) claims to support multiple languages[/color][/color][/color]
with[color=blue]
> a[color=green][color=darkred]
> > > caution saying it might not work properly.
> > >
> > > After further research it seems unicode might be the way to go, since
> > > unicode can represents all characters (in all languages) with[/color][/color][/color]
integers,[color=blue][color=green][color=darkred]
> > > which in turn can be handled in php as it has excellent integer[/color][/color][/color]
support.[color=blue][color=green]
> > But[color=darkred]
> > > again since all the data is store in mysql we need unicode support for[/color]
> > mysql[color=darkred]
> > > too and it has 2 formats[/color]
> > (http://www.mysql.com/doc/en/Charset-Unicode.html)[color=darkred]
> > > usc-2 (for storing data) and utf-8 (for encoding). Here is where I[/color][/color][/color]
need[color=blue][color=green][color=darkred]
> > > help. Do I opt for usc-2 or go ahead with utf-8? What are the[/color][/color][/color]
advantages[color=blue][color=green]
> > and[color=darkred]
> > > disadvantages of both.
> > >
> > > Now back to our CMS; can we make changes so that this new support is
> > > transparent to the code (that doesn't sound right). Any suggestion on[/color][/color]
> how[color=green]
> > I[color=darkred]
> > > can minimize the amount of rework we have to do on the code to[/color][/color]
> accomodate[color=green][color=darkred]
> > > for unicode. Are there any other suggestions on how to approach this
> > > transformation?
> > >
> > > --Turi
> > >
> > >[/color]
> >
> >[/color]
>
>[/color]


Aditya Ivaturi
Guest
 
Posts: n/a
#5: Jul 17 '05

re: php + mysql multilingual support


Our main goal is to add support for Chinese and Japanese. So judging by your
response it seems like a challenging project.

--Turi

"Chung Leong" <chernyshevsky@hotmail.com> wrote in message
news:muKdnXkvatcYn2OiRVn-jA@comcast.com...[color=blue]
> Newer versions of PHP supports regular expression matching on UTF8[/color]
strings,[color=blue]
> although it's still far from being useful. There's no support for[/color]
character[color=blue]
> classes, for instance. Using UTF16 doesn't help much either in that[/color]
regard.[color=blue]
> You might actually want to stick with using just ISO character sets.
>
> The big question, of course, is which languages do you want to support.
> Adding support for languages like Chinese or Arabic is definitely not
> trivial.
>
> Uzytkownik "Aditya Ivaturi" <aivaturi@aijalon.net> napisal w wiadomosci
> news:6vVKb.223690$Eq1.105771@twister.rdc-kc.rr.com...[color=green]
> > Thanks for you lucid explanation Chung. As you mentioned "As long as[/color][/color]
your[color=blue][color=green]
> > application doesn't perform much text analysis...", our application uses
> > regular expression and text manipulation heavily. There are arguments[/color][/color]
for[color=blue][color=green]
> > and against it, but at this point I don't think redoing everything is an
> > option either. But for most part there is one hope. Our application[/color]
> evolved[color=green]
> > over time and it was not very long ago that we introduced a new DB[/color][/color]
layer.[color=blue][color=green]
> > Even though efforts were made to change code to use this new DB layer, I[/color]
> am[color=green]
> > afraid there still might be instances where we use the php mysql[/color][/color]
functions[color=blue][color=green]
> > to access db.
> >
> > The obvious option might be to make sure db access is done via the new[/color]
> layer[color=green]
> > and we introduce code to handle proper conversion to UTF-8. But what[/color][/color]
other[color=blue][color=green]
> > options do you feel might be an alternative. Personally, I don't see a[/color]
> whole[color=green]
> > lot to wiggle around here. Any suggestions are welcome.
> >
> > --Turi
> >
> > "Chung Leong" <chernyshevsky@hotmail.com> wrote in message
> > news:GbWdnQQR8_WI7GaiRVn-vA@comcast.com...[color=darkred]
> > > I would recommend storing the Unicode text as UTF-8 as well, as PHP[/color]
> > doesn't[color=darkred]
> > > have a function that convert16-bit Unicode to UTF8. Also mySQL can[/color][/color][/color]
store[color=blue][color=green][color=darkred]
> > > UTF8 as text, while USC-2/UTF-16 would have to be stored as binary.
> > >
> > > As long as your application doesn't perform much text analysis, it[/color][/color][/color]
would[color=blue][color=green][color=darkred]
> > > handle the UTF-8 text more or less transparently. One thing to look[/color][/color][/color]
out[color=blue][color=green]
> > for[color=darkred]
> > > is database field lengths. Because UTF-8 is a variable length[/color][/color][/color]
encoding,[color=blue][color=green]
> > you[color=darkred]
> > > can't rely on the size attribute in your <input> tags to limit the[/color][/color]
> length[color=green]
> > of[color=darkred]
> > > user input. A Hindi or Chinese character, for instance, takes up 3[/color][/color][/color]
bytes[color=blue][color=green]
> > in[color=darkred]
> > > UTF-8. Any text truncation code would have to take that into account.[/color][/color]
> Bad[color=green][color=darkred]
> > > things happen when you drop off half a character at the end of a[/color][/color][/color]
string.[color=blue][color=green][color=darkred]
> > >
> > > Some languages present special challenges as well. In Chinese there's[/color][/color][/color]
no[color=blue][color=green][color=darkred]
> > > space between words, for example. The Arabic script displays from[/color][/color][/color]
right[color=blue]
> to[color=green][color=darkred]
> > > left. And Hindi will only work in Internet Explorer on Windows.[/color][/color]
> Non-Latin[color=green][color=darkred]
> > > scripts also tend to need to be bigger in order t be ligible.
> > >
> > > Uzytkownik "Aditya Ivaturi" <aivaturi@aijalon.net> napisal w[/color][/color][/color]
wiadomosci[color=blue][color=green][color=darkred]
> > > news:qxBKb.217846$Eq1.113132@twister.rdc-kc.rr.com...
> > > > We have a CMS which is written is based on php & mysql. Recently we
> > > received
> > > > a request to support multiple languages so that sites in that[/color][/color]
> particular[color=green][color=darkred]
> > > > laguage can be created. I did some search on the google and it seems[/color][/color][/color]
I[color=blue][color=green][color=darkred]
> > > have
> > > > to build in multibyte support for php and mysql. Mbstring
> > > > (http://us3.php.net/mbstring) claims to support multiple languages[/color][/color]
> with[color=green]
> > a[color=darkred]
> > > > caution saying it might not work properly.
> > > >
> > > > After further research it seems unicode might be the way to go,[/color][/color][/color]
since[color=blue][color=green][color=darkred]
> > > > unicode can represents all characters (in all languages) with[/color][/color]
> integers,[color=green][color=darkred]
> > > > which in turn can be handled in php as it has excellent integer[/color][/color]
> support.[color=green][color=darkred]
> > > But
> > > > again since all the data is store in mysql we need unicode support[/color][/color][/color]
for[color=blue][color=green][color=darkred]
> > > mysql
> > > > too and it has 2 formats
> > > (http://www.mysql.com/doc/en/Charset-Unicode.html)
> > > > usc-2 (for storing data) and utf-8 (for encoding). Here is where I[/color][/color]
> need[color=green][color=darkred]
> > > > help. Do I opt for usc-2 or go ahead with utf-8? What are the[/color][/color]
> advantages[color=green][color=darkred]
> > > and
> > > > disadvantages of both.
> > > >
> > > > Now back to our CMS; can we make changes so that this new support is
> > > > transparent to the code (that doesn't sound right). Any suggestion[/color][/color][/color]
on[color=blue][color=green]
> > how[color=darkred]
> > > I
> > > > can minimize the amount of rework we have to do on the code to[/color]
> > accomodate[color=darkred]
> > > > for unicode. Are there any other suggestions on how to approach this
> > > > transformation?
> > > >
> > > > --Turi
> > > >
> > > >
> > >
> > >[/color]
> >
> >[/color]
>
>[/color]


Closed Thread