Our main goal is to add support for Chinese and Japanese. So judging by your
response it seems like a challenging project.
--Turi
"Chung Leong" <chernyshevsky@hotmail.com> wrote in message
news:muKdnXkvatcYn2OiRVn-jA@comcast.com...[color=blue]
> Newer versions of PHP supports regular expression matching on UTF8[/color]
strings,[color=blue]
> although it's still far from being useful. There's no support for[/color]
character[color=blue]
> classes, for instance. Using UTF16 doesn't help much either in that[/color]
regard.[color=blue]
> You might actually want to stick with using just ISO character sets.
>
> The big question, of course, is which languages do you want to support.
> Adding support for languages like Chinese or Arabic is definitely not
> trivial.
>
> Uzytkownik "Aditya Ivaturi" <aivaturi@aijalon.net> napisal w wiadomosci
> news:6vVKb.223690$Eq1.105771@twister.rdc-kc.rr.com...[color=green]
> > Thanks for you lucid explanation Chung. As you mentioned "As long as[/color][/color]
your[color=blue][color=green]
> > application doesn't perform much text analysis...", our application uses
> > regular expression and text manipulation heavily. There are arguments[/color][/color]
for[color=blue][color=green]
> > and against it, but at this point I don't think redoing everything is an
> > option either. But for most part there is one hope. Our application[/color]
> evolved[color=green]
> > over time and it was not very long ago that we introduced a new DB[/color][/color]
layer.[color=blue][color=green]
> > Even though efforts were made to change code to use this new DB layer, I[/color]
> am[color=green]
> > afraid there still might be instances where we use the php mysql[/color][/color]
functions[color=blue][color=green]
> > to access db.
> >
> > The obvious option might be to make sure db access is done via the new[/color]
> layer[color=green]
> > and we introduce code to handle proper conversion to UTF-8. But what[/color][/color]
other[color=blue][color=green]
> > options do you feel might be an alternative. Personally, I don't see a[/color]
> whole[color=green]
> > lot to wiggle around here. Any suggestions are welcome.
> >
> > --Turi
> >
> > "Chung Leong" <chernyshevsky@hotmail.com> wrote in message
> > news:GbWdnQQR8_WI7GaiRVn-vA@comcast.com...[color=darkred]
> > > I would recommend storing the Unicode text as UTF-8 as well, as PHP[/color]
> > doesn't[color=darkred]
> > > have a function that convert16-bit Unicode to UTF8. Also mySQL can[/color][/color][/color]
store[color=blue][color=green][color=darkred]
> > > UTF8 as text, while USC-2/UTF-16 would have to be stored as binary.
> > >
> > > As long as your application doesn't perform much text analysis, it[/color][/color][/color]
would[color=blue][color=green][color=darkred]
> > > handle the UTF-8 text more or less transparently. One thing to look[/color][/color][/color]
out[color=blue][color=green]
> > for[color=darkred]
> > > is database field lengths. Because UTF-8 is a variable length[/color][/color][/color]
encoding,[color=blue][color=green]
> > you[color=darkred]
> > > can't rely on the size attribute in your <input> tags to limit the[/color][/color]
> length[color=green]
> > of[color=darkred]
> > > user input. A Hindi or Chinese character, for instance, takes up 3[/color][/color][/color]
bytes[color=blue][color=green]
> > in[color=darkred]
> > > UTF-8. Any text truncation code would have to take that into account.[/color][/color]
> Bad[color=green][color=darkred]
> > > things happen when you drop off half a character at the end of a[/color][/color][/color]
string.[color=blue][color=green][color=darkred]
> > >
> > > Some languages present special challenges as well. In Chinese there's[/color][/color][/color]
no[color=blue][color=green][color=darkred]
> > > space between words, for example. The Arabic script displays from[/color][/color][/color]
right[color=blue]
> to[color=green][color=darkred]
> > > left. And Hindi will only work in Internet Explorer on Windows.[/color][/color]
> Non-Latin[color=green][color=darkred]
> > > scripts also tend to need to be bigger in order t be ligible.
> > >
> > > Uzytkownik "Aditya Ivaturi" <aivaturi@aijalon.net> napisal w[/color][/color][/color]
wiadomosci[color=blue][color=green][color=darkred]
> > > news:qxBKb.217846$Eq1.113132@twister.rdc-kc.rr.com...
> > > > We have a CMS which is written is based on php & mysql. Recently we
> > > received
> > > > a request to support multiple languages so that sites in that[/color][/color]
> particular[color=green][color=darkred]
> > > > laguage can be created. I did some search on the google and it seems[/color][/color][/color]
I[color=blue][color=green][color=darkred]
> > > have
> > > > to build in multibyte support for php and mysql. Mbstring
> > > > (
http://us3.php.net/mbstring) claims to support multiple languages[/color][/color]
> with[color=green]
> > a[color=darkred]
> > > > caution saying it might not work properly.
> > > >
> > > > After further research it seems unicode might be the way to go,[/color][/color][/color]
since[color=blue][color=green][color=darkred]
> > > > unicode can represents all characters (in all languages) with[/color][/color]
> integers,[color=green][color=darkred]
> > > > which in turn can be handled in php as it has excellent integer[/color][/color]
> support.[color=green][color=darkred]
> > > But
> > > > again since all the data is store in mysql we need unicode support[/color][/color][/color]
for[color=blue][color=green][color=darkred]
> > > mysql
> > > > too and it has 2 formats
> > > (
http://www.mysql.com/doc/en/Charset-Unicode.html)
> > > > usc-2 (for storing data) and utf-8 (for encoding). Here is where I[/color][/color]
> need[color=green][color=darkred]
> > > > help. Do I opt for usc-2 or go ahead with utf-8? What are the[/color][/color]
> advantages[color=green][color=darkred]
> > > and
> > > > disadvantages of both.
> > > >
> > > > Now back to our CMS; can we make changes so that this new support is
> > > > transparent to the code (that doesn't sound right). Any suggestion[/color][/color][/color]
on[color=blue][color=green]
> > how[color=darkred]
> > > I
> > > > can minimize the amount of rework we have to do on the code to[/color]
> > accomodate[color=darkred]
> > > > for unicode. Are there any other suggestions on how to approach this
> > > > transformation?
> > > >
> > > > --Turi
> > > >
> > > >
> > >
> > >[/color]
> >
> >[/color]
>
>[/color]