Connecting Tech Pros Worldwide Help | Site Map

PHP5 and Double Byte (experts wanted)

Dorthe Luebbert
Guest
 
Posts: n/a
#1: Jul 17 '05
Hi,

we have to convert a quite large ISO-application (Mysql4, PHP5) to
UTF8. We found out so far that there is no collation for MySQL which
is able to sort all character sets correctly. So this has to be done.
And there might be some problems with string functions as not all
string functions have multibyte aquivalents.

What are your experiences with this topic? Any hints?

In case you are an expert for unicode in combination with MYsql and
PHP or for sorting algorithmn of asian characters please drop me a
line at
luebbert-AT-globalpark-Dot-de. We are willing to pay for consultants.

Thanx

Dorthe Luebbert
NSpam
Guest
 
Posts: n/a
#2: Jul 17 '05

re: PHP5 and Double Byte (experts wanted)


Dorthe Luebbert wrote:[color=blue]
> Hi,
>
> we have to convert a quite large ISO-application (Mysql4, PHP5) to
> UTF8. We found out so far that there is no collation for MySQL which
> is able to sort all character sets correctly. So this has to be done.
> And there might be some problems with string functions as not all
> string functions have multibyte aquivalents.
>
> What are your experiences with this topic? Any hints?
>
> In case you are an expert for unicode in combination with MYsql and
> PHP or for sorting algorithmn of asian characters please drop me a
> line at
> luebbert-AT-globalpark-Dot-de. We are willing to pay for consultants.
>
> Thanx
>
> Dorthe Luebbert[/color]
yup, somewhat of a problem, essentially, it can't be done. Its
inpossible to map multibyte character sets to 8 byte character sets.
NSpam
Guest
 
Posts: n/a
#3: Jul 17 '05

re: PHP5 and Double Byte (experts wanted)


NSpam wrote:[color=blue]
> Dorthe Luebbert wrote:
>[color=green]
>> Hi,
>>
>> we have to convert a quite large ISO-application (Mysql4, PHP5) to
>> UTF8. We found out so far that there is no collation for MySQL which
>> is able to sort all character sets correctly. So this has to be done.
>> And there might be some problems with string functions as not all
>> string functions have multibyte aquivalents.
>>
>> What are your experiences with this topic? Any hints?
>> In case you are an expert for unicode in combination with MYsql and
>> PHP or for sorting algorithmn of asian characters please drop me a
>> line at
>> luebbert-AT-globalpark-Dot-de. We are willing to pay for consultants.
>>
>> Thanx
>>
>> Dorthe Luebbert[/color]
>
> yup, somewhat of a problem, essentially, it can't be done. Its
> inpossible to map multibyte character sets to 8 byte character sets.[/color]
Whoops, should that read "8 bit character sets"
Malcolm Dew-Jones
Guest
 
Posts: n/a
#4: Jul 17 '05

re: PHP5 and Double Byte (experts wanted)


NSpam (chris.newey@gmail.com) wrote:
: Dorthe Luebbert wrote:
: > Hi,
: >
: > we have to convert a quite large ISO-application (Mysql4, PHP5) to
: > UTF8. We found out so far that there is no collation for MySQL which
: > is able to sort all character sets correctly. So this has to be done.
: > And there might be some problems with string functions as not all
: > string functions have multibyte aquivalents.
: >
: > What are your experiences with this topic? Any hints?
: >
: > In case you are an expert for unicode in combination with MYsql and
: > PHP or for sorting algorithmn of asian characters please drop me a
: > line at
: > luebbert-AT-globalpark-Dot-de. We are willing to pay for consultants.
: >
: > Thanx
: >
: > Dorthe Luebbert
: yup, somewhat of a problem, essentially, it can't be done. Its
: inpossible to map multibyte character sets to 8 [bit]byte character
sets.

Yes, but many parts can be done, just not quite in the straight forward
way.

If you view utf-8 data as 8 bit characters, then each utf-8 "character"
becomes a _unique_ string of one to three (four?) characters. Which means
that simply handling the data as 8 bit character strings can often work
correctly, as long as you don't chop the string in the middle of a
utf-character.

So lets say you want to search for the unicode character with codepoint
123,456 (what character that might be I have no idea). That value is
represented as something like a three byte string. Now use your 8 bit
string routines to search through the utf-string, but instead of using
utf-8 aware routines and looking for a character, insted you use 8 bit
routines and look for that three byte string. Your program will find it
correctly exactly the same as if you were using utf-8 aware string
routines. And so for example, if you do a search and replace then
replacing that three byte string with a new string will do exactly the
same thing as a character replace, as long as you use the correct little
substrings.

If you are manipulating data from outside the program, and the data comes
in as utf-8, like in a web form, then for many tasks you simply ignore
that fact it is utf-8 as treat it as 8 bit data and everything will work.
E.g. If the user fills in a search dialog and the data from the dialog is
utf-8, then your program simply searches for the string exactly as
received and the fact that it's utf-8 whereas you are using 8 bit string
routines will make no difference to whether you find that string in some
other utf-8 data. And when you send the data back to the client, then as
long as their browser displays it as utf-8, then they will see all the
correct data.

Sorting may not appear to work because the resulting character order
doesn't make sense, but I think you'll find that at a low level even
routines like strcmp "work" in the sense that utf-8 characters with higher
code points create 8 bits strings that sort in the same order as if you
were sorting them using the numerical code points.


--

This space not for rent.
Dana Cartwright
Guest
 
Posts: n/a
#5: Jul 17 '05

re: PHP5 and Double Byte (experts wanted)


"Malcolm Dew-Jones" <yf110@vtn1.victoria.tc.ca> wrote in message
news:422b6caa@news.victoria.tc.ca...[color=blue]
> So lets say you want to search for the unicode character with codepoint
> 123,456 (what character that might be I have no idea). That value is
> represented as something like a three byte string. Now use your 8 bit
> string routines to search through the utf-string, but instead of using
> utf-8 aware routines and looking for a character, insted you use 8 bit
> routines and look for that three byte string. Your program will find it
> correctly exactly the same as if you were using utf-8 aware string
> routines.[/color]

Not so. You will match the codepoint, as you say, but you will also match
codepoints that bridge UTF characters.

Let's say that the string "AAA" is a unicode sequence (in other words, it's
some codepoint) and "AAB" is also a codepoint.

Now if you have the two-character unicode string AAA followed by AAB, it
looks like AAAAAB when viewed as 8-bit bytes.

If you naively search for the codepoint AAA in this string you'll get 3
places it matches. But only the first of these is valid. The second two
matches are bogus because they bridge codepoints.


Chung Leong
Guest
 
Posts: n/a
#6: Jul 17 '05

re: PHP5 and Double Byte (experts wanted)


"Dorthe Luebbert" <dorthe.luebbert@gmx.de> wrote in message
news:278556ae.0503060132.42f74630@posting.google.c om...[color=blue]
> Hi,
>
> we have to convert a quite large ISO-application (Mysql4, PHP5) to
> UTF8. We found out so far that there is no collation for MySQL which
> is able to sort all character sets correctly. So this has to be done.
> And there might be some problems with string functions as not all
> string functions have multibyte aquivalents.
>
> What are your experiences with this topic? Any hints?
>
> In case you are an expert for unicode in combination with MYsql and
> PHP or for sorting algorithmn of asian characters please drop me a
> line at
> luebbert-AT-globalpark-Dot-de. We are willing to pay for consultants.
>
> Thanx
>
> Dorthe Luebbert[/color]

Some guy was working on creating a PHP extension for the ICU library. I
don't know what the status is at this point. Worth googling.

If MySQL doesn't support Unicode collation, I guess there's no much you can
do. Perhaps time to consider a commercial database? I know MSSQL can handle
sorting in a number of languages. Oracle can too probably.


Brion Vibber
Guest
 
Posts: n/a
#7: Jul 17 '05

re: PHP5 and Double Byte (experts wanted)


Dana Cartwright wrote:[color=blue]
> "Malcolm Dew-Jones" <yf110@vtn1.victoria.tc.ca> wrote in message
> news:422b6caa@news.victoria.tc.ca...[color=green]
>>So lets say you want to search for the unicode character with codepoint
>>123,456 (what character that might be I have no idea). That value is
>>represented as something like a three byte string. Now use your 8 bit
>>string routines to search through the utf-string, but instead of using
>>utf-8 aware routines and looking for a character, insted you use 8 bit
>>routines and look for that three byte string. Your program will find it
>>correctly exactly the same as if you were using utf-8 aware string
>>routines.[/color]
>
> Not so. You will match the codepoint, as you say, but you will also match
> codepoints that bridge UTF characters.[/color]

Since no UTF-8 character's byte sequence is a subsequence of any other
UTF-8 character's byte sequence, that's just not true. A bytewise search
will indeed find only correct matches; this is one of the things that
sets UTF-8 apart from pre-Unicode multibyte encodings.

-- brion vibber (brion @ pobox.com)
Brion Vibber
Guest
 
Posts: n/a
#8: Jul 17 '05

re: PHP5 and Double Byte (experts wanted)


Chung Leong wrote:[color=blue]
> Some guy was working on creating a PHP extension for the ICU library. I
> don't know what the status is at this point. Worth googling.[/color]

Can't help much on collation currently, but I've written a partial PHP
extension wrapper for ICU's normalizer and a pure-PHP equivalent for
validation and normalization of UTF-8 input. It's under GPL license and
is bundled with MediaWiki 1.4 (www.mediawiki.org)

I ended up writing that rather than trying to wade through the full ICU
extension; I had a hard enough time trying to track it down, and didn't
want to rely on ICU and a custom PHP extension as they aren't always
available.
[color=blue]
> If MySQL doesn't support Unicode collation, I guess there's no much you can
> do. Perhaps time to consider a commercial database? I know MSSQL can handle
> sorting in a number of languages. Oracle can too probably.[/color]

MySQL 4.1 and higher do have UTF-8 support. I don't know how well it
works at this stage or whether the collation support is suitable for the
original poster's needs.

-- brion vibber (brion @ pobox.com)
Brion Vibber
Guest
 
Posts: n/a
#9: Jul 17 '05

re: PHP5 and Double Byte (experts wanted)


Brion Vibber wrote:[color=blue]
> Dana Cartwright wrote:[color=green]
>> Not so. You will match the codepoint, as you say, but you will also
>> match codepoints that bridge UTF characters.[/color]
>
> Since no UTF-8 character's byte sequence is a subsequence of any other
> UTF-8 character's byte sequence, that's just not true. A bytewise search
> will indeed find only correct matches; this is one of the things that
> sets UTF-8 apart from pre-Unicode multibyte encodings.[/color]

Some details; if character encodings bore you, stop reading now. ;)
Having a byte substring that bridges characters is impossible because of
the way a UTF-8 byte sequence is structured. Any UTF-8 character will be
laid out into bytes like one of these sequences:

[ASCII]
[head1][tail]
[head2][tail][tail]
[head3][tail][tail][tail]

The number of tail bytes is determined by the high-order bits of the
head byte; any given head byte will always be followed by the same
number of tail bytes. Note that each category of byte is in a distinct,
non-overlapping range:

ASCII: 0x00-0x7f
tail: 0x80-0xbf
head1: 0xc0-0xdf
head2: 0xe0-0xef
head3: 0xf0-0xf7

No ASCII byte ever appears in a non-ASCII sequence, and no sequence head
byte ever appears in the tail of any other sequence. The only way for a
match to bridge characters is if you're working with corrupt data that's
not actually valid UTF-8.
[color=blue][color=green]
>> Let's say that the string "AAA" is a unicode sequence (in other words, it's
>> some codepoint) and "AAB" is also a codepoint.
>>
>> Now if you have the two-character unicode string AAA followed by AAB, it
>> looks like AAAAAB when viewed as 8-bit bytes.
>>
>> If you naively search for the codepoint AAA in this string you'll get 3
>> places it matches. But only the first of these is valid. The second two
>> matches are bogus because they bridge codepoints.[/color][/color]

This scenario is clearly impossible due to the distinction between head
and tail bytes; sequences cannot run together or overlap in this way.

-- brion vibber (brion @ pobox.com)
Closed Thread