PHP5 and Double Byte (experts wanted)

Dorthe Luebbert

Hi,

we have to convert a quite large ISO-application (Mysql4, PHP5) to
UTF8. We found out so far that there is no collation for MySQL which
is able to sort all character sets correctly. So this has to be done.
And there might be some problems with string functions as not all
string functions have multibyte aquivalents.

What are your experiences with this topic? Any hints?

In case you are an expert for unicode in combination with MYsql and
PHP or for sorting algorithmn of asian characters please drop me a
line at
luebbert-AT-globalpark-Dot-de. We are willing to pay for consultants.

Thanx

Dorthe Luebbert

Jul 17 '05 #1

Subscribe Reply

2811

NSpam

Dorthe Luebbert wrote:

Hi,

we have to convert a quite large ISO-application (Mysql4, PHP5) to
UTF8. We found out so far that there is no collation for MySQL which
is able to sort all character sets correctly. So this has to be done.
And there might be some problems with string functions as not all
string functions have multibyte aquivalents.

What are your experiences with this topic? Any hints?

In case you are an expert for unicode in combination with MYsql and
PHP or for sorting algorithmn of asian characters please drop me a
line at
luebbert-AT-globalpark-Dot-de. We are willing to pay for consultants.

Thanx

Dorthe Luebbert

yup, somewhat of a problem, essentially, it can't be done. Its
inpossible to map multibyte character sets to 8 byte character sets.

Jul 17 '05 #2

NSpam

NSpam wrote:

Dorthe Luebbert wrote:
Hi,

we have to convert a quite large ISO-application (Mysql4, PHP5) to
UTF8. We found out so far that there is no collation for MySQL which
is able to sort all character sets correctly. So this has to be done.
And there might be some problems with string functions as not all
string functions have multibyte aquivalents.

What are your experiences with this topic? Any hints?
In case you are an expert for unicode in combination with MYsql and
PHP or for sorting algorithmn of asian characters please drop me a
line at
luebbert-AT-globalpark-Dot-de. We are willing to pay for consultants.

Thanx

Dorthe Luebbert

yup, somewhat of a problem, essentially, it can't be done. Its
inpossible to map multibyte character sets to 8 byte character sets.

Whoops, should that read "8 bit character sets"

Jul 17 '05 #3

Malcolm Dew-Jones

NSpam (ch*********@gm ail.com) wrote:
: Dorthe Luebbert wrote:
: > Hi,
: >
: > we have to convert a quite large ISO-application (Mysql4, PHP5) to
: > UTF8. We found out so far that there is no collation for MySQL which
: > is able to sort all character sets correctly. So this has to be done.
: > And there might be some problems with string functions as not all
: > string functions have multibyte aquivalents.
: >
: > What are your experiences with this topic? Any hints?
: >
: > In case you are an expert for unicode in combination with MYsql and
: > PHP or for sorting algorithmn of asian characters please drop me a
: > line at
: > luebbert-AT-globalpark-Dot-de. We are willing to pay for consultants.
: >
: > Thanx
: >
: > Dorthe Luebbert
: yup, somewhat of a problem, essentially, it can't be done. Its
: inpossible to map multibyte character sets to 8 [bit]byte character
sets.

Yes, but many parts can be done, just not quite in the straight forward
way.

If you view utf-8 data as 8 bit characters, then each utf-8 "character"
becomes a _unique_ string of one to three (four?) characters. Which means
that simply handling the data as 8 bit character strings can often work
correctly, as long as you don't chop the string in the middle of a
utf-character.

So lets say you want to search for the unicode character with codepoint
123,456 (what character that might be I have no idea). That value is
represented as something like a three byte string. Now use your 8 bit
string routines to search through the utf-string, but instead of using
utf-8 aware routines and looking for a character, insted you use 8 bit
routines and look for that three byte string. Your program will find it
correctly exactly the same as if you were using utf-8 aware string
routines. And so for example, if you do a search and replace then
replacing that three byte string with a new string will do exactly the
same thing as a character replace, as long as you use the correct little
substrings.

If you are manipulating data from outside the program, and the data comes
in as utf-8, like in a web form, then for many tasks you simply ignore
that fact it is utf-8 as treat it as 8 bit data and everything will work.
E.g. If the user fills in a search dialog and the data from the dialog is
utf-8, then your program simply searches for the string exactly as
received and the fact that it's utf-8 whereas you are using 8 bit string
routines will make no difference to whether you find that string in some
other utf-8 data. And when you send the data back to the client, then as
long as their browser displays it as utf-8, then they will see all the
correct data.

Sorting may not appear to work because the resulting character order
doesn't make sense, but I think you'll find that at a low level even
routines like strcmp "work" in the sense that utf-8 characters with higher
code points create 8 bits strings that sort in the same order as if you
were sorting them using the numerical code points.
--

This space not for rent.

Jul 17 '05 #4

Dana Cartwright

"Malcolm Dew-Jones" <yf***@vtn1.vic toria.tc.ca> wrote in message
news:42******@n ews.victoria.tc .ca...

So lets say you want to search for the unicode character with codepoint
123,456 (what character that might be I have no idea). That value is
represented as something like a three byte string. Now use your 8 bit
string routines to search through the utf-string, but instead of using
utf-8 aware routines and looking for a character, insted you use 8 bit
routines and look for that three byte string. Your program will find it
correctly exactly the same as if you were using utf-8 aware string
routines.

Not so. You will match the codepoint, as you say, but you will also match
codepoints that bridge UTF characters.

Let's say that the string "AAA" is a unicode sequence (in other words, it's
some codepoint) and "AAB" is also a codepoint.

Now if you have the two-character unicode string AAA followed by AAB, it
looks like AAAAAB when viewed as 8-bit bytes.

If you naively search for the codepoint AAA in this string you'll get 3
places it matches. But only the first of these is valid. The second two
matches are bogus because they bridge codepoints.

Jul 17 '05 #5

Chung Leong

"Dorthe Luebbert" <do************ *@gmx.de> wrote in message
news:27******** *************** ***@posting.goo gle.com...

Hi,

we have to convert a quite large ISO-application (Mysql4, PHP5) to
UTF8. We found out so far that there is no collation for MySQL which
is able to sort all character sets correctly. So this has to be done.
And there might be some problems with string functions as not all
string functions have multibyte aquivalents.

What are your experiences with this topic? Any hints?

In case you are an expert for unicode in combination with MYsql and
PHP or for sorting algorithmn of asian characters please drop me a
line at
luebbert-AT-globalpark-Dot-de. We are willing to pay for consultants.

Thanx

Dorthe Luebbert

Some guy was working on creating a PHP extension for the ICU library. I
don't know what the status is at this point. Worth googling.

If MySQL doesn't support Unicode collation, I guess there's no much you can
do. Perhaps time to consider a commercial database? I know MSSQL can handle
sorting in a number of languages. Oracle can too probably.

Jul 17 '05 #6

Brion Vibber

Dana Cartwright wrote:

"Malcolm Dew-Jones" <yf***@vtn1.vic toria.tc.ca> wrote in message
news:42******@n ews.victoria.tc .ca...
So lets say you want to search for the unicode character with codepoint
123,456 (what character that might be I have no idea). That value is
represented as something like a three byte string. Now use your 8 bit
string routines to search through the utf-string, but instead of using
utf-8 aware routines and looking for a character, insted you use 8 bit
routines and look for that three byte string. Your program will find it
correctly exactly the same as if you were using utf-8 aware string
routines.

Not so. You will match the codepoint, as you say, but you will also match
codepoints that bridge UTF characters.

Since no UTF-8 character's byte sequence is a subsequence of any other
UTF-8 character's byte sequence, that's just not true. A bytewise search
will indeed find only correct matches; this is one of the things that
sets UTF-8 apart from pre-Unicode multibyte encodings.

-- brion vibber (brion @ pobox.com)

Jul 17 '05 #7

Brion Vibber

Chung Leong wrote:

Some guy was working on creating a PHP extension for the ICU library. I
don't know what the status is at this point. Worth googling.
Can't help much on collation currently, but I've written a partial PHP
extension wrapper for ICU's normalizer and a pure-PHP equivalent for
validation and normalization of UTF-8 input. It's under GPL license and
is bundled with MediaWiki 1.4 (www.mediawiki.org)

I ended up writing that rather than trying to wade through the full ICU
extension; I had a hard enough time trying to track it down, and didn't
want to rely on ICU and a custom PHP extension as they aren't always
available.
If MySQL doesn't support Unicode collation, I guess there's no much you can
do. Perhaps time to consider a commercial database? I know MSSQL can handle
sorting in a number of languages. Oracle can too probably.

MySQL 4.1 and higher do have UTF-8 support. I don't know how well it
works at this stage or whether the collation support is suitable for the
original poster's needs.

-- brion vibber (brion @ pobox.com)

Jul 17 '05 #8

Brion Vibber

Brion Vibber wrote:

Dana Cartwright wrote:
Not so. You will match the codepoint, as you say, but you will also
match codepoints that bridge UTF characters.

Since no UTF-8 character's byte sequence is a subsequence of any other
UTF-8 character's byte sequence, that's just not true. A bytewise search
will indeed find only correct matches; this is one of the things that
sets UTF-8 apart from pre-Unicode multibyte encodings.

Some details; if character encodings bore you, stop reading now. ;)
Having a byte substring that bridges characters is impossible because of
the way a UTF-8 byte sequence is structured. Any UTF-8 character will be
laid out into bytes like one of these sequences:

[ASCII]
[head1][tail]
[head2][tail][tail]
[head3][tail][tail][tail]

The number of tail bytes is determined by the high-order bits of the
head byte; any given head byte will always be followed by the same
number of tail bytes. Note that each category of byte is in a distinct,
non-overlapping range:

ASCII: 0x00-0x7f
tail: 0x80-0xbf
head1: 0xc0-0xdf
head2: 0xe0-0xef
head3: 0xf0-0xf7

No ASCII byte ever appears in a non-ASCII sequence, and no sequence head
byte ever appears in the tail of any other sequence. The only way for a
match to bridge characters is if you're working with corrupt data that's
not actually valid UTF-8.

Let's say that the string "AAA" is a unicode sequence (in other words, it's
some codepoint) and "AAB" is also a codepoint.

Now if you have the two-character unicode string AAA followed by AAB, it
looks like AAAAAB when viewed as 8-bit bytes.

If you naively search for the codepoint AAA in this string you'll get 3
places it matches. But only the first of these is valid. The second two
matches are bogus because they bridge codepoints.

This scenario is clearly impossible due to the distinction between head
and tail bytes; sequences cannot run together or overlap in this way.

-- brion vibber (brion @ pobox.com)

Jul 17 '05 #9

Similar topics

10586

PHP5 - Installation Error - 400 Bad Request

by: neur0maniak | last post by:

Hi, I've been eager to try out PHP5, so I've dumped it on my little dev machine. It's running WinXP with IIS5. I've put the php-cgi.exe in the "mappings" page as I'm used to doing with PHP4. I've got my php.ini all set in C:\Windows. I created an index.php containing: <?php phpinfo(); ?> When I try to view the page, I get "HTTP 400 - Bad Request".

PHP

12866

reinterpret_cast - to interpret double as long

by: Suzanne Vogel | last post by:

I'd like to convert a double to a binary representation. I can use the "&" bit operation with a bit mask to convert *non* float types to binary representations, but I can't use "&" on doubles. To get around this limitation on double, I'd like to keep the bits of the double the *same* but change its interpretation to long. I can use "&" on longs. I tried to use reinterpret_cast for this purpose, but it returned zero every time. double...

C / C++

6092

When (32-bit) double precision isn't precise enough

by: DAVID SCHULMAN | last post by:

I've been trying to perform a calculation that has been running into an underflow (insufficient precision) problem in Microsoft Excel, which calculates using at most 15 significant digits. For this purpose, that isn't enough. I was reading a book about some of the financial scandals of the 1990s called "Inventing Money: The Story of Long-Term Capital Management and the Legends Behind it" by Nicholas Dunbar. On page 95, he mentions that...

C / C++

3726

Wintel compilers with: C99 tgmath.h? 16-byte "long double"?

by: bq | last post by:

Hello, Two questions related to floating point support: What C compilers for the wintel (MS Windows + x86) platform are C99 compliant as far as <math.h> and <tgmath.h> are concerned? What wintel compilers support a 16-byte "long double" (with 33 decimal digits of accuracy) including proper printf() support. I found some compilers that did support "long double", but theirs was an 8-byte or 10-byte or 12-byte type with accuracy the...

C / C++

2617

align to double or to long double?

by: RoSsIaCrIiLoIA | last post by:

I have rewrote the malloc() function of K&R2 chapter 8.7 typedef long Align; ^^^^ Here, should I write 'long', 'double' or 'long double'? I know that in my pc+compiler sizeof(long)=4, sizeof(double)=8 and sizeof(long double)=10 (if 'long' or 'double' sizeof(Header)=8 if 'long double' sizeof(Header)=12)

C / C++

2916

storing floating point data with using float or double

by: Steve | last post by:

Hi, I've developed a testing application that stores formatted results in a database. Recently it was requested that I add the ability to generate graphs from the raw, un formatted test results (100,000+ float values) I don't intend to store all of the 100,000 datapoints, but rather a subset of say 250. Due to the volume of testing that we need and the volume of results stored, I need to be VERY careful with data size and keep things...

C# / C Sharp

4388

VB5 Long & Double to VB.NET

by: Terry | last post by:

I have some old VB5 functions that specify types of Long and Double that make calls to an unmanaged 3rd party DLL probably just as old. The source for the DLL is not available. I'm getting some warnings "PInvoke..unbalanced stack..." etc. Reading up a bit on this and trying to understand this particular warning, I find that part of the problem could be in the mismatch of allocated space for the variables, e.g. VB5 Long is not the same as...

Visual Basic .NET

3305

A small problem in running PHP4 under PHP5 - $row[fieldname]

by: xhe | last post by:

I have just upgraded my php version form php4 to php5. and I met this problem, and don't know if you know the solution. My site was written in PHP4, and most parts can be running smoothly in PHP5, only that in old version, I can use $row to access the data in database directly, no need to put double quote around fieldname. BUT in PHP5, this is wrong, I got error message "undefined constant". I know this is because PHP5 see the fieldname...

PHP

2971

Code to print each part of double as a separate group of bits

by: Virtual_X | last post by:

As in IEEE754 double consist of sign bit 11 bits for exponent 52 bits for fraction i write this code to print double parts as it explained in ieee754 i want to know if the code contain any bug , i am still c++ beginner

C / C++

9688

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

10260

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10030

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

9078

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

7570

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

5467

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5590

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

4146

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

3762

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP