473,326 Members | 2,680 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,326 software developers and data experts.

is any work being done to fix/improve PHP's string handling beyond 8 bits?

Last year I asked a bunch of questions about character encoding on this
newsgroup. All the answers came down to using ord() in creative ways to
try to make guesses about multi-byte characters. I was a little amazed
at this and wondered if I'd somehow misunderstood the situation.

I'm pleased to find that Joel Spolsky shared my amazement and offered
some criticism of PHP on these grounds: "When I discovered that the
popular web development tool PHP has almost complete ignorance of
character encoding issues, blithely using 8 bits for characters, making
it darn near impossible to develop good international web applications,
I thought, enough is enough."

But his essay is a year older than even the questions I had last year.
So I'm left wondering, is any work being done to fix the situation? I
just looked at http://us2.php.net/manual/en/ref.strings.php and saw no
new functions for handling multi-byte characters. Is anything being
done on this front?

And why aren't a lot of people asking these questions? Once again I'm
wondering if perhaps I've misunderstood something, somewhere. Isn't
this an issue that effects pretty much all of us using PHP on the web?
How are any of the people reading this post dealing with their own
character encoding issues?

Joel Spolsky's essay is here:

http://www.joelonsoftware.com/articles/Unicode.html

Jul 17 '05 #1
6 1887
On 23 May 2005 14:06:21 -0700, lk******@geocities.com wrote:
Last year I asked a bunch of questions about character encoding on this
newsgroup. All the answers came down to using ord() in creative ways to
try to make guesses about multi-byte characters. I was a little amazed
at this and wondered if I'd somehow misunderstood the situation.
Well - your questions, if I recall, were less about PHP supporting multibyte
strings, but rather you were receiving strings from external sources with no
well-defined encoding, or worse they were coming in with an encoding different
from that defined by the originating page (the main current browsers handle
this badly) and so you were forced to try heuristics to identify the unknown
encoding of a series of bytes.

Once you know what encoding a string is in, then PHP has wide support for
character set encodings.
I'm pleased to find that Joel Spolsky shared my amazement and offered
some criticism of PHP on these grounds: "When I discovered that the
popular web development tool PHP has almost complete ignorance of
character encoding issues, blithely using 8 bits for characters, making
it darn near impossible to develop good international web applications,
I thought, enough is enough."

But his essay is a year older than even the questions I had last year.
So I'm left wondering, is any work being done to fix the situation? I
just looked at http://us2.php.net/manual/en/ref.strings.php and saw no
new functions for handling multi-byte characters. Is anything being
done on this front?
That's because they're all in the Multibyte String section.

http://uk.php.net/mbstring
And why aren't a lot of people asking these questions? Once again I'm
wondering if perhaps I've misunderstood something, somewhere. Isn't
this an issue that effects pretty much all of us using PHP on the web?
How are any of the people reading this post dealing with their own
character encoding issues?

Joel Spolsky's essay is here:

http://www.joelonsoftware.com/articles/Unicode.html


The one key sentence in there is:

"It does not make sense to have a string without knowing what encoding it
uses."

Absolutely.

PHP's "string" datatype is a bit of a misnomer; it's more like a "series of
bytes" datatype. The "plain" string functions, as in C, assume a single byte
encoding, and are pretty dumb about the mapping between that and characters.
Where there's any significance, some functions take a character set encoding
parameter, or default to ISO-8859-1. You have to keep track of what encoding
you're storing in strings.

mbstring puts a bit more intelligence into it, since it knows about more
character set encodings, e.g. it can give you counts of characters for
multibyte encoded strings, or convert between encodings. But you still need to
know what encoding each string is in.

Multibyte strings are still second-class citizens in PHP, but saying it has no
support for them is just wrong, mbstring has been around for ages. There's even
an option (mbstring.func_overload) that replaces the builtin single-byte
functions with multibyte-aware equivalents.

http://uk.php.net/manual/en/ref.mbst...tring.overload

You can still work with UTF-8 strings without mbstring, anyway. It just
depends what operations you perform on them. Concatenation is unaffected, as is
printing. Counting characters requires a multibyte aware function, but if you
never use strlen() on the strings, it doesn't matter what encoding they're in.

If you want regular expressions, then the PCRE regexes have the "u" modifier
that treats the input as UTF-8.

So it all looks pretty well covered.

Perl only recently (in 5.8) finished the transition to natively supporting
utf8 strings (a process that began a long time ago). Strings in Perl are now
either a series of bytes of undefined encoding (i.e. C or PHP-style strings),
or have a utf8 flag set indicating they're UTF-8 encoded, which the builtin
string functions are aware of and so return the correct results in terms of
characters.

That's one step up from PHP, since strings carry around some metadata with
them on their encoding, at least if they're UTF-8.

--
Andy Hassall / <an**@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool
Jul 17 '05 #2
Thanks again for all the help. You summarized the problem I faced last
year well. I didn't know about multi-byte string section of PHP. I'm
sad that extension is optional. I spend tonight reading that section.

Jul 17 '05 #3
If there's one person who's qualified to talk about multilingual
programming and PHP, that person would be me. In the last couple years
I have been working on a content management system dealing with
materials in such languages as Korean, Pashto, Georgian, Ethiphic, and
Chechen. And let me tell you, whether the server-side technology you
use can "natively" support Unicode is the least of your problems.

PHP is basically encoding agnostic. By in large, this is good enough.
Most of the issues you encounter in multilingual application
development is on the display side. For example, how to get the page
layout to look correctly when you have to flip it for a right-to-left
language. Only on rarely does the server-side application need to
"understand" what it sends or receives. By default, you can't do much
with the text in a multilingual situation, because the scripts behave
so differently. In our application, for instance, we have to ask our
users to enter the word count, because for languages like Chinese where
no spaces appear between words, the computer can't do it automatically.

If you ask me, the 8-bit strings in PHP cut both ways. There are
occasions when I wish I can get the Unciode value of a specific
character (quite difficult in standard PHP). Yet there are also times
when I appreciate the fact that PHP isn't fiddling with the text that's
given.

Jul 17 '05 #4
I don't think the problem is that PHP focuses on 8 bit strings, I think
the problem is the lack of default, built-in functions for dealing with
multibyte strings.

Jul 17 '05 #5
Yeah, a set of functions that treats a regular string as a UTF-16
string would be quite useful.

Jul 17 '05 #6
Chung Leong <ch***********@hotmail.com> wrote:
<snip>
There are
occasions when I wish I can get the Unciode value of a specific
character (quite difficult in standard PHP).


For example? Just curious... (guess, you aren't referring UTF-8 to
UTF-32 conversion)

--
<?php echo 'Just another PHP saint'; ?>
Email: rrjanbiah-at-Y!com Blog: http://rajeshanbiah.blogspot.com

Jul 17 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Raptor | last post by:
I know it's bordeline off topic, but this is a subject which concerns many and is not frequently discussed. How about: bid the project at your usual rate, but bid a much lower rate beyond the...
12
by: lawrence | last post by:
I notice that when my weblog software tries to contact www.weblogs.com, to use the update service, my whole site (all PHP) slows down. Contacting www.weblogs.com can take a long time. I can't...
7
by: Martin | last post by:
I am a PHP newbie (just got my "Hello World" page working this morning). I'm doing some R&D work to see if PHP is viable for a situation I have. To accomplish what I want to do, I have to have the...
53
by: dterrors | last post by:
Will php 6 do strong typing and/or namespaces? I was shocked to find out today that there are some people who actually argue that weak typing is somehow better. I didn't even know there was a...
14
by: Steve Jorgensen | last post by:
Recently, I tried and did a poor job explaining an idea I've had for handling a particular case of implementation inheritance that would be easy and obvious in a fully OOP language, but is not at...
4
by: Trint Smith | last post by:
How can I improve this code please? It sometimes produces this error: "Object reference not set to an instance of an object" When I do this: strSQL = "UPDATE TBL_Items SET" & _ " item_itemnumber...
6
by: scottyman | last post by:
I can't make this script work properly. I've gone as far as I can with it and the rest is out of my ability. I can do some html editing but I'm lost in the Java world. The script at the bottom of...
4
by: liyanage | last post by:
I recently worked on error handling and three related issues/questions came up. 1.) I am trying to trigger Apache ErrorDocument handlers by setting appropriate HTTP status codes in my PHP...
4
by: MikeB | last post by:
I'm messing around with some sample code that I downloaded from a website (http://odnal.com/prosper-bid-sniper) that uses PHP objects and web services. When I upload it to my server, I get a...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.