On 21 Jul 2005 00:29:54 -0700,
lkrubner@geocities.com wrote:
[color=blue]
>I output everything from my site as UTF-8. I'd like to check the input
>for characters that are not UTF-8 and then turn the bad ones to an
>ASCII question mark.[/color]
Déja vu all over again. ;-p
[color=blue]
>I could loop through a string as if it was an
>array and test each character, but what does PHP think a character is?[/color]
PHP's string data type has no knowledge of character encodings. It treats
strings as a meaning-free series of bytes. Not characters.
[color=blue]
>Does PHP understand what a multi-byte character is?[/color]
No. The mbstring extension does, though.
[color=blue]
>Would this work?
>
>// the $string is a form input, possibly containing characters
>// written in any of the world's word processors
>$finalString = "";
>for ($i=0; $i < strlen($string); $i++) {
> $char = $string[$i];
> $encoding = mb_detect_encoding($char);
> if ($encoding != "UTF-8") {
> $char = "?";
> }
> $finalString .= $char;
>}[/color]
No. This goes byte-by-byte. There's no reason why mb_detect_encoding should
return UTF-8, since for anything <127 then it could equally be ASCII, or for
other values some other encoding such as ISO-8859-15.
To find invalid UTF-8 encoded byte sequences you have to consider more than
one byte at a time.
As I believe was covered the previous times you've asked this:
You can tell whether a series of bytes is not a series of UTF-8 encoded
characters, by looking for byte sequences that are not valid UTF-8 - look for
lead bytes and the corresponding numbers of trail bytes.
Therefore, your current request (replace byte sequences that cannot be UTF-8
encodings with a "?" character) is quite possible, but you need to consider
more than one byte at a time and will probably have to backtrack a bit if you
get an invalid sequence.
In one previous incarnation of this thread I posted a script to detect invalid
UTF-8 byte sequences; looks like this could be quite easily adapted to your
current request:
http://groups.google.co.uk/group/com...0075dcf?hl=en&
Just remove the various returns that exit when it finds bad characters, and
instead, whenever $charSize drops to zero, append it to a string for output, or
if it finds a bad encoing, append a "?".
However, you cannot tell whether a byte sequence is actually a series of UTF-8
characters, because it could be encoded in something else that happens to share
the same byte representation.
--
Andy Hassall / <andy@andyh.co.uk> / <http://www.andyh.co.uk>
<http://www.andyhsoftware.co.uk/space> Space: disk usage analysis tool