I'm worried about idiot users that write long essays in Microsoft Word,
then log into their accounts and bring up an HTML form and copy and
paste the essay and hit submit. Or perhaps they do this using
WordPerfect. Or perhaps they use MacWrite.
I output everything from my site as UTF-8. I'd like to check the input
for characters that are not UTF-8 and then turn the bad ones to an
ASCII question mark. I could loop through a string as if it was an
array and test each character, but what does PHP think a character is?
Does PHP understand what a multi-byte character is? Would this work?
// the $string is a form input, possibly containing characters
// written in any of the world's word processors
$finalString = "";
for ($i=0; $i < strlen($string); $i++) {
$char = $string[$i];
$encoding = mb_detect_encoding($char);
if ($encoding != "UTF-8") {
$char = "?";
}
$finalString .= $char;
}
They offer this on
www.php.net, in the comments, but, again, I'm not
sure it would work on individual characters, and I'm about reading
Regx.
===========================
Much simpler UTF-8-ness checker using a regular expression created by
the W3C:
<?php
// Returns true if $string is valid UTF-8 and false otherwise.
function is_utf8($string) {
// From
http://w3.org/International/question...rms-utf-8.html
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
} // function is_utf8
?>