By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,300 Members | 1,795 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 435,300 IT Pros & Developers. It's quick & easy.

String Validation With UTF-8 Support

P: n/a
Hello,

I am looking for a way to check whether a string contains only word
characters and a single space (!= any whitespace char), *regardless of
the current locale*. In other words, any character that is a word
character in any locale should be allowed. This check:

preg_match("/^[\w ]*$/", $_GET[whatever]);

in which the $_GET variable contains an UTF-8 encoded string, only
seems to work with whatever locale is currently defined. Of course, I
could change the locale using setlocale(), but that would still limit
the check to a subset of all possible input values.

I also created this function from information that I found on the web:

--------------------------------
function is_utf8($_string) {
return preg_match('/^([\x00-\x7f]|'
. '[\xc2-\xdf][\x80-\xbf]|'
. '\xe0[\xa0-\xbf][\x80-\xbf]|'
. '[\xe1-\xec][\x80-\xbf]{2}|'
. '\xed[\x80-\x9f][\x80-\xbf]|'
. '[\xee-\xef][\x80-\xbf]{2}|'
. 'f0[\x90-\xbf][\x80-\xbf]{2}|'
. '[\xf1-\xf3][\x80-\xbf]{3}|'
. '\xf4[\x80-\x8f][\x80-\xbf]{2})*$/',
$_string) > 0;
}
--------------------------------

However, this does not seem to be completely accurate, as it still
allows characters such as this:

http://debain.org/software/tefinch/d...214&forum_id=1
(sorry for the external link, I just don't know how to create such
characters here.)

According to the W3C Validator, those characters are still invalid.
http://validator.w3.org/check?uri=ht...tomatically%29

I know there must be an answer somewhere on the web already, but I have
not found any reference in Google nor in the archives of this
newsgroup.

Any help appreciated.

-Samuel

Oct 6 '05 #1
Share this Question
Share on Google+
1 Reply


P: n/a
Hi!

I hope I got your problem right. In the PHP Manual contributed notes
theres a very good function to validate (and proof) UTF-8 encoded data.

http://de3.php.net/manual/en/functio...code.php#48160

It works perfectly for me. This function returns false when the given
text has chars in it, which are not part of the UTF-8 standard i.e.
ISO/ANSI above 128. If your Webpage has the correct meta-tag (charset
UTF-8) or the corresponding header (look in the php.ini, there's a
default setting!), the browser should then send you UTF-8 encoded data.

By the way have a look at the mb_string extension. It delivers a set of
string functions that replace the existing php functions which don't
support multi-byte char strings.

Hope that helped you a bit.

Greetings,
Benjamin Wilger

Oct 7 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.