By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,968 Members | 1,684 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,968 IT Pros & Developers. It's quick & easy.

Chinese character detection

P: n/a
Hi, i have a website which contains both chinese and english content
which is stored in a database. Each record in the dB has an english
and Chinese field. If a user enters a search string i have to be able
to detect which characters are latin based and which are chinese
ideographs.

eg) a user may enter "hello 新闻网 world"

this is because many Chinese search phrases (especially those involved
with technology may include English words or acronyms) eg) I think MP3
in Chinese is MP三 as MP is an English acronym with the number 3 after
it, which in chinese is 三 (i may be wrong, my written Chinese is non-
existent :-) but that's just an example)

to make an effective search on the Chinese field I cannot just put
latin characters through the same search process as it would detract
from the effectiveness of the search.

What I need, from the search string (hello 新闻网 world) is a PHP
function that will give me an array telling me if each character in
the string is Chinese or not (i do not need to know if it is
punctuation symbols or any other characters, just yes Chinese or no
something else)

all of my dB fields are UTF-8, i looked at finding out the range of
Han characters in UTF-8 encoding but its seems very complicated. If
anyone can help out id appreciate it.

Regards

Simon
Oct 15 '08 #1
Share this Question
Share on Google+
2 Replies


P: n/a
On Oct 15, 11:45 pm, Wassy <si...@wass1.entadsl.comwrote:
Hi, i have a website which contains both chinese and english content
which is stored in a database. Each record in the dB has an english
and Chinese field. If a user enters a search string i have to be able
to detect which characters are latin based and which are chinese
ideographs.

eg) a user may enter "hello 新闻网 world"

this is because many Chinese search phrases (especially those involved
with technology may include English words or acronyms) eg) I think MP3
in Chinese is MP三 as MP is an English acronym with the number 3 after
it, which in chinese is 三 (i may be wrong, my written Chinese is non-
existent :-) but that's just an example)

to make an effective search on the Chinese field I cannot just put
latin characters through the same search process as it would detract
from the effectiveness of the search.

What I need, from the search string (hello 新闻网 world) is aPHP
function that will give me an array telling me if each character in
the string is Chinese or not (i do not need to know if it is
punctuation symbols or any other characters, just yes Chinese or no
something else)

all of my dB fields are UTF-8, i looked at finding out the range of
Han characters in UTF-8 encoding but its seems very complicated. If
anyone can help out id appreciate it.

Regards

Simon
Something like this:
function is_non_ascii($str){
$length = mb_strlen($str);
for($i = 0; $i < $length; ++$i){
$char = mb_substr($str, $i, 1);
if($char <= 0x7F)
return true;
}
return false;
}
Oct 16 '08 #2

P: n/a
Wassy escribi:
Hi, i have a website which contains both chinese and english content
which is stored in a database. Each record in the dB has an english
and Chinese field. If a user enters a search string i have to be able
to detect which characters are latin based and which are chinese
ideographs.
Very dirty tricks I can think of:

1. Convert the input to a non-chinese charset and compare it back with
the original. If they're equal, it's possibly English. You may use
utf8_decode() or iconv().

2. Compare the string length using a unicode-aware function and a
byte-only function. If they're equal, it's a single-byte string and it's
possibly English. Try strlen() and mb_strlen().

3. I found this in Google Code Search [1], it's from a piece of software
called Mushu:

function is_chinese($str) {
return ereg("^[" . chr(0xa1) . "-" . chr(0xff) . "]+$", $str);
}
[1] http://www.google.com/codesearch
--
-- http://alvaro.es - 羖varo G. Vicario - Burgos, Spain
-- Mi sitio sobre programaci髇 web: http://bits.demogracia.com
-- Mi web de humor al ba駉 Mar韆: http://www.demogracia.com
--
Oct 16 '08 #3

This discussion thread is closed

Replies have been disabled for this discussion.