472,342 Members | 1,419 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,342 software developers and data experts.

Chinese character detection

Hi, i have a website which contains both chinese and english content
which is stored in a database. Each record in the dB has an english
and Chinese field. If a user enters a search string i have to be able
to detect which characters are latin based and which are chinese
ideographs.

eg) a user may enter "hello 新闻网 world"

this is because many Chinese search phrases (especially those involved
with technology may include English words or acronyms) eg) I think MP3
in Chinese is MP三 as MP is an English acronym with the number 3 after
it, which in chinese is 三 (i may be wrong, my written Chinese is non-
existent :-) but that's just an example)

to make an effective search on the Chinese field I cannot just put
latin characters through the same search process as it would detract
from the effectiveness of the search.

What I need, from the search string (hello 新闻网 world) is a PHP
function that will give me an array telling me if each character in
the string is Chinese or not (i do not need to know if it is
punctuation symbols or any other characters, just yes Chinese or no
something else)

all of my dB fields are UTF-8, i looked at finding out the range of
Han characters in UTF-8 encoding but its seems very complicated. If
anyone can help out id appreciate it.

Regards

Simon
Oct 15 '08 #1
2 6146
On Oct 15, 11:45 pm, Wassy <si...@wass1.entadsl.comwrote:
Hi, i have a website which contains both chinese and english content
which is stored in a database. Each record in the dB has an english
and Chinese field. If a user enters a search string i have to be able
to detect which characters are latin based and which are chinese
ideographs.

eg) a user may enter "hello 新闻网 world"

this is because many Chinese search phrases (especially those involved
with technology may include English words or acronyms) eg) I think MP3
in Chinese is MP三 as MP is an English acronym with the number 3 after
it, which in chinese is 三 (i may be wrong, my written Chinese is non-
existent :-) but that's just an example)

to make an effective search on the Chinese field I cannot just put
latin characters through the same search process as it would detract
from the effectiveness of the search.

What I need, from the search string (hello 新闻网 world) is aPHP
function that will give me an array telling me if each character in
the string is Chinese or not (i do not need to know if it is
punctuation symbols or any other characters, just yes Chinese or no
something else)

all of my dB fields are UTF-8, i looked at finding out the range of
Han characters in UTF-8 encoding but its seems very complicated. If
anyone can help out id appreciate it.

Regards

Simon
Something like this:
function is_non_ascii($str){
$length = mb_strlen($str);
for($i = 0; $i < $length; ++$i){
$char = mb_substr($str, $i, 1);
if($char <= 0x7F)
return true;
}
return false;
}
Oct 16 '08 #2
Wassy escribi:
Hi, i have a website which contains both chinese and english content
which is stored in a database. Each record in the dB has an english
and Chinese field. If a user enters a search string i have to be able
to detect which characters are latin based and which are chinese
ideographs.
Very dirty tricks I can think of:

1. Convert the input to a non-chinese charset and compare it back with
the original. If they're equal, it's possibly English. You may use
utf8_decode() or iconv().

2. Compare the string length using a unicode-aware function and a
byte-only function. If they're equal, it's a single-byte string and it's
possibly English. Try strlen() and mb_strlen().

3. I found this in Google Code Search [1], it's from a piece of software
called Mushu:

function is_chinese($str) {
return ereg("^[" . chr(0xa1) . "-" . chr(0xff) . "]+$", $str);
}
[1] http://www.google.com/codesearch
--
-- http://alvaro.es - 羖varo G. Vicario - Burgos, Spain
-- Mi sitio sobre programaci髇 web: http://bits.demogracia.com
-- Mi web de humor al ba駉 Mar韆: http://www.demogracia.com
--
Oct 16 '08 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Anthony Liu | last post by:
The following 4 lines of code parses an XML document very well if the XML document contains only English words. But when I insert one Chinese...
3
by: Coco | last post by:
Hi! I managed to display chinese character in my web form (.aspx), in certain situation i need to to set the text of the label of my web form in...
6
by: Zhang Weiwu | last post by:
Hello. I am working with a php software project, in it (www.egroupware.org) Chinese simplified locate is "zh" while Traditional Chinese "tw". I...
8
by: Agnes | last post by:
In my .net ,i need to generate an xml file , however, user may input a chinese character, Then , the xml will got something unknow characters. the...
8
by: pabv | last post by:
Hello all, I am having a few issues with encoding to chinese characters and perhaps someone might be able to assist. At the moment I am only...
1
by: CYF | last post by:
My Computer : Window XP Pro English Version -> I have set the "Regional" to "Taiwan" Mysql 4.1.x -- > set to big5 already i am using C# to...
0
by: Coco | last post by:
Hi! I have been searching for solution for the problem i am facing in displaying chinese character in my aspx page initially when i created the...
12
by: Steven Nagy | last post by:
Hi all, I have to do a website in chinese! Basically I just need to know how to output chinese characters. I am assuming its very easy, but have...
19
by: many_years_after | last post by:
Hi,everyone: Have you any ideas? Say whatever you know about this. thanks.
0
by: concettolabs | last post by:
In today's business world, businesses are increasingly turning to PowerApps to develop custom business applications. PowerApps is a powerful tool...
0
better678
by: better678 | last post by:
Question: Discuss your understanding of the Java platform. Is the statement "Java is interpreted" correct? Answer: Java is an object-oriented...
0
by: CD Tom | last post by:
This only shows up in access runtime. When a user select a report from my report menu when they close the report they get a menu I've called Add-ins...
0
by: Matthew3360 | last post by:
Hi there. I have been struggling to find out how to use a variable as my location in my header redirect function. Here is my code. ...
2
by: Matthew3360 | last post by:
Hi, I have a python app that i want to be able to get variables from a php page on my webserver. My python app is on my computer. How would I make it...
0
by: AndyPSV | last post by:
HOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and on my computerHOW CAN I CREATE AN AI with an .executable...
0
by: Arjunsri | last post by:
I have a Redshift database that I need to use as an import data source. I have configured the DSN connection using the server, port, database, and...
0
hi
by: WisdomUfot | last post by:
It's an interesting question you've got about how Gmail hides the HTTP referrer when a link in an email is clicked. While I don't have the specific...
0
by: Matthew3360 | last post by:
Hi, I have been trying to connect to a local host using php curl. But I am finding it hard to do this. I am doing the curl get request from my web...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.