By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,830 Members | 2,276 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,830 IT Pros & Developers. It's quick & easy.

detect language

P: n/a
Hello
I have an UTF string, how can i detect what language it is?
thanks
from Peter (cm****@hotmail.com)
Aug 30 '08 #1
Share this Question
Share on Google+
7 Replies


P: n/a
On 30 Aug, 05:04, Peter <cmk...@hotmail.comwrote:
Hello
* * *I have an UTF string, how can i detect what language it is?
thanks
from Peter (cmk...@hotmail.com)
This is a difficult one since many words are the same in more than one
language. You would also need a large pool of words with their
language listed so that you could compare each word in your string
with them and make a guess at the language.

The fact that the string is UTF has no real relavance to the language
in which the string is written, unless it is one of the non-latin ones.
Aug 30 '08 #2

P: n/a
AqD
Peter wrote:
Hello
I have an UTF string, how can i detect what language it is?
thanks
from Peter (cm****@hotmail.com)
An UTF string can have characters of more than one languages. You can
find out what lanaguages are used by scanning & checking every
characters (see [mbstring] extension) and their codepages they're in.

Sep 2 '08 #3

P: n/a
On Aug 30, 5:04 am, Peter <cmk...@hotmail.comwrote:
Hello
I have an UTF string, how can i detect what language it is?
thanks
from Peter (cmk...@hotmail.com)
You can't. Only a person reading it could tell for sure what language
it is. If, however, you want to know the language a browser visiting
your site is configured to use you could look at the Accept-Language
header.
Sep 2 '08 #4

P: n/a
On Sep 2, 9:32*am, Gordon <gordon.mc...@ntlworld.comwrote:
On Aug 30, 5:04 am, Peter <cmk...@hotmail.comwrote:
Hello
* * *I have an UTF string, how can i detect what language it is?
thanks
from Peter (cmk...@hotmail.com)

You can't. Only a person reading it could tell for sure what language
it is. *If, however, you want to know the language a browser visiting
your site is configured to use you could look at the Accept-Language
header.
Short version: How to detect if a UTF-8 character is Chinese (CJK) or
NOT.

Long version: I have a database which contains articles stored in
UTF-8 format. The website is multilingual, so I have two database
fields one for chinese, one for english. the problem is that the
client wants the search to be seamless so if the user enters Chinese
characters it should search chinese and search english if there are
english characters. Now i am aware that in some cases the chinese use
english(latin) characters for things like acronyms etc and that my
chinese records may contain latin character. But what i NEED is to run
the 'search' string through a function that can look at each character
and tell me if that character is Chinese or not (whether 'not' is
latin, punctuation or anything else) i am simply looking to detect if
the character is chinese

regards

Simon
Oct 15 '08 #5

P: n/a
On Oct 15, 8:59*pm, Wassy <si...@wass1.entadsl.comwrote:
On Sep 2, 9:32*am, Gordon <gordon.mc...@ntlworld.comwrote:
On Aug 30, 5:04 am, Peter <cmk...@hotmail.comwrote:
Hello
* * *I have an UTF string, how can i detect what language it is?
thanks
from Peter (cmk...@hotmail.com)
You can't. Only a person reading it could tell for sure what language
it is. *If, however, you want to know the language a browser visiting
your site is configured to use you could look at the Accept-Language
header.

Short version: How to detect if a UTF-8 character is Chinese (CJK) or
NOT.

Long version: I have a database which contains articles stored in
UTF-8 format. The website is multilingual, so I have two database
fields one for chinese, one for english. the problem is that the
client wants the search to be seamless so if the user enters Chinese
characters it should search chinese and search english if there are
english characters. Now i am aware that in some cases the chinese use
english(latin) characters for things like acronyms etc and that my
chinese records may contain latin character. But what i NEED is to run
the 'search' string through a function that can look at each character
and tell me if that character is Chinese or not (whether 'not' is
latin, punctuation or anything else) i am simply looking to detect if
the character is chinese

regards

Simon
Why not just search the whole database? If the user enters Chinese
characters then articles with characters that match will be returned,
if they don't then they wont.
Oct 16 '08 #6

P: n/a
rf

"Wassy" <si***@wass1.entadsl.comwrote in message
news:ee**********************************@8g2000hs e.googlegroups.com...
On Sep 2, 9:32 am, Gordon <gordon.mc...@ntlworld.comwrote:
On Aug 30, 5:04 am, Peter <cmk...@hotmail.comwrote:
Hello
I have an UTF string, how can i detect what language it is?
thanks
from Peter (cmk...@hotmail.com)

You can't. Only a person reading it could tell for sure what language
it is. If, however, you want to know the language a browser visiting
your site is configured to use you could look at the Accept-Language
header.
Short version: How to detect if a UTF-8 character is Chinese (CJK) or
NOT.

Move the other way. Detect if it is not English.

If the character code is 127 then it is not "English". It may be Chinese
or Korean or Arabic but it is most likely not English. Of course it may also
be French or German or even a &nbsp; but how fine to you want this to be? It
is a search string after all.

Oct 16 '08 #7

P: n/a
Thanks for the replies.

Gordon: i understand what you are saying but its not that simple
because of the difference between Chinese word characters and our
Latin letters to make the search EFFECTIVE I have to split the search
string up into INDIVIDUAL characters and search for each character on
its own when the input is Chinese but when it is English I need to
keep the LETTERS that make up an individual word or acronym together
and search the database for those sequence of letters together. ie.
searhcing for "cat" is a lot difference from searching for "c" then
"a" then "t".

a quick example: a user may enter a search string that is: "how do i
turn my computer on", because in latin languages we can split the
string up into individual words by seperating out the spaces. However,
becauseof the way chinese is written there are no spaces between words
so if the chinese equivalent is: "howdoiturnmycomputeron" the dB will
not return a result if it tries to match ALL of the characters in
sequence. which is why for chinese character i need to know if the
search entry is characters so that i can split them up and which is
latin or other ounctuation etc, so i can keep them together because
chinese sometimes uses latin for technical acronyms etc. so you cannot
guarantee that a sentence will be fully characters and not a mixture
of latin letters and characters.

(i hope my ramblings make sense)

RF: thank you, i should have thought of that answer before :-) it is
probably the best way that im going to be able to separate the two, if
i can separate out all English (and the most common punctuation marks)
then what i am left with should be mostly chinese. I dont care if the
user enters some uncommon accented characters like in German or French
etc, the website is English/Chinese so there should be no need for
them to enter these characters in the normal case. Do you know how
UTF-8 numbers works? because i know it is not like ASCII where
characters go from 0-255? anyway thanks for the help, i should at
least be able to do a Regexp on the string to find things like a-z 0-9
and reglar punctuation symbols

Oct 16 '08 #8

This discussion thread is closed

Replies have been disabled for this discussion.