473,385 Members | 1,973 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

Determine language of body of text?

Does anyone have a method to determine the language (e.g. en-US, fr-FR,
zh-TW etc) of a body of text?

(In my case the body of text is an email received).

Naturally the text, if not English, may have a little bit of English. I'm
just trying to get the main language used.

Oct 23 '08 #1
8 5394
http://code.google.com/apis/ajaxlang...tation/#Detect

"Mark B" <no*****@none.comwrote in message
news:u$*************@TK2MSFTNGP06.phx.gbl...
Does anyone have a method to determine the language (e.g. en-US, fr-FR,
zh-TW etc) of a body of text?

(In my case the body of text is an email received).

Naturally the text, if not English, may have a little bit of English. I'm
just trying to get the main language used.
Oct 23 '08 #2
MC
Each language has a few words that are extremely common, such as "the", "a", "an", "of", "for" in English. You could look for the 5 or 10 most common words in each language, and see which language wins.

To find the most common words, use a word-frequency-table program to analyze some text samples.
Oct 23 '08 #3
On Thu, 23 Oct 2008 20:59:19 +1300, "Mark B" <no*****@none.comwrote:
>Does anyone have a method to determine the language (e.g. en-US, fr-FR,
zh-TW etc) of a body of text?

(In my case the body of text is an email received).

Naturally the text, if not English, may have a little bit of English. I'm
just trying to get the main language used.
As well as the other suggestions have a look at the characters used:
umlauts or the ss/beta character appear in German and acute or grave
accents in French.

Digram and trigram frequencies can also be a good indicator.

Essentially you are going to have to use some sort of statistical
method.

rossum

Oct 23 '08 #4
JP
There's some code here that writes the internet headers of an email to
a text file, which you could then parse for the language. For example,
some email headers include a "Content-Type" line which indicates the
character set used.

i.e.: Content-Type: text/plain; charset=US-ASCII

http://blogs.technet.com/kclemson/ar...n-outlook.aspx

HTH,
JP

On Oct 23, 3:59*am, "Mark B" <none...@none.comwrote:
Does anyone have a method to determine the language (e.g. en-US, fr-FR,
zh-TW etc) of a body of text?

(In my case the body of text is an email received).

Naturally the text, if not English, may have a little bit of English. I'm
just trying to get the main language used.
Oct 23 '08 #5
Look at the MailItem.InternetCodepage property (corresponds to the
PR_INTERNET_CPID MAPI propety).

--
Dmitry Streblechenko (MVP)
http://www.dimastr.com/
OutlookSpy - Outlook, CDO
and MAPI Developer Tool
-
"Mark B" <no*****@none.comwrote in message
news:u$*************@TK2MSFTNGP06.phx.gbl...
Does anyone have a method to determine the language (e.g. en-US, fr-FR,
zh-TW etc) of a body of text?

(In my case the body of text is an email received).

Naturally the text, if not English, may have a little bit of English. I'm
just trying to get the main language used.

Oct 23 '08 #6
Do all incoming emails have this? Even if they are not originating from an
Outlook client?

(It's for an Outlook 2007 Add-in C# BTW).
"Dmitry Streblechenko" <dm****@dimastr.comwrote in message
news:%2****************@TK2MSFTNGP06.phx.gbl...
Look at the MailItem.InternetCodepage property (corresponds to the
PR_INTERNET_CPID MAPI propety).

--
Dmitry Streblechenko (MVP)
http://www.dimastr.com/
OutlookSpy - Outlook, CDO
and MAPI Developer Tool
-
"Mark B" <no*****@none.comwrote in message
news:u$*************@TK2MSFTNGP06.phx.gbl...
>Does anyone have a method to determine the language (e.g. en-US, fr-FR,
zh-TW etc) of a body of text?

(In my case the body of text is an email received).

Naturally the text, if not English, may have a little bit of English. I'm
just trying to get the main language used.

Oct 23 '08 #7
Most of them. But that really tells you more about the defaut code page of
the sender.

--
Dmitry Streblechenko (MVP)
http://www.dimastr.com/
OutlookSpy - Outlook, CDO
and MAPI Developer Tool
-
"Mark B" <no*****@none.comwrote in message
news:%2***************@TK2MSFTNGP02.phx.gbl...
Do all incoming emails have this? Even if they are not originating from an
Outlook client?

(It's for an Outlook 2007 Add-in C# BTW).
"Dmitry Streblechenko" <dm****@dimastr.comwrote in message
news:%2****************@TK2MSFTNGP06.phx.gbl...
>Look at the MailItem.InternetCodepage property (corresponds to the
PR_INTERNET_CPID MAPI propety).

--
Dmitry Streblechenko (MVP)
http://www.dimastr.com/
OutlookSpy - Outlook, CDO
and MAPI Developer Tool
-
"Mark B" <no*****@none.comwrote in message
news:u$*************@TK2MSFTNGP06.phx.gbl...
>>Does anyone have a method to determine the language (e.g. en-US, fr-FR,
zh-TW etc) of a body of text?

(In my case the body of text is an email received).

Naturally the text, if not English, may have a little bit of English.
I'm just trying to get the main language used.


Oct 23 '08 #8
Mark B wrote:
Does anyone have a method to determine the language (e.g. en-US, fr-FR,
zh-TW etc) of a body of text?

(In my case the body of text is an email received).

Naturally the text, if not English, may have a little bit of English.
I'm just trying to get the main language used.
If you are willing to write some code, then you can detect
the language (but probably not the regional dialect).

* dictionary with common words
* special letters (forward and backward accents, umlauts etc.)
* distribution of letters
* distribution of pairs of letters

Arne
Oct 26 '08 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: bart plessers | last post by:
Hello, I have a form with some checkboxes, i.e. The default value of this checkbox is determined in a global file (config.asp), that is included in first line of the form, i.e. ShowFilenames...
3
by: Lian Liming | last post by:
Hi, The language in my web site may be Simplified Chinese or English. I want to define different styles for each language. For each language, the http header is different. <meta name="language"...
1
by: Timo | last post by:
All my font-sizes are set as relative sizes in CSS (large, medium, small, x-small, etc). Let's say something is set in CSS to be xx-large, but a visually impaired user wants it displayed even...
3
by: Ernst | last post by:
I have a script for a menu. However, this menu uses absolute coordinates. This menu had to be placed on a website. This website is position (centered) using a table. How can I determine/calculate...
3
by: glevik | last post by:
Hello, Anyone that can think of a way to programmaticaly determine the word on an HTML page that the user clicked on will be my hero for life. Leo
2
by: Ryu | last post by:
Is there a way to determine if a text is ASCII or Unicode in C#. I have looked at Encoding classes but I have found that They dont allow me to pass a text to the encoding obj. In addition is there...
2
by: Daniel Walzenbach | last post by:
Hi, I created an ASP.NET Datagrid where a single row can be selected by clicking anywhere on the row (according to...
1
by: ara.t.howard | last post by:
hi all- i'm a totally js hack so go easy on me... i'd like to create a function that, given the size of a block of text in x = maximum number of chars y = total number of lines (this is in...
1
by: yangtono | last post by:
Hi, How do I check if the Tag Name "i_tare_weight" exists inside my form? I am generating HTML page using a web server. The text field i_tare_weight will only appear when user click on option 2...
8
by: Mark B | last post by:
Does anyone have a method to determine the language (e.g. en-US, fr-FR, zh-TW etc) of a body of text? (In my case the body of text is an email received). Naturally the text, if not English,...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.