473,387 Members | 1,493 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

How to determine if a part of a text file is in English or not

3
I want to determine if a part of a text file is in English or not. Company rules say that certain texts should be in English, not the local language (in my case Swedish). Environment: Linux and Perl. Any hints?
May 11 '10 #1
5 2398
chaarmann
785 Expert 512MB
1.) look up the text word by word in an English dictionary automatically. If you can find the words, then it's English.
2.) look up the text word by word in an Swedish dictionary automatically. If you can find the words, then it's Swedish.
You can look for non-ascii characters which are used commonly in Sweden for a hint which dictonary to look up first.
3.) If you can't find a word in both dictionaries, then maybe it's misspelled.

Exclude words with capital letters. Names of swedish people, companies, etc. can also be inside English text, and the other way around.
Define a threshold for misspelled letters (e.g. 3% of all words that can't be looked up). The longer the text, the better the recognition rate. If the text consists only of a single, misspelled word, then it's very hard to say if it's Swedish or English.
But if you want to go so far and also determine that, then compute the Levenshtein distance between this word and the most similar word you can find in either language.
May 12 '10 #2
ovef
3
OK, thanks, good ideas.
What comes to mind is "now why didn't I think of that!"
To be honest I was hoping for something like "use this routine: ..."
It's not that it hasn't been done before. (being a Newbie I dare to ask...)

"look up the text ... in an English dictionary...". Do you have a suggestion on how to do this?

However, it will probably be to time consuming, I think I will just look for a few words like "the", "a", "an", "this", "do" that don't exist in Swedish, and vise versa. And also look for specific Swedish characters, like you suggest. The problem here is that these characters are encoded differently in PCs and Linux (and Macs if they exist). But solving problems is part of the fun, right? :-).
May 17 '10 #3
chaarmann
785 Expert 512MB
deleted - deleted
May 18 '10 #4
chaarmann
785 Expert 512MB
"look up the text ... in an English dictionary...". Do you have a suggestion on how to do this?
Manually? (Just joking).
There are a lot of tools out there which can access the internet, pass arguments to a web site (the word to test) and return the result (it can be found in the online-dictionary or not) by program. I usually use "wget" or "ant" to do it. But in Perl, it's easy, too. Just google for it. For example try the script on http://www.emunix.emich.edu/~evett/B...WebAccess.html.
in chapter "Using LPW"

If you need more performance, you might also think of buying an offline-dictionary and use its provided interface to do any lookups.
May 18 '10 #5
ovef
3
@chaarmann
Thanks a lot!
May 18 '10 #6

Sign in to post your reply or Sign up for a free account.

Similar topics

3
by: smjmitchell | last post by:
Hi, I am writing an application in VB6.0 that will have the option to select the language. For instance when Spanish is selected all the text on the program interface will display in Spanish. ...
22
by: Vincent | last post by:
I would like to develop a site that should be available in several languages, say English, French and German. My question is: how can I suggest browsers of visitors to display the correct language...
18
by: OrenFlekser | last post by:
Hi I've posted this message couple of days ago, but I can't find it now, so sorry if you see it twice... Anyways - I have a text box, and I want my users to be able to write only in english...
66
by: jacob navia | last post by:
The english word "Initialized" exists. (Cambridge dictionary finds it). The word "Uninitialized" doesn't seem to exist, and no dictionary has it. I am using that word very often in my tutorial of...
0
by: Christopher Attard | last post by:
Hi, I need to create a dialog like the 'Add Counters' dialog box in perfmon. I'm using the System.Diagnostics namespace class in .NET and I've managed to do it. The problem arises when I'm...
14
by: Gidi | last post by:
Hi, For the last week, i'm looking for a way to make a TextBox always write in English (No matter what the OS default language is). i asked here few times but the answers i got didn't help me. i...
91
by: jerger | last post by:
I want to help teach to a minority group in Milwaukee, so I want to create a dictionary program that translates a sentence (like a homework problem or teacher instructions), from English into Hmong....
12
by: Steve Howell | last post by:
The never-ending debate about PEP 3131 got me thinking about natural languages with respect to Python, and I have a bunch of mostly simple observations (some factual, some anecdotal). I present...
26
by: rao | last post by:
On some of the compilers integer size is 2 and on some other it is 4 bytes. My doubt is who decides the size of the integer? is it plainly the compiler? Does OS or Processor also has any control...
145
by: Dave Parker | last post by:
I've read that one of the design goals of Python was to create an easy- to-use English-like language. That's also one of the design goals of Flaming Thunder at http://www.flamingthunder.com/ ,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.