By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,179 Members | 970 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,179 IT Pros & Developers. It's quick & easy.

How to determine if a part of a text file is in English or not

P: 3
I want to determine if a part of a text file is in English or not. Company rules say that certain texts should be in English, not the local language (in my case Swedish). Environment: Linux and Perl. Any hints?
May 11 '10 #1
Share this Question
Share on Google+
5 Replies


Expert 100+
P: 785
1.) look up the text word by word in an English dictionary automatically. If you can find the words, then it's English.
2.) look up the text word by word in an Swedish dictionary automatically. If you can find the words, then it's Swedish.
You can look for non-ascii characters which are used commonly in Sweden for a hint which dictonary to look up first.
3.) If you can't find a word in both dictionaries, then maybe it's misspelled.

Exclude words with capital letters. Names of swedish people, companies, etc. can also be inside English text, and the other way around.
Define a threshold for misspelled letters (e.g. 3% of all words that can't be looked up). The longer the text, the better the recognition rate. If the text consists only of a single, misspelled word, then it's very hard to say if it's Swedish or English.
But if you want to go so far and also determine that, then compute the Levenshtein distance between this word and the most similar word you can find in either language.
May 12 '10 #2

P: 3
OK, thanks, good ideas.
What comes to mind is "now why didn't I think of that!"
To be honest I was hoping for something like "use this routine: ..."
It's not that it hasn't been done before. (being a Newbie I dare to ask...)

"look up the text ... in an English dictionary...". Do you have a suggestion on how to do this?

However, it will probably be to time consuming, I think I will just look for a few words like "the", "a", "an", "this", "do" that don't exist in Swedish, and vise versa. And also look for specific Swedish characters, like you suggest. The problem here is that these characters are encoded differently in PCs and Linux (and Macs if they exist). But solving problems is part of the fun, right? :-).
May 17 '10 #3

Expert 100+
P: 785
deleted - deleted
May 18 '10 #4

Expert 100+
P: 785
"look up the text ... in an English dictionary...". Do you have a suggestion on how to do this?
Manually? (Just joking).
There are a lot of tools out there which can access the internet, pass arguments to a web site (the word to test) and return the result (it can be found in the online-dictionary or not) by program. I usually use "wget" or "ant" to do it. But in Perl, it's easy, too. Just google for it. For example try the script on http://www.emunix.emich.edu/~evett/B...WebAccess.html.
in chapter "Using LPW"

If you need more performance, you might also think of buying an offline-dictionary and use its provided interface to do any lookups.
May 18 '10 #5

P: 3
@chaarmann
Thanks a lot!
May 18 '10 #6

Post your reply

Sign in to post your reply or Sign up for a free account.