How to determine if a part of a text file is in English or not

I want to determine if a part of a text file is in English or not. Company rules say that certain texts should be in English, not the local language (in my case Swedish). Environment: Linux and Perl. Any hints?

May 11 '10 #1

Subscribe Post Reply

2398

chaarmann

785

Expert 512MB

1.) look up the text word by word in an English dictionary automatically. If you can find the words, then it's English.
2.) look up the text word by word in an Swedish dictionary automatically. If you can find the words, then it's Swedish.
You can look for non-ascii characters which are used commonly in Sweden for a hint which dictonary to look up first.
3.) If you can't find a word in both dictionaries, then maybe it's misspelled.

Exclude words with capital letters. Names of swedish people, companies, etc. can also be inside English text, and the other way around.
Define a threshold for misspelled letters (e.g. 3% of all words that can't be looked up). The longer the text, the better the recognition rate. If the text consists only of a single, misspelled word, then it's very hard to say if it's Swedish or English.
But if you want to go so far and also determine that, then compute the Levenshtein distance between this word and the most similar word you can find in either language.

May 12 '10 #2

ovef

OK, thanks, good ideas.
What comes to mind is "now why didn't I think of that!"
To be honest I was hoping for something like "use this routine: ..."
It's not that it hasn't been done before. (being a Newbie I dare to ask...)

"look up the text ... in an English dictionary...". Do you have a suggestion on how to do this?

However, it will probably be to time consuming, I think I will just look for a few words like "the", "a", "an", "this", "do" that don't exist in Swedish, and vise versa. And also look for specific Swedish characters, like you suggest. The problem here is that these characters are encoded differently in PCs and Linux (and Macs if they exist). But solving problems is part of the fun, right? :-).

May 17 '10 #3

chaarmann

785

Expert 512MB

deleted - deleted

May 18 '10 #4

chaarmann

785

Expert 512MB

"look up the text ... in an English dictionary...". Do you have a suggestion on how to do this?

Manually? (Just joking).
There are a lot of tools out there which can access the internet, pass arguments to a web site (the word to test) and return the result (it can be found in the online-dictionary or not) by program. I usually use "wget" or "ant" to do it. But in Perl, it's easy, too. Just google for it. For example try the script on http://www.emunix.emich.edu/~evett/B...WebAccess.html.
in chapter "Using LPW"

If you need more performance, you might also think of buying an offline-dictionary and use its provided interface to do any lookups.

May 18 '10 #5

ovef

@chaarmann
Thanks a lot!

May 18 '10 #6

by: smjmitchell | last post by:

Hi, I am writing an application in VB6.0 that will have the option to select the language. For instance when Spanish is selected all the text on the program interface will display in Spanish. ...

Visual Basic 4 / 5 / 6

UA automatically choosing appropriate language

by: Vincent | last post by:

I would like to develop a site that should be available in several languages, say English, French and German. My question is: how can I suggest browsers of visitors to display the correct language...

HTML / CSS

Limiting the language in a text box to english only

by: OrenFlekser | last post by:

Hi I've posted this message couple of days ago, but I can't find it now, so sorry if you see it twice... Anyways - I have a text box, and I want my users to be able to write only in english...

HTML / CSS

English language question

by: jacob navia | last post by:

The english word "Initialized" exists. (Cambridge dictionary finds it). The word "Uninitialized" doesn't seem to exist, and no dictionary has it. I am using that word very often in my tutorial of...

C / C++

Performance counters problem on non-english Windows

by: Christopher Attard | last post by:

Hi, I need to create a dialog like the 'Add Counters' dialog box in perfmon. I'm using the System.Diagnostics namespace class in .NET and I've managed to do it. The problem arises when I'm...

C# / C Sharp

Hard Question - How to make a TextBox always write in English

by: Gidi | last post by:

Hi, For the last week, i'm looking for a way to make a TextBox always write in English (No matter what the OS default language is). i asked here few times but the answers i got didn't help me. i...

C# / C Sharp

i want to create a program that translates a sentance in english to another language

by: jerger | last post by:

I want to help teach to a minority group in Milwaukee, so I want to create a dictionary program that translates a sentence (like a homework problem or teacher instructions), from English into Hmong....

C / C++

Python, Dutch, English, Chinese, Japanese, etc.

by: Steve Howell | last post by:

The never-ending debate about PEP 3131 got me thinking about natural languages with respect to Python, and I have a bunch of mostly simple observations (some factual, some anecdotal). I present...

Python

who decides the size of a data type?

by: rao | last post by:

On some of the compilers integer size is 2 and on some other it is 4 bytes. My doubt is who decides the size of the integer? is it plainly the compiler? Does OS or Processor also has any control...

C / C++

145

Python and Flaming Thunder

by: Dave Parker | last post by:

I've read that one of the design goals of Python was to create an easy- to-use English-like language. That's also one of the design goals of Flaming Thunder at http://www.flamingthunder.com/ ,...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

How to determine if a part of a text file is in English or not

Similar topics