473,320 Members | 1,914 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Counting words from text file

Hi guys,

Very new to Python and was hoping you guys could give me some help.

I have a book about The Great War, and want to count the times a country appears in the book. So far i have this:

Accesing the book
Expand|Select|Wrap|Line Numbers
  1. >>> from __future__ import division 
  2. >>> import nltk, re, pprint
  3. >>> from urllib import urlopen
  4. >>> url = "http://www.gutenberg.org/files/29270/29270.txt"
  5. >>> raw = urlopen(url).read() 
  6. >>> type(raw)
  7. <type 'str'>
  8. >>> len(raw)
  9. 1067008
  10. >>> raw[:75]
  11. 'The Project Gutenberg EBook of The Story of the Great War, Volume II (of\r\nV'
  12.  
  13. Tokenizing
  14. >>> tokens = nltk.word_tokenize(raw)
  15. >>> type(tokens)
  16. <type 'list'>
  17. >>> len(tokens)
  18. 189743
  19. >>> tokens[:10]    
  20. ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'The', 'Story', 'of', 'the', 'Great']
  21.  
  22. Slicing
  23. >>> text = nltk.Text(tokens)
  24. >>> type(text)
  25. <class 'nltk.text.Text'>
  26. >>> text[1020:1060]    
  27. ['Battles', 'of', 'the', 'Polish', 'Campaign', '462', 'LXXX.', 'Winter', 'Battles', 'in', 'East', 'Prussia', '478', 'LXXXI.', 'Results', 'of', 'First', 'Six', 'Months', 'of', 'Russo-German', 'Campaign', '482', 'PART', 'VIII.--TURKEY', 'AND', 'THE', 'DARDANELLES', 'LXXXII.', 'First', 'Moves', 'of', 'Turkey', '493', 'LXXXIII.', 'The', 'First', 'Blow', 'Against', 'the']
  28. >>> text.collocations() 
  29. Building collocations list
  30. General von; Project Gutenberg-tm; East Prussia; Von Kluck; von Kluck;
  31. General Staff; General Joffre; army corps; General Foch; crown prince;
  32. Project Gutenberg; von Buelow; Sir John; Third Army; right wing; Crown
  33. Prince; Field Marshal; Von Buelow; First Army; Army Corps
  34.  
  35. Correcting the start and ending
  36. >>> raw.find("PART I") 
  37. 2629
  38. >>> raw.rfind("End of the Project Gutenberg")    
  39. 1047663
  40. >>> raw = raw[2629:1047663]    
  41. >>> raw.find("PART I")    
  42.  
For counting words i have this:

Expand|Select|Wrap|Line Numbers
  1. def getWordFrequencies(text):
  2. frequencies = {}
  3.  
  4. for c in re.split('\W+', text):
  5. frequencies[c] = (frequencies[c] if frequencies.has_key[c] else 0) + 1
  6.  
  7. return frequencies
  8.  
  9. <HERE THE BOOK SHOULD BE INSERTED, I THINK>
  10.  
  11. result = dict([(w, Book.count(w)) for w in Book.split()])
  12.  
  13. for i in result.items(): print "%s\t%d"%i
----------

I unfortunately have no idea how to implement the book into the wordcount. My ideal outcome would be something like this:

Germany 2000
United Kingdom 1500
USA 1000
Holland 50
Belgium 150

etc.


Please help!
Jun 5 '12 #1
4 3766
Glenton
391 Expert 256MB
Hi

Do you have a seed list of countries that you are going to look for? Are you going to look for "USA", "United States", "US of A", "United States of America", "America" as separate items? If so, you need also to be cognisant that "America" would also be found when "United States of America" is found etc.

There tends to be a lot of messy details with real life data mining!!

Also, please use the code tags - it will make it much clearer to us - especially with Python where indents are important.

Anyway, you seem to have got some results so far - hopefully you can clarify your question a little more...
Jun 7 '12 #2
Expand|Select|Wrap|Line Numbers
  1. def count_words(file_name ):
  2.  
  3.         fname = file_name
  4.  
  5.         num_lines = 0
  6.         num_words = 0
  7.  
  8.         with open(fname, 'r') as f:
  9.             for line in f:
  10.                 words = line.split()
  11.  
  12.                 num_lines += 1
  13.                 num_words += len(words)
  14.  
  15.         return num_words
  16.  
  17.     words_count = count_words(file_name ) //File name with absolute path.
Dec 14 '12 #3
kttr
12
I agree with Glenton. Data analysis in real life, especially with finding specific words matching a given meaning is complicated. However, I will assume that you only want to find the number of times "Germany" is found and NOT "Prussia". Either way, this difference should be made clear.

So far, your code is a little hard to read. Try adding more comments so that readers know what each part of your program is doing.

All criticism aside, here is what I would try, and why:

(Tell me if the code works or not!!!!!)

Before doing anything, just copy and paste the text from the online book onto a notepad document and save it as "ww1.txt" (quotes not included in name). That way, you can avoid any troubles that might arise by reading the file over the internet.

Once you've done that, here is what the code that you will make might look like (I have explained the code using inline comments).

Expand|Select|Wrap|Line Numbers
  1. filehandler = open("ww1.txt","r+")
  2.  
  3. #start a counter variable.
  4. #Every time you find the word, the corresponding variable will increase by 1.
  5. #this part is under the 'for' loop
  6. germany_counter = 0
  7. holland_counter = 0
  8. #add more countries' counters here
  9.  
  10. for line in filehandler:
  11.     stringy_line = str(line)#convert the line to string so you can use the find function.
  12.  
  13.     if stringy_line.find("Germany") != -1: #essentially, this part on the left means: if the word Germany is found AT ALL, then execute the following code. 
  14.         germany_counter = germany_counter + 1 #increase counter by one when word is found in line
  15.     if stringy_line.find("Holland") != -1:
  16.         holland_counter = holland_counter + 1 #same principle as above applies here
  17.  
  18. #when it's done reading all of the lines, print out the country's name and the counter
  19. print "Germany",germany_counter
  20. print "Holland",holland_counter
  21.  
  22. #obviously, you must add more counters and if statements to the code for other countries. 
  23. #PLEASE NOTE: you will have to change the search string depending on what you're looking for. Do you want Prussia or Germany? Edit the search string to see the difference.
  24.  
  25.  
Dec 19 '12 #4
dwblas
626 Expert 512MB
The original code is from June and the OP has not responded. It was either solved months ago or forgotten.
Dec 19 '12 #5

Sign in to post your reply or Sign up for a free account.

Similar topics

3
by: Noam Dekers | last post by:
Hi all, I would like to find a word stored in a text file. Structure: I have one file named keyWords.txt that stores some key words I'm interested in finding. In addition I also have a file...
4
by: sun6 | last post by:
this is a program counting words from "text_in.txt" file and writing them in "text_out.txt". it uses binary tree search, but there is an error when i use insert () thanks for any help ...
4
by: Stuk | last post by:
Hi, im false beginner in C so that`s why im writting here :). I have to write a Text Reformater, which should read data from text file every Verse. In text file may appear special directives (for...
5
by: andy.lee23 | last post by:
hi im having trouble counting lines in a text file, i have the following code int node1, node2, i; char name; float value; ifstream fin; fin.open(OpenDialog1->FileName.c_str()); i=1;
4
by: bigbagy | last post by:
Notes The programs will be compiled and tested on the machine which runs the Linux operating system. V3.4 of the GNU C/C++ compiler (gcc ,g++) must be used. A significant amount coding is...
0
by: pchahar | last post by:
Write a program to process a text file. The program will determine how many unique words there are in the text file that begin with each letter of the alphabet. The text file name will be given as a...
2
by: charlesbritto | last post by:
A C++ program for counting individual string in a text file using file handling functions., for ex: if the text file contains, am a boy,am studying +2,am from chennai Now the result shoud...
8
by: No Signal X | last post by:
hello i'm having a problem in reading a text file line by line with splitting words, counting the number of lines & words and saving the position of each word i've started with this : public...
4
by: lightning18 | last post by:
I have a list of incorrect words called # words and another list containing my txt file # text. I want to print the line number of the words located in the text. I get the following error for my...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.