By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
464,562 Members | 953 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 464,562 IT Pros & Developers. It's quick & easy.

Counting words from text file

P: 1
Hi guys,

Very new to Python and was hoping you guys could give me some help.

I have a book about The Great War, and want to count the times a country appears in the book. So far i have this:

Accesing the book
Expand|Select|Wrap|Line Numbers
  1. >>> from __future__ import division 
  2. >>> import nltk, re, pprint
  3. >>> from urllib import urlopen
  4. >>> url = "http://www.gutenberg.org/files/29270/29270.txt"
  5. >>> raw = urlopen(url).read() 
  6. >>> type(raw)
  7. <type 'str'>
  8. >>> len(raw)
  9. 1067008
  10. >>> raw[:75]
  11. 'The Project Gutenberg EBook of The Story of the Great War, Volume II (of\r\nV'
  12.  
  13. Tokenizing
  14. >>> tokens = nltk.word_tokenize(raw)
  15. >>> type(tokens)
  16. <type 'list'>
  17. >>> len(tokens)
  18. 189743
  19. >>> tokens[:10]    
  20. ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'The', 'Story', 'of', 'the', 'Great']
  21.  
  22. Slicing
  23. >>> text = nltk.Text(tokens)
  24. >>> type(text)
  25. <class 'nltk.text.Text'>
  26. >>> text[1020:1060]    
  27. ['Battles', 'of', 'the', 'Polish', 'Campaign', '462', 'LXXX.', 'Winter', 'Battles', 'in', 'East', 'Prussia', '478', 'LXXXI.', 'Results', 'of', 'First', 'Six', 'Months', 'of', 'Russo-German', 'Campaign', '482', 'PART', 'VIII.--TURKEY', 'AND', 'THE', 'DARDANELLES', 'LXXXII.', 'First', 'Moves', 'of', 'Turkey', '493', 'LXXXIII.', 'The', 'First', 'Blow', 'Against', 'the']
  28. >>> text.collocations() 
  29. Building collocations list
  30. General von; Project Gutenberg-tm; East Prussia; Von Kluck; von Kluck;
  31. General Staff; General Joffre; army corps; General Foch; crown prince;
  32. Project Gutenberg; von Buelow; Sir John; Third Army; right wing; Crown
  33. Prince; Field Marshal; Von Buelow; First Army; Army Corps
  34.  
  35. Correcting the start and ending
  36. >>> raw.find("PART I") 
  37. 2629
  38. >>> raw.rfind("End of the Project Gutenberg")    
  39. 1047663
  40. >>> raw = raw[2629:1047663]    
  41. >>> raw.find("PART I")    
  42.  
For counting words i have this:

Expand|Select|Wrap|Line Numbers
  1. def getWordFrequencies(text):
  2. frequencies = {}
  3.  
  4. for c in re.split('\W+', text):
  5. frequencies[c] = (frequencies[c] if frequencies.has_key[c] else 0) + 1
  6.  
  7. return frequencies
  8.  
  9. <HERE THE BOOK SHOULD BE INSERTED, I THINK>
  10.  
  11. result = dict([(w, Book.count(w)) for w in Book.split()])
  12.  
  13. for i in result.items(): print "%s\t%d"%i
----------

I unfortunately have no idea how to implement the book into the wordcount. My ideal outcome would be something like this:

Germany 2000
United Kingdom 1500
USA 1000
Holland 50
Belgium 150

etc.


Please help!
Jun 5 '12 #1
Share this Question
Share on Google+
4 Replies

Expert 100+
P: 391
Hi

Do you have a seed list of countries that you are going to look for? Are you going to look for "USA", "United States", "US of A", "United States of America", "America" as separate items? If so, you need also to be cognisant that "America" would also be found when "United States of America" is found etc.

There tends to be a lot of messy details with real life data mining!!

Also, please use the code tags - it will make it much clearer to us - especially with Python where indents are important.

Anyway, you seem to have got some results so far - hopefully you can clarify your question a little more...
Jun 7 '12 #2

P: 6
Expand|Select|Wrap|Line Numbers
  1. def count_words(file_name ):
  2.  
  3.         fname = file_name
  4.  
  5.         num_lines = 0
  6.         num_words = 0
  7.  
  8.         with open(fname, 'r') as f:
  9.             for line in f:
  10.                 words = line.split()
  11.  
  12.                 num_lines += 1
  13.                 num_words += len(words)
  14.  
  15.         return num_words
  16.  
  17.     words_count = count_words(file_name ) //File name with absolute path.
Dec 14 '12 #3

P: 12
I agree with Glenton. Data analysis in real life, especially with finding specific words matching a given meaning is complicated. However, I will assume that you only want to find the number of times "Germany" is found and NOT "Prussia". Either way, this difference should be made clear.

So far, your code is a little hard to read. Try adding more comments so that readers know what each part of your program is doing.

All criticism aside, here is what I would try, and why:

(Tell me if the code works or not!!!!!)

Before doing anything, just copy and paste the text from the online book onto a notepad document and save it as "ww1.txt" (quotes not included in name). That way, you can avoid any troubles that might arise by reading the file over the internet.

Once you've done that, here is what the code that you will make might look like (I have explained the code using inline comments).

Expand|Select|Wrap|Line Numbers
  1. filehandler = open("ww1.txt","r+")
  2.  
  3. #start a counter variable.
  4. #Every time you find the word, the corresponding variable will increase by 1.
  5. #this part is under the 'for' loop
  6. germany_counter = 0
  7. holland_counter = 0
  8. #add more countries' counters here
  9.  
  10. for line in filehandler:
  11.     stringy_line = str(line)#convert the line to string so you can use the find function.
  12.  
  13.     if stringy_line.find("Germany") != -1: #essentially, this part on the left means: if the word Germany is found AT ALL, then execute the following code. 
  14.         germany_counter = germany_counter + 1 #increase counter by one when word is found in line
  15.     if stringy_line.find("Holland") != -1:
  16.         holland_counter = holland_counter + 1 #same principle as above applies here
  17.  
  18. #when it's done reading all of the lines, print out the country's name and the counter
  19. print "Germany",germany_counter
  20. print "Holland",holland_counter
  21.  
  22. #obviously, you must add more counters and if statements to the code for other countries. 
  23. #PLEASE NOTE: you will have to change the search string depending on what you're looking for. Do you want Prussia or Germany? Edit the search string to see the difference.
  24.  
  25.  
Dec 19 '12 #4

Expert 100+
P: 626
The original code is from June and the OP has not responded. It was either solved months ago or forgotten.
Dec 19 '12 #5

Post your reply

Sign in to post your reply or Sign up for a free account.