470,863 Members | 1,356 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 470,863 developers. It's quick & easy.

Counting words from text file

Hi guys,

Very new to Python and was hoping you guys could give me some help.

I have a book about The Great War, and want to count the times a country appears in the book. So far i have this:

Accesing the book
Expand|Select|Wrap|Line Numbers
  1. >>> from __future__ import division 
  2. >>> import nltk, re, pprint
  3. >>> from urllib import urlopen
  4. >>> url = "http://www.gutenberg.org/files/29270/29270.txt"
  5. >>> raw = urlopen(url).read() 
  6. >>> type(raw)
  7. <type 'str'>
  8. >>> len(raw)
  9. 1067008
  10. >>> raw[:75]
  11. 'The Project Gutenberg EBook of The Story of the Great War, Volume II (of\r\nV'
  13. Tokenizing
  14. >>> tokens = nltk.word_tokenize(raw)
  15. >>> type(tokens)
  16. <type 'list'>
  17. >>> len(tokens)
  18. 189743
  19. >>> tokens[:10]    
  20. ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'The', 'Story', 'of', 'the', 'Great']
  22. Slicing
  23. >>> text = nltk.Text(tokens)
  24. >>> type(text)
  25. <class 'nltk.text.Text'>
  26. >>> text[1020:1060]    
  27. ['Battles', 'of', 'the', 'Polish', 'Campaign', '462', 'LXXX.', 'Winter', 'Battles', 'in', 'East', 'Prussia', '478', 'LXXXI.', 'Results', 'of', 'First', 'Six', 'Months', 'of', 'Russo-German', 'Campaign', '482', 'PART', 'VIII.--TURKEY', 'AND', 'THE', 'DARDANELLES', 'LXXXII.', 'First', 'Moves', 'of', 'Turkey', '493', 'LXXXIII.', 'The', 'First', 'Blow', 'Against', 'the']
  28. >>> text.collocations() 
  29. Building collocations list
  30. General von; Project Gutenberg-tm; East Prussia; Von Kluck; von Kluck;
  31. General Staff; General Joffre; army corps; General Foch; crown prince;
  32. Project Gutenberg; von Buelow; Sir John; Third Army; right wing; Crown
  33. Prince; Field Marshal; Von Buelow; First Army; Army Corps
  35. Correcting the start and ending
  36. >>> raw.find("PART I") 
  37. 2629
  38. >>> raw.rfind("End of the Project Gutenberg")    
  39. 1047663
  40. >>> raw = raw[2629:1047663]    
  41. >>> raw.find("PART I")    
For counting words i have this:

Expand|Select|Wrap|Line Numbers
  1. def getWordFrequencies(text):
  2. frequencies = {}
  4. for c in re.split('\W+', text):
  5. frequencies[c] = (frequencies[c] if frequencies.has_key[c] else 0) + 1
  7. return frequencies
  11. result = dict([(w, Book.count(w)) for w in Book.split()])
  13. for i in result.items(): print "%s\t%d"%i

I unfortunately have no idea how to implement the book into the wordcount. My ideal outcome would be something like this:

Germany 2000
United Kingdom 1500
USA 1000
Holland 50
Belgium 150


Please help!
Jun 5 '12 #1
4 3640
391 Expert 256MB

Do you have a seed list of countries that you are going to look for? Are you going to look for "USA", "United States", "US of A", "United States of America", "America" as separate items? If so, you need also to be cognisant that "America" would also be found when "United States of America" is found etc.

There tends to be a lot of messy details with real life data mining!!

Also, please use the code tags - it will make it much clearer to us - especially with Python where indents are important.

Anyway, you seem to have got some results so far - hopefully you can clarify your question a little more...
Jun 7 '12 #2
Expand|Select|Wrap|Line Numbers
  1. def count_words(file_name ):
  3.         fname = file_name
  5.         num_lines = 0
  6.         num_words = 0
  8.         with open(fname, 'r') as f:
  9.             for line in f:
  10.                 words = line.split()
  12.                 num_lines += 1
  13.                 num_words += len(words)
  15.         return num_words
  17.     words_count = count_words(file_name ) //File name with absolute path.
Dec 14 '12 #3
I agree with Glenton. Data analysis in real life, especially with finding specific words matching a given meaning is complicated. However, I will assume that you only want to find the number of times "Germany" is found and NOT "Prussia". Either way, this difference should be made clear.

So far, your code is a little hard to read. Try adding more comments so that readers know what each part of your program is doing.

All criticism aside, here is what I would try, and why:

(Tell me if the code works or not!!!!!)

Before doing anything, just copy and paste the text from the online book onto a notepad document and save it as "ww1.txt" (quotes not included in name). That way, you can avoid any troubles that might arise by reading the file over the internet.

Once you've done that, here is what the code that you will make might look like (I have explained the code using inline comments).

Expand|Select|Wrap|Line Numbers
  1. filehandler = open("ww1.txt","r+")
  3. #start a counter variable.
  4. #Every time you find the word, the corresponding variable will increase by 1.
  5. #this part is under the 'for' loop
  6. germany_counter = 0
  7. holland_counter = 0
  8. #add more countries' counters here
  10. for line in filehandler:
  11.     stringy_line = str(line)#convert the line to string so you can use the find function.
  13.     if stringy_line.find("Germany") != -1: #essentially, this part on the left means: if the word Germany is found AT ALL, then execute the following code. 
  14.         germany_counter = germany_counter + 1 #increase counter by one when word is found in line
  15.     if stringy_line.find("Holland") != -1:
  16.         holland_counter = holland_counter + 1 #same principle as above applies here
  18. #when it's done reading all of the lines, print out the country's name and the counter
  19. print "Germany",germany_counter
  20. print "Holland",holland_counter
  22. #obviously, you must add more counters and if statements to the code for other countries. 
  23. #PLEASE NOTE: you will have to change the search string depending on what you're looking for. Do you want Prussia or Germany? Edit the search string to see the difference.
Dec 19 '12 #4
626 Expert 512MB
The original code is from June and the OP has not responded. It was either solved months ago or forgotten.
Dec 19 '12 #5

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

3 posts views Thread by Noam Dekers | last post: by
4 posts views Thread by Stuk | last post: by
5 posts views Thread by andy.lee23 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.