By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,035 Members | 1,324 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 444,035 IT Pros & Developers. It's quick & easy.

How to remove words from text file?

P: 3
I am trying to do some text statistics, like word frequency, average word length, average sentence length, and average paragraph length, I managed to do the word frequency and the average sentence and word length. What I need to do next is preprocess the text file by removing some words, "listed in some other text file", and then do my statistics. And if some one can tell me how to do the average paragraph length too, please.
Any help is appreciated.
Nov 5 '10 #1

✓ answered by bvdet

What would be the definition of a paragraph? A blank line?

To eliminate words from another file, let's assume you have read the other file and split the words into a list (remove list). Let's also assume you have read in the file that you need statistics for and split the words into a list (stat list). Initialize a new list (keep list), iterate on the stat list, and if a word is not in the remove list, append to the keep list.

Expand|Select|Wrap|Line Numbers
  1. >>> remove_list = ['a','b','c']
  2. >>> stat_list = ['a','a','1','x','f','t']
  3. >>> keep_list = []
  4. >>> for word in stat_list:
  5. ...     if word not in remove_list:
  6. ...         keep_list.append(word)
  7. ...         
  8. >>> keep_list
  9. ['1', 'x', 'f', 't']
  10. >>> 
It also can be done with sets.
Expand|Select|Wrap|Line Numbers
  1. >>> keep_list = list(set(stat_list)-set(remove_list))
  2. >>> keep_list
  3. ['1', 'x', 't', 'f']
  4. >>> 
Give it a try and post back if you need more help.

Share this Question
Share on Google+
4 Replies


bvdet
Expert Mod 2.5K+
P: 2,851
What would be the definition of a paragraph? A blank line?

To eliminate words from another file, let's assume you have read the other file and split the words into a list (remove list). Let's also assume you have read in the file that you need statistics for and split the words into a list (stat list). Initialize a new list (keep list), iterate on the stat list, and if a word is not in the remove list, append to the keep list.

Expand|Select|Wrap|Line Numbers
  1. >>> remove_list = ['a','b','c']
  2. >>> stat_list = ['a','a','1','x','f','t']
  3. >>> keep_list = []
  4. >>> for word in stat_list:
  5. ...     if word not in remove_list:
  6. ...         keep_list.append(word)
  7. ...         
  8. >>> keep_list
  9. ['1', 'x', 'f', 't']
  10. >>> 
It also can be done with sets.
Expand|Select|Wrap|Line Numbers
  1. >>> keep_list = list(set(stat_list)-set(remove_list))
  2. >>> keep_list
  3. ['1', 'x', 't', 'f']
  4. >>> 
Give it a try and post back if you need more help.
Nov 5 '10 #2

P: 3
First, Thanks for the fast response, I made these changes:
Expand|Select|Wrap|Line Numbers
  1. filename = 'Jay.txt' 
  2. functionWords = 'function_words.txt'
  3. processedText=[]
  4. word_list = re.split('\s+', file(filename).read().lower())
  5. functionWordList = re.split('\s+', file(functionWords).read().lower())
  6.  
  7. for word in word_list:
  8.     if word not in functionWordList:
  9.         processedText.append(word)
  10. # Then I got this error
  11. Traceback (most recent call last):
  12.   File "F:\Python24\word_count1", line 21, in -toplevel-
  13.     if word not in functionWordList:
  14. TypeError: iterable argument required
Can you help me with that?
Nov 5 '10 #3

P: 3
I fixed it, but thanks for your help, I wouldn't find it without your help. Now do you know how to do the paragraph length? Usually paragraphs are separated by new line or two. I appreciate your help
Nov 5 '10 #4

bvdet
Expert Mod 2.5K+
P: 2,851
Assume a paragraph is separated by a blank line. One way to do it would be to iterate on the file object as in:
Expand|Select|Wrap|Line Numbers
  1. f = open("filename.txt")
  2. for line in f:
  3.     ....
Strip the line (string method strip(), removing whitespace). If the line has no content, you have reached a new paragraph.
Nov 5 '10 #5

Post your reply

Sign in to post your reply or Sign up for a free account.