By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
464,698 Members | 1,286 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 464,698 IT Pros & Developers. It's quick & easy.

How can I find top words frequencies of combined files?

P: 1
I need to take two files and print the top most frequent words they have in common as well as their combined(sum) frequencies.
Expand|Select|Wrap|Line Numbers
  1. def mostFrequent(word,frequency,n):
  2.    my_list = zip(word,frequency) #combine the two lists
  3.    my_list.sort(key=lambda x:x[1],reverse=True) #sort by freq
  4.    words,freqs = zip(*my_list[:n]) #take the top n entries and split back to seperate lists
  5.    return words, freqs #return our most frequent words in order   
  6. from wordFrequencies import * #gives both the word and its frequency in a file
  7. L1 = wordFrequencies('file1.txt')
  8. words1 = L1[0]
  9. freqs1 = L1[1]
  10. L2 = wordFrequencies('file2.txt')
  11. words2 = L2[0]
  12. freqs2 = L2[1]
  13. print mostFrequent(words,freqs,20)
I've tried
Expand|Select|Wrap|Line Numbers
  1. L1 = WordFrequencies('file1.txt')
  2. words1 = set(L1[0])
  3. freqs1 = set(L1[1])
  4. L2 = WordFrequencies('file2.txt')
  5. words2 = set(L2[0])
  6. freqs2 = set(L2[1])
  7. words3 = words1.intersection(words2)
  8. freqs3 = freqs1.intersection(freqs2)
  9. print mostFrequent(words3,freqs3,20)
but it didn't work. It outputed the wrong words
Mar 8 '13 #1
Share this Question
Share on Google+
1 Reply

Expert 100+
P: 626
We don't have the code for the function WordFrequencies(). It looks like it is returning some kind of container containing the word and the number of times it is found. The answer depends on if "L1" and "L2" (non-descriptive variable names don't tell us anything) are lists, sets, or dictionaries. In whatever case, combine the two and do a sort on the number converted to an integer if it is not one already, so the container will be in order of frequency. Then print how ever many words you want. On the code you posted, I would suggest that you print words1 and freqs1 as I don't think it contains what you want.
Expand|Select|Wrap|Line Numbers
  1.  L1 = WordFrequencies('file1.txt')
  2. words1 = set(L1[0])
  3. freqs1 = set(L1[1]) 
Mar 8 '13 #2

Post your reply

Sign in to post your reply or Sign up for a free account.