By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
464,672 Members | 1,333 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 464,672 IT Pros & Developers. It's quick & easy.

Compare multiple files for common entries

P: 1
I apologize in advance, I'm trying to teach myself python in my spare time since I was assigned this task. I am working on a way to examine a directory of thousands of files looking for common entries. In this instance, we have multiple cases when I have extracted telephone numbers from thousands of pieces of code and stored them in individual folders for each case. However, the name of the files are all the same "telephone_histogram.txt". What I'm trying to accomplish is to figure out how to compare the entire directory and have it produce a file that tells me if a number appears in more than just one file and how many files does it appear in. The other problem is that each of the txt files have two columns, with the number appearing in the second column. Here's what we have got so far, but I've only been able to compare two files, not a whole directory:
Expand|Select|Wrap|Line Numbers
  1. # Open each file and suck all of the data into an array called searchlines
  2. # Then sort the array
  3. with open("folder1/telephone_histogram.txt", "r") as f:
  4. searchlines = f.readlines()
  5. with open("folder2/telephone_histogram.txt", "r") as f:
  6. searchlines = searchlines+f.readlines()
  7. searchlines.sort();
  8. # dupe will be the variable to compare against the value of the next line
  9. # dupe_count will be the number of times the item is found in the file
  10. # dupe is initialized to a junk value and dupe_count is set to 0
  11. dupe="DUPE"
  12. dupe_count=1
  13. for i, line in enumerate(searchlines):
  14. if dupe in line:
  15.     dupe_count +=1;
  16. else:
  17.     if dupe_count==1:
  18.         #Item is unique
  19.         print searchlines[i-1];
  20.         nothing=False# delete this line.  It is just here so I can comment out the lines before without error
  21.  
  22.     else:
  23.         #Item is duplicated print the item preceeded by the number of times it was duplicated
  24.         #print dupe_count, searchlines[i-1];
  25.         nothing=False # delete this line.  It is just here so I can comment out the lines before without error
  26.     dupe_count=1;
  27. dupe=line;]
If you can help, thank you so much in advance
Apr 2 '13 #1
Share this Question
Share on Google+
1 Reply

bvdet
Expert Mod 2.5K+
P: 2,851
I would approach it like this:
  • Generate a list of files to read. os.walk() is ideal for this.
  • Initialize a dictionary. The phone numbers will be the keys and the counts will be the values.
  • Iterate over the files, updating the dictionary with each entry.
Dictionary method get() or setdefault() can be used to increment the counts. Example:
Expand|Select|Wrap|Line Numbers
  1. >>> key = '555-555-5555'
  2. >>> v = dd.get(key, 0)
  3. >>> dd[key] = v+1
  4. >>> key = '555-555-5556'
  5. >>> v = dd.setdefault(key, 0)
  6. >>> dd[key] += 1
  7. >>> 
Apr 2 '13 #2

Post your reply

Sign in to post your reply or Sign up for a free account.