By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,818 Members | 1,282 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,818 IT Pros & Developers. It's quick & easy.

Removing duplicate entries in a csv file using a python script

P: 4
I m a beginner to python. Could you tell me how should i proceed to remove duplicate rows in a csv file
Nov 19 '07 #1
Share this Question
Share on Google+
10 Replies


KaezarRex
P: 52
I m a beginner to python. Could you tell me how should i proceed to remove duplicate rows in a csv file
If the order of the information in your csv file doesn't matter, you could put each line of the file into a list, convert the list into a set, and then write the list back into the file. When you convert the list to a set, all duplicate elements disappear.

Expand|Select|Wrap|Line Numbers
  1. reader = open("file.csv", "r")
  2. lines = reader.read().split("\n")
  3. reader.close()
  4.  
  5. writer = open("file.csv", "w")
  6. for line in set(lines):
  7.     writer.write(line + "\n")
  8. writer.close()
Nov 19 '07 #2

bvdet
Expert Mod 2.5K+
P: 2,851
This code maintains the order of the data:
Expand|Select|Wrap|Line Numbers
  1. >>> rows = open('data.txt').read().split('\n')
  2. >>> newrows = []
  3. >>> for row in rows:
  4. ...     if row not in newrows:
  5. ...         newrows.append(row)
  6. ...         
  7. >>> f = open('data1.txt', 'w')
  8. >>> f.write('\n'.join(newrows))
  9. >>> f.close()
Nov 19 '07 #3

KaezarRex
P: 52
Here is another way to solve your problem using bvdet's method and the csv module.

Expand|Select|Wrap|Line Numbers
  1. import csv
  2. rows = csv.reader(open("file.csv", "rb"))
  3. newrows = []
  4. for row in rows:
  5.     if row not in newrows:
  6.         newrows.append(row)
  7. writer = csv.writer(open("file.csv", "wb"))
  8. writer.writerows(newrows)
Nov 19 '07 #4

P: 2
Here is another way to solve your problem using bvdet's method and the csv module.

Expand|Select|Wrap|Line Numbers
  1. import csv
  2. rows = csv.reader(open("file.csv", "rb"))
  3. newrows = []
  4. for row in rows:
  5.     if row not in newrows:
  6.         newrows.append(row)
  7. writer = csv.writer(open("file.csv", "wb"))
  8. writer.writerows(newrows)

from above code,when i am using set(rows),am getting error' list objects are unhashable'..i think list is hashble (by hashq module)...then y am i getting this error??pls explain
Nov 20 '07 #5

P: 2
hi...
from above code,when i am using set(rows),am getting error' list objects are unhashable'..i think list is hashble (by hashq module)...then y am i getting this error??pls explain
Nov 20 '07 #6

bvdet
Expert Mod 2.5K+
P: 2,851
hi...
from above code,when i am using set(rows),am getting error' list objects are unhashable'..i think list is hashble (by hashq module)...then y am i getting this error??pls explain
This error indicates you are attempting to create a set from objects that are mutable. By definition, a set is a group of unique immutable objects. The csv.reader() function returns a list of lists. A list is a mutable object. KaezarRex's earlier example in this thread was applying set() to a list of strings. A string is an immutable object.
Nov 20 '07 #7

P: 4
Here is another way to solve your problem using bvdet's method and the csv module.

Expand|Select|Wrap|Line Numbers
  1. import csv
  2. rows = csv.reader(open("file.csv", "rb"))
  3. newrows = []
  4. for row in rows:
  5.     if row not in newrows:
  6.         newrows.append(row)
  7. writer = csv.writer(open("file.csv", "wb"))
  8. writer.writerows(newrows)
Hey i used this code and i was able to remove the duplicate entries. thanks. actually this csv file is generated by a java code. if the code is modified, the output should remain the same. to acheive this i found that the files should be sorted in some order to compare(since the rows are selected by the java code randomly). could you tell me how to sort the contents for ex. priority: Column 5, Column 8, Column1. is it possible to sort the newrows list before writing.
Nov 23 '07 #8

P: 4
This code maintains the order of the data:
Expand|Select|Wrap|Line Numbers
  1. >>> rows = open('data.txt').read().split('\n')
  2. >>> newrows = []
  3. >>> for row in rows:
  4. ...     if row not in newrows:
  5. ...         newrows.append(row)
  6. ...         
  7. >>> f = open('data1.txt', 'w')
  8. >>> f.write('\n'.join(newrows))
  9. >>> f.close()
hey thanks for ur reply. i used the logic which KaezarRex said. could you see prev post and tell me your suggestion
Nov 23 '07 #9

bvdet
Expert Mod 2.5K+
P: 2,851
Hey i used this code and i was able to remove the duplicate entries. thanks. actually this csv file is generated by a java code. if the code is modified, the output should remain the same. to acheive this i found that the files should be sorted in some order to compare(since the rows are selected by the java code randomly). could you tell me how to sort the contents for ex. priority: Column 5, Column 8, Column1. is it possible to sort the newrows list before writing.
I am in Python 2.3. Define a comparison function to pass to the list sort method:
Expand|Select|Wrap|Line Numbers
  1. def comp581(a, b):
  2.     x = cmp(a[5], b[5])
  3.     if not x:
  4.         y = cmp(a[8], b[8])
  5.         if not y:
  6.             return cmp(a[1], b[1])
  7.         return y
  8.     return x
  9.  
  10. yourList.sort(comp581)
In Python 2.4:
Expand|Select|Wrap|Line Numbers
  1. yourList.sort(key=lambda i: (i[5], i[8], i[1]))
Nov 23 '07 #10

P: 4
I am in Python 2.3. Define a comparison function to pass to the list sort method:
Expand|Select|Wrap|Line Numbers
  1. def comp581(a, b):
  2.     x = cmp(a[5], b[5])
  3.     if not x:
  4.         y = cmp(a[8], b[8])
  5.         if not y:
  6.             return cmp(a[1], b[1])
  7.         return y
  8.     return x
  9.  
  10. yourList.sort(comp581)
In Python 2.4:
Expand|Select|Wrap|Line Numbers
  1. yourList.sort(key=lambda i: (i[5], i[8], i[1]))
thanks a lot. im able to sort the list now. (the version im using is 2.5.1 - i used the 'lambda' functionality)
Nov 26 '07 #11

Post your reply

Sign in to post your reply or Sign up for a free account.