423,850 Members | 1,661 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 423,850 IT Pros & Developers. It's quick & easy.

Ignoring similar entries in a column of many rows

P: 22
Hi everyone;

I have a file with one column the elements of which are separated by commas (,). The column has many rows, as in below:

45,56,890
67,92,4502
76,367,89
67,92,4502
67,92,4502
92,14,05
02,56,125
25,02,61
02,56,125
.
.
.

I want to go through all the rows in the column ignoring the similar rows and put the result in an output file. For example I want to count 67,92,4502 once and ignore the other similar ones (rows 4 and 5 are the same).

I hope I have explained the problem clearly.
Thanks in advance, Atrisa
Sep 17 '10 #1

✓ answered by bvdet

See if this works for you:
Expand|Select|Wrap|Line Numbers
  1. outfile = open('capture25000-column3.txt', 'w')
  2.  
  3. f = open('capture25000.txt')
  4. output = []
  5. for line in f:
  6.     column3 = line.strip().split()[2].split(".")[-1]
  7.     if column3 not in output:
  8.         output.append(column3)
  9.  
  10. f.close()
  11. outfile.write("\n".join(output))
  12. outfile.close()

Share this Question
Share on Google+
20 Replies


bvdet
Expert Mod 2.5K+
P: 2,851
This can easily be accomplished with the in operator:
Expand|Select|Wrap|Line Numbers
  1. s = '''45,56,890
  2. 67,92,4502
  3. 76,367,89
  4. 67,92,4502
  5. 67,92,4502
  6. 92,14,05
  7. 02,56,125
  8. 25,02,61
  9. 02,56,125'''
  10.  
  11. # create list of lines
  12. lineList = s.split("\n")
  13. output = []
  14. # simulate iteration on file object or list of lines
  15. for line in lineList:
  16.     if line not in output:
  17.         output.append(line)
  18. print "\n".join(output)
Output:
>>> 45,56,890
67,92,4502
76,367,89
92,14,05
02,56,125
25,02,61
>>>
Sep 17 '10 #2

Expert 100+
P: 599
If you have a large amount of data, use sets, i.e each group of three numbers would become a tuple added to a set. Sets don't add duplicates but do alter the order. If you want to retain the original file order, then use an OrderedDict for a large data set. If bvdet's solution runs in a reasonable amount of time then there is no reason to change. Post back if you want more info on either alternate solution.
Sep 17 '10 #3

P: 22
Thanks a lot both of you. I need to use bvdet's code like in the following:

Expand|Select|Wrap|Line Numbers
  1. outfile = open('capture25000-column3.txt', 'w')
  2.  
  3. for (i, eachline) in enumerate(open('capture25000.txt','r')):
  4.  column3 = eachline.split(' ')[2]
  5.  newcolumn = column3.split('.')[-1]
  6.  outfile.write(newcolumn + '\n')
  7.  output = []
  8.  lineList = newcolumn.split("\n")
  9.  for line in lineList:
  10.   if line not in output:
  11.    output.append(line)
  12.  print "\n".join(output)
  13.  
Basically what I have is an input file of 25000 rows, 'capture25000.txt'. I extract the column that I need from this file and put that column in another file 'capture25000-column3.txt'. So how do I use this new output file in bvdet's code? I have done the way above, but it again prints the same values as in the 'newcolumn' list or 'capture25000-column3.txt' file. I want to get rid of the similar lines in the 'capture25000-column3.txt' file.
Sep 20 '10 #4

bvdet
Expert Mod 2.5K+
P: 2,851
Judging from your code, it appears that the file format is not like the sample data in your first post. Could you post a representative sample of your actual data?
Sep 20 '10 #5

P: 22
The actual data which is in a file called 'capture25000-column3.txt' is a set of port numbers like this:

52141
21283
43439
21283
21283
23479
80
10464
80
80
21283
43439
80
10464
10464
21283
.
.
.

and there are 25000 rows of those numbers. If you need all of them, I could send it to your email.
Sep 20 '10 #6

bvdet
Expert Mod 2.5K+
P: 2,851
So that's supposed to be the output? What does the corresponding data look like?
Sep 20 '10 #7

P: 22
I first have an input file that has captured data. I need only the third column from that file, so I did it this way:
Expand|Select|Wrap|Line Numbers
  1. # for (i, eachline) in enumerate(open('capture25000.txt','r')):
  2. #  column3 = eachline.split(' ')[2]
Then I get the list 'column3' which looks like this:

27.20.29.12.21283
27.20.29.12.21283
65.24.112.69.23479
27.120.98.21.80
32.16.45.74.10464
56.176.45.74.10464
80.290.20.12.21283
45.85.178.252.28302
80.91.188.120.1328
27.144.102.81.13696
62.244.260.6.33344
54.57.14.68.80
200.97.17.48.80

There 2500 rows of those. Then I use the following piece of code:
Expand|Select|Wrap|Line Numbers
  1. #  newcolumn = column3.split('.')[-1]
  2.  
to get only the port numbers as in the following:

21283
21283
23479
80
10464
1046
1283
28302
1328
13696
33344
80
80
.
.
.

As you notice there are more than one occurrence of every port number. I want to leave only the first occurrence of any number and get rid of the rest, which your code is supposed to do. But since I have those port numbers in an external file, I didn't know how to use your code for my purpose.

Thanks a lot again for your time
Sep 20 '10 #8

bvdet
Expert Mod 2.5K+
P: 2,851
See if this works for you:
Expand|Select|Wrap|Line Numbers
  1. outfile = open('capture25000-column3.txt', 'w')
  2.  
  3. f = open('capture25000.txt')
  4. output = []
  5. for line in f:
  6.     column3 = line.strip().split()[2].split(".")[-1]
  7.     if column3 not in output:
  8.         output.append(column3)
  9.  
  10. f.close()
  11. outfile.write("\n".join(output))
  12. outfile.close()
Sep 20 '10 #9

P: 22
Thanks a lot bvdet. That worked well:)
Sep 21 '10 #10

P: 22
I also want to count how many times the port numbers appeared in the 'column3' and put the port number and beside it the number of appearances of that port in a file, so that I have something like this:

21283 3
21283 2
23479 1
80 7
10464 2

and so on. This is my code that tries to do that, but it gives me only 1 appearance for each port:

Expand|Select|Wrap|Line Numbers
  1.  
  2. outfile = open('capture25000-column3.txt', 'w')
  3.  
  4. f = open('capture25000.txt')
  5.  
  6. output = []
  7. for line in f:
  8.    column3 = line.strip().split()[2].split(".")[-1].split("\n")
  9.    for val in column3:   
  10.       if not val in output:
  11.          print val, column3.count(val)
  12.          output.append(val)
  13.  
  14. f.close()
  15. outfile.write("\n".join(output))  
  16. outfile.close()
  17.  
  18.  
It seems to count the appearances from the file where I have only the unique numbers, so basically each of those appear for once there. What am I missing in the above code?
Sep 21 '10 #11

bvdet
Expert Mod 2.5K+
P: 2,851
Variable column3 is a string representing a number. Splitting column3 on the "\n" character does nothing except return a one element list. column3 is only added to the output list one time, so it is only counted once.

I would recommend storing the data in a dictionary. The port numbers would be the dictionary keys and the counts would be the values.
Expand|Select|Wrap|Line Numbers
  1. outfile = open('capture25000-column3.txt', 'w')
  2.  
  3. f = open('capture25000.txt')
  4. # initialize a dictionary
  5. dd = {}
  6. # iterate on the file object
  7. for line in f:
  8.     # get the port number from the third item in line
  9.     column3 = line.strip().split()[2].split(".")[-1]
  10.     # if column3 not in dd, add to dd and set quantity to 0
  11.     dd.setdefault(column3, 0)
  12.     # increment dd[port number] by one
  13.     dd[column3] += 1
  14.  
  15. f.close()
  16. # write the dictionary to disk
  17. outfile.write("\n".join(["%s %s" % (key, dd[key]) for key in dd]))
  18. outfile.close()
Sep 21 '10 #12

P: 22
Thanks a lot bvdet for your help and your time. That works perfectly. It's two weeks that I have started learning Python, that's why I am a bit lost.
Sep 21 '10 #13

P: 22
Now I have created the following method in the code that you did earlier. The method creates a dictionary that has the port numbers and their corresponding port names:

Expand|Select|Wrap|Line Numbers
  1. def get_port_name (name, port_table):
  2. # ports.txt is a file where I have put the IANA PORT NUMBERS
  3.     f= open('ports.txt','r')
  4.     content=f.readline()
  5.     keycount=1
  6.     while content:
  7.         key = str(keycount)
  8.         line1=content.split()   
  9.         if len(line1)>3 and line1[0]!="#":
  10.             port=line1[1].split('/')
  11.             if len(port)==2:
  12.                 port_table[int(port[0])]={'keycode':line1[0],'description':line1[2]}
  13.         keycount +=1
  14.         content=f.readline()
  15.  
  16. ports_name={}
  17. get_port_name ('Ports', ports_name)
  18. print ports_name;
  19.  
A sample of the result of the code above is like this:

-----------------------------------------------------
80: {'keycode': 'www-http', 'description': 'World'},
82: {'keycode': 'xfer', 'description': 'XFER'},
83: {'keycode': 'mit-ml-dev', 'description': 'MIT'},
---------------------------------------------------

e.g. the first line shows port 80 with the name being 'www-http'. I want to grab those keycode values and put them in the 3rd column of the output file that you gave earlier and have a result like this for all the port numbers:

80 115 HTTP
443 52 HTTPS

and so on, with the following line:
Expand|Select|Wrap|Line Numbers
  1. outfile.write("\n".join(["%s %s %s" % (key, dd[key], port_table['keycode'][key]) for key in dd]))
  2.  
but it doesn't work. How to use the method here for the output? I am yet well familiar with the syntax of Python methods I guess.
Sep 21 '10 #14

bvdet
Expert Mod 2.5K+
P: 2,851
It may be as simple as:
Expand|Select|Wrap|Line Numbers
  1. port_table[key]['keycode']
Sep 21 '10 #15

P: 22
Unfortunately that didn't work either. I changed few lines in the previously sent code this way:
Expand|Select|Wrap|Line Numbers
  1. .
  2. .
  3. .
  4. port_table[port[0]]=line1[0]
  5.         keycount +=1
  6.         content=f.readline()
  7.  
  8. ports_name={}
  9. get_port_name ('Ports', ports_name)
  10.  
and now the dictionary looks like this:
-----------------------------------------------
{'8732': 'dtp-net', '4026': 'as-debug', '4448': 'asc-slmd', '4024': 'tnp1-port', '4025': 'partimage', '38203': 'agpolicy', '4023': 'esnm-zoning', '4020': 'trap', '4021': 'nexus-portal', '9418': 'git', '4028': 'dtserver-port', '4029': 'ip-qsig'}
-----------------------------------------------
The output line is like this:

Expand|Select|Wrap|Line Numbers
  1. outfile.write("\n".join(["%s %s %s" % (key, port_dic[key], ports_name[key]) for key in port_dic]))
  2.  
but it gives this error message: KeyError: '62084', which doesn't tell me anything.
Sep 22 '10 #16

bvdet
Expert Mod 2.5K+
P: 2,851
"KeyError" indicates that you are attempting to retrieve a dictionary value using a dictionary key that is not in the dictionary. In order for the script to work, both dictionaries must have the same keys. The alternative would be to create one dictionary instead. Assuming the 'keycode' is in column 4 and the 'description' is in column 5:
Expand|Select|Wrap|Line Numbers
  1. outfile = open('capture25000-column3.txt', 'w')
  2.  
  3. f = open('capture25000.txt')
  4. # initialize a dictionary
  5. dd = {}
  6. # iterate on the file object
  7. for line in f:
  8.     # get the port number from the third item in line
  9.     lineList = line.strip().split()
  10.     column3 = lineList[2].split(".")[-1]
  11.     keycode = lineList[3]
  12.     description = lineList[4]
  13.     # if column3 not in dd, add to dd and set quantity to 0
  14.     # only first occurring value of keycode and description will be saved
  15.     dd.setdefault(column3, [0, keycode, description])
  16.     # increment dd[port number][0] by one
  17.     dd[column3][0] += 1
  18.  
  19. f.close()
  20. # write the dictionary to disk
  21. outfile.write("\n".join(["%s %s %s" % (key, dd[key][0], dd[key][1]) for key in dd]))
  22. outfile.close()
Sep 22 '10 #17

P: 22
Now I get the port description in the third column of the output file 'capture25000-column3.txt', but each description is repeated for few times, not just once as the port numbers are. Some output:

25910 1 pangolin-laser
55149 1 pangolin-laser
4024 4 www-http
13734 2 www-http
2451 2 www-http
55617 6 iapp
61510 1 iapp

which is not correct. The few last lines of the following code is supposed to do that, but doesn't work properly. I guess something I have done wrong with looping?

Expand|Select|Wrap|Line Numbers
  1. def get_port_name (name, port_table):
  2.  
  3.     pf= open('ports.txt','r')
  4.     content=pf.readline()
  5.     keycount=1
  6.     while content:
  7.         line1=content.split()  
  8.         if len(line1)>3 and line1[0]!="#":
  9.             port=line1[1].split('/')
  10.             port_table[port[0]]=line1[0]
  11.         keycount +=1
  12.         content=pf.readline()
  13.  
  14. # initialize a dictionary
  15. dd = {}
  16.  
  17. # iterate on the file object
  18. for line in f:
  19.     # get the port number from the third item in line
  20.     column3 = line.strip().split()[2].split(".")[-1]
  21.     # if column3 not in dd, add to dd and set quantity to 0
  22.     dd.setdefault(column3, 0)
  23.     # increment dd[port number] by one
  24.     dd[column3] += 1
  25.  
  26. ports_name={}
  27. get_port_name ('Ports', ports_name)
  28. # if the key in dd equals the key (item) in ports_name, insert the value of that item to the third column in the output file:
  29. for key in dd.keys():
  30.     for item in ports_name.keys():
  31.         if key == item:
  32.         outfile.write("\n".join(["%s %s %s" % (key, dd[key], ports_name[item]) for key in dd]))            
  33.  
  34. #f.close()
  35. #outfile.write("\n".join(["%s %s %s" % (key, dd[key], ports_name[item]) for key in dd]))
  36. #outfile.close()
  37.  
Sep 23 '10 #18

bvdet
Expert Mod 2.5K+
P: 2,851
Your code doesn't look anything like the code I posted. It looks like you are reading the data file twice.
Sep 23 '10 #19

P: 22
Finally these are the line that did it:

Expand|Select|Wrap|Line Numbers
  1. for key in dd.keys():
  2.     for item in ports_name.keys():
  3.         if key == item:
  4.             outfile.write("\n".join(["%s %s %s" % (key, dd[key], ports_name[item])])+ "\n")  
  5.  
  6.  
Thanks bvdet for your help.
Sep 24 '10 #20

bvdet
Expert Mod 2.5K+
P: 2,851
You are welcome, and I'm glad that you got it working!

BV
Sep 24 '10 #21

Post your reply

Sign in to post your reply or Sign up for a free account.