Ignoring similar entries in a column of many rows

Hi everyone;

I have a file with one column the elements of which are separated by commas (,). The column has many rows, as in below:

45,56,890
67,92,4502
76,367,89
67,92,4502
67,92,4502
92,14,05
02,56,125
25,02,61
02,56,125
.
.
.

I want to go through all the rows in the column ignoring the similar rows and put the result in an output file. For example I want to count 67,92,4502 once and ignore the other similar ones (rows 4 and 5 are the same).

I hope I have explained the problem clearly.
Thanks in advance, Atrisa

Sep 17 '10 #1

Subscribe Post Reply

✓ answered by bvdet

See if this works for you:

Expand|Select|Wrap|Line Numbers

 outfile = open('capture25000-column3.txt', 'w')
 
f = open('capture25000.txt')

output = []

for line in f:

    column3 = line.strip().split()[2].split(".")[-1]

    if column3 not in output:

        output.append(column3)
 
f.close()

outfile.write("\n".join(output))

outfile.close()

1641

bvdet

2,851

Expert Mod 2GB

This can easily be accomplished with the in operator:

Expand|Select|Wrap|Line Numbers

 s = '''45,56,890

67,92,4502

76,367,89

67,92,4502

67,92,4502

92,14,05

02,56,125

25,02,61

02,56,125'''
 
# create list of lines

lineList = s.split("\n")

output = []

# simulate iteration on file object or list of lines

for line in lineList:

    if line not in output:

        output.append(line)

print "\n".join(output)

Output:
>>> 45,56,890
67,92,4502
76,367,89
92,14,05
02,56,125
25,02,61
>>>

Sep 17 '10 #2

dwblas

626

Expert 512MB

If you have a large amount of data, use sets, i.e each group of three numbers would become a tuple added to a set. Sets don't add duplicates but do alter the order. If you want to retain the original file order, then use an OrderedDict for a large data set. If bvdet's solution runs in a reasonable amount of time then there is no reason to change. Post back if you want more info on either alternate solution.

Sep 17 '10 #3

Atrisa

Thanks a lot both of you. I need to use bvdet's code like in the following:

Expand|Select|Wrap|Line Numbers

 
outfile = open('capture25000-column3.txt', 'w')
 
for (i, eachline) in enumerate(open('capture25000.txt','r')):

 column3 = eachline.split(' ')[2]

 newcolumn = column3.split('.')[-1]

 outfile.write(newcolumn + '\n')

 output = []

 lineList = newcolumn.split("\n")

 for line in lineList:

  if line not in output:

   output.append(line)

 print "\n".join(output)

Basically what I have is an input file of 25000 rows, 'capture25000.txt'. I extract the column that I need from this file and put that column in another file 'capture25000-column3.txt'. So how do I use this new output file in bvdet's code? I have done the way above, but it again prints the same values as in the 'newcolumn' list or 'capture25000-column3.txt' file. I want to get rid of the similar lines in the 'capture25000-column3.txt' file.

Sep 20 '10 #4

bvdet

2,851

Expert Mod 2GB

Judging from your code, it appears that the file format is not like the sample data in your first post. Could you post a representative sample of your actual data?

Sep 20 '10 #5

Atrisa

The actual data which is in a file called 'capture25000-column3.txt' is a set of port numbers like this:

52141
21283
43439
21283
21283
23479
80
10464
80
80
21283
43439
80
10464
10464
21283
.
.
.

and there are 25000 rows of those numbers. If you need all of them, I could send it to your email.

Sep 20 '10 #6

bvdet

2,851

Expert Mod 2GB

So that's supposed to be the output? What does the corresponding data look like?

Sep 20 '10 #7

Atrisa

I first have an input file that has captured data. I need only the third column from that file, so I did it this way:

Expand|Select|Wrap|Line Numbers

 # for (i, eachline) in enumerate(open('capture25000.txt','r')):

#  column3 = eachline.split(' ')[2]

Then I get the list 'column3' which looks like this:

27.20.29.12.21283
27.20.29.12.21283
65.24.112.69.23479
27.120.98.21.80
32.16.45.74.10464
56.176.45.74.10464
80.290.20.12.21283
45.85.178.252.28302
80.91.188.120.1328
27.144.102.81.13696
62.244.260.6.33344
54.57.14.68.80
200.97.17.48.80

There 2500 rows of those. Then I use the following piece of code:

Expand|Select|Wrap|Line Numbers

# newcolumn = column3.split('.')[-1]

to get only the port numbers as in the following:

21283
21283
23479
80
10464
1046
1283
28302
1328
13696
33344
80
80
.
.
.

As you notice there are more than one occurrence of every port number. I want to leave only the first occurrence of any number and get rid of the rest, which your code is supposed to do. But since I have those port numbers in an external file, I didn't know how to use your code for my purpose.

Thanks a lot again for your time

Sep 20 '10 #8

bvdet

2,851

Expert Mod 2GB

See if this works for you:

Expand|Select|Wrap|Line Numbers

 outfile = open('capture25000-column3.txt', 'w')
 
f = open('capture25000.txt')

output = []

for line in f:

    column3 = line.strip().split()[2].split(".")[-1]

    if column3 not in output:

        output.append(column3)
 
f.close()

outfile.write("\n".join(output))

outfile.close()

Sep 20 '10 #9

Atrisa

Thanks a lot bvdet. That worked well:)

Sep 21 '10 #10

Atrisa

I also want to count how many times the port numbers appeared in the 'column3' and put the port number and beside it the number of appearances of that port in a file, so that I have something like this:

21283 3
21283 2
23479 1
80 7
10464 2

and so on. This is my code that tries to do that, but it gives me only 1 appearance for each port:

Expand|Select|Wrap|Line Numbers

  
outfile = open('capture25000-column3.txt', 'w')
 
f = open('capture25000.txt')
 
output = []

for line in f:

   column3 = line.strip().split()[2].split(".")[-1].split("\n")

   for val in column3:   

      if not val in output:

         print val, column3.count(val)

         output.append(val)
 
f.close()

outfile.write("\n".join(output))  

outfile.close()

It seems to count the appearances from the file where I have only the unique numbers, so basically each of those appear for once there. What am I missing in the above code?

Sep 21 '10 #11

bvdet

2,851

Expert Mod 2GB

Variable column3 is a string representing a number. Splitting column3 on the "\n" character does nothing except return a one element list. column3 is only added to the output list one time, so it is only counted once.

I would recommend storing the data in a dictionary. The port numbers would be the dictionary keys and the counts would be the values.

Expand|Select|Wrap|Line Numbers

 outfile = open('capture25000-column3.txt', 'w')
 
f = open('capture25000.txt')

# initialize a dictionary

dd = {}

# iterate on the file object

for line in f:

    # get the port number from the third item in line

    column3 = line.strip().split()[2].split(".")[-1]

    # if column3 not in dd, add to dd and set quantity to 0

    dd.setdefault(column3, 0)

    # increment dd[port number] by one

    dd[column3] += 1
 
f.close()

# write the dictionary to disk

outfile.write("\n".join(["%s %s" % (key, dd[key]) for key in dd]))

outfile.close()

Sep 21 '10 #12

Atrisa

Thanks a lot bvdet for your help and your time. That works perfectly. It's two weeks that I have started learning Python, that's why I am a bit lost.

Sep 21 '10 #13

Atrisa

Now I have created the following method in the code that you did earlier. The method creates a dictionary that has the port numbers and their corresponding port names:

Expand|Select|Wrap|Line Numbers

 
def get_port_name (name, port_table):

# ports.txt is a file where I have put the IANA PORT NUMBERS

    f= open('ports.txt','r')

    content=f.readline()

    keycount=1

    while content:

        key = str(keycount)

        line1=content.split()   

        if len(line1)>3 and line1[0]!="#":

            port=line1[1].split('/')

            if len(port)==2:

                port_table[int(port[0])]={'keycode':line1[0],'description':line1[2]}

        keycount +=1

        content=f.readline()
 
ports_name={}

get_port_name ('Ports', ports_name)

print ports_name;

A sample of the result of the code above is like this:

-----------------------------------------------------
80: {'keycode': 'www-http', 'description': 'World'},
82: {'keycode': 'xfer', 'description': 'XFER'},
83: {'keycode': 'mit-ml-dev', 'description': 'MIT'},
---------------------------------------------------

e.g. the first line shows port 80 with the name being 'www-http'. I want to grab those keycode values and put them in the 3rd column of the output file that you gave earlier and have a result like this for all the port numbers:

80 115 HTTP
443 52 HTTPS

and so on, with the following line:

Expand|Select|Wrap|Line Numbers

 
outfile.write("\n".join(["%s %s %s" % (key, dd[key], port_table['keycode'][key]) for key in dd]))

but it doesn't work. How to use the method here for the output? I am yet well familiar with the syntax of Python methods I guess.

Sep 21 '10 #14

bvdet

2,851

Expert Mod 2GB

It may be as simple as:

Expand|Select|Wrap|Line Numbers

port_table[key]['keycode']

Sep 21 '10 #15

Atrisa

Unfortunately that didn't work either. I changed few lines in the previously sent code this way:

Expand|Select|Wrap|Line Numbers

 
.

.

.

port_table[port[0]]=line1[0]

        keycount +=1

        content=f.readline()
 
ports_name={}

get_port_name ('Ports', ports_name)

and now the dictionary looks like this:
-----------------------------------------------
{'8732': 'dtp-net', '4026': 'as-debug', '4448': 'asc-slmd', '4024': 'tnp1-port', '4025': 'partimage', '38203': 'agpolicy', '4023': 'esnm-zoning', '4020': 'trap', '4021': 'nexus-portal', '9418': 'git', '4028': 'dtserver-port', '4029': 'ip-qsig'}
-----------------------------------------------
The output line is like this:

Expand|Select|Wrap|Line Numbers

 
outfile.write("\n".join(["%s %s %s" % (key, port_dic[key], ports_name[key]) for key in port_dic]))

but it gives this error message: KeyError: '62084', which doesn't tell me anything.

Sep 22 '10 #16

bvdet

2,851

Expert Mod 2GB

"KeyError" indicates that you are attempting to retrieve a dictionary value using a dictionary key that is not in the dictionary. In order for the script to work, both dictionaries must have the same keys. The alternative would be to create one dictionary instead. Assuming the 'keycode' is in column 4 and the 'description' is in column 5:

Expand|Select|Wrap|Line Numbers

 outfile = open('capture25000-column3.txt', 'w')
 
f = open('capture25000.txt')

# initialize a dictionary

dd = {}

# iterate on the file object

for line in f:

    # get the port number from the third item in line

    lineList = line.strip().split()

    column3 = lineList[2].split(".")[-1]

    keycode = lineList[3]

    description = lineList[4]

    # if column3 not in dd, add to dd and set quantity to 0

    # only first occurring value of keycode and description will be saved

    dd.setdefault(column3, [0, keycode, description])

    # increment dd[port number][0] by one

    dd[column3][0] += 1
 
f.close()

# write the dictionary to disk

outfile.write("\n".join(["%s %s %s" % (key, dd[key][0], dd[key][1]) for key in dd]))

outfile.close()

Sep 22 '10 #17

Atrisa

Now I get the port description in the third column of the output file 'capture25000-column3.txt', but each description is repeated for few times, not just once as the port numbers are. Some output:

25910 1 pangolin-laser
55149 1 pangolin-laser
4024 4 www-http
13734 2 www-http
2451 2 www-http
55617 6 iapp
61510 1 iapp

which is not correct. The few last lines of the following code is supposed to do that, but doesn't work properly. I guess something I have done wrong with looping?

Expand|Select|Wrap|Line Numbers

 
def get_port_name (name, port_table):
 
    pf= open('ports.txt','r')

    content=pf.readline()

    keycount=1

    while content:

        line1=content.split()  

        if len(line1)>3 and line1[0]!="#":

            port=line1[1].split('/')

            port_table[port[0]]=line1[0]

        keycount +=1

        content=pf.readline()
 
# initialize a dictionary

dd = {}
 
# iterate on the file object

for line in f:

    # get the port number from the third item in line

    column3 = line.strip().split()[2].split(".")[-1]

    # if column3 not in dd, add to dd and set quantity to 0

    dd.setdefault(column3, 0)

    # increment dd[port number] by one

    dd[column3] += 1
 
ports_name={}

get_port_name ('Ports', ports_name)

# if the key in dd equals the key (item) in ports_name, insert the value of that item to the third column in the output file:

for key in dd.keys():

    for item in ports_name.keys():

        if key == item:

        outfile.write("\n".join(["%s %s %s" % (key, dd[key], ports_name[item]) for key in dd]))            
 
#f.close()

#outfile.write("\n".join(["%s %s %s" % (key, dd[key], ports_name[item]) for key in dd]))

#outfile.close()

Sep 23 '10 #18

bvdet

2,851

Expert Mod 2GB

Your code doesn't look anything like the code I posted. It looks like you are reading the data file twice.

Sep 23 '10 #19

Atrisa

Finally these are the line that did it:

Expand|Select|Wrap|Line Numbers

 
for key in dd.keys():

    for item in ports_name.keys():

        if key == item:

            outfile.write("\n".join(["%s %s %s" % (key, dd[key], ports_name[item])])+ "\n")

Thanks bvdet for your help.

Sep 24 '10 #20

bvdet

2,851

Expert Mod 2GB

You are welcome, and I'm glad that you got it working!

BV

Sep 24 '10 #21

Ignoring similar entries in a column of many rows

✓ answered by bvdet

Similar topics