Hi everyone;
I have a file with one column the elements of which are separated by commas (,). The column has many rows, as in below:
45,56,890
67,92,4502
76,367,89
67,92,4502
67,92,4502
92,14,05
02,56,125
25,02,61
02,56,125
.
.
.
I want to go through all the rows in the column ignoring the similar rows and put the result in an output file. For example I want to count 67,92,4502 once and ignore the other similar ones (rows 4 and 5 are the same).
I hope I have explained the problem clearly.
Thanks in advance, Atrisa
See if this works for you: - outfile = open('capture25000-column3.txt', 'w')
-
-
f = open('capture25000.txt')
-
output = []
-
for line in f:
-
column3 = line.strip().split()[2].split(".")[-1]
-
if column3 not in output:
-
output.append(column3)
-
-
f.close()
-
outfile.write("\n".join(output))
-
outfile.close()
20 1641 bvdet 2,851
Expert Mod 2GB
This can easily be accomplished with the in operator: - s = '''45,56,890
-
67,92,4502
-
76,367,89
-
67,92,4502
-
67,92,4502
-
92,14,05
-
02,56,125
-
25,02,61
-
02,56,125'''
-
-
# create list of lines
-
lineList = s.split("\n")
-
output = []
-
# simulate iteration on file object or list of lines
-
for line in lineList:
-
if line not in output:
-
output.append(line)
-
print "\n".join(output)
Output:
>>> 45,56,890
67,92,4502
76,367,89
92,14,05
02,56,125
25,02,61
>>>
If you have a large amount of data, use sets, i.e each group of three numbers would become a tuple added to a set. Sets don't add duplicates but do alter the order. If you want to retain the original file order, then use an OrderedDict for a large data set. If bvdet's solution runs in a reasonable amount of time then there is no reason to change. Post back if you want more info on either alternate solution.
Thanks a lot both of you. I need to use bvdet's code like in the following: -
outfile = open('capture25000-column3.txt', 'w')
-
-
for (i, eachline) in enumerate(open('capture25000.txt','r')):
-
column3 = eachline.split(' ')[2]
-
newcolumn = column3.split('.')[-1]
-
outfile.write(newcolumn + '\n')
-
output = []
-
lineList = newcolumn.split("\n")
-
for line in lineList:
-
if line not in output:
-
output.append(line)
-
print "\n".join(output)
-
Basically what I have is an input file of 25000 rows, 'capture25000.txt'. I extract the column that I need from this file and put that column in another file 'capture25000-column3.txt'. So how do I use this new output file in bvdet's code? I have done the way above, but it again prints the same values as in the 'newcolumn' list or 'capture25000-column3.txt' file. I want to get rid of the similar lines in the 'capture25000-column3.txt' file.
bvdet 2,851
Expert Mod 2GB
Judging from your code, it appears that the file format is not like the sample data in your first post. Could you post a representative sample of your actual data?
The actual data which is in a file called 'capture25000-column3.txt' is a set of port numbers like this:
52141
21283
43439
21283
21283
23479
80
10464
80
80
21283
43439
80
10464
10464
21283
.
.
.
and there are 25000 rows of those numbers. If you need all of them, I could send it to your email.
bvdet 2,851
Expert Mod 2GB
So that's supposed to be the output? What does the corresponding data look like?
I first have an input file that has captured data. I need only the third column from that file, so I did it this way: - # for (i, eachline) in enumerate(open('capture25000.txt','r')):
-
# column3 = eachline.split(' ')[2]
Then I get the list 'column3' which looks like this:
27.20.29.12.21283
27.20.29.12.21283
65.24.112.69.23479
27.120.98.21.80
32.16.45.74.10464
56.176.45.74.10464
80.290.20.12.21283
45.85.178.252.28302
80.91.188.120.1328
27.144.102.81.13696
62.244.260.6.33344
54.57.14.68.80
200.97.17.48.80
There 2500 rows of those. Then I use the following piece of code: -
# newcolumn = column3.split('.')[-1]
-
to get only the port numbers as in the following:
21283
21283
23479
80
10464
1046
1283
28302
1328
13696
33344
80
80
.
.
.
As you notice there are more than one occurrence of every port number. I want to leave only the first occurrence of any number and get rid of the rest, which your code is supposed to do. But since I have those port numbers in an external file, I didn't know how to use your code for my purpose.
Thanks a lot again for your time
bvdet 2,851
Expert Mod 2GB
See if this works for you: - outfile = open('capture25000-column3.txt', 'w')
-
-
f = open('capture25000.txt')
-
output = []
-
for line in f:
-
column3 = line.strip().split()[2].split(".")[-1]
-
if column3 not in output:
-
output.append(column3)
-
-
f.close()
-
outfile.write("\n".join(output))
-
outfile.close()
Thanks a lot bvdet. That worked well:)
I also want to count how many times the port numbers appeared in the 'column3' and put the port number and beside it the number of appearances of that port in a file, so that I have something like this:
21283 3
21283 2
23479 1
80 7
10464 2
and so on. This is my code that tries to do that, but it gives me only 1 appearance for each port: -
-
outfile = open('capture25000-column3.txt', 'w')
-
-
f = open('capture25000.txt')
-
-
output = []
-
for line in f:
-
column3 = line.strip().split()[2].split(".")[-1].split("\n")
-
for val in column3:
-
if not val in output:
-
print val, column3.count(val)
-
output.append(val)
-
-
f.close()
-
outfile.write("\n".join(output))
-
outfile.close()
-
-
It seems to count the appearances from the file where I have only the unique numbers, so basically each of those appear for once there. What am I missing in the above code?
bvdet 2,851
Expert Mod 2GB
Variable column3 is a string representing a number. Splitting column3 on the "\n" character does nothing except return a one element list. column3 is only added to the output list one time, so it is only counted once.
I would recommend storing the data in a dictionary. The port numbers would be the dictionary keys and the counts would be the values. - outfile = open('capture25000-column3.txt', 'w')
-
-
f = open('capture25000.txt')
-
# initialize a dictionary
-
dd = {}
-
# iterate on the file object
-
for line in f:
-
# get the port number from the third item in line
-
column3 = line.strip().split()[2].split(".")[-1]
-
# if column3 not in dd, add to dd and set quantity to 0
-
dd.setdefault(column3, 0)
-
# increment dd[port number] by one
-
dd[column3] += 1
-
-
f.close()
-
# write the dictionary to disk
-
outfile.write("\n".join(["%s %s" % (key, dd[key]) for key in dd]))
-
outfile.close()
Thanks a lot bvdet for your help and your time. That works perfectly. It's two weeks that I have started learning Python, that's why I am a bit lost.
Now I have created the following method in the code that you did earlier. The method creates a dictionary that has the port numbers and their corresponding port names: -
def get_port_name (name, port_table):
-
# ports.txt is a file where I have put the IANA PORT NUMBERS
-
f= open('ports.txt','r')
-
content=f.readline()
-
keycount=1
-
while content:
-
key = str(keycount)
-
line1=content.split()
-
if len(line1)>3 and line1[0]!="#":
-
port=line1[1].split('/')
-
if len(port)==2:
-
port_table[int(port[0])]={'keycode':line1[0],'description':line1[2]}
-
keycount +=1
-
content=f.readline()
-
-
ports_name={}
-
get_port_name ('Ports', ports_name)
-
print ports_name;
-
A sample of the result of the code above is like this:
-----------------------------------------------------
80: {'keycode': 'www-http', 'description': 'World'},
82: {'keycode': 'xfer', 'description': 'XFER'},
83: {'keycode': 'mit-ml-dev', 'description': 'MIT'},
---------------------------------------------------
e.g. the first line shows port 80 with the name being 'www-http'. I want to grab those keycode values and put them in the 3rd column of the output file that you gave earlier and have a result like this for all the port numbers:
80 115 HTTP
443 52 HTTPS
and so on, with the following line: -
outfile.write("\n".join(["%s %s %s" % (key, dd[key], port_table['keycode'][key]) for key in dd]))
-
but it doesn't work. How to use the method here for the output? I am yet well familiar with the syntax of Python methods I guess.
bvdet 2,851
Expert Mod 2GB
It may be as simple as: - port_table[key]['keycode']
Unfortunately that didn't work either. I changed few lines in the previously sent code this way: -
.
-
.
-
.
-
port_table[port[0]]=line1[0]
-
keycount +=1
-
content=f.readline()
-
-
ports_name={}
-
get_port_name ('Ports', ports_name)
-
and now the dictionary looks like this:
-----------------------------------------------
{'8732': 'dtp-net', '4026': 'as-debug', '4448': 'asc-slmd', '4024': 'tnp1-port', '4025': 'partimage', '38203': 'agpolicy', '4023': 'esnm-zoning', '4020': 'trap', '4021': 'nexus-portal', '9418': 'git', '4028': 'dtserver-port', '4029': 'ip-qsig'}
-----------------------------------------------
The output line is like this: -
outfile.write("\n".join(["%s %s %s" % (key, port_dic[key], ports_name[key]) for key in port_dic]))
-
but it gives this error message: KeyError: '62084', which doesn't tell me anything.
bvdet 2,851
Expert Mod 2GB
"KeyError" indicates that you are attempting to retrieve a dictionary value using a dictionary key that is not in the dictionary. In order for the script to work, both dictionaries must have the same keys. The alternative would be to create one dictionary instead. Assuming the 'keycode' is in column 4 and the 'description' is in column 5: - outfile = open('capture25000-column3.txt', 'w')
-
-
f = open('capture25000.txt')
-
# initialize a dictionary
-
dd = {}
-
# iterate on the file object
-
for line in f:
-
# get the port number from the third item in line
-
lineList = line.strip().split()
-
column3 = lineList[2].split(".")[-1]
-
keycode = lineList[3]
-
description = lineList[4]
-
# if column3 not in dd, add to dd and set quantity to 0
-
# only first occurring value of keycode and description will be saved
-
dd.setdefault(column3, [0, keycode, description])
-
# increment dd[port number][0] by one
-
dd[column3][0] += 1
-
-
f.close()
-
# write the dictionary to disk
-
outfile.write("\n".join(["%s %s %s" % (key, dd[key][0], dd[key][1]) for key in dd]))
-
outfile.close()
Now I get the port description in the third column of the output file 'capture25000-column3.txt', but each description is repeated for few times, not just once as the port numbers are. Some output:
25910 1 pangolin-laser
55149 1 pangolin-laser
4024 4 www-http
13734 2 www-http
2451 2 www-http
55617 6 iapp
61510 1 iapp
which is not correct. The few last lines of the following code is supposed to do that, but doesn't work properly. I guess something I have done wrong with looping? -
def get_port_name (name, port_table):
-
-
pf= open('ports.txt','r')
-
content=pf.readline()
-
keycount=1
-
while content:
-
line1=content.split()
-
if len(line1)>3 and line1[0]!="#":
-
port=line1[1].split('/')
-
port_table[port[0]]=line1[0]
-
keycount +=1
-
content=pf.readline()
-
-
# initialize a dictionary
-
dd = {}
-
-
# iterate on the file object
-
for line in f:
-
# get the port number from the third item in line
-
column3 = line.strip().split()[2].split(".")[-1]
-
# if column3 not in dd, add to dd and set quantity to 0
-
dd.setdefault(column3, 0)
-
# increment dd[port number] by one
-
dd[column3] += 1
-
-
ports_name={}
-
get_port_name ('Ports', ports_name)
-
# if the key in dd equals the key (item) in ports_name, insert the value of that item to the third column in the output file:
-
for key in dd.keys():
-
for item in ports_name.keys():
-
if key == item:
-
outfile.write("\n".join(["%s %s %s" % (key, dd[key], ports_name[item]) for key in dd]))
-
-
#f.close()
-
#outfile.write("\n".join(["%s %s %s" % (key, dd[key], ports_name[item]) for key in dd]))
-
#outfile.close()
-
bvdet 2,851
Expert Mod 2GB
Your code doesn't look anything like the code I posted. It looks like you are reading the data file twice.
Finally these are the line that did it: -
for key in dd.keys():
-
for item in ports_name.keys():
-
if key == item:
-
outfile.write("\n".join(["%s %s %s" % (key, dd[key], ports_name[item])])+ "\n")
-
-
Thanks bvdet for your help.
bvdet 2,851
Expert Mod 2GB
You are welcome, and I'm glad that you got it working!
BV
Sign in to post your reply or Sign up for a free account.
Similar topics
by: Paolo Alexis Falcone |
last post by:
Whenever I try to access a table with many rows using PgSQL's
fetchall(), this happens:
>>> from pyPgSQL import PgSQL
>>> db = PgSQL.connect("192.168.0.8:5432:whitegold","dondon","dondon")
>>>...
|
by: Good Man |
last post by:
Hi there
I'm in the planning stages of creating a database, and I have two options
here. Which makes more sense, and/or provides better performance for
queries - a single table with hundreds of...
|
by: Krish |
last post by:
How do you find out the number of rows in the array {Array}
I have
.....
thanks
|
by: Steve Atkins |
last post by:
I have a large table (potentially tens or hundreds of millions of rows) and
I need to extract some number of these rows, defined by an integer primary
key.
So, the obvious answer is
select *...
|
by: rcamarda |
last post by:
I have a column that has text delimited by a percent sign that I wish
to turn into rows.
Example:
A column contains ROBERT%CAMARDA, I want to turn that into two rows,
one row with ROBERT and...
|
by: pkirk25 |
last post by:
Assume an array of structs that is having rows added at random. By the
time it reaches your function, you have no idea if it has a few hundred
over over 10000 rows.
When your function recieves...
|
by: bill |
last post by:
I am sure there must be an easy way to determine the number of
rows in a table, but I can't find it.
I appreciate the courtesy and patience ng members have shown this
mysql novice.
bill
|
by: Anca Floria |
last post by:
Hi there!
I am trying to find out what is wrong with the following code:
SELECT Investor.Company, Investor.ID, Investor.Country
FROM Investor
WHERE ((( Investor.Company) In (
SELECT
...
|
by: manish2007 |
last post by:
I have edit sql server 2000 table record by vb or direct enterprise mangafer and i have recived one message "Key column information is insufficient or incorrect. Too many rows were affected by...
|
by: luciegiles |
last post by:
Hi,
I wish to transpose data in an Access table currently in a single column over many rows to many columns over single row:
Current format:
CHI DateIssued
123456 01/01/10...
|
by: Faith0G |
last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
|
by: ryjfgjl |
last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
|
by: Charles Arthur |
last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
|
by: aa123db |
last post by:
Variable and constants
Use var or let for variables and const fror constants.
Var foo ='bar';
Let foo ='bar';const baz ='bar';
Functions
function $name$ ($parameters$) {
}
...
|
by: ryjfgjl |
last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
|
by: ryjfgjl |
last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
|
by: emmanuelkatto |
last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud.
Please let me know.
Thanks!
Emmanuel
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
| |