473,498 Members | 2,021 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Is there a faster way to do this?

I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.

Heres the code:
import string

def checkForProduct(product_id, product_list):
for product in product_list:
if product == product_id:
return 1
return 0
input_file="c:\\input.txt"
output_file="c:\\output.txt"
product_info = []
input_count = 0

input = open(input_file,"r")
output = open(output_file, "w")

for line in input:
break_down = line.split(",")
product_number = break_down[2]
input_count+=1
if input_count == 1:
product_info.append(product_number)
output.write(line)
output_count = 1
if not checkForProduct(product_number,product_info):
product_info.append(product_number)
output.write(line)
output_count+=1

output.close()
input.close()
print input_count
print output_count
Aug 5 '08 #1
7 1109
ro************@gmail.com wrote:
I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.
Store your ID's in a dictionary or a set. Then test for for existence
of a new ID in that set. That test will be *much* more efficient that
searching a list. (It uses a hashing scheme.)
IDs = set()
for row in ...
ID = extractIdFromRow(row)
if ID not in IDs:
set.add(ID)
... whatever ...
In fact if *all* you are doing is trying to identify all product IDs
that occur in the file (no matter how many times they occur)

IDs = set()
for row in ...
ID = extractIdFromRow(row)
set,add(ID)

and your set is will contain *one* copy of each ID added, no matter how
many were added.
Better yet, if you can write you ID extraction as a generator or list
comprehension...

IDs = set(extractIdFromRow(row) for row in rowsOfTable)

or some such would be most efficient.
Gary Herron
Heres the code:
import string

def checkForProduct(product_id, product_list):
for product in product_list:
if product == product_id:
return 1
return 0
input_file="c:\\input.txt"
output_file="c:\\output.txt"
product_info = []
input_count = 0

input = open(input_file,"r")
output = open(output_file, "w")

for line in input:
break_down = line.split(",")
product_number = break_down[2]
input_count+=1
if input_count == 1:
product_info.append(product_number)
output.write(line)
output_count = 1
if not checkForProduct(product_number,product_info):
product_info.append(product_number)
output.write(line)
output_count+=1

output.close()
input.close()
print input_count
print output_count
--
http://mail.python.org/mailman/listinfo/python-list
Aug 5 '08 #2
On Aug 5, 2008, at 10:00 PM, ro************@gmail.com wrote:
I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.
Why not split the file into more manageable chunks, especially as it's
just what seems like plaintext?
Heres the code:
import string

def checkForProduct(product_id, product_list):
for product in product_list:
if product == product_id:
return 1
return 0
input_file="c:\\input.txt"
output_file="c:\\output.txt"
product_info = []
input_count = 0

input = open(input_file,"r")
output = open(output_file, "w")

for line in input:
break_down = line.split(",")
product_number = break_down[2]
input_count+=1
if input_count == 1:
product_info.append(product_number)
output.write(line)
output_count = 1
This seems redundant.
if not checkForProduct(product_number,product_info):
product_info.append(product_number)
output.write(line)
output_count+=1
File writing is extremely expensive. In fact, so is reading. Think
about reading the file in whole chunks. Put those chunks into Python
data structures, and make your output information in Python data
structures. If you use a dictionary and search the ID's there, you'll
notice some speed improvements as Python does a dictionary lookup far
quicker than searching a list. Then, output your data all at once at
the end.

--
Avi

Aug 5 '08 #3
Avinash Vora wrote:
On Aug 5, 2008, at 10:00 PM, ro************@gmail.com wrote:
>I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.

Why not split the file into more manageable chunks, especially as it's
just what seems like plaintext?
>Heres the code:
import string

def checkForProduct(product_id, product_list):
for product in product_list:
if product == product_id:
return 1
return 0
input_file="c:\\input.txt"
output_file="c:\\output.txt"
product_info = []
input_count = 0

input = open(input_file,"r")
output = open(output_file, "w")

for line in input:
break_down = line.split(",")
product_number = break_down[2]
input_count+=1
if input_count == 1:
product_info.append(product_number)
output.write(line)
output_count = 1

This seems redundant.
> if not checkForProduct(product_number,product_info):
product_info.append(product_number)
output.write(line)
output_count+=1

File writing is extremely expensive. In fact, so is reading. Think
about reading the file in whole chunks. Put those chunks into Python
data structures, and make your output information in Python data
structures.
Don't bother yourself with this suggestion about reading in chunks --
Python already does this for you, and does so more efficiently that you
could. The code
for line in open(input_file,"r"):
reads in large chunks (efficiently) and then serves up the contents
line-by-line.

Gary Herron

If you use a dictionary and search the ID's there, you'll notice some
speed improvements as Python does a dictionary lookup far quicker than
searching a list. Then, output your data all at once at the end.

--
Avi

--
http://mail.python.org/mailman/listinfo/python-list
Aug 5 '08 #4
ro************@gmail.com wrote:
I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.
I think you need to learn about Python's dictionary data type.
Aug 6 '08 #5
Why not just use sets?

a = set()
a.add(1)
a.add(2)

On Tue, Aug 5, 2008 at 10:14 PM, RPM1 <rp**********@earthlink.netwrote:
ro************@gmail.com wrote:
>>
I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.

I think you need to learn about Python's dictionary data type.
--
http://mail.python.org/mailman/listinfo/python-list
Aug 6 '08 #6
On Tue, 5 Aug 2008, ro************@gmail.com wrote:
I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.
My take:

I assume you have 80 bytes per line, that makes 10 milion lines for 700M
file. To be quite sure lets round it to 20 milions. Next, I don't want to
trash my disk with 700M+ files, so I assume reading the line, breaking it
and getting product id takes roughly the same time as generating random id
by my code. So, I:

1. read all records line by line (or just make random ids), append the
product id to the list (actually, I preallocate the list with empty space
and fill it up)
2. sort() the list
3. iterate the list, count the unique ids, optionally write to file

The code (most of it is just making random names, which was real fun):

import string

RECORD_NUM = 20*1024*1024 # 700M/80-bytes-per-line = ca. 10M+ records
ALPHA = string.digits + string.ascii_letters
RAND = None

def random_iter ( ) :
x = 12345678910
y = 234567891011
M = 2**61 - 1
M0 = 2**31 - 1
pot10 = [ 1, 10, 100, 1000, 10000, 100000 ]
while 1 :
p = x * y
l = pot10[5 - (p % 10)]
n = (p / l) % M
d = l * (n % 10)
p = p % M0
x1 = y - d + p
y1 = x + d + p
x, y = x1, y1
yield n
pass
pass

def next_random ( ) :
global RAND
if RAND == None :
RAND = random_iter()
return RAND.next()

def num_to_name ( n ) :
s = []
len_a = len(ALPHA)
while n 0 :
n1, r = divmod(n, len_a)
s.append(ALPHA[r])
n = n1
pass
return "".join(s)

def get_product_id ( ) :
return num_to_name(next_random())

def dry_run ( n ) :
r = [ 0 ] * n
while n 0 :
n -= 1
r[n] = get_product_id()
return r

###

if __name__ == "__main__":
print "hello"
for i in range(10) : print get_product_id()
print
print "RECORD_NUM: %d" % RECORD_NUM
products = dry_run(RECORD_NUM)
print "RECORDS PRODUCED: %d" % len(products)
products.sort()
i = 0
lastp = ""
count = 0
while i < len(products) :
if lastp != products[i] :
lastp = products[i]
count += 1
i += 1
pass
print "UNIQUE RECORDS: %d" % count

I may have made some bugs, but it works on my computer. Or seems to ;-/.

For 20 mln products, on my Athlon XP @1800 / 1.5G ram, Debian Linux box,
it takes about 13 minutes to go through list generation, about 3 minutes
to sort the list, and few more seconds to skim it (writing should not be
much longer). All summed up, about 18 minutes of real time, with some
other programs computing a little etc in the background - so, much less
then 2 hours.

Regards,
Tomasz Rola

--
** A C programmer asked whether computer had Buddha's nature. **
** As the answer, master did "rm -rif" on the programmer's home **
** directory. And then the C programmer became enlightened... **
** **
** Tomasz Rola mailto:to*********@bigfoot.com **
Aug 6 '08 #7
Is your product ID always the 3rd and last item on the line ?
Else your output won't separate IDs.

And how does

output = open(output_file,'w')
for x in set(line.split(',')[2] for line in open(input_file)) :
output.write(x)
output.close()

behave ?
ro************@gmail.com wrote:
I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.

Heres the code:
import string

def checkForProduct(product_id, product_list):
for product in product_list:
if product == product_id:
return 1
return 0
input_file="c:\\input.txt"
output_file="c:\\output.txt"
product_info = []
input_count = 0

input = open(input_file,"r")
output = open(output_file, "w")

for line in input:
break_down = line.split(",")
product_number = break_down[2]
input_count+=1
if input_count == 1:
product_info.append(product_number)
output.write(line)
output_count = 1
if not checkForProduct(product_number,product_info):
product_info.append(product_number)
output.write(line)
output_count+=1

output.close()
input.close()
print input_count
print output_count
--
http://mail.python.org/mailman/listinfo/python-list
Aug 6 '08 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

36
2591
by: Armin Rigo | last post by:
Hi! This is a rant against the optimization trend of the Python interpreter. Sorting a list of 100000 integers in random order takes: * 0.75 seconds in Python 2.1 * 0.51 seconds in Python...
23
4692
by: YinTat | last post by:
Hi, I learned C++ recently and I made a string class. A code example is this: class CString { public: inline CString(const char *rhs) { m_size = strlen(rhs);
98
14298
by: jrefactors | last post by:
I heard people saying prefix increment is faster than postfix incerement, but I don't know what's the difference. They both are i = i+1. i++ ++i Please advise. thanks!!
3
2045
by: mathilda | last post by:
When I'm opening a recordset for add new, is it faster to put a limiting clause on it? i.e. does "Select from Where Keyfield = " & value & " return a recordset faster than "Select ...
18
1639
by: junky_fellow | last post by:
which of the following is faster and why ? if ( x == 1 ) { } or if ( x != 0 ) {
1
2697
by: James dean | last post by:
I done a test and i really do not know the reason why a jagged array who has the same number of elements as a multidimensional array is faster here is my test. I assign a value and do a small...
1
1830
by: anagai | last post by:
Im wondering if generating html objects such as tabels and rows in javascript is faster than typing the html directly? Seems when you do it in javascript you have to download alot of code and would...
11
2939
by: ctman770 | last post by:
Hi Everyone, Is it faster to save the precise location of an html dom node into a variable in js, or to use getElementById everytime you need to access the node? I want to make my application...
23
2492
by: Python Maniac | last post by:
I am new to Python however I would like some feedback from those who know more about Python than I do at this time. def scrambleLine(line): s = '' for c in line: s += chr(ord(c) | 0x80)...
41
2638
by: c | last post by:
Hi every one, Me and my Cousin were talking about C and C#, I love C and he loves C#..and were talking C is ...blah blah...C# is Blah Blah ...etc and then we decided to write a program that...
0
7121
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
6993
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
7375
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
5456
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
1
4899
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
4584
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3088
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
1411
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
0
287
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.