Is there a faster way to do this?

ronald.johnson

I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.

Heres the code:
import string

def checkForProduct(product_id, product_list):
for product in product_list:
if product == product_id:
return 1
return 0
input_file="c:\\input.txt"
output_file="c:\\output.txt"
product_info = []
input_count = 0

input = open(input_file,"r")
output = open(output_file, "w")

for line in input:
break_down = line.split(",")
product_number = break_down[2]
input_count+=1
if input_count == 1:
product_info.append(product_number)
output.write(line)
output_count = 1
if not checkForProduct(product_number,product_info):
product_info.append(product_number)
output.write(line)
output_count+=1

output.close()
input.close()
print input_count
print output_count

Aug 5 '08 #1

Subscribe Reply

1109

Gary Herron

ro************@gmail.com wrote:

I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.

Store your ID's in a dictionary or a set. Then test for for existence
of a new ID in that set. That test will be *much* more efficient that
searching a list. (It uses a hashing scheme.)
IDs = set()
for row in ...
ID = extractIdFromRow(row)
if ID not in IDs:
set.add(ID)
... whatever ...
In fact if *all* you are doing is trying to identify all product IDs
that occur in the file (no matter how many times they occur)

IDs = set()
for row in ...
ID = extractIdFromRow(row)
set,add(ID)

and your set is will contain *one* copy of each ID added, no matter how
many were added.
Better yet, if you can write you ID extraction as a generator or list
comprehension...

IDs = set(extractIdFromRow(row) for row in rowsOfTable)

or some such would be most efficient.
Gary Herron

Heres the code:
import string

def checkForProduct(product_id, product_list):
for product in product_list:
if product == product_id:
return 1
return 0
input_file="c:\\input.txt"
output_file="c:\\output.txt"
product_info = []
input_count = 0

input = open(input_file,"r")
output = open(output_file, "w")

for line in input:
break_down = line.split(",")
product_number = break_down[2]
input_count+=1
if input_count == 1:
product_info.append(product_number)
output.write(line)
output_count = 1
if not checkForProduct(product_number,product_info):
product_info.append(product_number)
output.write(line)
output_count+=1

output.close()
input.close()
print input_count
print output_count
--
http://mail.python.org/mailman/listinfo/python-list

Aug 5 '08 #2

Avinash Vora

On Aug 5, 2008, at 10:00 PM, ro************@gmail.com wrote:

I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.

Why not split the file into more manageable chunks, especially as it's
just what seems like plaintext?

Heres the code:
import string

def checkForProduct(product_id, product_list):
for product in product_list:
if product == product_id:
return 1
return 0
input_file="c:\\input.txt"
output_file="c:\\output.txt"
product_info = []
input_count = 0

input = open(input_file,"r")
output = open(output_file, "w")

for line in input:
break_down = line.split(",")
product_number = break_down[2]
input_count+=1
if input_count == 1:
product_info.append(product_number)
output.write(line)
output_count = 1

This seems redundant.

if not checkForProduct(product_number,product_info):
product_info.append(product_number)
output.write(line)
output_count+=1

File writing is extremely expensive. In fact, so is reading. Think
about reading the file in whole chunks. Put those chunks into Python
data structures, and make your output information in Python data
structures. If you use a dictionary and search the ID's there, you'll
notice some speed improvements as Python does a dictionary lookup far
quicker than searching a list. Then, output your data all at once at
the end.

--
Avi

Aug 5 '08 #3

Gary Herron

Avinash Vora wrote:

On Aug 5, 2008, at 10:00 PM, ro************@gmail.com wrote:

>I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.

Why not split the file into more manageable chunks, especially as it's
just what seems like plaintext?

>Heres the code:
import string

def checkForProduct(product_id, product_list):
for product in product_list:
if product == product_id:
return 1
return 0
input_file="c:\\input.txt"
output_file="c:\\output.txt"
product_info = []
input_count = 0

input = open(input_file,"r")
output = open(output_file, "w")

for line in input:
break_down = line.split(",")
product_number = break_down[2]
input_count+=1
if input_count == 1:
product_info.append(product_number)
output.write(line)
output_count = 1

This seems redundant.

> if not checkForProduct(product_number,product_info):
product_info.append(product_number)
output.write(line)
output_count+=1

File writing is extremely expensive. In fact, so is reading. Think
about reading the file in whole chunks. Put those chunks into Python
data structures, and make your output information in Python data
structures.

Don't bother yourself with this suggestion about reading in chunks --
Python already does this for you, and does so more efficiently that you
could. The code
for line in open(input_file,"r"):
reads in large chunks (efficiently) and then serves up the contents
line-by-line.

Gary Herron

If you use a dictionary and search the ID's there, you'll notice some
speed improvements as Python does a dictionary lookup far quicker than
searching a list. Then, output your data all at once at the end.

--
Avi

--
http://mail.python.org/mailman/listinfo/python-list

Aug 5 '08 #4

RPM1

ro************@gmail.com wrote:

I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.

I think you need to learn about Python's dictionary data type.

Aug 6 '08 #5

Roy H. Han

Why not just use sets?

a = set()
a.add(1)
a.add(2)

On Tue, Aug 5, 2008 at 10:14 PM, RPM1 <rp**********@earthlink.netwrote:

ro************@gmail.com wrote:
>>
I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.

I think you need to learn about Python's dictionary data type.
--
http://mail.python.org/mailman/listinfo/python-list

Aug 6 '08 #6

Tomasz Rola

On Tue, 5 Aug 2008, ro************@gmail.com wrote:

I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.

My take:

I assume you have 80 bytes per line, that makes 10 milion lines for 700M
file. To be quite sure lets round it to 20 milions. Next, I don't want to
trash my disk with 700M+ files, so I assume reading the line, breaking it
and getting product id takes roughly the same time as generating random id
by my code. So, I:

1. read all records line by line (or just make random ids), append the
product id to the list (actually, I preallocate the list with empty space
and fill it up)
2. sort() the list
3. iterate the list, count the unique ids, optionally write to file

The code (most of it is just making random names, which was real fun):

import string

RECORD_NUM = 20*1024*1024 # 700M/80-bytes-per-line = ca. 10M+ records
ALPHA = string.digits + string.ascii_letters
RAND = None

def random_iter ( ) :
x = 12345678910
y = 234567891011
M = 2**61 - 1
M0 = 2**31 - 1
pot10 = [ 1, 10, 100, 1000, 10000, 100000 ]
while 1 :
p = x * y
l = pot10[5 - (p % 10)]
n = (p / l) % M
d = l * (n % 10)
p = p % M0
x1 = y - d + p
y1 = x + d + p
x, y = x1, y1
yield n
pass
pass

def next_random ( ) :
global RAND
if RAND == None :
RAND = random_iter()
return RAND.next()

def num_to_name ( n ) :
s = []
len_a = len(ALPHA)
while n 0 :
n1, r = divmod(n, len_a)
s.append(ALPHA[r])
n = n1
pass
return "".join(s)

def get_product_id ( ) :
return num_to_name(next_random())

def dry_run ( n ) :
r = [ 0 ] * n
while n 0 :
n -= 1
r[n] = get_product_id()
return r

###

if __name__ == "__main__":
print "hello"
for i in range(10) : print get_product_id()
print
print "RECORD_NUM: %d" % RECORD_NUM
products = dry_run(RECORD_NUM)
print "RECORDS PRODUCED: %d" % len(products)
products.sort()
i = 0
lastp = ""
count = 0
while i < len(products) :
if lastp != products[i] :
lastp = products[i]
count += 1
i += 1
pass
print "UNIQUE RECORDS: %d" % count

I may have made some bugs, but it works on my computer. Or seems to ;-/.

For 20 mln products, on my Athlon XP @1800 / 1.5G ram, Debian Linux box,
it takes about 13 minutes to go through list generation, about 3 minutes
to sort the list, and few more seconds to skim it (writing should not be
much longer). All summed up, about 18 minutes of real time, with some
other programs computing a little etc in the background - so, much less
then 2 hours.

Regards,
Tomasz Rola

--
** A C programmer asked whether computer had Buddha's nature. **
** As the answer, master did "rm -rif" on the programmer's home **
** directory. And then the C programmer became enlightened... **
** **
** Tomasz Rola mailto:to*********@bigfoot.com **

Aug 6 '08 #7

Boris Borcic

Is your product ID always the 3rd and last item on the line ?
Else your output won't separate IDs.

And how does

output = open(output_file,'w')
for x in set(line.split(',')[2] for line in open(input_file)) :
output.write(x)
output.close()

behave ?
ro************@gmail.com wrote:

I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.

Heres the code:
import string

def checkForProduct(product_id, product_list):
for product in product_list:
if product == product_id:
return 1
return 0
input_file="c:\\input.txt"
output_file="c:\\output.txt"
product_info = []
input_count = 0

input = open(input_file,"r")
output = open(output_file, "w")

for line in input:
break_down = line.split(",")
product_number = break_down[2]
input_count+=1
if input_count == 1:
product_info.append(product_number)
output.write(line)
output_count = 1
if not checkForProduct(product_number,product_info):
product_info.append(product_number)
output.write(line)
output_count+=1

output.close()
input.close()
print input_count
print output_count
--
http://mail.python.org/mailman/listinfo/python-list

Aug 6 '08 #8

Similar topics

2591

Python is faster than C

by: Armin Rigo | last post by:

Hi! This is a rant against the optimization trend of the Python interpreter. Sorting a list of 100000 integers in random order takes: * 0.75 seconds in Python 2.1 * 0.51 seconds in Python...

Python

4692

Faster than STL string class?

by: YinTat | last post by:

Hi, I learned C++ recently and I made a string class. A code example is this: class CString { public: inline CString(const char *rhs) { m_size = strlen(rhs);

C / C++

14298

why prefix increment is faster than postfix increment?

by: jrefactors | last post by:

I heard people saying prefix increment is faster than postfix incerement, but I don't know what's the difference. They both are i = i+1. i++ ++i Please advise. thanks!!

C / C++

2045

ADO query question - which is faster?

by: mathilda | last post by:

When I'm opening a recordset for add new, is it faster to put a limiting clause on it? i.e. does "Select from Where Keyfield = " & value & " return a recordset faster than "Select ...

Microsoft Access / VBA

1639

which is faster ?

by: junky_fellow | last post by:

which of the following is faster and why ? if ( x == 1 ) { } or if ( x != 0 ) {

C / C++

2697

Are even square jagged arrays faster?

by: James dean | last post by:

I done a test and i really do not know the reason why a jagged array who has the same number of elements as a multidimensional array is faster here is my test. I assign a value and do a small...

C# / C Sharp

1830

straight html or dom generated html objects faster?

by: anagai | last post by:

Im wondering if generating html objects such as tabels and rows in javascript is faster than typing the html directly? Seems when you do it in javascript you have to download alot of code and would...

Javascript

2939

What's faster, saving an HTML DOM node as a variable, or using getElementById?

by: ctman770 | last post by:

Hi Everyone, Is it faster to save the precise location of an html dom node into a variable in js, or to use getElementById everytime you need to access the node? I want to make my application...

Javascript

2492

I could use some help making this Python code run faster using only Python code.

by: Python Maniac | last post by:

I am new to Python however I would like some feedback from those who know more about Python than I do at this time. def scrambleLine(line): s = '' for c in line: s += chr(ord(c) | 0x80)...

Python

2638

Same program in C and in C#. C# is faster than C. How Come ?

by: c | last post by:

Hi every one, Me and my Cousin were talking about C and C#, I love C and he loves C#..and were talking C is ...blah blah...C# is Blah Blah ...etc and then we decided to write a program that...

C / C++

7121

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

6993

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

7375

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

5456

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

4899

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

4584

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

3088

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

1411

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

C# / C Sharp

287

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

General