extremely slow array indexing?

Grace Fang

Hi,

I am writing code to sort the columns according to the sum of each
column. The dataset is huge (50k rows x 300k cols), so i need to read
line by line and do the summation to avoid the out-of-memory problem.
But I don't know why it runs very slow, and part of the code is as
follows. I suspect it's because of array index, but not sure. Can
anyone
point out what needs to be modified to make it run fast? thanks in
advance!

....
from numpy import *
....

currSum = zeros(self.componentcount)
currRow = zeros(self.componentcount)
for featureDict in self.featureDictList:
currRow[:] = 0
for components in self.componentdict1:
if featureDict.has_key(components):
col = self.componentdict1[components]
value = featureDict[components]
currRow[col]=value;
currSum = currSum + row;
....

Nov 30 '06 #1

Subscribe Post Reply

1442

Will McGugan

Grace Fang wrote:

Hi,

I am writing code to sort the columns according to the sum of each
column. The dataset is huge (50k rows x 300k cols), so i need to read
line by line and do the summation to avoid the out-of-memory problem.
But I don't know why it runs very slow, and part of the code is as
follows. I suspect it's because of array index, but not sure. Can
anyone
point out what needs to be modified to make it run fast? thanks in
advance!

Array indexing is unlikely to be the culprit. Could it not just be
slow, because you are processing a lot of data? With numbers those big
I would expect to have enough time to go make a coffee, then drink it.

If you think it is slower than it could be, post more code for
optimization advice...

Will McGugan
--
http://www.willmcgugan.com

Nov 30 '06 #2

Grace Fang

Hi will,Thanks for your reply. The simplified code is as follows, and
you can run it if you like. It takes 7 seconds to process 1000 rows,
which is tolerable, but I wonder why it takes so long, because I also
did one for loop through all of the same rows without accessing array,
which only takes 1 sec to process 1000 rows. Isn't vectorized
operation supposed to run very quickly?

from numpy import *
componentcount = 300000
currSum = zeros(componentcount)
row = zeros(componentcount) #current row
rowcount = 50000
for i in range(1,rowcount):
row[:] = 1
currSum = currSum + row;

Array indexing is unlikely to be the culprit. Could it not just be
slow, because you are processing a lot of data? With numbers those big
I would expect to have enough time to go make a coffee, then drink it.

If you think it is slower than it could be, post more code for
optimization advice...

Will McGugan
--
http://www.willmcgugan.com

Nov 30 '06 #3

John Machin

Will McGugan wrote:

Grace Fang wrote:

Hi,

I am writing code to sort the columns according to the sum of each
column. The dataset is huge (50k rows x 300k cols), so i need to read
line by line and do the summation to avoid the out-of-memory problem.
But I don't know why it runs very slow, and part of the code is as
follows. I suspect it's because of array index, but not sure. Can
anyone
point out what needs to be modified to make it run fast? thanks in
advance!

Array indexing is unlikely to be the culprit. Could it not just be
slow, because you are processing a lot of data? With numbers those big
I would expect to have enough time to go make a coffee, then drink it.

If you think it is slower than it could be, post more code for
optimization advice...

Will McGugan

Hi Grace,
What Will McGugan said, plus:
1. Post *much* more of your code e.g. all relevant parts :-)
2. Explain "featureDict" and "componentdict1"; note that you seem to
be doing more dictionary accessing than array indexing.
3. Tell us what is "row" (not mentioned elsewhere) in the last line of
your code snippet. Should it be "currRow"? For your sake and ours,
copy/paste your code; don't re-type it.
4. Tell us what version of Python [why are you using dict.has_key??],
what platform, how much memory.
5. Tell us what "very slow" means e.g. how many rows per second.

HTH,
John

Nov 30 '06 #4

Robert Kern

John Machin wrote:

Hi Grace,
What Will McGugan said, plus:
1. Post *much* more of your code e.g. all relevant parts :-)

Note that Grace has also posted this to numpy-discussion and with prompting
provided the following snippet as a distillation of the key slow part:
from numpy import *

componentcount = 300000
currSum = zeros(componentcount)
row = zeros(componentcount) #current row
rowcount = 50000
for i in range(1,rowcount):
row[:] = 1
currSum = currSum + row;
As it is, the OP gets through 1000 rows every 7 seconds or so on their machine,
and I get about the same on mine.

Changing the last line to "currSum += row" gets a 3x speedup. Dropping the
"row[:] = 1" line as it's a really just a time-consuming no-op in the example
and probably not an accurate reflection of what's going on in the real code gets
you another 2x speedup.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Nov 30 '06 #5

Similar topics

NumArray array-indexing

by: Michael Drumheller | last post by:

(If you're not interested in NumArray, please skip this message.) I am new to NumArray and I wonder if someone can help me with array-indexing. Here's the basic situation: Given a rank-2 array...

Python

website doc search is extremely SLOW

by: D. Dante Lorenso | last post by:

Trying to use the 'search' in the docs section of PostgreSQL.org is extremely SLOW. Considering this is a website for a database and databases are supposed to be good for indexing content, I'd...

PostgreSQL Database

type of array index?

by: shmartonak | last post by:

For maximum portability what should the type of an array index be? Can any integer type be used safely? Or should I only use an unsigned type? Or what? If I'm using pointers to access array...

C / C++

2D array indexing very slow

by: Tristan | last post by:

In C# why is: int myArray = new int; for (int y = 0; y < 5000; y++) { for (int x = 0; x < 5000; x++) { myArray = 1; }

C# / C Sharp

Why is indexing into an numpy array that slow?

by: Rüdiger Werner | last post by:

Hello! Out of curiosity and to learn a little bit about the numpy package i've tryed to implement a vectorised version of the 'Sieve of Zakiya'. While the code itself works fine it is...

Python

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General