By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,100 Members | 2,495 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 444,100 IT Pros & Developers. It's quick & easy.

extremely slow array indexing?

P: n/a
Hi,

I am writing code to sort the columns according to the sum of each
column. The dataset is huge (50k rows x 300k cols), so i need to read
line by line and do the summation to avoid the out-of-memory problem.
But I don't know why it runs very slow, and part of the code is as
follows. I suspect it's because of array index, but not sure. Can
anyone
point out what needs to be modified to make it run fast? thanks in
advance!

....
from numpy import *
....

currSum = zeros(self.componentcount)
currRow = zeros(self.componentcount)
for featureDict in self.featureDictList:
currRow[:] = 0
for components in self.componentdict1:
if featureDict.has_key(components):
col = self.componentdict1[components]
value = featureDict[components]
currRow[col]=value;
currSum = currSum + row;
....

Nov 30 '06 #1
Share this Question
Share on Google+
4 Replies


P: n/a

Grace Fang wrote:
Hi,

I am writing code to sort the columns according to the sum of each
column. The dataset is huge (50k rows x 300k cols), so i need to read
line by line and do the summation to avoid the out-of-memory problem.
But I don't know why it runs very slow, and part of the code is as
follows. I suspect it's because of array index, but not sure. Can
anyone
point out what needs to be modified to make it run fast? thanks in
advance!
Array indexing is unlikely to be the culprit. Could it not just be
slow, because you are processing a lot of data? With numbers those big
I would expect to have enough time to go make a coffee, then drink it.

If you think it is slower than it could be, post more code for
optimization advice...

Will McGugan
--
http://www.willmcgugan.com

Nov 30 '06 #2

P: n/a
Hi will,Thanks for your reply. The simplified code is as follows, and
you can run it if you like. It takes 7 seconds to process 1000 rows,
which is tolerable, but I wonder why it takes so long, because I also
did one for loop through all of the same rows without accessing array,
which only takes 1 sec to process 1000 rows. Isn't vectorized
operation supposed to run very quickly?

from numpy import *
componentcount = 300000
currSum = zeros(componentcount)
row = zeros(componentcount) #current row
rowcount = 50000
for i in range(1,rowcount):
row[:] = 1
currSum = currSum + row;
Array indexing is unlikely to be the culprit. Could it not just be
slow, because you are processing a lot of data? With numbers those big
I would expect to have enough time to go make a coffee, then drink it.

If you think it is slower than it could be, post more code for
optimization advice...

Will McGugan
--
http://www.willmcgugan.com
Nov 30 '06 #3

P: n/a
Will McGugan wrote:
Grace Fang wrote:
Hi,

I am writing code to sort the columns according to the sum of each
column. The dataset is huge (50k rows x 300k cols), so i need to read
line by line and do the summation to avoid the out-of-memory problem.
But I don't know why it runs very slow, and part of the code is as
follows. I suspect it's because of array index, but not sure. Can
anyone
point out what needs to be modified to make it run fast? thanks in
advance!

Array indexing is unlikely to be the culprit. Could it not just be
slow, because you are processing a lot of data? With numbers those big
I would expect to have enough time to go make a coffee, then drink it.

If you think it is slower than it could be, post more code for
optimization advice...

Will McGugan
Hi Grace,
What Will McGugan said, plus:
1. Post *much* more of your code e.g. all relevant parts :-)
2. Explain "featureDict" and "componentdict1"; note that you seem to
be doing more dictionary accessing than array indexing.
3. Tell us what is "row" (not mentioned elsewhere) in the last line of
your code snippet. Should it be "currRow"? For your sake and ours,
copy/paste your code; don't re-type it.
4. Tell us what version of Python [why are you using dict.has_key??],
what platform, how much memory.
5. Tell us what "very slow" means e.g. how many rows per second.

HTH,
John

Nov 30 '06 #4

P: n/a
John Machin wrote:
Hi Grace,
What Will McGugan said, plus:
1. Post *much* more of your code e.g. all relevant parts :-)
Note that Grace has also posted this to numpy-discussion and with prompting
provided the following snippet as a distillation of the key slow part:
from numpy import *

componentcount = 300000
currSum = zeros(componentcount)
row = zeros(componentcount) #current row
rowcount = 50000
for i in range(1,rowcount):
row[:] = 1
currSum = currSum + row;
As it is, the OP gets through 1000 rows every 7 seconds or so on their machine,
and I get about the same on mine.

Changing the last line to "currSum += row" gets a 3x speedup. Dropping the
"row[:] = 1" line as it's a really just a time-consuming no-op in the example
and probably not an accurate reflection of what's going on in the real code gets
you another 2x speedup.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Nov 30 '06 #5

This discussion thread is closed

Replies have been disabled for this discussion.