473,320 Members | 1,979 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

zip() function troubles

Hello all,

I've been debugging the reason for a major slowdown in a piece of
code ... and it turns out that it was the zip function. In the past
the lists that were zipped were reasonably short, but once the size
exceeded 10 million the zip function slowed to a crawl. Note that
there was memory available to store over 100 million items.

Now I know that zip () wastes lots of memory because it copies the
content of the lists, I had used zip to try to trade memory for speed
(heh!) , and now that everything was replaced with izip it works just
fine. What was really surprising is that it works with no issues up
until 1 million items, but for say 10 million it pretty much goes
nuts. Does anyone know why? is there some limit that it reaches, or is
there something about the operating system (Vista in the case) that
makes it behave like so?

I've noticed the same kinds of behavior when trying to create very
long lists that should easily fit into memory, yet above a given
threshold I get inexplicable slowdowns. Now that I think about is this
something about the way lists grow when expanding them?

and here is the code:

from itertools import izip

BIGNUM = int(1E7)

# let's make a large list
data = range(BIGNUM)

# this works fine (uses about 200 MB and 4 seconds)
s = 0
for x in data:
s += x
print s
# this works fine, 4 seconds as well
s = 0
for x1, x2 in izip(data, data):
s += x1
print s
# this takes over 2 minutes! and uses 600 MB of memory
# the memory usage slowly ticks upwards
s = 0
for x1, x2 in zip(data, data):
s += x1
print s

Jul 26 '07 #1
18 1255
Istvan Albert <is***********@gmail.comwrites:
exceeded 10 million the zip function slowed to a crawl. Note that
there was memory available to store over 100 million items.
How many bytes is that? Maybe the items (heap-allocated boxed
integers in your code example) are bigger than you expect.
Jul 26 '07 #2
On Jul 26, 7:44 pm, Paul Rubin <http://phr...@NOSPAM.invalidwrote:
Istvan Albert <istvan.alb...@gmail.comwrites:
exceeded 10 million the zip function slowed to a crawl. Note that
there was memory available to store over 100 million items.

How many bytes is that? Maybe the items (heap-allocated boxed
integers in your code example) are bigger than you expect.
while I don't have an answer to this

the point that I was trying to make is that I'm fairly certain that it
is not a memory issue (some sort of swapping) because the overall
memory usage with the OS included is about 1Gb (out of available 2Gb)

I tested this on a linux server system with 4Gb of RAM

a = [ 0 ] * 10**7

takes miliseconds, but say the

b = zip(a,a)

will take a very long time to finish:

atlas:~$ time python -c "a = [0] * 10**7"

real 0m0.165s
user 0m0.128s
sys 0m0.036s
atlas:~$ time python -c "a = [0] * 10**7; b= zip(a,a)"

real 0m55.150s
user 0m54.919s
sys 0m0.232s

Istvan

Jul 27 '07 #3
Istvan Albert <is***********@gmail.comwrites:
I tested this on a linux server system with 4Gb of RAM
a = [ 0 ] * 10**7
takes miliseconds, but say the
b = zip(a,a)
will take a very long time to finish:
Do a top or vmstat while that is happening and see if you are
swapping. You are allocating 10 million ints and 10 million tuple
nodes, = 20 million objects. Although, even at 100 bytes per object
that would be 1GB which would fit in your machine easily. Is it
a 64 bit cpu?
Jul 27 '07 #4
On Jul 26, 9:33 pm, Paul Rubin <http://phr...@NOSPAM.invalidwrote:
Do a top or vmstat while that is happening and see if you are
swapping. You are allocating 10 million ints and 10 million tuple
nodes, = 20 million objects. Although, even at 100 bytes per object
that would be 1GB which would fit in your machine easily. Is it
a 64 bit cpu?
we can safely drop the memory limit as being the cause and think about
something else

if you try it yourself you'll see that it is very easy to generate 10
million tuples,
on my system it takes 3 (!!!) seconds to do the following:

size = 10**7
data = []
for i in range(10):
x = [ (0,1) ] * size
data.append( x )

Now it takes over two minutes to do this:

size = 10**7
a = [ 0 ] * size
b = zip(a,a)

the only explanation I can come up with is that the internal
implementation of zip must have some flaws
Jul 27 '07 #5
Istvan Albert <is***********@gmail.comwrites:
Now it takes over two minutes to do this:

size = 10**7
a = [ 0 ] * size
b = zip(a,a)
OK, I'm getting similar results under 64 bit Pytnon 2.4.4c1 and also
under 2.5. About 103 seconds for 10**7 and 26 seconds for 5*10**6.
So it looks like zip is using quadratic time. I suggest entering a
bug report.
Jul 27 '07 #6
Istvan Albert wrote:
I've been debugging the reason for a major slowdown in a piece of
code ... and it turns out that it was the zip function. In the past
the lists that were zipped were reasonably short, but once the size
exceeded 10 million the zip function slowed to a crawl. Note that
there was memory available to store over 100 million items.

Now I know that zip () wastes lots of memory because it copies the
content of the lists, I had used zip to try to trade memory for speed
(heh!) , and now that everything was replaced with izip it works just
fine. What was really surprising is that it works with no issues up
until 1 million items, but for say 10 million it pretty much goes
nuts. Does anyone know why? is there some limit that it reaches, or is
there something about the operating system (Vista in the case) that
makes it behave like so?

I've noticed the same kinds of behavior when trying to create very
long lists that should easily fit into memory, yet above a given
threshold I get inexplicable slowdowns. Now that I think about is this
something about the way lists grow when expanding them?

and here is the code:

from itertools import izip

BIGNUM = int(1E7)

# let's make a large list
data = range(BIGNUM)

# this works fine (uses about 200 MB and 4 seconds)
s = 0
for x in data:
s += x
print s
# this works fine, 4 seconds as well
s = 0
for x1, x2 in izip(data, data):
s += x1
print s
# this takes over 2 minutes! and uses 600 MB of memory
# the memory usage slowly ticks upwards
s = 0
for x1, x2 in zip(data, data):
s += x1
print s
When you are allocating a lot of objects without releasing them the garbage
collector kicks in to look for cycles. Try switching it off:

import gc
gc.disable()
try:
# do the zipping
finally:
gc.enable()

Peter
Jul 27 '07 #7

"Istvan Albert" <is***********@gmail.comwrote in message
news:11**********************@l70g2000hse.googlegr oups.com...
|| if you try it yourself you'll see that it is very easy to generate 10
| million tuples,

No it is not on most machines.

| on my system it takes 3 (!!!) seconds to do the following:
|
| size = 10**7
| data = []
| for i in range(10):
| x = [ (0,1) ] * size

x has 10**7 references (4 bytes each) to the same tuple. Use id() to
check. 40 megs is manageable.

| data.append( x )
|
| Now it takes over two minutes to do this:
|
| size = 10**7
| a = [ 0 ] * size
| b = zip(a,a)

b has 40 megs that reference 10 meg *different* tuples. Each is 20 to 40,
so 200-400 megs more. Try
[(i,i) for i in xrange(5000000)]
for comparison (it also makes 10000000 objects plus large list).

| the only explanation I can come up with is that the internal
| implementation of zip must have some flaws

References are not objects.
Terry Jan Reedy

Jul 27 '07 #8
Peter Otten <__*******@web.dewrites:
When you are allocating a lot of objects without releasing them the garbage
collector kicks in to look for cycles. Try switching it off:
I think that is the answer. The zip took almost 2 minutes without
turning gc off, but takes 1.25 seconds with gc off. It turned a
linear-time algorithm into a quadratic one. I think something is
broken about a design where that can happen. Maybe Pypy will have
a generational GC someday.
Jul 27 '07 #9
On Jul 27, 1:24 am, Peter Otten <__pete...@web.dewrote:
When you are allocating a lot of objects without releasing them the garbage
collector kicks in to look for cycles. Try switching it off:

import gc
gc.disable()
Yes, this solves the problem I was experiencing. Thanks.

Istvan

Jul 27 '07 #10
On Jul 27, 2:16 am, "Terry Reedy" <tjre...@udel.eduwrote:
References are not objects.
yes this a valid objection,

but in all fairness the example in the original post operates on
comparably sized objects and also exhibited unexpected performance
degradation

as it turns out the garbage collection is the culprit, I never used to
have to care about gc (and that's great!), but now that I'm that I'm
shuffling 1Gb chunks I have to be more careful.

best,

Istvan

Jul 27 '07 #11
On 26 Jul 2007 23:35:44 -0700, Paul Rubin
<"http://phr.cx"@nospam.invalidwrote:
Peter Otten <__*******@web.dewrites:
When you are allocating a lot of objects without releasing them the garbage
collector kicks in to look for cycles. Try switching it off:

I think that is the answer. The zip took almost 2 minutes without
turning gc off, but takes 1.25 seconds with gc off. It turned a
linear-time algorithm into a quadratic one. I think something is
broken about a design where that can happen. Maybe Pypy will have
a generational GC someday.
--
Better hand in your computer, then. You're never going to find a
situation where the environment won't affect the running time of your
algorithms.

For the record, the python GC is generational. This is a case of a
default heuristic giving pathological behavior in a corner case, not
anything broken about the design of the python GC.
http://mail.python.org/mailman/listinfo/python-list
Jul 27 '07 #12
On 7/27/07, Istvan Albert <is***********@gmail.comwrote:
On Jul 27, 2:16 am, "Terry Reedy" <tjre...@udel.eduwrote:
References are not objects.

yes this a valid objection,

but in all fairness the example in the original post operates on
comparably sized objects and also exhibited unexpected performance
degradation

as it turns out the garbage collection is the culprit, I never used to
have to care about gc (and that's great!), but now that I'm that I'm
shuffling 1Gb chunks I have to be more careful.
You might be interested in the gc.set_threshold function.
Jul 27 '07 #13
On Jul 26, 4:25 pm, Istvan Albert <istvan.alb...@gmail.comwrote:
Now I know that zip () wastes lots of memory because it copies the
content of the lists, I had used zip to try to trade memory for speed
(heh!) , and now that everything was replaced with izip it works just
fine. What was really surprising is that it works with no issues up
until 1 million items, but for say 10 million it pretty much goes
nuts. Does anyone know why?
There's nothing in izip() that holds memory, tracks indices, or is
sensitive to the length of its inputs (it simply calls the next method
on the input iterators as returns the tuple result). So, if you're
seeing behavior changes at 10 millions items, the cause almost
certainly lies elsewhere. One possible candidate could be memory
consumed by immortal integer objects.
Raymond Hettinger

Jul 27 '07 #14
Raymond Hettinger <py****@rcn.comwrites:
What was really surprising is that it works with no issues up
until 1 million items, but for say 10 million it pretty much goes
nuts. Does anyone know why?

There's nothing in izip() that holds memory, tracks indices, or ...
The issue was with zip, not izip. It was caused by gc running too often.
Jul 27 '07 #15
On Jul 27, 2:18 pm, Raymond Hettinger <pyt...@rcn.comwrote:
>What was really surprising is that it works
with no issues up until 1 million items"
later editing made the sentence more difficult to read
I should have said: "What was really surprising is that zip works
with no issues up until 1 million items"

It was the zip function (and the garbage collection that it repeatedly
triggers) that cause the problem

best,

Istvan

Jul 27 '07 #16

"Istvan Albert" <is***********@gmail.comwrote in message
news:11**********************@r34g2000hsd.googlegr oups.com...
| On Jul 27, 2:18 pm, Raymond Hettinger <pyt...@rcn.comwrote:
|
| >What was really surprising is that it works
| >with no issues up until 1 million items"
|
| later editing made the sentence more difficult to read
| I should have said: "What was really surprising is that zip works
| with no issues up until 1 million items"
|
| It was the zip function (and the garbage collection that it repeatedly
| triggers) that cause the problem

It is not the zip function that caused a problem. It was the creation of
so many tuples without any deletions. The list comp example I gave does
the same thing and has the same problem. Nothing to do with zipping per
se.

tjr

Jul 27 '07 #17
"Chris Mellon" <ar*****@gmail.comwrites:
Better hand in your computer, then. You're never going to find a
situation where the environment won't affect the running time of your
algorithms.
The environment may affect the running time by an additive or linear
multiplicative constant but it should never turn an O(n) algorithm
into an O(n**2) one.
For the record, the python GC is generational. This is a case of a
default heuristic giving pathological behavior in a corner case, not
anything broken about the design of the python GC.
No, it is broken, per discussion on a
comp.compilers/comp.lang.functional thread this week. The notion of
using a generational collector was to collect less and less frequently
for older and older generations (e.g. doubling the amount of
allocation between generations) but the usual solution is apparently
to increase the heap size by some multiplicative factor when GC fails
to free enough memory.
Jul 31 '07 #18
On 31 Jul 2007 12:57:13 -0700, Paul Rubin
<"http://phr.cx"@nospam.invalidwrote:
"Chris Mellon" <ar*****@gmail.comwrites:
Better hand in your computer, then. You're never going to find a
situation where the environment won't affect the running time of your
algorithms.

The environment may affect the running time by an additive or linear
multiplicative constant but it should never turn an O(n) algorithm
into an O(n**2) one.
Hardly true. When you start to factor real world issues like memory
allocations (and all the complexities thereof) and cache misses into
algorithmic performance an algorithm certainly can change from O(n) to
O(n**2). This is one reason why real world performance is tested and
compared using benchmarks and not Big O notation.
For the record, the python GC is generational. This is a case of a
default heuristic giving pathological behavior in a corner case, not
anything broken about the design of the python GC.

No, it is broken, per discussion on a
comp.compilers/comp.lang.functional thread this week. The notion of
using a generational collector was to collect less and less frequently
for older and older generations (e.g. doubling the amount of
allocation between generations) but the usual solution is apparently
to increase the heap size by some multiplicative factor when GC fails
to free enough memory.
--
The issue at hand has nothing to do with the generational nature of
the Python GC. It's caused by a heuristic that aggressively attempts
to reclaim objects when there are large numbers of allocations without
releases. This is a tuneable GC parameter.
Aug 1 '07 #19

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: garett | last post by:
Hello, I have been reading text processing in python and in the appendix the author describes: >>> sides = >>> zip(*zip(*sides)) what is this asterisk-list syntax called? Any suggestions for...
4
by: Message Drop Box | last post by:
All, How (and why) does zip(*someList) work? >>> s = , , ] >>> zip(*s) I've never seen the star '*' outside of function/method definitions and I've looked in the Python documentation...
10
by: Steven Bethard | last post by:
So, as I understand it, in Python 3000, zip will basically be replaced with izip, meaning that instead of returning a list, it will return an iterator. This is great for situations like: zip(*)...
2
by: Galsaba | last post by:
anyone knows what the formula is for finding a distance betweeen 2 zip codes? Aaron
2
by: Axel Foley | last post by:
I used some of the excellent resources from DITHERING.COM for help in my groveling newbie attempts to cough up working form validation.... I cut and pasted bits of code to check USA ZIP codes and...
12
by: Krustov | last post by:
Using the standard php functions found on most web servers - how do i zip selected folders and have the zip file emailed to myself . Not as a cron job or anything automated - it will be a option...
1
by: Arkady Renko | last post by:
Gday Guys I'm attempting to create zip files on the fly for some highly compressible, yet very large files stored on my Web server. At present I'm using a class from the Zend library by Eric...
11
by: comp.lang.php | last post by:
Once again, I thought my class method deleteZip() would do the trick, but it never deletes any .zip* file found in a directory: /** * Delete any latent ZIP files found in this album. This...
5
by: techusky | last post by:
I made a script that successfully creates a .zip file of all the files in a directory on my web server, but now what I haven't figured out how to do is how to have it automatically deleted when the...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
0
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.