473,834 Members | 1,869 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

zip() function troubles

Hello all,

I've been debugging the reason for a major slowdown in a piece of
code ... and it turns out that it was the zip function. In the past
the lists that were zipped were reasonably short, but once the size
exceeded 10 million the zip function slowed to a crawl. Note that
there was memory available to store over 100 million items.

Now I know that zip () wastes lots of memory because it copies the
content of the lists, I had used zip to try to trade memory for speed
(heh!) , and now that everything was replaced with izip it works just
fine. What was really surprising is that it works with no issues up
until 1 million items, but for say 10 million it pretty much goes
nuts. Does anyone know why? is there some limit that it reaches, or is
there something about the operating system (Vista in the case) that
makes it behave like so?

I've noticed the same kinds of behavior when trying to create very
long lists that should easily fit into memory, yet above a given
threshold I get inexplicable slowdowns. Now that I think about is this
something about the way lists grow when expanding them?

and here is the code:

from itertools import izip

BIGNUM = int(1E7)

# let's make a large list
data = range(BIGNUM)

# this works fine (uses about 200 MB and 4 seconds)
s = 0
for x in data:
s += x
print s
# this works fine, 4 seconds as well
s = 0
for x1, x2 in izip(data, data):
s += x1
print s
# this takes over 2 minutes! and uses 600 MB of memory
# the memory usage slowly ticks upwards
s = 0
for x1, x2 in zip(data, data):
s += x1
print s

Jul 26 '07 #1
18 1291
Istvan Albert <is***********@ gmail.comwrites :
exceeded 10 million the zip function slowed to a crawl. Note that
there was memory available to store over 100 million items.
How many bytes is that? Maybe the items (heap-allocated boxed
integers in your code example) are bigger than you expect.
Jul 26 '07 #2
On Jul 26, 7:44 pm, Paul Rubin <http://phr...@NOSPAM.i nvalidwrote:
Istvan Albert <istvan.alb...@ gmail.comwrites :
exceeded 10 million the zip function slowed to a crawl. Note that
there was memory available to store over 100 million items.

How many bytes is that? Maybe the items (heap-allocated boxed
integers in your code example) are bigger than you expect.
while I don't have an answer to this

the point that I was trying to make is that I'm fairly certain that it
is not a memory issue (some sort of swapping) because the overall
memory usage with the OS included is about 1Gb (out of available 2Gb)

I tested this on a linux server system with 4Gb of RAM

a = [ 0 ] * 10**7

takes miliseconds, but say the

b = zip(a,a)

will take a very long time to finish:

atlas:~$ time python -c "a = [0] * 10**7"

real 0m0.165s
user 0m0.128s
sys 0m0.036s
atlas:~$ time python -c "a = [0] * 10**7; b= zip(a,a)"

real 0m55.150s
user 0m54.919s
sys 0m0.232s

Istvan

Jul 27 '07 #3
Istvan Albert <is***********@ gmail.comwrites :
I tested this on a linux server system with 4Gb of RAM
a = [ 0 ] * 10**7
takes miliseconds, but say the
b = zip(a,a)
will take a very long time to finish:
Do a top or vmstat while that is happening and see if you are
swapping. You are allocating 10 million ints and 10 million tuple
nodes, = 20 million objects. Although, even at 100 bytes per object
that would be 1GB which would fit in your machine easily. Is it
a 64 bit cpu?
Jul 27 '07 #4
On Jul 26, 9:33 pm, Paul Rubin <http://phr...@NOSPAM.i nvalidwrote:
Do a top or vmstat while that is happening and see if you are
swapping. You are allocating 10 million ints and 10 million tuple
nodes, = 20 million objects. Although, even at 100 bytes per object
that would be 1GB which would fit in your machine easily. Is it
a 64 bit cpu?
we can safely drop the memory limit as being the cause and think about
something else

if you try it yourself you'll see that it is very easy to generate 10
million tuples,
on my system it takes 3 (!!!) seconds to do the following:

size = 10**7
data = []
for i in range(10):
x = [ (0,1) ] * size
data.append( x )

Now it takes over two minutes to do this:

size = 10**7
a = [ 0 ] * size
b = zip(a,a)

the only explanation I can come up with is that the internal
implementation of zip must have some flaws
Jul 27 '07 #5
Istvan Albert <is***********@ gmail.comwrites :
Now it takes over two minutes to do this:

size = 10**7
a = [ 0 ] * size
b = zip(a,a)
OK, I'm getting similar results under 64 bit Pytnon 2.4.4c1 and also
under 2.5. About 103 seconds for 10**7 and 26 seconds for 5*10**6.
So it looks like zip is using quadratic time. I suggest entering a
bug report.
Jul 27 '07 #6
Istvan Albert wrote:
I've been debugging the reason for a major slowdown in a piece of
code ... and it turns out that it was the zip function. In the past
the lists that were zipped were reasonably short, but once the size
exceeded 10 million the zip function slowed to a crawl. Note that
there was memory available to store over 100 million items.

Now I know that zip () wastes lots of memory because it copies the
content of the lists, I had used zip to try to trade memory for speed
(heh!) , and now that everything was replaced with izip it works just
fine. What was really surprising is that it works with no issues up
until 1 million items, but for say 10 million it pretty much goes
nuts. Does anyone know why? is there some limit that it reaches, or is
there something about the operating system (Vista in the case) that
makes it behave like so?

I've noticed the same kinds of behavior when trying to create very
long lists that should easily fit into memory, yet above a given
threshold I get inexplicable slowdowns. Now that I think about is this
something about the way lists grow when expanding them?

and here is the code:

from itertools import izip

BIGNUM = int(1E7)

# let's make a large list
data = range(BIGNUM)

# this works fine (uses about 200 MB and 4 seconds)
s = 0
for x in data:
s += x
print s
# this works fine, 4 seconds as well
s = 0
for x1, x2 in izip(data, data):
s += x1
print s
# this takes over 2 minutes! and uses 600 MB of memory
# the memory usage slowly ticks upwards
s = 0
for x1, x2 in zip(data, data):
s += x1
print s
When you are allocating a lot of objects without releasing them the garbage
collector kicks in to look for cycles. Try switching it off:

import gc
gc.disable()
try:
# do the zipping
finally:
gc.enable()

Peter
Jul 27 '07 #7

"Istvan Albert" <is***********@ gmail.comwrote in message
news:11******** **************@ l70g2000hse.goo glegroups.com.. .
|| if you try it yourself you'll see that it is very easy to generate 10
| million tuples,

No it is not on most machines.

| on my system it takes 3 (!!!) seconds to do the following:
|
| size = 10**7
| data = []
| for i in range(10):
| x = [ (0,1) ] * size

x has 10**7 references (4 bytes each) to the same tuple. Use id() to
check. 40 megs is manageable.

| data.append( x )
|
| Now it takes over two minutes to do this:
|
| size = 10**7
| a = [ 0 ] * size
| b = zip(a,a)

b has 40 megs that reference 10 meg *different* tuples. Each is 20 to 40,
so 200-400 megs more. Try
[(i,i) for i in xrange(5000000)]
for comparison (it also makes 10000000 objects plus large list).

| the only explanation I can come up with is that the internal
| implementation of zip must have some flaws

References are not objects.
Terry Jan Reedy

Jul 27 '07 #8
Peter Otten <__*******@web. dewrites:
When you are allocating a lot of objects without releasing them the garbage
collector kicks in to look for cycles. Try switching it off:
I think that is the answer. The zip took almost 2 minutes without
turning gc off, but takes 1.25 seconds with gc off. It turned a
linear-time algorithm into a quadratic one. I think something is
broken about a design where that can happen. Maybe Pypy will have
a generational GC someday.
Jul 27 '07 #9
On Jul 27, 1:24 am, Peter Otten <__pete...@web. dewrote:
When you are allocating a lot of objects without releasing them the garbage
collector kicks in to look for cycles. Try switching it off:

import gc
gc.disable()
Yes, this solves the problem I was experiencing. Thanks.

Istvan

Jul 27 '07 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
2206
by: garett | last post by:
Hello, I have been reading text processing in python and in the appendix the author describes: >>> sides = >>> zip(*zip(*sides)) what is this asterisk-list syntax called? Any suggestions for finding more information about it? Thanks. -Garett
4
4089
by: Message Drop Box | last post by:
All, How (and why) does zip(*someList) work? >>> s = , , ] >>> zip(*s) I've never seen the star '*' outside of function/method definitions and I've looked in the Python documentation without success. This
10
2220
by: Steven Bethard | last post by:
So, as I understand it, in Python 3000, zip will basically be replaced with izip, meaning that instead of returning a list, it will return an iterator. This is great for situations like: zip(*) where I want to receive tuples of (item1, item2, item3) from the iterables. But it doesn't work well for a situation like: zip(*tuple_iter)
2
1770
by: Galsaba | last post by:
anyone knows what the formula is for finding a distance betweeen 2 zip codes? Aaron
2
8378
by: Axel Foley | last post by:
I used some of the excellent resources from DITHERING.COM for help in my groveling newbie attempts to cough up working form validation.... I cut and pasted bits of code to check USA ZIP codes and CANADIAN POSTAL codes, and merged them into one function that I called validCode. The <form> tag has an onSubmit call to a general form-checker that works fine to make sure all fields are filled. But within the form is a ZIP/POSTAL CODE field,...
12
1759
by: Krustov | last post by:
Using the standard php functions found on most web servers - how do i zip selected folders and have the zip file emailed to myself . Not as a cron job or anything automated - it will be a option i will add in the control panel .
1
7709
by: Arkady Renko | last post by:
Gday Guys I'm attempting to create zip files on the fly for some highly compressible, yet very large files stored on my Web server. At present I'm using a class from the Zend library by Eric Mueller which I've modified to suit my purposes. The problem occurs when I have files of an arbitrarily large size.. say 100mb, whereby I end up with high memory usage. At the moment I'm using a workaround whereby the scripts may use a very...
11
5821
by: comp.lang.php | last post by:
Once again, I thought my class method deleteZip() would do the trick, but it never deletes any .zip* file found in a directory: /** * Delete any latent ZIP files found in this album. This method is to be inherited by all listing classes to allow for * list-wide deletion of latent server-created ZIP files for security purposes *
5
2229
by: techusky | last post by:
I made a script that successfully creates a .zip file of all the files in a directory on my web server, but now what I haven't figured out how to do is how to have it automatically deleted when the user successfully downloads it, as otherwise my server would eventually get clogged up with all these zip files. Any help/suggestions? Thanks
0
9651
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10800
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10516
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10556
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9339
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7762
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6960
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5629
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
3
3085
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.