Initializing the number of slots in a dictionary

Jon Smirl

Is there some way to tell a dictionary object that I am going to load 1M
objects into it and have it pre-allocate enought slots to hold all of the
entries? Thus avoiding many thousand memory allocations.

Jon Smirl
jo******@gmail.com

Aug 6 '06 #1

Subscribe Post Reply

3824

John Machin

Jon Smirl wrote:

Is there some way to tell a dictionary object that I am going to load 1M
objects into it and have it pre-allocate enought slots to hold all of the
entries?

Not according to the manual.

Not according to the source [as at 2.4.3]. In any case, if there were a
back-door undocumented arg for the dict constructor, somebody would
have read the source and spread the news.

Thus avoiding many thousand memory allocations.

What gives you the impression that "many thousand memory allocations"
are involved? {My impression is that the dict would be resized about 11
times on its trip from size 8 to size 2M, and each resize would involve
one allocation.]

Do you have an application with a performance problem? If so, what
makes you think inserting 1M items into a Python dict is contributing
to the problem?

Have you read the thread "Large Dictionaries" which started on/about
2006-05-15?

In general, what is the background to your question?

Cheers,
John

Aug 6 '06 #2

Jon Smirl

On Sun, 06 Aug 2006 15:33:30 -0700, John Machin wrote:

Jon Smirl wrote:
>Is there some way to tell a dictionary object that I am going to load 1M
objects into it and have it pre-allocate enought slots to hold all of
the entries?

Not according to the manual.

Not according to the source [as at 2.4.3]. In any case, if there were a
back-door undocumented arg for the dict constructor, somebody would have
read the source and spread the news.

>Thus avoiding many thousand memory allocations.

What gives you the impression that "many thousand memory allocations" are
involved? {My impression is that the dict would be resized about 11 times
on its trip from size 8 to size 2M, and each resize would involve one
allocation.]

Do you have an application with a performance problem? If so, what makes
you think inserting 1M items into a Python dict is contributing to the
problem?

I know in advance how many items will be added to the dictionary. Most
dictionary implementations I have previously worked with are more
efficient if they know ahead of time how big to make their tables.

In this case I only need to do a presence test of the key, there is no
actual data associated with the key. The table is used for detecting
duplicate entries. Is there a more efficient to do this test that sticking
an empty string into a dict? The keys are sha1.digest().

>
Have you read the thread "Large Dictionaries" which started on/about
2006-05-15?

I'll look at this.

In general, what is the background to your question?

I am in the process of modifying cvs2svn to do cvs2git. The purpose of
this is to convert Mozilla CVS (4GB) into git format. Currently a
conversion takes 5 hours but the code isn't finished yet.

Other programs that attempt to process the Mozilla CVS repository take
several days to run. But none of the other programs can successful convert
the Mozilla repository without encountering errors. cvs2svn can successful
convert the repository to subversion but it takes about a week. It is
obvious that a lot of effort has been put into teaching cvs2svn how to
deal with broken CVS repositories.

I do practice on smaller repositories but some of the problems I am
dealing with only occur when processing the full dataset (it contains
weird CVS errors). In general the conversion process is completely IO
bound. Since I am rerunning the conversion over and over anything simple
that speeds it up is helpful. I already have 4GB RAM, it would take around
30GB to get everything in memory.

Dictionaries are not a big problem for me, but there are many in use and
they have millions of items in them. But first I have to get the code for
converting to git working, then I can worry about how long it runs. Two or
three days is ok, a month is not ok.

Jon Smirl
jo******@gmail.com

Aug 7 '06 #3

Tim Peters

....

[Jon Smirl]

I know in advance how many items will be added to the dictionary. Most
dictionary implementations I have previously worked with are more
efficient if they know ahead of time how big to make their tables.

Richard Jones spent considerable time investigating whether
"pre-sizing" lists and dicts in CPython could help, at the "Need For
Speed" sprint earlier this year. He didn't find a win worth getting;
e.g., read the section "List pre-allocation" at:

http://wiki.python.org/moin/NeedForSpeed/Failures

Trying it for dicts was also part of what he did, but I don't think
specifics about that were recorded on the Wiki. I was at the sprint,
and hoots of triumph from Richard's direction were conspicuous by
absence during his dict time ;-)

In this case I only need to do a presence test of the key, there is no
actual data associated with the key. The table is used for detecting
duplicate entries. Is there a more efficient to do this test that sticking
an empty string into a dict? The keys are sha1.digest().

It's probably more common to set the value to None or 1, but it
doesn't really matter in reality. In theory, using 1 or an empty
string relies on the implementation accident that CPython stores those
uniquely, while it's guaranteed that None is a singleton object.

BTW, is there a reason to use SHA instead of MD5? I ask because the
latter is 4 bytes shorter, and you apparently have a /lot/ of keys.

If you're using a current version of Python, you can save some memory
by using a builtin set object instead. The CPython implementation of
sets is very much like its implementation of dicts, but doesn't
consume any memory for the (non-existent) associated values. You also
get a number of operations useful on sets (like intersection and
union).

...
Since I am rerunning the conversion over and over anything simple that speeds
it up is helpful.

I already have 4GB RAM, it would take around 30GB to get everything in memory.

Dictionaries are not a big problem for me, but there are many in use and
they have millions of items in them.

A peculiar suggestion: if you don't need to release system resources
cleanly at the end of a run, try doing:

import os
os._exit(0)

at the end. If you have dicts with millions of entries swapped to
disk, it /can/ consume major time just to page them all in again to
decrement the key & value refcounts if you let "clean shutdown" code
determine they've become trash. Bailing ungracefully can skip all
that work (but also skips other work, like letting the platform C I/O
library close still-open files gracefully).

Aug 7 '06 #4

Jon Smirl

On Mon, 07 Aug 2006 00:33:33 -0400, Tim Peters wrote:

...

[Jon Smirl]
>I know in advance how many items will be added to the dictionary. Most
dictionary implementations I have previously worked with are more
efficient if they know ahead of time how big to make their tables.

Richard Jones spent considerable time investigating whether "pre-sizing"
lists and dicts in CPython could help, at the "Need For Speed" sprint
earlier this year. He didn't find a win worth getting; e.g., read the
section "List pre-allocation" at:

http://wiki.python.org/moin/NeedForSpeed/Failures

Trying it for dicts was also part of what he did, but I don't think
specifics about that were recorded on the Wiki. I was at the sprint, and
hoots of triumph from Richard's direction were conspicuous by absence
during his dict time ;-)

>In this case I only need to do a presence test of the key, there is no
actual data associated with the key. The table is used for detecting
duplicate entries. Is there a more efficient to do this test that
sticking an empty string into a dict? The keys are sha1.digest().

It's probably more common to set the value to None or 1, but it doesn't
really matter in reality. In theory, using 1 or an empty string relies on
the implementation accident that CPython stores those uniquely, while it's
guaranteed that None is a singleton object.

BTW, is there a reason to use SHA instead of MD5? I ask because the
latter is 4 bytes shorter, and you apparently have a /lot/ of keys.

http://git.or.cz/index.html
git is the source code control tool used by the Linux kernel. It is
optimized for distributed development. git is what the kernel developers
wrote to replace Bitkeeper after the Bitkeeper license change.

git identifies it's change sets with sha1 hashes.

Distributed SCM is very important when working on large projects. With a
distributed SCM nobody needs commit privs - everyone has their own
complete copy of the repository. To get a change into a distribution you
need to convince someone up the heirarchy to pull your changes into
their repository. In the Linux world Linus makes the distributions. There
are about 10 people in the next circle and about 100 in the circle after
that. You need to convince one of them to accept your changes, but that's
not very hard if your code makes sense.

Distributed also means that you can pull changes from anyone else into
your repository. So if you want to run Reiser4 and it isn't in the main
Linus tree yet, just pull a copy from the namesys git tree into your local
tree and everything will get merged.

Python is using Subversion which is way better than CVS, but Subversion is
still organized around a central repository with commit privs. If that
repository disappears in an earthquake Python may be in trouble.

If you give git a try you will also notice that it is way faster than
subversion.

If you're using a current version of Python, you can save some memory by
using a builtin set object instead. The CPython implementation of sets is
very much like its implementation of dicts, but doesn't consume any memory
for the (non-existent) associated values. You also get a number of
operations useful on sets (like intersection and union).

Sets may be a good option, I'll adjust the code for the next run.

>...
Since I am rerunning the conversion over and over anything simple that
speeds it up is helpful.

I already have 4GB RAM, it would take around 30GB to get everything in
memory.

Dictionaries are not a big problem for me, but there are many in use and
they have millions of items in them.

A peculiar suggestion: if you don't need to release system resources
cleanly at the end of a run, try doing:

import os
os._exit(0)

at the end. If you have dicts with millions of entries swapped to disk,
it /can/ consume major time just to page them all in again to decrement
the key & value refcounts if you let "clean shutdown" code determine
they've become trash. Bailing ungracefully can skip all that work (but
also skips other work, like letting the platform C I/O library close
still-open files gracefully).

Jon Smirl
jo******@gmail.com

Aug 7 '06 #5

Fuzzyman

Jon Smirl wrote:

On Sun, 06 Aug 2006 15:33:30 -0700, John Machin wrote:

[snip..]

Do you have an application with a performance problem? If so, what makes
you think inserting 1M items into a Python dict is contributing to the
problem?

I know in advance how many items will be added to the dictionary. Most
dictionary implementations I have previously worked with are more
efficient if they know ahead of time how big to make their tables.

In this case I only need to do a presence test of the key, there is no
actual data associated with the key. The table is used for detecting
duplicate entries. Is there a more efficient to do this test that sticking
an empty string into a dict? The keys are sha1.digest().

Possibly a naive question - but would using sets be more efficient ?

They are generally used for the sort of job you are describing (testing
for duplicates rather than storing data associated with a key).

We did some limited performance tests at Resolver Systems and found use
of sets and dictionaries to be almost exactly hte same speed - but
there could be memory advantages for you. Performance is also likely to
be different for vast data sets, but I understand the hashing algorithm
to be very similar for both...

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Aug 7 '06 #6

Similar topics

Prime number module

by: Dag | last post by:

Is there a python module that includes functions for working with prime numbers? I mainly need A function that returns the Nth prime number and that returns how many prime numbers are less than N,...

Python

Documentation on creating new types with slots in C ?

by: Boris Boutillier | last post by:

Hi all, Of what I heard Python 2.3 now allow creation of new types in C with slots. Is there somewhere some documentation on this, or at least an example ? Thanks a lot Boris

Python

Problem with QT / Threads / Signals / Slots

by: Frank Bossy | last post by:

Dear group :) I don't quite understand the meaning of this paragraph in the qt docu (http://doc.trolltech.com/3.1/threads.html): ***SNIP The Signals and Slots mechanism can be used in...

C / C++

Initializing Int Array With a Non-Zero Value

by: simondex | last post by:

Hi, Everyone! Does anyone know how to initialize an int array with a non-zero number? Thank You Very Much. Truly Yours, Simon Dexter

C / C++

Initializing defaults to module variables

by: Burton Samograd | last post by:

Hi, I'm writing an app that stores some user configuration variables in a file ~/.program/config, which it then imports like so: import sys from posix import environ...

Python

Initializing a dictionary

by: moo | last post by:

Forgive the fairly mundane question, but all the code samples I've found searching for initializing a dictionary have only shown one way, specifically: <code> HybridDictionary toy = new...

C# / C Sharp

Signals/Slots support in Python

by: Brian Vanderburg II | last post by:

I don't know if any such support is already built in, so I ended up making my own simple signals/slots like mechanism. If anyone is interested then here it is, along with a simple test. It can...

Python

Re: Signals/Slots support in Python

by: Scott SA | last post by:

On 5/1/08, Brian Vanderburg II (BrianVanderburg2@aim.com) wrote: Did you review this? <http://pydispatcher.sourceforge.net/> from what I understand is originally based upon this:...

Python

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice