Sorry Carl Banks for the answering delay, there are problems in Google

Groups.

This is not to study graph theory; I'm using the graph to represent a

problem domain. The graphs could be arbitrarily large, and could

easily have millions of nodes, and most nodes have a substantial

amount of data associated with them. Obviously I don't want a whole

such graph in memory at once, so libraries the just deal with in-

memory graphs are out.

It doesn't sound a problem for modern PCs.

I think you can keep the whole graph topology in RAM, and the node

data on disk (for example in a file or in DB). The topology is

represented by the arcs (you have said nothing regarding arc data, so

I assume it's absent) and the nodes (in RAM you just need a 32 bit

unsigned integer to represent the index of the node that is stored on

disk. If memory becomes tight you can use just 3 bytes (2 ^ 24 = 16

millions different nodes) for the nodes, but generally the memory

required for the nodes is very little compared to the memory necessary

to store the arcs).

You haven't said how many arcs there are, total or average for node,

and if such arcs are directed or undirected.

Anyway, using my Graph class (that stores each arc twice), this takes

about 1 minute and 1.3 GB of RAM (1 million nodes, 10 arcs for node):

from graph import Graph

from random import randrange

g = Graph()

N = 1000000

g.addNodes(xrange(N))

for i in xrange(N * 10):

g.addArc(randrange(N), randrange(N))

You have said "could easily have millions of nodes", and every arc may

have tens of arcs or more.

("arbitrarily large" is an unsolvable problem because there are always

limits in your program, there's no program that's able to work on an

"arbitrarily large" dataset), so Python data structures become too

much costly for the RAM. With a lower level language like D/C/C++ you

can manage a bigger graph. You can use Boost Graph, but a homemade

graph structure may suffice too for your purposes.

For the problem you have explained I think a very simple graph

representation may suffice: an array of integer pairs (starting_index,

len) (starting_index can be a pointer too), where len is the number of

outbound arcs of the node n, plus an array of indexes/pointers that

list the outbound arcs. If memory gets tight you can split this second

array in two, and use an array of bytes for the lengths (if you have

more than 256 outbound arcs you may need a short). Note that if you

use indexes then a Python array.array (or Numpy) suffices.

In this situation if nnodes = 10_000_000 and narcs/node = 40 (total

nodes = 40 * 10_000_000):

nnodes * narcs * 4 + nnodes * (4 + 4) = 1_680_000_000 bytes, that is

often available on modern PCs.

On a 64 bit machine the indexes take the same memory, but pointers

twice that.

In "memory saving" mode:

nnodes * narcs * 3 + nnodes * (3 + 1) = 1_240_000_000 bytes.

A more handy compromise is:

nnodes * narcs * 3 + nnodes * (4 + 4) = 1_280_000_000 bytes.

But then what happens if something changes elsewhere.

Regarding the data structure, if you use arrays like I have explained,

if your updates aren't much frequent then you can allocate small extra

arrays to store more arcs coming out a node (but to do this you

probably may enjoy using pointers instead of indexes). When you have a

lot of updated you can save all to disk, and then regenerate the whole

data structure.

Bye,

bearophile