473,386 Members | 1,726 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

Database specialized in storing directed graphs?

I was wondering if anyone had any advice on this.

This is not to study graph theory; I'm using the graph to represent a
problem domain. The graphs could be arbitrarily large, and could
easily have millions of nodes, and most nodes have a substantial
amount of data associated with them. Obviously I don't want a whole
such graph in memory at once, so libraries the just deal with in-
memory graphs are out.

I know I could implement this with a relational DB, and I'd be fine
with a library that was built on top of one. But I'm hoping for
specialzed behavior useful for graphs.

For example, suppose I have a set of nodes called A. It would be
useful to know if any of these nodes are connected by paths that
include no nodes in A. I could, of course, do that by reading from
the database and following the paths, but I clearly don't want to do
that. I would want instead to somehow cache the outside connectedness
relationship between different nodes of A. But then what happens if
something changes elsewhere. How is the cache for A notified, and can
it be updated efficiently using graph theorems, rather than
regenerated?

It's very tricky; that's why I hope someone else has done it.

I'm guessing no.
Carl Banks
Oct 28 '08 #1
6 2889
On Mon, Oct 27, 2008 at 5:32 PM, Carl Banks <pa************@gmail.comwrote:
I was wondering if anyone had any advice on this.

This is not to study graph theory; I'm using the graph to represent a
problem domain. The graphs could be arbitrarily large, and could
easily have millions of nodes, and most nodes have a substantial
amount of data associated with them. Obviously I don't want a whole
such graph in memory at once, so libraries the just deal with in-
memory graphs are out.

I know I could implement this with a relational DB, and I'd be fine
with a library that was built on top of one. But I'm hoping for
specialzed behavior useful for graphs.

For example, suppose I have a set of nodes called A. It would be
useful to know if any of these nodes are connected by paths that
include no nodes in A. I could, of course, do that by reading from
the database and following the paths, but I clearly don't want to do
that. I would want instead to somehow cache the outside connectedness
relationship between different nodes of A. But then what happens if
something changes elsewhere. How is the cache for A notified, and can
it be updated efficiently using graph theorems, rather than
regenerated?

It's very tricky; that's why I hope someone else has done it.

I'm guessing no.
By sacrificing a goat at the altar of the Almighty Google, I was able
to locate a project I came upon a long while ago but couldn't remember
the name of that's vaguely like what you want, in that it's a "graph
database": Neo4j - http://neo4j.org/ (and yes, it's in Java; sigh)
Not sure it's exactly what you're looking for, but anyway....

Cheers,
Chris
--
Follow the path of the Iguana...
http://rebertia.com
>

Carl Banks
--
http://mail.python.org/mailman/listinfo/python-list
Oct 28 '08 #2
On 2008-10-28 01:32, Carl Banks wrote:
I was wondering if anyone had any advice on this.

This is not to study graph theory; I'm using the graph to represent a
problem domain. The graphs could be arbitrarily large, and could
easily have millions of nodes, and most nodes have a substantial
amount of data associated with them. Obviously I don't want a whole
such graph in memory at once, so libraries the just deal with in-
memory graphs are out.

I know I could implement this with a relational DB, and I'd be fine
with a library that was built on top of one. But I'm hoping for
specialzed behavior useful for graphs.

For example, suppose I have a set of nodes called A. It would be
useful to know if any of these nodes are connected by paths that
include no nodes in A. I could, of course, do that by reading from
the database and following the paths, but I clearly don't want to do
that. I would want instead to somehow cache the outside connectedness
relationship between different nodes of A. But then what happens if
something changes elsewhere. How is the cache for A notified, and can
it be updated efficiently using graph theorems, rather than
regenerated?

It's very tricky; that's why I hope someone else has done it.

I'm guessing no.
Aaron Watters is an expert on this and has implemented kjbuckets
for doing this in memory:

http://gadfly.sourceforge.net/kjbuckets.html

Gadfly uses the library to implement relational queries (and works
on disk):

http://gadfly.sourceforge.net/

The package is now maintained by Richard Jones.

You might be able to reuse some parts of Gadfly for your
purposes.

Also have a look at Pygr:

http://bioinfo.mbi.ucla.edu/pygr

which is a Python library to build a graph interface on top of
a relational database.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Oct 28 2008)
>>Python/Zope Consulting and Support ... http://www.egenix.com/
mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
__________________________________________________ ______________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
Oct 28 '08 #3
On Oct 27, 8:32*pm, Carl Banks <pavlovevide...@gmail.comwrote:
I was wondering if anyone had any advice on this.

This is not to study graph theory; I'm using the graph to represent a
problem domain. *The graphs could be arbitrarily large, and could
easily have millions of nodes, and most nodes have a substantial
amount of data associated with them. *Obviously I don't want a whole
such graph in memory at once, so libraries the just deal with in-
memory graphs are out.

I know I could implement this with a relational DB, and I'd be fine
with a library that was built on top of one. *But I'm hoping for
specialzed behavior useful for graphs.
If you're looking for FOSS, the Boost graph library [1] or its
parallel extension [2] is probably your best bet; it also comes with
Python bindings but they are not maintained any more. For commercial
solutions, Star-P [3] seems an interesting platform, with bindings to
Matlab and Python. Freebase [4] is apparently built on a special graph
database but unfortunately only the stored data are available, not the
DB source code.

George

[1] http://www.boost.org/doc/libs/1_36_0...doc/index.html
[2] http://www.osl.iu.edu/research/pbgl/
[3] http://www.interactivesupercomputing...sematrices.php
[4] http://www.freebase.com/help/faq#q7
Oct 28 '08 #4
Sorry Carl Banks for the answering delay, there are problems in Google
Groups.
This is not to study graph theory; I'm using the graph to represent a
problem domain. The graphs could be arbitrarily large, and could
easily have millions of nodes, and most nodes have a substantial
amount of data associated with them. Obviously I don't want a whole
such graph in memory at once, so libraries the just deal with in-
memory graphs are out.
It doesn't sound a problem for modern PCs.
I think you can keep the whole graph topology in RAM, and the node
data on disk (for example in a file or in DB). The topology is
represented by the arcs (you have said nothing regarding arc data, so
I assume it's absent) and the nodes (in RAM you just need a 32 bit
unsigned integer to represent the index of the node that is stored on
disk. If memory becomes tight you can use just 3 bytes (2 ^ 24 = 16
millions different nodes) for the nodes, but generally the memory
required for the nodes is very little compared to the memory necessary
to store the arcs).

You haven't said how many arcs there are, total or average for node,
and if such arcs are directed or undirected.

Anyway, using my Graph class (that stores each arc twice), this takes
about 1 minute and 1.3 GB of RAM (1 million nodes, 10 arcs for node):

from graph import Graph
from random import randrange
g = Graph()
N = 1000000
g.addNodes(xrange(N))
for i in xrange(N * 10):
g.addArc(randrange(N), randrange(N))
You have said "could easily have millions of nodes", and every arc may
have tens of arcs or more.
("arbitrarily large" is an unsolvable problem because there are always
limits in your program, there's no program that's able to work on an
"arbitrarily large" dataset), so Python data structures become too
much costly for the RAM. With a lower level language like D/C/C++ you
can manage a bigger graph. You can use Boost Graph, but a homemade
graph structure may suffice too for your purposes.

For the problem you have explained I think a very simple graph
representation may suffice: an array of integer pairs (starting_index,
len) (starting_index can be a pointer too), where len is the number of
outbound arcs of the node n, plus an array of indexes/pointers that
list the outbound arcs. If memory gets tight you can split this second
array in two, and use an array of bytes for the lengths (if you have
more than 256 outbound arcs you may need a short). Note that if you
use indexes then a Python array.array (or Numpy) suffices.

In this situation if nnodes = 10_000_000 and narcs/node = 40 (total
nodes = 40 * 10_000_000):
nnodes * narcs * 4 + nnodes * (4 + 4) = 1_680_000_000 bytes, that is
often available on modern PCs.

On a 64 bit machine the indexes take the same memory, but pointers
twice that.

In "memory saving" mode:
nnodes * narcs * 3 + nnodes * (3 + 1) = 1_240_000_000 bytes.

A more handy compromise is:
nnodes * narcs * 3 + nnodes * (4 + 4) = 1_280_000_000 bytes.
But then what happens if something changes elsewhere.
Regarding the data structure, if you use arrays like I have explained,
if your updates aren't much frequent then you can allocate small extra
arrays to store more arcs coming out a node (but to do this you
probably may enjoy using pointers instead of indexes). When you have a
lot of updated you can save all to disk, and then regenerate the whole
data structure.

Bye,
bearophile
Oct 28 '08 #5
Don't really know if this will be useful but i'd try pytables:
http://www.pytables.org/moin
it deals very well with every kind of hierarchical data sets, doesn't
matter the size.
It will let you load only significant data, and you'll be able to
query your data.
It's built on top of HDF5 libraries but exposes a very friendly
pythonic interface.
Surely you'll still have to implement all of the graph logic yourself
but this could be a good starting point.
Hope it helps

Marco
Oct 28 '08 #6
On Oct 27, 7:32*pm, Carl Banks <pavlovevide...@gmail.comwrote:
I was wondering if anyone had any advice on this.

This is not to study graph theory; I'm using the graph to represent a
problem domain. *The graphs could be arbitrarily large, and could
easily have millions of nodes, and most nodes have a substantial
amount of data associated with them. *Obviously I don't want a whole
such graph in memory at once, so libraries the just deal with in-
memory graphs are out.

I know I could implement this with a relational DB, and I'd be fine
with a library that was built on top of one. *But I'm hoping for
specialzed behavior useful for graphs.

For example, suppose I have a set of nodes called A. *It would be
useful to know if any of these nodes are connected by paths that
include no nodes in A. *I could, of course, do that by reading from
the database and following the paths, but I clearly don't want to do
that. *I would want instead to somehow cache the outside connectedness
relationship between different nodes of A. *But then what happens if
something changes elsewhere. *How is the cache for A notified, and can
it be updated efficiently using graph theorems, rather than
regenerated?

It's very tricky; that's why I hope someone else has done it.

I'm guessing no.

Carl Banks
Are you just looking for a persistent graph? The usual options are
'shelve', 'sqllite', 'mmap', and 'ctypes.Structure'. Or did you need
a special structure for the 'outside connectedness' properties?

Storing that is the usual time-space tradeoff. (Actually, I think
that term refers to something slightly distinct. This would be more
properly termed read-time/write-time trade off, roughly.)

Assuming you pick an RDB, you'd have a table of vertices and a table
of edges. Did you want 'set A' to be a table and persist between runs
of the program? If not, just keep a set of 2-tuples of nodes that
satisfy that property. I think the set S you're after is: all x,y
such that x is a vertex in A, y is a vertex in A, and there exists a P
such that P is a path, P starts at x, P ends at y, and vertex v in P
implies that v is x, v is y, or v is not in A. I don't know if you
can cache any information about A that gives you any shortcuts in
calculating S, or what the running time or running space of the
fastest/smallest algorithm is. At worst, it's O( |V| * |E| ) on each
change to V or E. Unverified.

You might be able to store the shortest path in a mapping from 2-
tuples to paths. Or better yet, map each node in A to the set of all
nodes reachable from it only entering A once. Then, you might save
time on either adding a node/vertex to A or G, or removing one from A
or G, or both. Maybe mapping each node in G to that relative set.
You didn't say whether the graph was directed and/or cyclic.

A good place to start would be, if you add a node to G, can S
decrease? No, because the original path P still exists. If you
remove one from A, still in G, S can stay the same size, might
decrease. If you remove a node from G, not in A, same, etc.
Oct 28 '08 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Roderik | last post by:
Hi, When developing a website in general when should I choose to put content in a database. For menu options, settings, and listings like products it seems to be clear for me to put them in. But...
1
by: Nicola | last post by:
Consider the following implementation of a graph, whose nodes must be of type Node or of a subclass of Node: class Node { public: Node(Data* d) { adjList = new vector<Node*>; data = d; }...
14
by: Dave | last post by:
Hello all, After perusing the Standard, I believe it is true to say that once you insert an element into a std::list<>, its location in memory never changes. This makes a std::list<> ideal for...
1
by: pocm | last post by:
Hi all, I'm at work and I don't have a copy of C Unleashed here with me, I have it at home. However, I'd like someone to confirm me that the Graph code of C Unleashed is not for Directed Graphs....
4
by: Krista Lemieux | last post by:
Hello, I know this is probably a hudge topic to discuss and there are lots of different ways of implementation, but I still would like to ask and hear the most commonly used techniques for this....
6
by: Saket Mundra | last post by:
I have a web application with two forms. After user enters data in first form he is directed to the second form. After Filling the second form as he clicks on save button, the data entered is...
32
by: David Isaac | last post by:
I have no experience with database applications. This database will likely hold only a few hundred items, including both textfiles and binary files. I would like a pure Python solution to the...
10
by: shsandeep | last post by:
The ETL application loaded around 3000 rows in 14 seconds in a Development database while it took 2 hours to load in a UAT database. UAT db is partitioned. Dev db is not partitioned. the...
9
by: TDB | last post by:
Hello, I'm creating an application using C which requires the data structures like trees and graphs to be stored in files and retrieved later ( simply serialization of a data structure ) . Is...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.