472,131 Members | 1,393 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,131 software developers and data experts.

Speed Q: Storing Large Graphs in DB - Inserting Several Million Rows - Research Proj.

THE PROBLEM
I'm running into performance issues generating and storing a randomly created graph in a SQL Server database. I have a T-SQL script that generates a graph, and then randomly connects the vertices in that graph to each other. I can see that my hard-drive is working very hard throughout the operation. The operation took about 2 hours (I just canceled it at this point, looked like it was about 10% done) with 200,000 vertices and an average of 50 edges per vertex (essentially, it had to insert about 12 million rows).

How can I speed this up? Is there a clever way to get SQL Server to do all the operations in RAM, and then just copy the data from RAM to disk in one contiguous stream after all the inserting is done?

I am developing on a 4gb RAM laptop, and the database should easily fit entirely into RAM several times over--I estimate that my "test" script would generate a 150mbyte database.

THE BACKGROUND
I'm new to database programming and am trying to store a directed graph in a database for a research project. The graph will ultimately contain approximately 1,000,000 vertices with an average of about 150 edges per vertex. There will be more than one graph stored. When this is deployed, it will be deployed on a cluster with (much) more RAM, storage and computing power.

Each edge represents a uni-directional connection between two points (vertices) on the graph). So if point 10 connects to point 23 in a graph, there would be an "edge" entry in the database containing "10" and "23".

If a vertex/point is isolated (it connects to nothing), it is stored as an edge which "connects" to itself: i.e. if point 56 connects to no other points, then there would be an "edge" entry in the database containing "56" and "56".

THE SCHEMA
The table is of the form:
Expand|Select|Wrap|Line Numbers
  1. CREATE TABLE [dbo].[edges](
  2.     [edge_id] [int] IDENTITY(1,1) NOT NULL,
  3.     [graph_id] [int] NOT NULL,
  4.     [primary_vertex_id] [int] NOT NULL,
  5.     [adjacent_vertex_id] [int] NULL CONSTRAINT [DF_edges_adjacent_vertex_id]  DEFAULT (NULL),
  6.  CONSTRAINT [PK_edges] PRIMARY KEY CLUSTERED 
  7. (
  8.     [edge_id] ASC
  9. )WITH (PAD_INDEX  = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
  10. ) ON [PRIMARY]
  11.  
THE SCRIPT THAT TAKES TOO LONG
This was cobbled together from two functions ("add vertices" and "connect vertices"), I hope it isn't too difficult to read.
Expand|Select|Wrap|Line Numbers
  1. DECLARE @graph_id INT
  2. DECLARE @vertices INT
  3. DECLARE @average_edges INT
  4. SET @graph_id = 35
  5. SET @vertices = 200000 /* 200,000 vertex graph */
  6. SET @average_edges = 50; /* approx 50 connections per vertex */
  7.     BEGIN
  8.         /* get the first new vertex id -- should usually be 1, but just in case there's already some vertices in graph */
  9.         DECLARE @first_vertex_id INT
  10.         SELECT @first_vertex_id = (1 + isnull(max(dbo.edges.primary_vertex_id),0)) FROM dbo.edges WHERE dbo.edges.graph_id = @graph_id;
  11.         BEGIN
  12.             DECLARE @counter INT
  13.             DECLARE @new_vertex_id INT
  14.             SET @counter = 0
  15.             WHILE @counter < @vertices
  16.             BEGIN
  17.                 SET @new_vertex_id = @first_vertex_id + @counter;
  18.                 SET @counter = @counter + 1;
  19.                 /* insert a partial edge with the new vertex id */
  20.                 INSERT INTO dbo.edges(graph_id,
  21.                     primary_vertex_id,
  22.                     adjacent_vertex_id)
  23.                 VALUES(@graph_id, @new_vertex_id, @new_vertex_id);
  24.             END;
  25.             RETURN 0;
  26.         END;
  27.     END;
  28.  
  29.     BEGIN
  30.         DECLARE @lower INT
  31.         DECLARE @upper INT
  32.         SELECT @lower = min(dbo.edges.primary_vertex_id) FROM dbo.edges WHERE dbo.edges.graph_id = @graph_id
  33.         SELECT @upper = max(dbo.edges.primary_vertex_id) FROM dbo.edges WHERE dbo.edges.graph_id = @graph_id
  34.  
  35.         /* note: this algorithm assumes that the vertex_ids are CONTIGUOUS!! and may create duplicate edges */
  36.         BEGIN
  37.             DECLARE @loop_max INT
  38.             DECLARE @counter INT
  39.             DECLARE @rn_one INT
  40.             DECLARE @rn_two INT
  41.             SET @counter = 0
  42.             SELECT @loop_max = (@upper - @lower + 1) * @average_edges
  43.             WHILE @counter < @loop_max
  44.             BEGIN
  45.                 /* pick two random vertex IDs that exist on the graph */
  46.                 SELECT @rn_one = Round(((@Upper - @Lower -1) * Rand() + @Lower), 0);
  47.                 SELECT @rn_two = Round(((@Upper - @Lower -1) * Rand() + @Lower), 0);                
  48.                 /* insert a bi-directional edge for those two vertices */
  49.                 INSERT INTO dbo.edges(graph_id,
  50.                     primary_vertex_id,
  51.                     adjacent_vertex_id)
  52.                 VALUES(@graph_id, @rn_one, @rn_two);
  53.  
  54.                 INSERT INTO dbo.edges(graph_id,
  55.                     primary_vertex_id,
  56.                     adjacent_vertex_id)
  57.                 VALUES(@graph_id, @rn_two, @rn_one);
  58.                 SET @counter = @counter + 1;
  59.             END
  60.         END;
  61.     END;
  62.  
Nov 18 '08 #1
2 3481
Delerna
1,134 Expert 1GB
Is there a clever way to get SQL Server to do all the operations in RAM, and then just copy the data from RAM to disk in one contiguous stream after all the inserting is done
Expand|Select|Wrap|Line Numbers
  1. declare @RamTable table (Field1 int,Field2 varchar(50))
  2. --You now have a table that exists in ram
  3.  
  4. insert into @RamTable select 1,'string value'
  5. insert into @RamTable  1,'string value'
  6. update @RamTable  set Field2=Field2 + convert(varchar(3),Field1)
  7. --Some silly sql to show that you use it like any real table
  8.  
  9. insert into TheRealTable
  10. select * from @RamTable  
  11. --Now you can save the results of all your processing into the actual table
  12.  
Whether this works any faster or not I will leave for you to discover.
Hope it helps?
Nov 19 '08 #2
Your tip works great: this sped it up by MORE THAN AN ORDER OF MAGNITUDE ! Thank you very much!

Details:
A 100,000 vertex graph, with an average of 100 connections per vertex, executed fully in 10:52 (just under 11 minutes!), although it used about 1035MB of RAM. The total # of edges in the DB is now approximately 20 million.
Nov 19 '08 #3

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

3 posts views Thread by Jeremy Howard | last post: by
4 posts views Thread by alwaf | last post: by
18 posts views Thread by Jeff Boes | last post: by
4 posts views Thread by Joachim Klassen | last post: by
3 posts views Thread by Joachim Klassen | last post: by
60 posts views Thread by Neil | last post: by
8 posts views Thread by Josué Maldonado | last post: by
6 posts views Thread by Carl Banks | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.