473,715 Members | 5,414 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Speed Q: Storing Large Graphs in DB - Inserting Several Million Rows - Research Proj.

2 New Member
THE PROBLEM
I'm running into performance issues generating and storing a randomly created graph in a SQL Server database. I have a T-SQL script that generates a graph, and then randomly connects the vertices in that graph to each other. I can see that my hard-drive is working very hard throughout the operation. The operation took about 2 hours (I just canceled it at this point, looked like it was about 10% done) with 200,000 vertices and an average of 50 edges per vertex (essentially, it had to insert about 12 million rows).

How can I speed this up? Is there a clever way to get SQL Server to do all the operations in RAM, and then just copy the data from RAM to disk in one contiguous stream after all the inserting is done?

I am developing on a 4gb RAM laptop, and the database should easily fit entirely into RAM several times over--I estimate that my "test" script would generate a 150mbyte database.

THE BACKGROUND
I'm new to database programming and am trying to store a directed graph in a database for a research project. The graph will ultimately contain approximately 1,000,000 vertices with an average of about 150 edges per vertex. There will be more than one graph stored. When this is deployed, it will be deployed on a cluster with (much) more RAM, storage and computing power.

Each edge represents a uni-directional connection between two points (vertices) on the graph). So if point 10 connects to point 23 in a graph, there would be an "edge" entry in the database containing "10" and "23".

If a vertex/point is isolated (it connects to nothing), it is stored as an edge which "connects" to itself: i.e. if point 56 connects to no other points, then there would be an "edge" entry in the database containing "56" and "56".

THE SCHEMA
The table is of the form:
Expand|Select|Wrap|Line Numbers
  1. CREATE TABLE [dbo].[edges](
  2.     [edge_id] [int] IDENTITY(1,1) NOT NULL,
  3.     [graph_id] [int] NOT NULL,
  4.     [primary_vertex_id] [int] NOT NULL,
  5.     [adjacent_vertex_id] [int] NULL CONSTRAINT [DF_edges_adjacent_vertex_id]  DEFAULT (NULL),
  6.  CONSTRAINT [PK_edges] PRIMARY KEY CLUSTERED 
  7. (
  8.     [edge_id] ASC
  9. )WITH (PAD_INDEX  = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
  10. ) ON [PRIMARY]
  11.  
THE SCRIPT THAT TAKES TOO LONG
This was cobbled together from two functions ("add vertices" and "connect vertices"), I hope it isn't too difficult to read.
Expand|Select|Wrap|Line Numbers
  1. DECLARE @graph_id INT
  2. DECLARE @vertices INT
  3. DECLARE @average_edges INT
  4. SET @graph_id = 35
  5. SET @vertices = 200000 /* 200,000 vertex graph */
  6. SET @average_edges = 50; /* approx 50 connections per vertex */
  7.     BEGIN
  8.         /* get the first new vertex id -- should usually be 1, but just in case there's already some vertices in graph */
  9.         DECLARE @first_vertex_id INT
  10.         SELECT @first_vertex_id = (1 + isnull(max(dbo.edges.primary_vertex_id),0)) FROM dbo.edges WHERE dbo.edges.graph_id = @graph_id;
  11.         BEGIN
  12.             DECLARE @counter INT
  13.             DECLARE @new_vertex_id INT
  14.             SET @counter = 0
  15.             WHILE @counter < @vertices
  16.             BEGIN
  17.                 SET @new_vertex_id = @first_vertex_id + @counter;
  18.                 SET @counter = @counter + 1;
  19.                 /* insert a partial edge with the new vertex id */
  20.                 INSERT INTO dbo.edges(graph_id,
  21.                     primary_vertex_id,
  22.                     adjacent_vertex_id)
  23.                 VALUES(@graph_id, @new_vertex_id, @new_vertex_id);
  24.             END;
  25.             RETURN 0;
  26.         END;
  27.     END;
  28.  
  29.     BEGIN
  30.         DECLARE @lower INT
  31.         DECLARE @upper INT
  32.         SELECT @lower = min(dbo.edges.primary_vertex_id) FROM dbo.edges WHERE dbo.edges.graph_id = @graph_id
  33.         SELECT @upper = max(dbo.edges.primary_vertex_id) FROM dbo.edges WHERE dbo.edges.graph_id = @graph_id
  34.  
  35.         /* note: this algorithm assumes that the vertex_ids are CONTIGUOUS!! and may create duplicate edges */
  36.         BEGIN
  37.             DECLARE @loop_max INT
  38.             DECLARE @counter INT
  39.             DECLARE @rn_one INT
  40.             DECLARE @rn_two INT
  41.             SET @counter = 0
  42.             SELECT @loop_max = (@upper - @lower + 1) * @average_edges
  43.             WHILE @counter < @loop_max
  44.             BEGIN
  45.                 /* pick two random vertex IDs that exist on the graph */
  46.                 SELECT @rn_one = Round(((@Upper - @Lower -1) * Rand() + @Lower), 0);
  47.                 SELECT @rn_two = Round(((@Upper - @Lower -1) * Rand() + @Lower), 0);                
  48.                 /* insert a bi-directional edge for those two vertices */
  49.                 INSERT INTO dbo.edges(graph_id,
  50.                     primary_vertex_id,
  51.                     adjacent_vertex_id)
  52.                 VALUES(@graph_id, @rn_one, @rn_two);
  53.  
  54.                 INSERT INTO dbo.edges(graph_id,
  55.                     primary_vertex_id,
  56.                     adjacent_vertex_id)
  57.                 VALUES(@graph_id, @rn_two, @rn_one);
  58.                 SET @counter = @counter + 1;
  59.             END
  60.         END;
  61.     END;
  62.  
Nov 18 '08 #1
2 3591
Delerna
1,134 Recognized Expert Top Contributor
Is there a clever way to get SQL Server to do all the operations in RAM, and then just copy the data from RAM to disk in one contiguous stream after all the inserting is done
Expand|Select|Wrap|Line Numbers
  1. declare @RamTable table (Field1 int,Field2 varchar(50))
  2. --You now have a table that exists in ram
  3.  
  4. insert into @RamTable select 1,'string value'
  5. insert into @RamTable  1,'string value'
  6. update @RamTable  set Field2=Field2 + convert(varchar(3),Field1)
  7. --Some silly sql to show that you use it like any real table
  8.  
  9. insert into TheRealTable
  10. select * from @RamTable  
  11. --Now you can save the results of all your processing into the actual table
  12.  
Whether this works any faster or not I will leave for you to discover.
Hope it helps?
Nov 19 '08 #2
aeblank
2 New Member
Your tip works great: this sped it up by MORE THAN AN ORDER OF MAGNITUDE ! Thank you very much!

Details:
A 100,000 vertex graph, with an average of 100 connections per vertex, executed fully in 10:52 (just under 11 minutes!), although it used about 1035MB of RAM. The total # of edges in the DB is now approximately 20 million.
Nov 19 '08 #3

Sign in to post your reply or Sign up for a free account.

Similar topics

9
7072
by: rhmd | last post by:
I need to create image files (eg bmp or jpeg) of xy scatter graphs (i.e., graphs in which markers denote individual points; the markers need to be small polygons of various sizes, shapes, colors, shadings, etc. and there are thousands on them on each graph). Have been using MS Excel, but its limitations are unbelievable (only whole number sizes, no way around a 56 color palette, only low quality jpeg files so that when I publish the...
3
1575
by: Jeremy Howard | last post by:
I am finding delete queries on large InnoDB tables very slow - are there ways to speed this up? I have a table with about 100 million rows: I am trying to delete just a few of these rows (the following select takes a couple of seconds): > SELECT count(*) -> FROM UserSnap
4
3376
by: alwaf | last post by:
We are busy designing a generic analytical system at work that will hold multiple analytic types over time. This system is being developed in SQL 2000. Example of table IDENTITY int ItemId int AnalyticType int AnalyticDate DateTime Value numeric(28,15)
18
6359
by: Jeff Boes | last post by:
I'm sure this is a concept that's been explored here. I have a table (fairly simple, just two columns, one of which is a 32-digit checksum) with several million rows (currently, about 7 million). About a million times a day we do select * from my_table where md5 = ? to verify presence or absence of the row, and base further processing on that information.
4
3414
by: Joachim Klassen | last post by:
Hi all, I'm currently investigating the use of MDC Tables for large data warehouse tables. My scenario: A fact table with 1000 Million Rows distributed on 12 Partitions (3 physical hosts with 4 logical partitions each). Overall size of table is 350 GB . Each night 1.5 Million new rows will be added
3
6912
by: Joachim Klassen | last post by:
Hi all, first apologies if this question looks the same as another one I recently posted - its a different thing but for the same szenario:-). We are having performance problems when inserting/deleting rows from a large table. My scenario: Table (lets call it FACT1) with 1000 million rows distributed on 12
60
10140
by: Neil | last post by:
I have a situation with an ODBC linked view in an Access 2000 MDB with a SQL 7 back end. The view is scrolling very slowly. However, if I open the view in an ADP file, it scrolls quickly. I needed to use an ODBC link for the view because it needs to be editable. Otherwise, I would have used a pass-through query. In previous discussions about using an MDB file vs. an ADP file as a front end for SQL Server, the impression I got was that...
8
2563
by: Josué Maldonado | last post by:
Hello List, I'm importing some data from Foxpro to Postgres, there is atable wich contains aprox 4.8 million rows and it size about 830MB. I uploaded it to Postgres using dbf2pg and worked fine, it tooks about 10-15 minutes. Now I'm inserting some data from that table to a brand new table in Postgresql, for that I'm doing insert into ... select from. The point is inserting this data from one table to another table in Postgresql took...
6
2910
by: Carl Banks | last post by:
I was wondering if anyone had any advice on this. This is not to study graph theory; I'm using the graph to represent a problem domain. The graphs could be arbitrarily large, and could easily have millions of nodes, and most nodes have a substantial amount of data associated with them. Obviously I don't want a whole such graph in memory at once, so libraries the just deal with in- memory graphs are out. I know I could implement this...
0
8718
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9198
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9104
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9047
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7973
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6646
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5967
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
1
3175
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2119
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.