473,545 Members | 2,737 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

very large graph

I need to represent the hyperlinks between a large number of HTML
files as a graph. My non-directed graph will have about 63,000 nodes
and and probably close to 500,000 edges.

I have looked into igraph (http://cneurocvs.rmki.kfki.hu/igraph/doc/
python/index.html) and networkX (https://networkx.lanl.gov/wiki) for
generating a file to store the graph, and I have also looked into
Graphviz for visualization. I'm just not sure which modules are
best. I need to be able to do the following:

1) The names of my nodes are not known ahead of time, so I will
extract the title from all the HTML files to name the nodes prior to
parsing the files for hyperlinks (edges).

2) Every file will be parsed for links and nondirectional connections
will be drawn between the two nodes.

3) The files might link to each other so the graph package needs to
be able to check to see if an edge between two nodes already exists,
or at least not double draw connections between the two nodes when
adding edges.

I'm relatively new to graph theory so I would greatly appreciate any
suggestions for filetypes. I imagine doing this as a python
dictionary with a list for the edges and a node:list paring is out of
the question for such a large graph?
Jun 27 '08 #1
5 4430
chrispoliq...@g mail.com:
My non-directed graph will have about 63,000 nodes
and and probably close to 500,000 edges.
That's large, but today not very large anymore. Today very large
graphs probably have more than millions of nodes...
You have to try, but I think any Python graph lib may be fit for your
purpose.
But I think a list of pairs won't do.

For larger graphs you can use Boost graph too, but it may be overkill
for your purposes.
If your graphs are even larger, you can implement a graph with
DataDraw in C and use the compiled module from Python, this is
probably among the faster possible near-prebuilt data structures.
For even larger graphs you may need the "external" STXXL with C++, and
then custom code.

The list of the operations you have to perform on the graph seems very
simple, so any graph lib may be enough, mine too:
http://sourceforge.net/projects/pynetwork/

For 63,000 nodes and 500,000 edges Graphviz may be too much slow, so
you may need to find a faster visualization tool. Two (or more) of
them are free too.

Ask if you need more information.

Bye,
bearophile
Jun 27 '08 #2
On Jun 24, 1:26*am, chrispoliq...@g mail.com wrote:
I need to represent the hyperlinks between a large number of HTML
files as a graph. *My non-directed graph will have about 63,000 nodes
and and probably close to 500,000 edges.

I have looked into igraph (http://cneurocvs.rmki.kfki.hu/igraph/doc/
python/index.html) and networkX (https://networkx.lanl.gov/wiki) for
generating a file to store the graph, and I have also looked into
Graphviz for visualization. *I'm just not sure which modules are
best. *I need to be able to do the following:

1) *The names of my nodes are not known ahead of time, so I will
extract the title from all the HTML files to name the nodes prior to
parsing the files for hyperlinks (edges).

2) Every file will be parsed for links and nondirectional connections
will be drawn between the two nodes.

3) *The files might link to each other so the graph package needs to
be able to check to see if an edge between two nodes already exists,
or at least not double draw connections between the two nodes when
adding edges.

I'm relatively new to graph theory so I would greatly appreciate any
suggestions for filetypes. *I imagine doing this as a python
dictionary with a list for the edges and a node:list paring is out of
the question for such a large graph?
Perhaps a dictionary where the key is a node and the value is a set of
destination nodes?
Jun 27 '08 #3
On 2008-06-24, MRAB <go****@mrabarn ett.plus.comwro te:
On Jun 24, 1:26*am, chrispoliq...@g mail.com wrote:
>I need to represent the hyperlinks between a large number of HTML
files as a graph. *My non-directed graph will have about 63,000 nodes
and and probably close to 500,000 edges.

I have looked into igraph (http://cneurocvs.rmki.kfki.hu/igraph/doc/
python/index.html) and networkX (https://networkx.lanl.gov/wiki) for
generating a file to store the graph, and I have also looked into
Graphviz for visualization. *I'm just not sure which modules are
best. *I need to be able to do the following:
Afaik Graphviz is not good at abstracting the graph, which you may need here.
A page with 63,000 circles on it, and 500,000 edges will probably come out of
the printer as a black sheet of paper.
(8"x11" paper, 1 circle is 1/5", then you have only 2200 circles at one sheet.
You need a factor 28 more circles which leads to a circle of about 0.007".)

Even when actual paper format would not be a problem, you will need some
abstraction and/or tools, as you will not find anything manually in an ocean of
63,000 elements.

One area you may want to start looking for tools is in state graphs, where the
set of possible states of an entire system is unfolded. These things go up to
over a million states, so you only have a 'small' problem there...
>1) *The names of my nodes are not known ahead of time, so I will
extract the title from all the HTML files to name the nodes prior to
parsing the files for hyperlinks (edges).

2) Every file will be parsed for links and nondirectional connections
will be drawn between the two nodes.

3) *The files might link to each other so the graph package needs to
be able to check to see if an edge between two nodes already exists,
or at least not double draw connections between the two nodes when
adding edges.

I'm relatively new to graph theory so I would greatly appreciate any
suggestions for filetypes. *I imagine doing this as a python
dictionary with a list for the edges and a node:list paring is out of
the question for such a large graph?

Perhaps a dictionary where the key is a node and the value is a set of
destination nodes?
For undirected edges, you could make an Edge class and have a set of Edge's
(where two Edge objects are equal when they connect the same nodes).
I don't expect 500,000 elements in a set to be a problem.

Sincerely,
Albert

Jun 27 '08 #4
On Jun 24, 5:58*am, "A.T.Hofkam p" <h...@se-162.se.wtb.tue. nlwrote:
On 2008-06-24, MRAB <goo...@mrabarn ett.plus.comwro te:
On Jun 24, 1:26*am, chrispoliq...@g mail.com wrote:
I need to represent the hyperlinks between a large number of HTML
files as a graph. *My non-directed graph will have about 63,000 nodes
and and probably close to 500,000 edges.
I have looked into igraph (http://cneurocvs.rmki.kfki.hu/igraph/doc/
python/index.html) and networkX (https://networkx.lanl.gov/wiki) for
generating a file to store the graph, and I have also looked into
Graphviz for visualization. *I'm just not sure which modules are
best. *I need to be able to do the following:

Afaik Graphviz is not good at abstracting the graph, which you may need here.
A page with 63,000 circles on it, and 500,000 edges will probably come out of
the printer as a black sheet of paper.
(8"x11" paper, 1 circle is 1/5", then you have only 2200 circles at one sheet.
You need a factor 28 more circles which leads to a circle of about 0.007"..)

Even when actual paper format would not be a problem, you will need some
abstraction and/or tools, as you will not find anything manually in an ocean of
63,000 elements.

One area you may want to start looking for tools is in state graphs, where the
set of possible states of an entire system is unfolded. These things go up to
over a million states, so you only have a 'small' problem there...


1) *The names of my nodes are not known ahead of time, so I will
extract the title from all the HTML files to name the nodes prior to
parsing the files for hyperlinks (edges).
2) Every file will be parsed for links and nondirectional connections
will be drawn between the two nodes.
3) *The files might link to each other so the graph package needs to
be able to check to see if an edge between two nodes already exists,
or at least not double draw connections between the two nodes when
adding edges.
I'm relatively new to graph theory so I would greatly appreciate any
suggestions for filetypes. *I imagine doing this as a python
dictionary with a list for the edges and a node:list paring is out of
the question for such a large graph?
Perhaps a dictionary where the key is a node and the value is a set of
destination nodes?

For undirected edges, you could make an Edge class and have a set of Edge's
(where two Edge objects are equal when they connect the same nodes).
I don't expect 500,000 elements in a set to be a problem.

Sincerely,
Albert- Hide quoted text -

- Show quoted text -
Readability counts: Do you need to be able to examine the datafile by
hand? If not, investigate the 'shelve' module. You can go two ways.
One is to map each node to a list of nodes. It's probably more
intuitive, but it needs repeated code:

shelf['pageA'].add( pageB )
shelf['pageB']= pageB
shelf['pageB'].add( pageA )

Which is incorrect as is anyway-- updates to shelved objects aren't
committed automatically.

a= shelf['pageA']
a.add( pageB )
shelf['pageA']= a
b= shelf['pageB']
b.add( pageA )
shelf['pageB']= b

is closer. Otherwise, two is to store a 2-tuple of node keys in a
secondary file.

shelfEdges[ frozenset(
shelfVerts[ 'pageA' ], shelfVerts[ 'pageB' ]
) ]= True

Jun 27 '08 #5
Drawing a large graph like this is not
very insightful by itself, and doing this well
is still an art form.
Many cool visualizations, and all
very domain and question dependent,
can be found at
http://www.visualcomplexity.com/vc/

You can also search on flickr for network
and graph drawing.

Much of it is eyecandy though. What do you
really want to know?

Your graph is not overly huge for python, provided
you do not want to compute nonlocal statistics
such as betweenness metrics. (If you have
a laptop running windows and with
less than 1GB RAM, then you have a really large graph.)

Given that you do not know much about graph theory,
I would suggest that networkx applied to a
subset of your large website --- one of the
representative branches --- will allow
for the fastest way to figure out what you
want to do. Python provides access to many other
libraries for html mangling or even webpage
analysis. Imagine writing C code to ask questions
like: how many pdfs and images? how big are they?
when were these pages last updated? etc.
You can come up with 10 questions faster
than you can write C code, and this where
python has its great advantage.

You can learn graph theory while using these libraries,
on one of the smaller branches of your website tree.
Even interactively by using ipython.
This provides at least a feeling of great power while
stumbling in the dark.

Many of your edges are probably repetitions
associated with navigation menus to provide
a standard look-and-feel. See how much of that
you can strip out and find the cross-links that took
effort to insert an manage. (I suggest
doing the analyis on a digraph rather than a graph,
even if you want to draw it as graph.)
For visualization, the new ubigraph is quite
fast compared to graphviz.
See the cool networkx + ubigraph
video at http://ubietylab.net/blog/
Ask on the networkx mailinglist when stuck.

Jun 27 '08 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
2887
by: Lilith | last post by:
Is there a python module somewhere (been searching today, no luck) which has efficiently coded various graph-handling routines, such as finding the shortest path through a graph, or the set of all paths through a graph? I'm not a compsci-educated person, so coding my own would be less parsimonious. Thanks for any suggestions! D
4
2949
by: Mike | last post by:
Related to another topic I just posted, I wanted to discuss ways to optimize the validation of very large (>100MB) XML documents. First, I have no idea if something like this already exists; it may even be the typical implementation for all I know. At any rate, it occurs to me that the set of business rules that need to be validated...
1
4373
by: entropy123 | last post by:
Hey all, In an effort to solve a sticky - for me - problem I've picked up Sedgewick's 'Algorithm's in C'. I've tried working through the first few problems but am a little stumped when he refers to #include "Graph.h" in the header file of the first program. The style he uses to make his declarations is a little unfamiliar to me, but here...
2
2870
by: MLH | last post by:
A97 Am having difficulty displaying graph in Form View that I see fine in graph control on form opened in design view. I know I'm doing something wrong. If I open the form in design view - I see the graph in the graph control. If I dbl-clik the graph control, microsoft graph opens and displays the graph fine there too.
11
3014
by: Andreas.Burman | last post by:
Hi What is the best way to implement a undirected weighted graph ADT in javascript?
6
1888
by: semedao | last post by:
Hi All, I had working code that made custom serialization on objects that inherit from queue in the inherited queue I create my own GetObjectData: public void GetObjectData(SerializationInfo info, StreamingContext ctxt) { lock (this.SyncRoot) {
4
1612
by: John Henry | last post by:
I am looking for a ready made simple graph package. I found an extensive one in the piana package but when I try to use just the graph portion, it fails to load because of the line: class Graph(object): ... It seems that their Graph is subclassed from "object" but I couldn't find a "object" class anywhere. So, I abandoned using that...
2
2300
by: Man4ish | last post by:
I have created Graph object without vertex and edge property.It is working fine. #include <boost/config.hpp> #include <iostream> #include <vector> #include <string> #include <boost/graph/adjacency_list.hpp> #include <boost/tuple/tuple.hpp> #include <set> using namespace std;
0
3465
by: eureka2050 | last post by:
Hi all, I am creating a radar chart containing 2 plots using jpgraph. My code is as follows:- include ("./jpgraph/src/jpgraph.php"); include ("./jpgraph/src/jpgraph_radar.php"); // Create the basic radar graph $graph = new RadarGraph(700,500,"auto");
0
7502
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
7692
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
6026
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5360
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
5078
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
0
3491
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
1
1921
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1045
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
744
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.