473,401 Members | 2,146 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,401 software developers and data experts.

very large graph

I need to represent the hyperlinks between a large number of HTML
files as a graph. My non-directed graph will have about 63,000 nodes
and and probably close to 500,000 edges.

I have looked into igraph (http://cneurocvs.rmki.kfki.hu/igraph/doc/
python/index.html) and networkX (https://networkx.lanl.gov/wiki) for
generating a file to store the graph, and I have also looked into
Graphviz for visualization. I'm just not sure which modules are
best. I need to be able to do the following:

1) The names of my nodes are not known ahead of time, so I will
extract the title from all the HTML files to name the nodes prior to
parsing the files for hyperlinks (edges).

2) Every file will be parsed for links and nondirectional connections
will be drawn between the two nodes.

3) The files might link to each other so the graph package needs to
be able to check to see if an edge between two nodes already exists,
or at least not double draw connections between the two nodes when
adding edges.

I'm relatively new to graph theory so I would greatly appreciate any
suggestions for filetypes. I imagine doing this as a python
dictionary with a list for the edges and a node:list paring is out of
the question for such a large graph?
Jun 27 '08 #1
5 4417
chrispoliq...@gmail.com:
My non-directed graph will have about 63,000 nodes
and and probably close to 500,000 edges.
That's large, but today not very large anymore. Today very large
graphs probably have more than millions of nodes...
You have to try, but I think any Python graph lib may be fit for your
purpose.
But I think a list of pairs won't do.

For larger graphs you can use Boost graph too, but it may be overkill
for your purposes.
If your graphs are even larger, you can implement a graph with
DataDraw in C and use the compiled module from Python, this is
probably among the faster possible near-prebuilt data structures.
For even larger graphs you may need the "external" STXXL with C++, and
then custom code.

The list of the operations you have to perform on the graph seems very
simple, so any graph lib may be enough, mine too:
http://sourceforge.net/projects/pynetwork/

For 63,000 nodes and 500,000 edges Graphviz may be too much slow, so
you may need to find a faster visualization tool. Two (or more) of
them are free too.

Ask if you need more information.

Bye,
bearophile
Jun 27 '08 #2
On Jun 24, 1:26*am, chrispoliq...@gmail.com wrote:
I need to represent the hyperlinks between a large number of HTML
files as a graph. *My non-directed graph will have about 63,000 nodes
and and probably close to 500,000 edges.

I have looked into igraph (http://cneurocvs.rmki.kfki.hu/igraph/doc/
python/index.html) and networkX (https://networkx.lanl.gov/wiki) for
generating a file to store the graph, and I have also looked into
Graphviz for visualization. *I'm just not sure which modules are
best. *I need to be able to do the following:

1) *The names of my nodes are not known ahead of time, so I will
extract the title from all the HTML files to name the nodes prior to
parsing the files for hyperlinks (edges).

2) Every file will be parsed for links and nondirectional connections
will be drawn between the two nodes.

3) *The files might link to each other so the graph package needs to
be able to check to see if an edge between two nodes already exists,
or at least not double draw connections between the two nodes when
adding edges.

I'm relatively new to graph theory so I would greatly appreciate any
suggestions for filetypes. *I imagine doing this as a python
dictionary with a list for the edges and a node:list paring is out of
the question for such a large graph?
Perhaps a dictionary where the key is a node and the value is a set of
destination nodes?
Jun 27 '08 #3
On 2008-06-24, MRAB <go****@mrabarnett.plus.comwrote:
On Jun 24, 1:26*am, chrispoliq...@gmail.com wrote:
>I need to represent the hyperlinks between a large number of HTML
files as a graph. *My non-directed graph will have about 63,000 nodes
and and probably close to 500,000 edges.

I have looked into igraph (http://cneurocvs.rmki.kfki.hu/igraph/doc/
python/index.html) and networkX (https://networkx.lanl.gov/wiki) for
generating a file to store the graph, and I have also looked into
Graphviz for visualization. *I'm just not sure which modules are
best. *I need to be able to do the following:
Afaik Graphviz is not good at abstracting the graph, which you may need here.
A page with 63,000 circles on it, and 500,000 edges will probably come out of
the printer as a black sheet of paper.
(8"x11" paper, 1 circle is 1/5", then you have only 2200 circles at one sheet.
You need a factor 28 more circles which leads to a circle of about 0.007".)

Even when actual paper format would not be a problem, you will need some
abstraction and/or tools, as you will not find anything manually in an ocean of
63,000 elements.

One area you may want to start looking for tools is in state graphs, where the
set of possible states of an entire system is unfolded. These things go up to
over a million states, so you only have a 'small' problem there...
>1) *The names of my nodes are not known ahead of time, so I will
extract the title from all the HTML files to name the nodes prior to
parsing the files for hyperlinks (edges).

2) Every file will be parsed for links and nondirectional connections
will be drawn between the two nodes.

3) *The files might link to each other so the graph package needs to
be able to check to see if an edge between two nodes already exists,
or at least not double draw connections between the two nodes when
adding edges.

I'm relatively new to graph theory so I would greatly appreciate any
suggestions for filetypes. *I imagine doing this as a python
dictionary with a list for the edges and a node:list paring is out of
the question for such a large graph?

Perhaps a dictionary where the key is a node and the value is a set of
destination nodes?
For undirected edges, you could make an Edge class and have a set of Edge's
(where two Edge objects are equal when they connect the same nodes).
I don't expect 500,000 elements in a set to be a problem.

Sincerely,
Albert

Jun 27 '08 #4
On Jun 24, 5:58*am, "A.T.Hofkamp" <h...@se-162.se.wtb.tue.nlwrote:
On 2008-06-24, MRAB <goo...@mrabarnett.plus.comwrote:
On Jun 24, 1:26*am, chrispoliq...@gmail.com wrote:
I need to represent the hyperlinks between a large number of HTML
files as a graph. *My non-directed graph will have about 63,000 nodes
and and probably close to 500,000 edges.
I have looked into igraph (http://cneurocvs.rmki.kfki.hu/igraph/doc/
python/index.html) and networkX (https://networkx.lanl.gov/wiki) for
generating a file to store the graph, and I have also looked into
Graphviz for visualization. *I'm just not sure which modules are
best. *I need to be able to do the following:

Afaik Graphviz is not good at abstracting the graph, which you may need here.
A page with 63,000 circles on it, and 500,000 edges will probably come out of
the printer as a black sheet of paper.
(8"x11" paper, 1 circle is 1/5", then you have only 2200 circles at one sheet.
You need a factor 28 more circles which leads to a circle of about 0.007"..)

Even when actual paper format would not be a problem, you will need some
abstraction and/or tools, as you will not find anything manually in an ocean of
63,000 elements.

One area you may want to start looking for tools is in state graphs, where the
set of possible states of an entire system is unfolded. These things go up to
over a million states, so you only have a 'small' problem there...


1) *The names of my nodes are not known ahead of time, so I will
extract the title from all the HTML files to name the nodes prior to
parsing the files for hyperlinks (edges).
2) Every file will be parsed for links and nondirectional connections
will be drawn between the two nodes.
3) *The files might link to each other so the graph package needs to
be able to check to see if an edge between two nodes already exists,
or at least not double draw connections between the two nodes when
adding edges.
I'm relatively new to graph theory so I would greatly appreciate any
suggestions for filetypes. *I imagine doing this as a python
dictionary with a list for the edges and a node:list paring is out of
the question for such a large graph?
Perhaps a dictionary where the key is a node and the value is a set of
destination nodes?

For undirected edges, you could make an Edge class and have a set of Edge's
(where two Edge objects are equal when they connect the same nodes).
I don't expect 500,000 elements in a set to be a problem.

Sincerely,
Albert- Hide quoted text -

- Show quoted text -
Readability counts: Do you need to be able to examine the datafile by
hand? If not, investigate the 'shelve' module. You can go two ways.
One is to map each node to a list of nodes. It's probably more
intuitive, but it needs repeated code:

shelf['pageA'].add( pageB )
shelf['pageB']= pageB
shelf['pageB'].add( pageA )

Which is incorrect as is anyway-- updates to shelved objects aren't
committed automatically.

a= shelf['pageA']
a.add( pageB )
shelf['pageA']= a
b= shelf['pageB']
b.add( pageA )
shelf['pageB']= b

is closer. Otherwise, two is to store a 2-tuple of node keys in a
secondary file.

shelfEdges[ frozenset(
shelfVerts[ 'pageA' ], shelfVerts[ 'pageB' ]
) ]= True

Jun 27 '08 #5
Drawing a large graph like this is not
very insightful by itself, and doing this well
is still an art form.
Many cool visualizations, and all
very domain and question dependent,
can be found at
http://www.visualcomplexity.com/vc/

You can also search on flickr for network
and graph drawing.

Much of it is eyecandy though. What do you
really want to know?

Your graph is not overly huge for python, provided
you do not want to compute nonlocal statistics
such as betweenness metrics. (If you have
a laptop running windows and with
less than 1GB RAM, then you have a really large graph.)

Given that you do not know much about graph theory,
I would suggest that networkx applied to a
subset of your large website --- one of the
representative branches --- will allow
for the fastest way to figure out what you
want to do. Python provides access to many other
libraries for html mangling or even webpage
analysis. Imagine writing C code to ask questions
like: how many pdfs and images? how big are they?
when were these pages last updated? etc.
You can come up with 10 questions faster
than you can write C code, and this where
python has its great advantage.

You can learn graph theory while using these libraries,
on one of the smaller branches of your website tree.
Even interactively by using ipython.
This provides at least a feeling of great power while
stumbling in the dark.

Many of your edges are probably repetitions
associated with navigation menus to provide
a standard look-and-feel. See how much of that
you can strip out and find the cross-links that took
effort to insert an manage. (I suggest
doing the analyis on a digraph rather than a graph,
even if you want to draw it as graph.)
For visualization, the new ubigraph is quite
fast compared to graphviz.
See the cool networkx + ubigraph
video at http://ubietylab.net/blog/
Ask on the networkx mailinglist when stuck.

Jun 27 '08 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
by: Lilith | last post by:
Is there a python module somewhere (been searching today, no luck) which has efficiently coded various graph-handling routines, such as finding the shortest path through a graph, or the set of all...
4
by: Mike | last post by:
Related to another topic I just posted, I wanted to discuss ways to optimize the validation of very large (>100MB) XML documents. First, I have no idea if something like this already exists; it...
1
by: entropy123 | last post by:
Hey all, In an effort to solve a sticky - for me - problem I've picked up Sedgewick's 'Algorithm's in C'. I've tried working through the first few problems but am a little stumped when he refers...
2
by: MLH | last post by:
A97 Am having difficulty displaying graph in Form View that I see fine in graph control on form opened in design view. I know I'm doing something wrong. If I open the form in design view - I...
11
by: Andreas.Burman | last post by:
Hi What is the best way to implement a undirected weighted graph ADT in javascript?
6
by: semedao | last post by:
Hi All, I had working code that made custom serialization on objects that inherit from queue in the inherited queue I create my own GetObjectData: public void GetObjectData(SerializationInfo info,...
4
by: John Henry | last post by:
I am looking for a ready made simple graph package. I found an extensive one in the piana package but when I try to use just the graph portion, it fails to load because of the line: class...
2
by: Man4ish | last post by:
I have created Graph object without vertex and edge property.It is working fine. #include <boost/config.hpp> #include <iostream> #include <vector> #include <string> #include...
0
by: eureka2050 | last post by:
Hi all, I am creating a radar chart containing 2 plots using jpgraph. My code is as follows:- include ("./jpgraph/src/jpgraph.php"); include ("./jpgraph/src/jpgraph_radar.php"); //...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.