Traversing Friends Graph - How does Friendster do it?

Timin Uram

I'm not sure if this forum is the correct place to post this, but I
couldn't think of any other group. I would really appreciate any help
you could give me.

FINAL GOAL OF MY APPLICATION:
Building a friendster clone for a very large organization (10,000
people). I am using Php/Mysql.

SETUP OF SYSTEM:
(it may help if you are familiar with how friendster works)
Mathematically what I have is a graph, with around 10,000 nodes (each
node representing one person). Each of these nodes have an avg of appx
40 edges coming out from them going to other nodes in the graph.

I have implemented this graph in my database using two tables. One, I
have a Person table which assigns each person a ID. Then I have a table
called "Edge" that defines the edges. It has two cells in every row of
the table (each cell holds the ID of one of the two ends of an edge).

WHAT I WANT TO DO:
Starting at any node, I want to be able to (as quickly as possible)
calculate which nodes are up to 3 degrees of separation away from it.
This would be needed because I want to allow someone to search for
people within 3 degrees of distance away from themselves.

WHAT I HAVE TRIED:
I wrote a recursive function that starts at a certain node and just
traverses all the nodes that are a given distance away. I have done
optimizations to this function like keeping track of which nodes have
already been visited, and not revisiting them. On the site, I was
planning on running this function the first time someone does a search,
and then storing the results in the session so I wouldn't have to run
it again.

The problem is with a test database where each person has about 40
edges going from them to random nodes, it takes about 60 seconds to
calculate the 3rd degree friends. It is taking about 800 sql select
queries. 60 seconds however is WAY too slow...

Can someone explain to me, or take a stab at how friendster (or any
website like it, orkut, linkedin, hi5 etc) does it (their database is
far greater than mine). Friendster not only keeps track of WHO is up to
3 degrees away from you, but they also tell you ALL the paths between
you and any other person within those 3 degrees.

If you can come up with some smart way that I could solve my problem
(like maybe store everyone's 2nd degree friends before hand, or
something). I'm really stuck...any help would be GREATLY appreciated.

Thanks,
Timin

Jul 17 '05 #1

Subscribe Post Reply

2184

Timin Uram wrote:

WHAT I WANT TO DO:
Starting at any node, I want to be able to (as quickly as possible)
calculate which nodes are up to 3 degrees of separation away from it.
This would be needed because I want to allow someone to search for
people within 3 degrees of distance away from themselves.

OK, here is an idea... Imagine a database table:

CREATE TABLE `knows` (
`who` int,
`whom` int,
KEY `who` (`who`),
KEY `whom` (`whom`)
) COMMENT='The Who Knows Whom Table'

So if the person whose data stored in another table under
ID number 1578 knows the person whose data stored in that
same table under ID number 8690, there would be two records
in the table `knows`:

===========
who whom
===========
1578 8690
8690 1578
===========

Now we need to figure out, say, who knows ID 6231 firsthand:

SELECT who FROM knows WHERE whom=6231;

Then, we can figure out the first degree of separation (i.e.,
who knows someone who knows 6231):

SELECT
k1.who AS who,
k2.who AS through,
k2.whom AS whom
FROM knows AS k1 LEFT JOIN knows AS k2 ON k1.whom=k2.who
WHERE k2.whom=6231;

Hopefully, this will return something like:

=========================
who through whom
=========================
4572 9421 6231
=========================

meaning, 4572 knows 6231 through 9421.

Now comes the second degree of separation:

SELECT
k1.who AS who,
k2.who AS through1,
k3.who AS through2,
k3.whom AS whom
FROM (knows AS k1 LEFT JOIN knows AS k2 ON k1.whom=k2.who)
LEFT JOIN knows AS k3 ON k2.whom=k3.who
WHERE k3.whom=6231;

And, finally, voila: the third degree:

SELECT
k1.who AS who,
k2.who AS through1,
k3.who AS through2,
k4.who AS through3,
k4.whom AS whom
FROM ((knows AS k1 LEFT JOIN knows AS k2 ON k1.whom=k2.who)
LEFT JOIN knows AS k3 ON k2.whom=k3.who)
LEFT JOIN knows AS k4 ON k3.whom=k4.who
WHERE k4.whom=6231;

Actually, you can write a function that would generate a query
automatically given the number of degrees of separation...

Cheers,
NC

Jul 17 '05 #2

Timin Uram

Hi NC, Thanks for your response. Thanks for your suggestion, and it
definitely cut down on the time, however, because of multiple joins,
the query still ends up taking about 20 seconds. It also doesn't take
loops into account (but that's ok because we're not infinitely
looping).

Are there any other optimizations possible...or something else, like
caching some information somehow? It seems like these friendster sites
do this pretty fast...is there something else possible?

Timin

NC wrote:

Timin Uram wrote:

WHAT I WANT TO DO:
Starting at any node, I want to be able to (as quickly as possible)
calculate which nodes are up to 3 degrees of separation away from it. This would be needed because I want to allow someone to search for
people within 3 degrees of distance away from themselves.

OK, here is an idea... Imagine a database table:

CREATE TABLE `knows` (
`who` int,
`whom` int,
KEY `who` (`who`),
KEY `whom` (`whom`)
) COMMENT='The Who Knows Whom Table'

So if the person whose data stored in another table under
ID number 1578 knows the person whose data stored in that
same table under ID number 8690, there would be two records
in the table `knows`:

===========
who whom
===========
1578 8690
8690 1578
===========

Now we need to figure out, say, who knows ID 6231 firsthand:

SELECT who FROM knows WHERE whom=6231;

Then, we can figure out the first degree of separation (i.e.,
who knows someone who knows 6231):

SELECT
k1.who AS who,
k2.who AS through,
k2.whom AS whom
FROM knows AS k1 LEFT JOIN knows AS k2 ON k1.whom=k2.who
WHERE k2.whom=6231;

Hopefully, this will return something like:

=========================
who through whom
=========================
4572 9421 6231
=========================

meaning, 4572 knows 6231 through 9421.

Now comes the second degree of separation:

SELECT
k1.who AS who,
k2.who AS through1,
k3.who AS through2,
k3.whom AS whom
FROM (knows AS k1 LEFT JOIN knows AS k2 ON k1.whom=k2.who)
LEFT JOIN knows AS k3 ON k2.whom=k3.who
WHERE k3.whom=6231;

And, finally, voila: the third degree:

SELECT
k1.who AS who,
k2.who AS through1,
k3.who AS through2,
k4.who AS through3,
k4.whom AS whom
FROM ((knows AS k1 LEFT JOIN knows AS k2 ON k1.whom=k2.who)
LEFT JOIN knows AS k3 ON k2.whom=k3.who)
LEFT JOIN knows AS k4 ON k3.whom=k4.who
WHERE k4.whom=6231;

Actually, you can write a function that would generate a query
automatically given the number of degrees of separation...

Cheers,
NC

Jul 17 '05 #3

Allan Savolainen

On 19 Mar 2005 22:14:12 -0800, "Timin Uram" <so*******@hotmail.com>
wrote:

Hi NC, Thanks for your response. Thanks for your suggestion, and it
definitely cut down on the time, however, because of multiple joins,
the query still ends up taking about 20 seconds. It also doesn't take
loops into account (but that's ok because we're not infinitely
looping).

Are there any other optimizations possible...or something else, like
caching some information somehow? It seems like these friendster sites
do this pretty fast...is there something else possible?

Timin

Here is what I tried.

Tables
----------

persons
pid serial // int4 with autoincrement
name varchar(32)

links
lid serial // int4 with autoincrement (not
required but I like to have uniq id
in all tables
parent int4
child int4

Then I filled the persons table with 10000 random names and then
created 400000 random links (this averages to 40 links per person).

I can then select all linked persons to used id 1201 with:

SELECT DISTINCT
pid
FROM
persons,links
WHERE
(parentid IN (
SELECT pid FROM persons,links
WHERE (pid=childid AND parentid IN (
SELECT pid FROM persons,links
WHERE (pid=childid AND parentid=1201
))
)) AND pid=childid);

This takes about 1-2 secs on my PostgreSQL server and could be even
faster with indexes in links table.

- Allan Savolainen

Jul 17 '05 #4

Timin Uram wrote:

because of multiple joins, the query still ends up
taking about 20 seconds.

???

A similar query I tested on a shared hosting with a table
of about 7500 records consistently completes in under one
second. Are you sure you have proper indexes in place?

Cheers,
NC

Jul 17 '05 #5

Timin Uram wrote:

Are there any other optimizations possible...or something
else, like caching some information somehow? It seems like
these friendster sites do this pretty fast...is there
something else possible?

Of course there is always something possible. You can get
better hardware (lots of it) and/or hire Jeremy Zawodny to
advise you on performance tuning and replication (the latter,
by the way, is exactly what Friendster did).

Cheers,
NC

Jul 17 '05 #6

Timin Uram

NC wrote:

Timin Uram wrote:

because of multiple joins, the query still ends up
taking about 20 seconds.

???

A similar query I tested on a shared hosting with a table
of about 7500 records consistently completes in under one
second. Are you sure you have proper indexes in place?

I have a table of about 100,000 edges (200,000 rows), and it is taking
about 20 seconds. I've tried it on two different servers (a local one,
and one on a shared host).

Another thing is that in a friendster type network, there is a
clustering effect where say 40 friends have most of each other in each
other's lists. This leads to a lot of loops (paths such as 8->7->8->9
for a path from 8 to 9). So maybe it's faster when there are completely
random connections...(but I wouldn't thinks so).

I also have an index on both of the primary keys (who and whom)

Is there anything else possible? Or do you there there is something
wrong with my setup?

Jul 17 '05 #7

Timin Uram

Hey, one thing is that my host has mysql version 4.0 which doesn't
support subqueries, but I can get around that by just doing each of
those queries separately, and then inserting those results in the ()
separated by commas individually.

However, I did try the query on my local server with mysql 4.1, and it
still took about 18 seconds. Here is an image of my table's setup from
phpmyadmin: http://raffleforrelief.org/images/knows_table.gif

Is something wrong with my setup or indexes?

Jul 17 '05 #8

milli

hey timin,
i read your post. ya i've also been surfing the friendster and hi5. but
right now i'm writing to you to ask that do you have any idea that how
theirs Invitation works. You might be familiar about their invitation
program. Just give your msn/yahoo id and pasword, they just grabs all
the contacts of that id and sends an invitaion.
do you know how it works???
hope listen from you soon.
Sachin

Jul 17 '05 #9

R. Rajesh Jeba Anbiah

"NC" <nc@iname.com> wrote in message news:<11**********************@g14g2000cwa.googleg roups.com>...
<snip>

Of course there is always something possible. You can get
better hardware (lots of it) and/or hire Jeremy Zawodny to
advise you on performance tuning and replication (the latter,
by the way, is exactly what Friendster did).

Surprise. I'm reading J.Z's blog for sometime and might have missed
this news. Couldn't even find that in Google. Just curious, where did
you find this news?

--
<?php echo 'Just another PHP saint'; ?>
Email: rrjanbiah-at-Y!com Blog: http://rajeshanbiah.blogspot.com/

Jul 17 '05 #10

Timin Uram

Hey Sachin, I don't know the exact answer for your query, but I can
guess that they are doing the same thing that 3rd party chatting
programs like trillian and gaim do.

Gaim is open source, so you could possibly look at the code for doing
that there. Friendster or hi5 must have this code running in the
backend.

Also note that gaim and trillian are not authorized to do what they do,
but for now, i think the chatting networks (msn, yahoo, aol) have just
given up trying to fight and are allowing third party chatting programs
on their network.

Jul 17 '05 #11

Gordon Burditt

>Another thing is that in a friendster type network, there is a

clustering effect where say 40 friends have most of each other in each
other's lists. This leads to a lot of loops (paths such as 8->7->8->9
for a path from 8 to 9). So maybe it's faster when there are completely
random connections...(but I wouldn't thinks so).
You can modify the queries to avoid loops, or at least minimize
them. Whether or not the added conditions take more time to check
than they do cut down the number of combinations, I don't know.
You will still get many combinations where there are multiple paths
between two people that don't involve loops.
I also have an index on both of the primary keys (who and whom)

Is there anything else possible? Or do you there there is something
wrong with my setup?

Assuming you are doing a join of 4 instances of the "who knows whom"
table (k1, k2, k3, and k4), for a 3-degrees-of-separation query.
Assuming for the moment that nobody "knows" himself in the table
(and this is somehow enforced), you might want to put in the
conditions:

k1.who != k3.who
k1.who != k4.who and k2.who != k4.who

which prevents the same person showing up in the chain more than once.

Gordon L. Burditt

Jul 17 '05 #12

R. Rajesh Jeba Anbiah wrote:

NC> hire Jeremy Zawodny to advise you on performance tuning and
NC> replication (the latter, by the way, is exactly what Friendster
NC> did).
Surprise. I'm reading J.Z's blog for sometime and might have missed
this news. Couldn't even find that in Google. Just curious, where did
you find this news?

It's not on the blog, it's in the "MySQL Stuff" section:

Consulting

I occasionally (as time permits) help out companies with their
MySQL needs. I specialize in performance tuning and replication
but can handle nearly all aspects of MySQL on Linux/Unix systems.
Here are some of my current and past clients.

Quest Software - Foglight group
Plaxo
CafeDVD
Friendster
Feedster LLC
Technorati
Spam Arrest LLC
Cloudmark
Rackspace
LiveJournal
Eight Days Inc.

Contact me for rates and availability.

http://jeremy.zawodny.com/mysql/

Cheers,
NC

Jul 17 '05 #13

Timin Uram wrote:

I did try the query on my local server with mysql 4.1,
and it still took about 18 seconds. Here is an image
of my table's setup from phpmyadmin:
http://raffleforrelief.org/images/knows_table.gif

Is something wrong with my setup or indexes?

Yes. Instead of creating two separate indexes for `who`
and `whom`, you created a single index for both fields.
This makes it difficult to join the table to itself on
copy1.who = copy2.whom.

Cheers,
NC

Jul 17 '05 #14

Timin Uram wrote:

I've tried it on two different servers (a local one,
and one on a shared host).

OK, let's try this:

1. Change your indexing. Right now, you have a primary
index based on both columns. Replace it with two non-
unique indexes, one for each column.

2. Rewrite the query like this:

SELECT
k1.who AS who,
k2.who AS through1,
k3.who AS through2,
k4.who AS through3,
k4.whom AS whom
FROM ((knows AS k1 LEFT JOIN knows AS k2 ON k1.whom=k2.who)
LEFT JOIN knows AS k3 ON k2.whom=k3.who)
LEFT JOIN knows AS k4 ON k3.whom=k4.who
WHERE k1.who = 6231
AND k3.who <> 6231
AND k4.who <> 6231
AND k4.whom <> 6231

My first post to this topic said "WHERE k4.whom=6231", which
created a lot of completely unnecessary overhead (my apologies
for not thinking about it sooner). This way, it runs MUCH
faster... Also, looping is taken care of by adding extra
WHERE clauses.

Cheers,
NC

Jul 17 '05 #15

R. Rajesh Jeba Anbiah

NC wrote:

R. Rajesh Jeba Anbiah wrote:
NC> hire Jeremy Zawodny to advise you on performance tuning and
NC> replication (the latter, by the way, is exactly what Friendster
NC> did).

Surprise. I'm reading J.Z's blog for sometime and might have missed
this news. Couldn't even find that in Google. Just curious, where did you find this news?

It's not on the blog, it's in the "MySQL Stuff" section:

<snip> http://jeremy.zawodny.com/mysql/

Thanks. Interesting, I missed it really. Now I wonder, why Yahoo
didn't fire him yet;)

--
<?php echo 'Just another PHP saint'; ?>
Email: rrjanbiah-at-Y!com Blog: http://rajeshanbiah.blogspot.com/

Jul 17 '05 #16

Traversing Friends Graph - How does Friendster do it?

Similar topics