Proposal for a cascaded master-slave replication system

Jan Wieck

Dear community,

for some reason the post I sent yesterday night still did not show up on
the mailing lists. I have set up some links on the developers side under
http://developer.postgresql.org/~wieck/slony1.html

The concept will be the base for some of my work as a Software Engineer
here at Afilias USA INC. in the near future. Afilias is like many of you
in need of reliable and performant replication solutions for backup and
failover purposes. We started this work a couple of weeks ago by
defining the goals and required features for our usage of PostgreSQL.

Slony-I will be the first of 2 distinct replication systems designed
with the 24/7 datacenter in mind.

We want to build this system as a community project. The plan was from
the beginning to release the product under the BSD license. And we think
it is best to start it as such and to ask for suggestions during the
design phase already.

I would like to start developing the replication engine itself as soon
as possible. And as a PostgreSQL CORE developer I will sure put some of
my spare time into this as well. On the other hand there is absolutely
no design other than "they mostly call some stored procedures" done for
the frontend tools yet, and I think that we need some real good admin
tools in the end.

I look forward to your comments.
Jan

--
#================================================= =====================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================= = Ja******@Yahoo.com #

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 12 '05 #1

Subscribe Post Reply

2422

Joe Conway

Jan Wieck wrote:

http://developer.postgresql.org/~wieck/slony1.html
Very interesting read. Nice work!
We want to build this system as a community project. The plan was from
the beginning to release the product under the BSD license. And we think
it is best to start it as such and to ask for suggestions during the
design phase already.

I couldn't quite tell from the design doc -- do you intend to support
conditional replication at a row level?

I'm also curious, with cascaded replication, how do you handle the case
where a second level slave has a transaction failure for some reason, i.e.:

M
/ \
/ \
Sa Sb
/ \ / \
Sc Sd Se Sf

What happens if data is successfully replicated to Sa, Sb, Sc, and Sd,
and then an exception/rollback occurs on Se?

Joe
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 12 '05 #2

Jan Wieck

Joe Conway wrote:

Jan Wieck wrote:
http://developer.postgresql.org/~wieck/slony1.html
Very interesting read. Nice work!
We want to build this system as a community project. The plan was from
the beginning to release the product under the BSD license. And we think
it is best to start it as such and to ask for suggestions during the
design phase already.

I couldn't quite tell from the design doc -- do you intend to support
conditional replication at a row level?

If you mean to configure the system to replicate rows to different
destinations (slaves) based on arbitrary qualifications, no. I had
thought about it, but it does not really fit into the "datacenter and
failover" picture, so it is not required to meet the goals and adds
unnecessary complexity.

This sort of feature is much more important for a replication system
designed for hundreds or thousands of sporadic, asynchronous
multi-master systems, the typical "salesman on the street" kind of
replication.

I'm also curious, with cascaded replication, how do you handle the case
where a second level slave has a transaction failure for some reason, i.e.:

M
/ \
/ \
Sa Sb
/ \ / \
Sc Sd Se Sf

What happens if data is successfully replicated to Sa, Sb, Sc, and Sd,
and then an exception/rollback occurs on Se?

First, it does not replicate single transactions. It replicates batches
of them together. Since the transactions are already committed (and
possibly some other depending on them too), there is no way - you loose Se.

If this is only a temporary failure, like a power fail and the database
recovers on restart fine including the last confirmed SYNC event (they
get confirmed after they commit locally, but that's before the next
checkpoint so there is actually a gap where the slave could loose a
committed transaction and then it's lost for sure) ... so if it comes
back up without loosing the last confirmed SYNC, it will catch up.
Jan

--
#================================================= =====================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================= = Ja******@Yahoo.com #
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 12 '05 #3

Christopher Browne

In the last exciting episode, Ja******@Yahoo.com (Jan Wieck) wrote:

I look forward to your comments.

It is not evident from the paper what approach is taken to dealing
with the duplicate key conflicts.

The example:

UPDATE table SET col1 = 'temp' where col = 'A';
UPDATE table SET col1 = 'A' where col = 'B';
UPDATE table SET col1 = 'B' where col = 'temp';

I can think of several approaches to this:

1. The present eRserv code reads what is in the table at the time of
the 'snapshot', and so tries to pass on:

update table set col1 = 'B' where otherkey = 123;
update table set col1 = 'A' where otherkey = 456;

which breaks because at some point, col1 is not unique, irrespective
of what order we apply the changes in.

2. If the contents as at the time of the COMMIT are stored in the log
table, then we would do all three updates in the destination DB, in
order, as shown above.

Either we have to:
a) Store the updated fields in the replication tables somewhere, or
b) Make the third UPDATE wait for the updates to be stored in a
file somewhere.

3. The replication code requires that any given key only be updated
once in a 'snapshot', so that the updates may be unambiguously
partitioned:

UPDATE table SET col1 = 'temp' where col = 'A' ; -- and otherkey = 123
UPDATE table SET col1 = 'A' where col = 'B'; -- and otherkey = 456
-- Must partition here before hitting #123 again --
UPDATE table SET col1 = 'B' where col = 'temp'; -- and otherkey = 123

The third UPDATE may have to be held up until the "partition" is set
up, right?

4. I seem to recall a recent discussion about the possibility of
deferring the UNIQUE constraint 'til the END of a commit, with the
result that we could simplify to

update table set col1 = 'B' where otherkey = 123;
update table set col1 = 'A' where otherkey = 456;

and discover that the UNIQUE constraint was relaxed just long enough
for us to make the TWO changes that in the end combined to being
unique.

None of these look like they turn out totally happily, or am I missing
an approach?
--
wm(X,Y):-write(X),write('@'),write(Y). wm('cbbrowne','ntlug.org').
http://www.ntlug.org/~cbbrowne/languages.html
"Java and C++ make you think that the new ideas are like the old ones.
Java is the most distressing thing to hit computing since MS-DOS."
-- Alan Kay

Nov 12 '05 #4

Joe Conway

Jan Wieck wrote:

If you mean to configure the system to replicate rows to different
destinations (slaves) based on arbitrary qualifications, no. I had
thought about it, but it does not really fit into the "datacenter and
failover" picture, so it is not required to meet the goals and adds
unnecessary complexity.

This sort of feature is much more important for a replication system
designed for hundreds or thousands of sporadic, asynchronous
multi-master systems, the typical "salesman on the street" kind of
replication.
OK, thanks. This actually fits any kind of distributed application. We
have one that lives in our datacenters, but needs to replicate across
both fast LAN/MAN and slow WAN. It is multimaster in the sense that
individual data rows can be originated anywhere, but they are read-only
in nodes other than where they were originated. Anyway, I'm using a
hacked copy of dbmirror at the moment.
First, it does not replicate single transactions. It replicates batches
of them together. Since the transactions are already committed (and
possibly some other depending on them too), there is no way - you loose Se.

OK, got it. Thanks.

Joe

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 12 '05 #5

Jan Wieck

Hans-Jürgen Schönig wrote:

Jan,

First of all we really appreciate that this is going to be an Open
Source project.
There is something I wanted to add from a marketing point of view: I
have done many public talks in the 2 years or so. There is one question
people keep asking me: "How about the pgreplication project?". In every
training course, at any conference people keep asking for synchronous
replication. We have offered this people some async solutions which are
already out there but nobody seems to be interested in having it (my
person impression). People keep asking for a sync approach via email but
nobody seems to care about an async approach. This does not mean that
async is bad but we can see a strong demand for synchronous replication.

Meanwhile we seem to be in a situation where PostgreSQL is rather
competing against Oracle than against MySQL. In our case there are more
people asking for Oracle -> Pg migration than for MySQL -> Pg. MySQL
does not seem to be the great enemy because most people know that it is
an inferior product anyway. What I want to point out is that some people
want an alternative Oracle's Real Application Cluster. They want load
balancing and hot failover. Even data centers asking for replication did
not want to have an async approach in the past.

Hans-Jürgen,

we are well aware of the high demand for multi-master replication
addressing load balancing and clustering. We have that need ourself as
well and I plan to work on a follow-up project as soon as Slony-I is
released. But as of now, we see a higher priority for a reliable master
slave system that includes the cascading and backup features described
in my concept. There are a couple of different similar product out
there, I know. But show me one of them where you can failover without
becoming the single point of failure? We've just recently seen ... or
better "where not able to see anything any more" how failures tend to
ripple through systems - half of the US East Coast was dark. So where is
the replication system where a slave becomes the "master", and not a
standalone server. Show me one that has a clear concept of failback, one
that has hot-join as a primary design goal. These are the features that
I expect if something is labeled "Enterprise Level".

As far as my ideas for multi-master go, it will be a synchronous
solution using group communication. My idea is "group commit" instead of
2-Phase ... and an early stage test hack has replicated some update 3
weeks ago. The big challange will be to integrate the two systems so
that a node can start as an asynchronous Slony-I slave, catch up ... and
switch over to synchronous multimaster without stopping the cluster. I
have no clue yet how to do that, but I refuse to think smaller.
Jan

--
#================================================= =====================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================= = Ja******@Yahoo.com #
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 12 '05 #6

Jan Wieck

Jordan Henderson wrote:

Jan,

I am wondering if you are familar with the work covered in 'Recovery in
Parallel Database Systems' by Svein-Olaf Hvasshovd (Vieweg) ? The book is an
excellent detailed description covering high availablility DB
implementations.
No, but it sounds like something I allways wanted to have.

I think your right on by not thinking smaller!!
Thanks

Jan

Jordan Henderson
On Wednesday 12 November 2003 10:45, Jan Wieck wrote:
Hans-Jürgen Schönig wrote:
> Jan,
>
> First of all we really appreciate that this is going to be an Open
> Source project.
> There is something I wanted to add from a marketing point of view: I
> have done many public talks in the 2 years or so. There is one question
> people keep asking me: "How about the pgreplication project?". In every
> training course, at any conference people keep asking for synchronous
> replication. We have offered this people some async solutions which are
> already out there but nobody seems to be interested in having it (my
> person impression). People keep asking for a sync approach via email but
> nobody seems to care about an async approach. This does not mean that
> async is bad but we can see a strong demand for synchronous replication.
>
> Meanwhile we seem to be in a situation where PostgreSQL is rather
> competing against Oracle than against MySQL. In our case there are more
> people asking for Oracle -> Pg migration than for MySQL -> Pg. MySQL
> does not seem to be the great enemy because most people know that it is
> an inferior product anyway. What I want to point out is that some people
> want an alternative Oracle's Real Application Cluster. They want load
> balancing and hot failover. Even data centers asking for replication did
> not want to have an async approach in the past.

Hans-Jürgen,

we are well aware of the high demand for multi-master replication
addressing load balancing and clustering. We have that need ourself as
well and I plan to work on a follow-up project as soon as Slony-I is
released. But as of now, we see a higher priority for a reliable master
slave system that includes the cascading and backup features described
in my concept. There are a couple of different similar product out
there, I know. But show me one of them where you can failover without
becoming the single point of failure? We've just recently seen ... or
better "where not able to see anything any more" how failures tend to
ripple through systems - half of the US East Coast was dark. So where is
the replication system where a slave becomes the "master", and not a
standalone server. Show me one that has a clear concept of failback, one
that has hot-join as a primary design goal. These are the features that
I expect if something is labeled "Enterprise Level".

As far as my ideas for multi-master go, it will be a synchronous
solution using group communication. My idea is "group commit" instead of
2-Phase ... and an early stage test hack has replicated some update 3
weeks ago. The big challange will be to integrate the two systems so
that a node can start as an asynchronous Slony-I slave, catch up ... and
switch over to synchronous multimaster without stopping the cluster. I
have no clue yet how to do that, but I refuse to think smaller.
Jan

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

--
#================================================= =====================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================= = Ja******@Yahoo.com #
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 12 '05 #7

Jan Wieck

Christopher Browne wrote:

In the last exciting episode, Ja******@Yahoo.com (Jan Wieck) wrote:
I look forward to your comments.
It is not evident from the paper what approach is taken to dealing
with the duplicate key conflicts.

The example:

UPDATE table SET col1 = 'temp' where col = 'A';
UPDATE table SET col1 = 'A' where col = 'B';
UPDATE table SET col1 = 'B' where col = 'temp';

I can think of several approaches to this:

One fundamental flaw in eRServer is that it tries to "combine" multiple
updates into one update at snapshot-time in the first place. The
application can do these three steps in one single transaction, how do
you split that?

You can develop an automatic recovery for that. At the time you got a
dupkey error, you rollback but remember the _rserv_ts and table_id that
caused the dupkey. In the next sync attempt, you fetch the row with that
_rserv_ts and delete all rows from the slave table with that primary key
plus fake INSERT log rows on the master for the same. Then you prepare
and apply and cross fingers that nobody touched the same row again
already between your last attempt and now ... which was how many hours
ago? And since you can only find one dupkey per round, you might do this
a few times with larger and larger lists of _rserv_ts,table_id.

The idea of not accumulating log forever, but just holding this status
table (the name log is misleading in eRServer, it holds flags telling
"the row with _rserv_ts=nnnn got INS|UPD|DEL'd") has one big advantage.
However long your slave does not sync, your master will not run out of
space.

But I don't think that there is value in the attempt to let a slave
catch up the last 4 days at once anyway. Drop it and use COPY. When your
slave does not come up before you have modified half your database, it
will be faster this way anyway.
Jan

1. The present eRserv code reads what is in the table at the time of
the 'snapshot', and so tries to pass on:

update table set col1 = 'B' where otherkey = 123;
update table set col1 = 'A' where otherkey = 456;

which breaks because at some point, col1 is not unique, irrespective
of what order we apply the changes in.

2. If the contents as at the time of the COMMIT are stored in the log
table, then we would do all three updates in the destination DB, in
order, as shown above.

Either we have to:
a) Store the updated fields in the replication tables somewhere, or
b) Make the third UPDATE wait for the updates to be stored in a
file somewhere.

3. The replication code requires that any given key only be updated
once in a 'snapshot', so that the updates may be unambiguously
partitioned:

UPDATE table SET col1 = 'temp' where col = 'A' ; -- and otherkey = 123
UPDATE table SET col1 = 'A' where col = 'B'; -- and otherkey = 456
-- Must partition here before hitting #123 again --
UPDATE table SET col1 = 'B' where col = 'temp'; -- and otherkey = 123

The third UPDATE may have to be held up until the "partition" is set
up, right?

4. I seem to recall a recent discussion about the possibility of
deferring the UNIQUE constraint 'til the END of a commit, with the
result that we could simplify to

update table set col1 = 'B' where otherkey = 123;
update table set col1 = 'A' where otherkey = 456;

and discover that the UNIQUE constraint was relaxed just long enough
for us to make the TWO changes that in the end combined to being
unique.

None of these look like they turn out totally happily, or am I missing
an approach?

--
#================================================= =====================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================= = Ja******@Yahoo.com #
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 12 '05 #8

Andrew Sullivan

On Wed, Nov 12, 2003 at 02:08:23PM +0100, Hans-J?rgen Sch?nig wrote:

an inferior product anyway. What I want to point out is that some people
want an alternative Oracle's Real Application Cluster. They want load
balancing and hot failover. Even data centers asking for replication did
not want to have an async approach in the past.

I think Jan has already outlined his more-distant-future idea, but
I'd also like to know whether the people who are asking for a
replacement for RAC are willing to invest in it? You could buy some
_awfully_ good development time for even a year's worth of licensing
for RAC. I get the impression from the Postgres-R list that their
biggest obstacle is development resources.

<rant> People often like to say they need hot-fail-capable, five
nines, 24/7/365 systems. For most applications, I just do not
believe that, and the truth is that the cost of getting from three
nines to four (never mind five) is so great that people cheat: one
paragraph has the "five nines" clause, and the next paragraph talks
about scheduled downtime. In a real "five nines" system (the phone
company, say, or the air traffic control system), the time for
scheduled downtime is just the cumulative possible outage at any node
when it is being switched with its replacement. Five minutes a year
is a pretty high bar to jump, and most people long ago concluded that
you don't actually need it for most applications. </rant>

A
--
----
Andrew Sullivan 204-4141 Yonge Street
Afilias Canada Toronto, Ontario Canada
<an****@libertyrms.info> M2P 2A8
+1 416 646 3304 x110
---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 12 '05 #9

Andrew Sullivan

On Tue, Nov 11, 2003 at 03:38:53PM -0500, Christopher Browne wrote:

In the last exciting episode, Ja******@Yahoo.com (Jan Wieck) wrote:
I look forward to your comments.

It is not evident from the paper what approach is taken to dealing
with the duplicate key conflicts.

The example:

UPDATE table SET col1 = 'temp' where col = 'A';
UPDATE table SET col1 = 'A' where col = 'B';
UPDATE table SET col1 = 'B' where col = 'temp';

It's not a problem, because as the proposal states, the actual SQL is
to be sent in order to the slave. That is, only consistent sets are
sent: you can't have a condition on the slave that never could have
obtained on the master. This means greater overhead for cases where
the same row is altered repeatedly, but it's safe.

A

--
----
Andrew Sullivan 204-4141 Yonge Street
Afilias Canada Toronto, Ontario Canada
<an****@libertyrms.info> M2P 2A8
+1 416 646 3304 x110
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 12 '05 #10

Proposal for a cascaded master-slave replication system

Similar topics