472,805 Members | 1,092 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,805 software developers and data experts.

Replication Ideas

Hi--

I had been thinking of the issues of multimaster replication and how to
do highly available, loadballanced clustering with PostgreSQL. Here is
my outline, and I am looking for comments on the limitations of how this
would work.

Several PostgreSQL servers would share a virtual IP address, and would
coordinate among themselves which will act as "Master" for the purposes
of a single transaction (but connection could be easier). SELECT
statements are handled exclusively by the transaction master while
anything that writes to a database would be sent to all the the
"Masters." At the end of each transaction the systems would poll
eachother regarding whether they were all successful:

1: Any system which is successful in COMMITting the transaction must
ignore any system which fails the transaction untill a recovery can be made.

2: Any system which fails in COMMITting the transaction must cease to
be a master, provided that it recieves a signat from any other member of
the cluster that indicates that that member succeeded in committing the
transaction.

3: If all nodes fail to commit, then they all remain masters.

Recovery would be done in several steps:

1: The database would be copied to the failed system using pg_dump.
2: A current recovery would be done from the transaction log.
3: This would be repeated in order to ensure that the database is up to
date.
4: When two successive restores have been achieved with no new
additions to the database, the "All Recovered" signal is sent to the
cluster and the node is ready to start processing again. (need a better
way of doing this).

Note: Recovery is the problem, I know. my model is only a starting
point for the purposes of discussion and trying to bring something to
the conversation.

Any thoughts or suggestions?

Best Wishes,
Chris Travers
---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 11 '05 #1
10 2373
On Sat, 2003-08-23 at 23:27, Chris Travers wrote:
Hi--

I had been thinking of the issues of multimaster replication and how to
do highly available, loadballanced clustering with PostgreSQL. Here is
my outline, and I am looking for comments on the limitations of how this
would work.

Several PostgreSQL servers would share a virtual IP address, and would
coordinate among themselves which will act as "Master" for the purposes
of a single transaction (but connection could be easier). SELECT
statements are handled exclusively by the transaction master while
anything that writes to a database would be sent to all the the
"Masters." At the end of each transaction the systems would poll
eachother regarding whether they were all successful:

1: Any system which is successful in COMMITting the transaction must
ignore any system which fails the transaction untill a recovery can be made.

2: Any system which fails in COMMITting the transaction must cease to
be a master, provided that it recieves a signat from any other member of
the cluster that indicates that that member succeeded in committing the
transaction.

3: If all nodes fail to commit, then they all remain masters.

Recovery would be done in several steps:

1: The database would be copied to the failed system using pg_dump.
2: A current recovery would be done from the transaction log.
3: This would be repeated in order to ensure that the database is up to
date.
4: When two successive restores have been achieved with no new
additions to the database, the "All Recovered" signal is sent to the
cluster and the node is ready to start processing again. (need a better
way of doing this).

Note: Recovery is the problem, I know. my model is only a starting
point for the purposes of discussion and trying to bring something to
the conversation.


This is vaguely similar to Two Phase Commit, which is a sine qua
non of distributed transactions, which is the s.q.n. of multi-master
replication.

--
-----------------------------------------------------------------
Ron Johnson, Jr. ro***********@cox.net
Jefferson, LA USA

"Eternal vigilance is the price of liberty: power is ever
stealing from the many to the few. The manna of popular liberty
must be gathered each day, or it is rotten... The hand entrusted
with power becomes, either from human depravity or esprit de
corps, the necessary enemy of the people. Only by continual
oversight can the democrat in office be prevented from hardening
into a despot: only by unintermitted agitation can a people be
kept sufficiently awake to principle not to let liberty be
smothered in material prosperity... Never look, for an age when
the people can be quiet and safe. At such times despotism, like
a shrouding mist, steals over the mirror of Freedom"
Wendell Phillips
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 11 '05 #2
On Mon, 2003-08-25 at 12:06, Chris Travers wrote:
Ron Johnson wrote:
This is vaguely similar to Two Phase Commit, which is a sine qua
non of distributed transactions, which is the s.q.n. of multi-master
replication.


I may be wrong, but if I recall correctly, one of the problems with a
standard 2-phase commit is that if one server goes down, the other
masters cannot commit their transactions. This would make a clustered
database server have a downtime equivalent to the total downtime of all
of its nodes. This is a real problem. Of course my understanding of
Two Phase Commit may be incorrect, in which case, I would appreciate it
if someone could point out where I am wrong.


Note that I didn't mean to imply that 2PC is sufficient to implement
M-M. The DBMS designer(s) must decide what to do (like queue up
changes) if 2PC fails.

--
-----------------------------------------------------------------
Ron Johnson, Jr. ro***********@cox.net
Jefferson, LA USA

"Our computers and their computers are the same color. The
conversion should be no problem!"
Unknown
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 11 '05 #3
On Mon, Aug 25, 2003 at 10:06:22AM -0700, Chris Travers wrote:
Ron Johnson wrote:
This is vaguely similar to Two Phase Commit, which is a sine qua
non of distributed transactions, which is the s.q.n. of multi-master
replication.


I may be wrong, but if I recall correctly, one of the problems with a
standard 2-phase commit is that if one server goes down, the other
masters cannot commit their transactions.


Before the discussion goes any further, have you read the work related
to Postgres-r? It's a substantially different animal from 2PC AFAIK.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Right now the sectors on the hard disk run clockwise, but I heard a rumor that
you can squeeze 0.2% more throughput by running them counterclockwise.
It's worth the effort. Recommended." (Gerry Pourwelle)

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 11 '05 #4
Alvaro Herrera wrote:
Before the discussion goes any further, have you read the work related
to Postgres-r? It's a substantially different animal from 2PC AFAIK.

Yes I have. Postgres-r is not a high-availability solution which is
capable of transparent failover, although it is a very useful project on
its own.

Best Wishes,
Chris Travers.
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 11 '05 #5
Tom Lane wrote:
Chris Travers <ch***@travelamericas.com> writes:

Yes I have. Postgres-r is not a high-availability solution which is
capable of transparent failover,


What makes you say that? My understanding is it's supposed to survive
loss of individual servers.

regards, tom lane

My mistake. I must have gotten them confused with another
(asynchronous) replication project.

Best Wishes,
Chris Travers
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 11 '05 #6
Chris Travers <ch***@travelamericas.com> writes:
Yes I have. Postgres-r is not a high-availability solution which is
capable of transparent failover,


What makes you say that? My understanding is it's supposed to survive
loss of individual servers.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 11 '05 #7
WARNING: This is getting long ...

Postgres-R is a very interesting and inspiring idea. And I've been
kicking that concept around for a while now. What I don't like about it
is that it requires fundamental changes in the lock mechanism and that
it is based on the assumption of very low lock conflict.

<explain-PG-R>
In Postgres-R a committing transaction sends it's workset (WS - a list
of all updates done in this transaction) to the group communication
system (GC). The GC guarantees total order, meaning that all nodes will
receive all WSs in the same order, no matter how they have been sent.

If a node receives back it's own WS before any error occured, it goes
ahead and finalizes the commit. If it receives a foreign WS, it has to
apply the whole WS and commit it before it can process anything else. If
now a local transaction, in progress or while waiting for it's WS to
come back, holds a lock that is required to process such remote WS, the
local transaction needs to be aborted to unlock it's resources ... it
lost the total order race.
</explain-PG-R>

Postgres-R requires that all remote WSs are applied and committed before
a local transaction can commit. Otherwise it couldn't correctly detect a
lock conflict. So there will not be any read ahead. And since the total
order really counts here, it cannot apply any two remote WSs in
parallel, a race condition could possibly exist and a later WS in the
total order runs faster and locks up a previous one, so we have to
squeeze all remote WSs through one single replication work process. And
all the locally parallel executed transactions that wait for their WSs
to come back have to wait until that poor little worker is done with the
whole pile. Bye bye concurrency. And I don't know how the GC will deal
with the backlog either. Could well choke on it.

I do not see how this will scale well in a multi-SMP-system cluster. At
least the serialization of WSs will become a horror if there is
significant lock contention like in a standard TPC-C on the district row
containing the order number counter. I don't know for sure, but I
suspect that with this kind of bottleneck, Postgres-R will have to
rollback more than 50% of it's transactions when there are more than 4
nodes under heavy load (like in a benchmark run). That will suck ...
But ... initially I said that it is an inspiring concept ... soooo ...

I am currently hacking around with some C+PL/TclU+Spread constructs that
might form a rude kind of prototype creature.

My changes to the Postgres-R concept are that there will be as many
replicating slave processes as there are in summary masters out in the
cluster ... yes, it will try to utilize all the CPU's in the cluster!
For failover reliability, A committing transaction will hold before
finalizing the commit and send it's "I'm ready" to the GC. Every
replicator that reaches the same state send's "I'm ready" too. Spread
guarantees in SAFE_MESS mode that messages are delivered to all nodes in
a group or that at least LEAVE/DISCONNECT messages are deliverd before.
So if a node receives more than 50% of "I'm ready", there would be a
very small gap where multiple nodes have to fail in the same split
second so that the majority of nodes does NOT commit. A node that
reported "I'm ready" but lost more than 50% of the cluster before
committing has to rollback and rejoin or wait for operator intervention.

Now the idea is to split up the communication into GC distribution
groups per transaction. So working master backends and associated
replication backends will join/leave a unique group for every
transaction in the cluster. This way, the per process communication is
reduced to the required minimum.
As said, I am hacking on some code ...
Jan

Chris Travers wrote:
Tom Lane wrote:
Chris Travers <ch***@travelamericas.com> writes:

Yes I have. Postgres-r is not a high-availability solution which is
capable of transparent failover,


What makes you say that? My understanding is it's supposed to survive
loss of individual servers.

regards, tom lane

My mistake. I must have gotten them confused with another
(asynchronous) replication project.

Best Wishes,
Chris Travers
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

--
#================================================= =====================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================= = Ja******@Yahoo.com #
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 11 '05 #8
Jan Wieck wrote:
WARNING: This is getting long ...

Postgres-R is a very interesting and inspiring idea. And I've been
kicking that concept around for a while now. What I don't like about
it is that it requires fundamental changes in the lock mechanism and
that it is based on the assumption of very low lock conflict.

<explain-PG-R>
In Postgres-R a committing transaction sends it's workset (WS - a list
of all updates done in this transaction) to the group communication
system (GC). The GC guarantees total order, meaning that all nodes
will receive all WSs in the same order, no matter how they have been
sent.

If a node receives back it's own WS before any error occured, it goes
ahead and finalizes the commit. If it receives a foreign WS, it has to
apply the whole WS and commit it before it can process anything else.
If now a local transaction, in progress or while waiting for it's WS
to come back, holds a lock that is required to process such remote WS,
the local transaction needs to be aborted to unlock it's resources ...
it lost the total order race.
</explain-PG-R>

Postgres-R requires that all remote WSs are applied and committed
before a local transaction can commit. Otherwise it couldn't correctly
detect a lock conflict. So there will not be any read ahead. And since
the total order really counts here, it cannot apply any two remote WSs
in parallel, a race condition could possibly exist and a later WS in
the total order runs faster and locks up a previous one, so we have to
squeeze all remote WSs through one single replication work process.
And all the locally parallel executed transactions that wait for their
WSs to come back have to wait until that poor little worker is done
with the whole pile. Bye bye concurrency. And I don't know how the GC
will deal with the backlog either. Could well choke on it.

I do not see how this will scale well in a multi-SMP-system cluster.
At least the serialization of WSs will become a horror if there is
significant lock contention like in a standard TPC-C on the district
row containing the order number counter. I don't know for sure, but I
suspect that with this kind of bottleneck, Postgres-R will have to
rollback more than 50% of it's transactions when there are more than 4
nodes under heavy load (like in a benchmark run). That will suck ...
But ... initially I said that it is an inspiring concept ... soooo ...

I am currently hacking around with some C+PL/TclU+Spread constructs
that might form a rude kind of prototype creature.

My changes to the Postgres-R concept are that there will be as many
replicating slave processes as there are in summary masters out in the
cluster ... yes, it will try to utilize all the CPU's in the cluster!
For failover reliability, A committing transaction will hold before
finalizing the commit and send it's "I'm ready" to the GC. Every
replicator that reaches the same state send's "I'm ready" too. Spread
guarantees in SAFE_MESS mode that messages are delivered to all nodes
in a group or that at least LEAVE/DISCONNECT messages are deliverd
before. So if a node receives more than 50% of "I'm ready", there
would be a very small gap where multiple nodes have to fail in the
same split second so that the majority of nodes does NOT commit. A
node that reported "I'm ready" but lost more than 50% of the cluster
before committing has to rollback and rejoin or wait for operator
intervention.

Now the idea is to split up the communication into GC distribution
groups per transaction. So working master backends and associated
replication backends will join/leave a unique group for every
transaction in the cluster. This way, the per process communication is
reduced to the required minimum.
As said, I am hacking on some code ...
Jan

Chris Travers wrote:
Tom Lane wrote:
Chris Travers <ch***@travelamericas.com> writes:
Yes I have. Postgres-r is not a high-availability solution which is
capable of transparent failover,

What makes you say that? My understanding is it's supposed to survive
loss of individual servers.

regards, tom lane

My mistake. I must have gotten them confused with another
(asynchronous) replication project.

Best Wishes,
Chris Travers
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if
your
joining column's datatypes do not match


As my british friends would say, "Bully for you",and I applaud you
playing, struggling, learning from this for our sakes. Jeez, all I think
about is me,huh?
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 11 '05 #9


On Mon, 25 Aug 2003, Tom Lane wrote:
Chris Travers <ch***@travelamericas.com> writes:
Yes I have. Postgres-r is not a high-availability solution which is
capable of transparent failover,


What makes you say that? My understanding is it's supposed to survive
loss of individual servers.


How does it play 'catch up' went a server comes back online?

note that I did go through the 'docs' on how it works, and am/was quite
impressed at what they were doing ... but, if I have a large network, say,
and one group is connecting to ServerA, and another group with ServerB,
what happens when ServerA and ServerB loose network connectivity for any
period of time? How do they re-sync when the network comes back up again?

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 11 '05 #10
"Marc G. Fournier" <sc*****@hub.org> writes:
On Mon, 25 Aug 2003, Tom Lane wrote:
What makes you say that? My understanding is it's supposed to survive
loss of individual servers.
How does it play 'catch up' went a server comes back online?


The recovered server has to run through the part of the GCS data stream
that it missed the first time. This is not conceptually different from
recovering using archived WAL logs (or archived trigger-driven
replication data streams). As with using WAL for recovery, you have to
be able to archive the message stream until you don't need it any more.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 11 '05 #11

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Vincento Harris | last post by:
Is there any problem with implementating some kind of replication and backing up transaction logs at the same time? SQL Server 2000 Standard Edition The server is configured to back up log files...
0
by: Marko Damaschke | last post by:
Hello alltogether, i'm working on a database-plattform which should work with fail-over-technics. Therefor 2 identic machines with internal RAID are available, which work in...
0
by: webhosting | last post by:
After a failure, we promoted a slave to master. This included renaming the server to the old master's name and ip address, but leaving the server id alone. The new master works fine. We...
9
by: Jan Wieck | last post by:
Dear community, for some reason the post I sent yesterday night still did not show up on the mailing lists. I have set up some links on the developers side under...
5
by: Rafael Faria | last post by:
Hello All, I'm new to DB2, so apologies in advance for any stupid comments I might make on this replication problem. Using Information integrator and DB2 8.2 I managed to create a nickname that...
3
by: mark | last post by:
HELP!!! Last week we moved the Server From 1 Location to Another that was running Replication Manager 4.0. (We changed the IP Addresses and changed the name servers also... ) After the...
3
by: Query Builder | last post by:
Hi All, I have a table with a column DeletedDate which stores a logical delete of a record. I need to set up transactional replication for reporting purposes that this deleted records should...
13
by: John | last post by:
Hi We have an MS Access application which runs on the server but some laptop users need the app to work offline. Our solution has been to use Access replication which automatically syncs data...
0
by: tsharp89 | last post by:
We currently have SQL Server 2000 and have merge replication. We have the following process 1. Data is transfered to a laptop, data on the laptop is changed and the data on the server is 'locked'...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 2 August 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
0
by: erikbower65 | last post by:
Using CodiumAI's pr-agent is simple and powerful. Follow these steps: 1. Install CodiumAI CLI: Ensure Node.js is installed, then run 'npm install -g codiumai' in the terminal. 2. Connect to...
0
by: kcodez | last post by:
As a H5 game development enthusiast, I recently wrote a very interesting little game - Toy Claw ((http://claw.kjeek.com/))。Here I will summarize and share the development experience here, and hope it...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Sept 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
14
DJRhino1175
by: DJRhino1175 | last post by:
When I run this code I get an error, its Run-time error# 424 Object required...This is my first attempt at doing something like this. I test the entire code and it worked until I added this - If...
0
by: Rina0 | last post by:
I am looking for a Python code to find the longest common subsequence of two strings. I found this blog post that describes the length of longest common subsequence problem and provides a solution in...
0
by: lllomh | last post by:
Define the method first this.state = { buttonBackgroundColor: 'green', isBlinking: false, // A new status is added to identify whether the button is blinking or not } autoStart=()=>{
0
by: lllomh | last post by:
How does React native implement an English player?
0
by: Mushico | last post by:
How to calculate date of retirement from date of birth

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.