Feature Request for 7.5

Chris Travers

Hi all;

I have been looking into how to ensure that synchronous replication, etc.
could best be implimented. To date, I see only two options: incorporate
the replication code into the database backend or have a separate "proxy"
which handles the replication.

The main problem with incorporating the system into the backend process is
that it limits the development to the 10-month timeframe between releases.
The main advantage is that the planner could be used to spot queries which
alter tuples and thus could "sort out" select queries. I am not sure if
this could be done using a proxy approach if rules or user-defined functions
were used.

Also the proxy approach could be more flexible in that it could allow for
different types of communications such as IP sharing which may not be
desireable in the database backend.

The easiest way of furthering the development of asynchronous replication
proxies would be to break off the server-side network protocol handler into
a library which would contain functions to bind to ports, listen, and pass
messages back to the calling program. The library could then also be
installed, but also be redistributable, so that developers could build these
solutions.

Also, if the protocol does not provide for "select" queries providing
notification of affected tuples, an extension to handle that would be
helpful and would allow for better load-ballancing.

Best Wishes,
Chris Travers
---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 12 '05 #1

Subscribe Post Reply

1991

Peter Eisentraut

Chris Travers writes:

The easiest way of furthering the development of asynchronous replication
proxies would be to break off the server-side network protocol handler into
a library which would contain functions to bind to ports, listen, and pass
messages back to the calling program. The library could then also be
installed, but also be redistributable, so that developers could build these
solutions.

You might be interested in Clustered JDBC (http://c-jdbc.objectweb.org/),
which implements a proxy approach but does not require knowledge of the
database-specific protocol.

--
Peter Eisentraut pe*****@gmx.net
---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 12 '05 #2

Jan Wieck

Chris Travers wrote:

Hi all;

I have been looking into how to ensure that synchronous replication, etc.
could best be implimented. To date, I see only two options: incorporate
the replication code into the database backend or have a separate "proxy"
which handles the replication.

There are many problems with a "proxy" solution. One is that you really
don't know if a statement does modify the database or not. A SELECT for
example can call a user defined function somewhere and that can do
whatever the programmer likes it to do. So you would have to "replicate"
all that too. Granted, you can exclude this type of database usage from
your supported list.

Next you don't have control over sequence allocation. Every application
that uses sequence allocated ID's is in danger, because they are not
blocking, you cannot force the order of assignments and they don't roll
back either.

And you get into deadlock problems if you don't guarantee that your
proxy uses the same order to access all databases. And you cannot
guarantee that if your proxy tries to do it parallel. So it has to do
the queries against all databases one by one, that doesn't scale well.

The last thing (I mention for now) is that I cannot imagine any way that
such proxy code allows for a new member to join without stopping the
whole application, creating an identical copy of one member
(dump+restore) and continue. So it is impossible to build 24*7 support
that way.

No, separate proxy code doesn't strike me as the superior solution.
Jan

--
#================================================= =====================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================= = Ja******@Yahoo.com #
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 12 '05 #3

Scott Ribe

>> I have been looking into how to ensure that synchronous replication, etc.

could best be implimented. To date, I see only two options: incorporate
the replication code into the database backend or have a separate "proxy"
which handles the replication.

There are many problems with a "proxy" solution. One is that you really
don't know if a statement does modify the database or not. A SELECT for
example can call a user defined function somewhere and that can do
whatever the programmer likes it to do. So you would have to "replicate"
all that too. Granted, you can exclude this type of database usage from
your supported list.

Next you don't have control over sequence allocation. Every application
that uses sequence allocated ID's is in danger, because they are not
blocking, you cannot force the order of assignments and they don't roll
back either.

And you get into deadlock problems if you don't guarantee that your
proxy uses the same order to access all databases. And you cannot
guarantee that if your proxy tries to do it parallel. So it has to do
the queries against all databases one by one, that doesn't scale well.

The last thing (I mention for now) is that I cannot imagine any way that
such proxy code allows for a new member to join without stopping the
whole application, creating an identical copy of one member
(dump+restore) and continue. So it is impossible to build 24*7 support
that way.

No, separate proxy code doesn't strike me as the superior solution.

I incorporate code (triggers/plpgsql) into the backend that creates a
serialized audit trail of changes, then have an external proxy running which
queries the audit trail and applies the changes. I don't have enough
experience with the subject to debate it much, but I took this approach
because:

- It's much easier for the backend to keep straight the changes, for all the
reasons mentioned above.

- It seemed to me to be much easier to write the actual replication stuff as
a C++ app on top of libpq than to try to stuff that into the backend.

Note that I only need master/slave (take that LA County Commission)
replication, and I don't really need to worry about scalability because I'll
only have one or two slaves.
--
Scott Ribe
sc********@killerbytes.com
http://www.killerbytes.com/
(303) 665-7007 voice
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 12 '05 #4

Chris Travers

Comments inline

From: "Jan Wieck" <Ja******@Yahoo.com>:

There are many problems with a "proxy" solution. One is that you really
don't know if a statement does modify the database or not. A SELECT for
example can call a user defined function somewhere and that can do
whatever the programmer likes it to do. So you would have to "replicate"
all that too. Granted, you can exclude this type of database usage from
your supported list.
That is why it would be nice to be able to check for altered tuples on a
select before deciding whether to replicate... In this case you could have
a query->check for altered tuples-> if altered replicate query routine.

Next you don't have control over sequence allocation. Every application
that uses sequence allocated ID's is in danger, because they are not
blocking, you cannot force the order of assignments and they don't roll
back either.
This is the more serious problem. I will have to think this one over. I
wonder about having cross-proxy sequence generators.

And you get into deadlock problems if you don't guarantee that your
proxy uses the same order to access all databases. And you cannot
guarantee that if your proxy tries to do it parallel. So it has to do
the queries against all databases one by one, that doesn't scale well.
This is true also, but if the sequence of the queries is similar, then I am
having trouble seeing how the deadlocks would happen on a server in a case
where you wouldn't otherwise have one. Since a deadlock on ONE server would
force a restore process (with some performance problems in the beginning of
that process), it would not be too bad.
The last thing (I mention for now) is that I cannot imagine any way that
such proxy code allows for a new member to join without stopping the
whole application, creating an identical copy of one member
(dump+restore) and continue. So it is impossible to build 24*7 support
that way.
Not too hard. Read my comments on restoring from failure for details. The
same proceedure could be used to add a new member. The only performance
drawback is that new transactions would have to be queued up (uncommitted)
while the old ones complete. If you have hanging transactions, this could
be a problem. What we have is the following process:
1) PRepare query queue for storing incoming queries.
2) Request a restore point.
3) At this point, all new queries get queued in the query queue. No new
transactions may be committed.
4) When all transactions which were open at the beginning of step 3 are
closed, give permission to start restore, and address of server to use.
5) Use PGDump to start the restore.
6) New transactions may now be committed.
7) when restore finishes, start committing transactions in the query log in
order of committal.
8) When no closed transactions remain, change status to online.
No, separate proxy code doesn't strike me as the superior solution.
There are advantages and disadvantages to either. The other option is to
use some sort of library to handle the additional clustering protocols.
Either way is limited and difficult. Still working on these problems.
Jan

--
#================================================= =====================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================= = Ja******@Yahoo.com #

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 12 '05 #5

Richard Huxton

On Wednesday 03 December 2003 11:13, Chris Travers wrote:

Comments inline

From: "Jan Wieck" <Ja******@Yahoo.com>:
There are many problems with a "proxy" solution. One is that you really
don't know if a statement does modify the database or not. A SELECT for
example can call a user defined function somewhere and that can do
whatever the programmer likes it to do. So you would have to "replicate"
all that too. Granted, you can exclude this type of database usage from
your supported list.

That is why it would be nice to be able to check for altered tuples on a
select before deciding whether to replicate... In this case you could have
a query->check for altered tuples-> if altered replicate query routine.
Next you don't have control over sequence allocation. Every application
that uses sequence allocated ID's is in danger, because they are not
blocking, you cannot force the order of assignments and they don't roll
back either.

This is the more serious problem. I will have to think this one over. I
wonder about having cross-proxy sequence generators.

Don't forget you need to lockstep system-clocks too, otherwise SELECT * FROM
log_stats WHERE log_ts > now() - '1 hour' becomes ill-defined.

Hmm - thinking about it, you'll need to serialise queries too. Otherwise you
could issue simultaneous queries to machines A,B and have A complete in order
(1,2) and B in order (2,1). I don't see a way around that one without a
guaranteed scheduling order at the kernel level.

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 12 '05 #6

Chris Travers

Interesting feedback.

It strikes me that, for many sorts of databases, multimaster synchronous
replication is not the best solution for the reasons that Scott, Jan, et.
al. have raised. I am wondering how commercial RDBMS's get arround this
problem? There are several possibilities that I can think of-- have a write
master, and many read-only slaves (available at the moment, iirc).
Replication could then occur at the tuple level using linked databases,
triggers, etc. Rewrite rules could then allow one to use the slaves to
"funnel" the queries back up to the master. It seems to me that latency
would be a killer on this sort of solution, though everything would
effectively occur on all databases in the same order, but recovering from a
crash of the master could be complicated and result in additional
downtime...

The other solution (still not "guaranteed" to work in all cases) is that
every proxy could be hardwired to attempt to contact databases in a set
order. This would also avoid deadlocks. Note that if sufficient business
logic is built into the database, one would be guaranteed that a single
"consistent" view would be maintained at any given time (conflicts would
result in the minority of up to 50 percent of the servers needing to go
through the recovery process-- not killing uptime, but certainly killing
performance).

However, it seems to me that the only solution for many of these databases
is to have a "cluster in a box" solution where you have a system comprised
entirely of redundent, hot-swapable hardware so that nearly anything can be
swapped out if it breaks. In this case, we should be able to just run
PostgreSQL as is....
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 12 '05 #7

Jan Wieck

Chris Travers wrote:

Comments inline

From: "Jan Wieck" <Ja******@Yahoo.com>:
There are many problems with a "proxy" solution. One is that you really
don't know if a statement does modify the database or not. A SELECT for
example can call a user defined function somewhere and that can do
whatever the programmer likes it to do. So you would have to "replicate"
all that too. Granted, you can exclude this type of database usage from
your supported list.
That is why it would be nice to be able to check for altered tuples on a
select before deciding whether to replicate... In this case you could have
a query->check for altered tuples-> if altered replicate query routine.

Better make that "did dirty any block". There are other things than tuples.

Next you don't have control over sequence allocation. Every application
that uses sequence allocated ID's is in danger, because they are not
blocking, you cannot force the order of assignments and they don't roll
back either.

This is the more serious problem. I will have to think this one over. I
wonder about having cross-proxy sequence generators.

And you get into deadlock problems if you don't guarantee that your
proxy uses the same order to access all databases. And you cannot
guarantee that if your proxy tries to do it parallel. So it has to do
the queries against all databases one by one, that doesn't scale well.

This is true also, but if the sequence of the queries is similar, then I am
having trouble seeing how the deadlocks would happen on a server in a case
where you wouldn't otherwise have one. Since a deadlock on ONE server would
force a restore process (with some performance problems in the beginning of
that process), it would not be too bad.

I think you don't understand completely.

Imagine databases DB1, and DB2 with proxy connections (client->proxy)
PC-A and PC-B having database connections PC-A1, PC-A2 and PC-B1, PC-B2.
Now A does "START TRANSACTION" and "UPDATE tab1 SET data=1 WHERE id=666"
while B does "START TRANSACTION" and "UPDATE tab1 SET data=2 WHERE id=666".

If PC-A does this on PC-A1 first while PC-B forwards the statements on
PC-B2, then both proxies will succeed and A has a lock for the row in
DB1 while B holds it in DB2. Now A tries the UPDATE in DB2 and blocks,
while B does the same in DB1 and blocks. None of the databases can
detect the deadlock, because for them it isn't clear that both are
waiting on each other but in different databases.

You have a couple of choices to solve this problem.

You can let A complete it's mission on DB1, buffer the remaining actions
in B and do vice versa with B on DB2 buffering for A. But you will later
have A stomping over B's update in DB2 and B doing the same in DB1 and
the end result is DB1: data=2 while DB2: data=1 ... out of sync.

You can avoid this by executing all queries on the same database first,
and repeat them on all other databases in the order they finished on
that leader database. This way the whole thing will not be able to be
any faster than one single database. There is no scaling here. All you
have is backup servers for failover.

So in order to scale you need to do your own deadlock detection in the
proxy server. I think real deadlock detection will lead to parsing a
substantial amount of SQL just to have a second level lock management
and is probably out of the scope of your proposal. So you could only
shoot for timeouts. Guess what, PostgreSQL did detect deadlocks that way
once, way back ... people didn't like it.

The last thing (I mention for now) is that I cannot imagine any way that
such proxy code allows for a new member to join without stopping the
whole application, creating an identical copy of one member
(dump+restore) and continue. So it is impossible to build 24*7 support
that way.
Not too hard. Read my comments on restoring from failure for details. The
same proceedure could be used to add a new member. The only performance
drawback is that new transactions would have to be queued up (uncommitted)
while the old ones complete. If you have hanging transactions, this could
be a problem. What we have is the following process:
1) PRepare query queue for storing incoming queries.
2) Request a restore point.
3) At this point, all new queries get queued in the query queue. No new
transactions may be committed.
4) When all transactions which were open at the beginning of step 3 are
closed, give permission to start restore, and address of server to use.
5) Use PGDump to start the restore.
6) New transactions may now be committed.
7) when restore finishes, start committing transactions in the query log in
order of committal.

Since this is one of my remaining problems for Slony-II I am curious how
exactly you determine the commit order. I know that it is not the order
in which your proxies issue the commit, nor must it necessarily be the
order in which they get the response for it either. So how do you do it?
8) When no closed transactions remain, change status to online.

No, separate proxy code doesn't strike me as the superior solution.

There are advantages and disadvantages to either. The other option is to
use some sort of library to handle the additional clustering protocols.
Either way is limited and difficult. Still working on these problems.

The real other option is full synchronous multi-master. I am planning
that, but it will take me a couple more months to get Slony-I done
before I can really start on that.
Jan

--
#================================================= =====================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================= = Ja******@Yahoo.com #
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 12 '05 #8

Jan Wieck

The following is more or less a brain-dump ... not finally thought out
and not supposed to be considered a proposal at this time.

The synchronous multi-master solution I have in mind needs a few
currently non existent support features in the backend. One is
non-blocking locks and another one is a callback mechanism just before
marking a transaction in clog as committed.

It will use reliable group communication (GC) that can guarantee total
order. There is an AFTER trigger on all replicated tables. A daemon
started for every database will create a number of threads/subprocesses.
Each of these workers has his separate DB connection and is a member of
a different group in the GC. The number of these groups determines the
maximum number of concurrent UPDATE-transactions, the cluster can handle.

At the first call of the trigger inside of a transaction (this is the
first modifying statement), the trigger allocates one of the replication
groups (possibly waiting for one to become free). It now communicates
with one daemon thread on every database in the cluster. The triggers
now send the replication data into this group. It is not necessary to
wait for the other cluster members as long as the GC guarantees FIFO by
sender.

At the time the transaction commits, it sends a commit message into the
group. This message has another service type level which is total order.
It will wait now for all members in the replication group to reply with
the same. When every member in the group replied, all agreed to commit
and are just before stamping clog.

Since the service type is total order, the GC guarantees that either all
members get the messages in the same order, or if one cannot get a
message a corresponding LEAVE message will be generated. Also, all the
replication threads will use non-blocking locking. If any of them ever
finds a locked row, it will send an ABORT message into the group,
causing the whole group to roll back.

This way, either all members of the group reach the "just before
stamping clog" state together and know that everyone got there, or they
will get an abort or leave message from any of their co-workers and roll
back.

There is a gap between reporting "ready" and really stamping clog in
which a database might crash. This will cause all other cluster members
to go ahead and commit while the crashed DB does not commit. But this is
limited to crashes only and a restarting database must rejoin/resync
with the cluster anyway and doubt its own data. So this is not really a
problem.
With this synchronous model, read only transactions can be handled on
every node independently of replication at all - this is the scaling
part. The total amount of UPDATE transactions is limited by the slowest
cluster member and does not scale, but that is true for all synchronous
solutions.
Jan
Chris Travers wrote:

Interesting feedback.

It strikes me that, for many sorts of databases, multimaster synchronous
replication is not the best solution for the reasons that Scott, Jan, et.
al. have raised. I am wondering how commercial RDBMS's get arround this
problem? There are several possibilities that I can think of-- have a write
master, and many read-only slaves (available at the moment, iirc).
Replication could then occur at the tuple level using linked databases,
triggers, etc. Rewrite rules could then allow one to use the slaves to
"funnel" the queries back up to the master. It seems to me that latency
would be a killer on this sort of solution, though everything would
effectively occur on all databases in the same order, but recovering from a
crash of the master could be complicated and result in additional
downtime...

The other solution (still not "guaranteed" to work in all cases) is that
every proxy could be hardwired to attempt to contact databases in a set
order. This would also avoid deadlocks. Note that if sufficient business
logic is built into the database, one would be guaranteed that a single
"consistent" view would be maintained at any given time (conflicts would
result in the minority of up to 50 percent of the servers needing to go
through the recovery process-- not killing uptime, but certainly killing
performance).

However, it seems to me that the only solution for many of these databases
is to have a "cluster in a box" solution where you have a system comprised
entirely of redundent, hot-swapable hardware so that nearly anything can be
swapped out if it breaks. In this case, we should be able to just run
PostgreSQL as is....
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

--
#================================================= =====================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================= = Ja******@Yahoo.com #
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 12 '05 #9

Alvaro Herrera

On Wed, Dec 03, 2003 at 08:41:40PM +0700, Chris Travers wrote:

It strikes me that, for many sorts of databases, multimaster synchronous
replication is not the best solution for the reasons that Scott, Jan, et.
al. have raised. I am wondering how commercial RDBMS's get arround this
problem?

Say, have you looked at the approach taken by postgres-r? It's supposed
to solve these problems, and unless I've missed something they are in
need of some manpower to port it to recent releases.

I don't know the URL. It's on gborg somewhere though.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Some men are heterosexual, and some are bisexual, and some
men don't think about sex at all... they become lawyers" (Woody Allen)

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 12 '05 #10

Peter Childs

On Wed, 3 Dec 2003, Alvaro Herrera wrote:

On Wed, Dec 03, 2003 at 08:41:40PM +0700, Chris Travers wrote:
It strikes me that, for many sorts of databases, multimaster synchronous
replication is not the best solution for the reasons that Scott, Jan, et.
al. have raised. I am wondering how commercial RDBMS's get arround this
problem?

Say, have you looked at the approach taken by postgres-r? It's supposed
to solve these problems, and unless I've missed something they are in
need of some manpower to port it to recent releases.

I don't know the URL. It's on gborg somewhere though.

http://gborg.postgresql.org/project/...rojdisplay.php

The way I under stand the fix is to have a two stage commit...
This is from previous threads mind you.

So its

Client -> Server <Commit Transaction>
Server -> Other Servers <Pre Commit>
Other Servers -> Server <Ready to Commit>
<Server Waits for all other servers to respond>
Server -> Other Servers <Commit>
Server -> Client <Commit Success or Not>

Each query that makes a change needs to tell all other servers and
get a responce.

Oh yes and you need point in time backup to bring new servers up
to date and crashed ones too.

I really need point in time backup at the very least and would
love to see all these features. These features are growing to beyond
urgent now.... Less talk more action.

Peter Childs

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 12 '05 #11

Alvaro Herrera

On Wed, Dec 03, 2003 at 04:04:41PM +0000, Peter Childs wrote:

On Wed, 3 Dec 2003, Alvaro Herrera wrote:
On Wed, Dec 03, 2003 at 08:41:40PM +0700, Chris Travers wrote:
It strikes me that, for many sorts of databases, multimaster synchronous
replication is not the best solution for the reasons that Scott, Jan, et.
al. have raised. I am wondering how commercial RDBMS's get arround this
problem?

Say, have you looked at the approach taken by postgres-r? It's supposed
to solve these problems, and unless I've missed something they are in
need of some manpower to port it to recent releases.

I don't know the URL. It's on gborg somewhere though.

http://gborg.postgresql.org/project/...rojdisplay.php

The way I under stand the fix is to have a two stage commit...
This is from previous threads mind you.

No, it is not two phase commit at all. It uses "group communication"
with "total order". What does this mean in detail I don't know.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
Voy a acabar con todos los humanos / con los humanos yo acabaré
voy a acabar con todos / con todos los humanos acabaré (Bender)

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 12 '05 #12

Jan Wieck

Peter Childs wrote:

On Wed, 3 Dec 2003, Alvaro Herrera wrote:
On Wed, Dec 03, 2003 at 08:41:40PM +0700, Chris Travers wrote:
> It strikes me that, for many sorts of databases, multimaster synchronous
> replication is not the best solution for the reasons that Scott, Jan, et.
> al. have raised. I am wondering how commercial RDBMS's get arround this
> problem?
Say, have you looked at the approach taken by postgres-r? It's supposed
to solve these problems, and unless I've missed something they are in
need of some manpower to port it to recent releases.

I don't know the URL. It's on gborg somewhere though.

http://gborg.postgresql.org/project/...rojdisplay.php

The way I under stand the fix is to have a two stage commit...
This is from previous threads mind you.

You must have misunderstood something there. The whole idea of
Postgres-R is how to "avoid" the need for 2-phase commit in a
synchronous replication system.
Jan

So its

Client -> Server <Commit Transaction>
Server -> Other Servers <Pre Commit>
Other Servers -> Server <Ready to Commit>
<Server Waits for all other servers to respond>
Server -> Other Servers <Commit>
Server -> Client <Commit Success or Not>

Each query that makes a change needs to tell all other servers and
get a responce.

Oh yes and you need point in time backup to bring new servers up
to date and crashed ones too.

I really need point in time backup at the very least and would
love to see all these features. These features are growing to beyond
urgent now.... Less talk more action.

Peter Childs

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

--
#================================================= =====================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================= = Ja******@Yahoo.com #
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 12 '05 #13

Bopolissimus Platypus

On Thursday 04 December 2003 00:34, Jan Wieck wrote:

You must have misunderstood something there. The whole idea of
Postgres-R is how to "avoid" the need for 2-phase commit in a
synchronous replication system.

I've been looking at postgres-r and other postgres replication
temporary solutions. does anyone have a good *current* summary
of postgres-r status? from the website it looks like the working model
is based on 6.2. is postgres-r 6.2 supposed to be good enough
for production? that's what i'd assume "working model" means,
but confirmation from people who've actually stress tested it would
be great :).

the 7.2 version is the development version. i don't see much detail
(documentation) though on how stable the 7.2 version is. looking
at the docs in the download itself, everything seems to refer to generic
7.2. i don't see anything that looks like postgres-r specific documentation.
so i can't tell what things are implemented and what things still
need to be implemented (but of course i might just not be looking
hard enough for the data. to be honest, i'm just grepping for
"replication", opening files that look like they might be interesting
and scanning them, it's not like i've read everything in the doc
directory).

in doc/TODO.detail/replication, there's a bunch of emails, but the
last one seems to be from 2001.

there's src/backend/replication/README, but that seems to discuss
only 6.2 work. i could do a diff, i suppose between the working model
and 7.2, but i can't right now. i'm behind a dialup modem and it would
take too long to download 6.2 :). maybe tomorrow at the office :).

but again, if someone has detailed and up-to-date information (summary
or in detail) on the 7.2 postgres-r status, that would be incredibly
cool :).

hmmm, there's a mailing list. archives at:

http://gborg.postgresql.org/pipermai...ation-general/

i didn't see any detailed summary of what's missing in 7.2 though.
oh well, i'll just set up some test servers and just play with
it for now, i guess.

thanks for any information.

tiger

--
Gerald Timothy Quimpo gquimpo*hotmail.com tiger*sni*ph
http://bopolissimus.sni.ph
Public Key: "gpg --keyserver pgp.mit.edu --recv-keys 672F4C78"

for whom self-parody appears to be a form of self-realization
-- http://slate.msn.com/id/2090554/?0si=-
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 12 '05 #14

Keith C. Perry

Jan,

To continue the brain-dump. I was curious how the GC protocol was going to be
implemented (if you had any ideas thus far).

Several years ago, I started working on a network security and intrusion
detection system for a client where the audit/logging system needed to be
redundant- they wanted 3 servers each on a different LANs in fact.

The work in that design was centered around making sure that the aggregated data
set was exactly the same on each of the 3 servers. Not only were all the
event timestamps the same but the events were ordered the same way in the logs.

The solution I was working on was a multicast IPv4 (possibly IPv6) network where
the "packet" of information had an id of some sort and the event data inside the
datagram had a timestamp (of course).

The obviously problem is that multicasting is not reliable so in order sure all
event were on all servers, there would be a periodic polling that would give a
server with say 2 missing event the chance to "catch-up". This catch-up"
function make sure all events were ordered an had the same last event. This
would be much more of an issue with the server a couple of hops away than with a
server on the same LAN. The client never went ahead with the system so I
apologize for not having some reference examples.

This is totally different from what true replication is amongst a group of
database servers but it seems to me that if the servers are in multicast group,
at least transactions would be theoretically sent to all servers at the same
time. I would think that a homogenous system of servers is already ordering
events the same way so transactions would occur properly unless it was missed.
A "catch-up" function here was be difficult to implement because if the servers
are committing asyncronously then you can't catch-up and one of your datasets
has lost integrity. Syncronously (meaning, "we'll all commit now because we all
agree on the current list of transactions") seems a bit messy and not as scalable.

I didn't mean to get into all that but how the GC is going to work in this
project is something that I'm curious about.
--
Keith C. Perry, MS E.E.
Director of Networks & Applications
VCSN, Inc.
http://vcsn.com

Quoting Jan Wieck <Ja******@Yahoo.com>:

The following is more or less a brain-dump ... not finally thought out
and not supposed to be considered a proposal at this time.

The synchronous multi-master solution I have in mind needs a few
currently non existent support features in the backend. One is
non-blocking locks and another one is a callback mechanism just before
marking a transaction in clog as committed.

It will use reliable group communication (GC) that can guarantee total
order. There is an AFTER trigger on all replicated tables. A daemon
started for every database will create a number of threads/subprocesses.
Each of these workers has his separate DB connection and is a member of
a different group in the GC. The number of these groups determines the
maximum number of concurrent UPDATE-transactions, the cluster can handle.

At the first call of the trigger inside of a transaction (this is the
first modifying statement), the trigger allocates one of the replication
groups (possibly waiting for one to become free). It now communicates
with one daemon thread on every database in the cluster. The triggers
now send the replication data into this group. It is not necessary to
wait for the other cluster members as long as the GC guarantees FIFO by
sender.

At the time the transaction commits, it sends a commit message into the
group. This message has another service type level which is total order.
It will wait now for all members in the replication group to reply with
the same. When every member in the group replied, all agreed to commit
and are just before stamping clog.

Since the service type is total order, the GC guarantees that either all
members get the messages in the same order, or if one cannot get a
message a corresponding LEAVE message will be generated. Also, all the
replication threads will use non-blocking locking. If any of them ever
finds a locked row, it will send an ABORT message into the group,
causing the whole group to roll back.

This way, either all members of the group reach the "just before
stamping clog" state together and know that everyone got there, or they
will get an abort or leave message from any of their co-workers and roll
back.

There is a gap between reporting "ready" and really stamping clog in
which a database might crash. This will cause all other cluster members
to go ahead and commit while the crashed DB does not commit. But this is
limited to crashes only and a restarting database must rejoin/resync
with the cluster anyway and doubt its own data. So this is not really a
problem.
With this synchronous model, read only transactions can be handled on
every node independently of replication at all - this is the scaling
part. The total amount of UPDATE transactions is limited by the slowest
cluster member and does not scale, but that is true for all synchronous
solutions.
Jan
Chris Travers wrote:
Interesting feedback.

It strikes me that, for many sorts of databases, multimaster synchronous
replication is not the best solution for the reasons that Scott, Jan, et.
al. have raised. I am wondering how commercial RDBMS's get arround this
problem? There are several possibilities that I can think of-- have a

write
master, and many read-only slaves (available at the moment, iirc).
Replication could then occur at the tuple level using linked databases,
triggers, etc. Rewrite rules could then allow one to use the slaves to
"funnel" the queries back up to the master. It seems to me that latency
would be a killer on this sort of solution, though everything would
effectively occur on all databases in the same order, but recovering from

a
crash of the master could be complicated and result in additional
downtime...

The other solution (still not "guaranteed" to work in all cases) is that
every proxy could be hardwired to attempt to contact databases in a set
order. This would also avoid deadlocks. Note that if sufficient business
logic is built into the database, one would be guaranteed that a single
"consistent" view would be maintained at any given time (conflicts would
result in the minority of up to 50 percent of the servers needing to go
through the recovery process-- not killing uptime, but certainly killing
performance).

However, it seems to me that the only solution for many of these databases
is to have a "cluster in a box" solution where you have a system comprised
entirely of redundent, hot-swapable hardware so that nearly anything can

be
swapped out if it breaks. In this case, we should be able to just run
PostgreSQL as is....
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

--
#================================================= =====================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================= = Ja******@Yahoo.com #
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

____________________________________
This email account is being host by:
VCSN, Inc : http://vcsn.com

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 12 '05 #15

Jan Wieck

Keith C. Perry wrote:

Jan,

To continue the brain-dump. I was curious how the GC protocol was going to be
implemented (if you had any ideas thus far).

The stuff I've been kicking around so far is Spread 3.17.1 with a simple
self made Tcl/Tk binding library used from inside PG via PL/TclU. You
can download the Spread toolkit here:

http://www.spread.org

What I have is way too premature to be shown, so don't ask. But it has
replicated one UPDATE once, so it is a bit more than a pure braindump.
Jan

--
#================================================= =====================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================= = Ja******@Yahoo.com #
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 12 '05 #16

Chris Travers

The problems with 2-phase-commit have already been mentioned in response my
feature request.

Most notably, 2-phase commit does NOT ensure that the sequences are called
in the same order on all databases. This can cause huge problems! For
example invoices might point to different customers in different databases.
There are other problems that exist as well (deadlocks and the like) but I
am not as worried about them. The sequence issue IS a showstopper.

Best Wishes,
Chris Travers

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 12 '05 #17

Bruce Momjian

Postgres-R is not ready for production, and development seems to have
stopped., and I don't know what multi-master solutions we can really
offer right now. We do have several master/slave replication solutions.

---------------------------------------------------------------------------

Bopolissimus Platypus wrote:

On Thursday 04 December 2003 00:34, Jan Wieck wrote:
You must have misunderstood something there. The whole idea of
Postgres-R is how to "avoid" the need for 2-phase commit in a
synchronous replication system.

I've been looking at postgres-r and other postgres replication
temporary solutions. does anyone have a good *current* summary
of postgres-r status? from the website it looks like the working model
is based on 6.2. is postgres-r 6.2 supposed to be good enough
for production? that's what i'd assume "working model" means,
but confirmation from people who've actually stress tested it would
be great :).

the 7.2 version is the development version. i don't see much detail
(documentation) though on how stable the 7.2 version is. looking
at the docs in the download itself, everything seems to refer to generic
7.2. i don't see anything that looks like postgres-r specific documentation.
so i can't tell what things are implemented and what things still
need to be implemented (but of course i might just not be looking
hard enough for the data. to be honest, i'm just grepping for
"replication", opening files that look like they might be interesting
and scanning them, it's not like i've read everything in the doc
directory).

in doc/TODO.detail/replication, there's a bunch of emails, but the
last one seems to be from 2001.

there's src/backend/replication/README, but that seems to discuss
only 6.2 work. i could do a diff, i suppose between the working model
and 7.2, but i can't right now. i'm behind a dialup modem and it would
take too long to download 6.2 :). maybe tomorrow at the office :).

but again, if someone has detailed and up-to-date information (summary
or in detail) on the 7.2 postgres-r status, that would be incredibly
cool :).

hmmm, there's a mailing list. archives at:

http://gborg.postgresql.org/pipermai...ation-general/

i didn't see any detailed summary of what's missing in 7.2 though.
oh well, i'll just set up some test servers and just play with
it for now, i guess.

thanks for any information.

tiger

--
Gerald Timothy Quimpo gquimpo*hotmail.com tiger*sni*ph
http://bopolissimus.sni.ph
Public Key: "gpg --keyserver pgp.mit.edu --recv-keys 672F4C78"

for whom self-parody appears to be a form of self-realization
-- http://slate.msn.com/id/2090554/?0si=-
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

--
Bruce Momjian | http://candle.pha.pa.us
pg***@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 12 '05 #18

Bopolissimus Platypus

On Saturday 06 December 2003 20:50, Bruce Momjian wrote:

Postgres-R is not ready for production, and development seems to have
stopped., and I don't know what multi-master solutions we can really
offer right now. We do have several master/slave replication solutions.

thanks. i'll look at the master/slave solutions. i don't really need
multi-master anyway, it's just that postgres-r got some press and i
looked at it first :). which master/slave solutions do you recommend?

tiger

--
Gerald Timothy Quimpo gquimpo*hotmail.com tiger*sni*ph
http://bopolissimus.sni.ph
Public Key: "gpg --keyserver pgp.mit.edu --recv-keys 672F4C78"

Doveryai no proveryai.

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 12 '05 #19

Similar topics