By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,208 Members | 1,592 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 444,208 IT Pros & Developers. It's quick & easy.

HADR takeover by force during peer state - why does this end up with split brain ?

P: n/a
Hi all,

if I accidentally use a TAKEOVER command with BY FORCE clause while
primary and standby are in peer state I'll end up with two primary's
(at least with FP10 and Windows). Is this works as designed or a bug ?
Manuals say that the standby will inform the primary about the takeover
but will not wait for acknowledgement, so the primary knows about whats
going on. In my eyes primary should either switch to standby or
shutdown immediately in this situation - what do you think ?

TIA
Joachim

Jul 26 '06 #1
Share this Question
Share on Google+
3 Replies


P: n/a
Hi, Joachim.

If HADR is functioning correctly, a TAKEOVER .. BY FORCE while the two
sites are in Peer state will result in (a) a new viable primary, and
(b) a zombie old primary. The poison pill that the standby sends to
the primary in this case does not itself shut the old primary down.
What it does do is hobble the primary such that it can no longer
generate any new log, and the next time that an agent attempts to do
so, it will bring down the server.

Note that the poison pill is a secondary mechanism intended to help
prevent split brain. A fundamental premise is that the primary is dead
when a TAKEOVER .. BY FORCE is issued. The poison pill is a backstop
in case of either (a) incorrect operation of the subsystem, or (b)
primary is wedged such that it is alive but not functioning well
(perhaps user can't even cause the primary db to shut down without
impacting a larger-grain entity such as the entire instance or the host
machine).

In this scenario, switching the old primary's role to standby or
immediate shutdown of the old primary sounds nice, but the devil is in
the details. If the user wanted to switch roles, then the non-forced
takeover should have been issued. Since a forced takeover was issued,
we assume the primary is dead or at least in a world of hurt, and the
user has requested a failover. As such we're not at liberty to wait
for a clean shutdown or role transition on the old primary. (Note as
well that we can't be sure such action would be successful were we to
attempt it.)

Anyway, you may well see the old primary still reporting its role as
primary after such an event, and it can even perform non-logged
operations, indefinitely. While this is not ideal, it should not be
mistaken for the system being split brained at that time (well, at
least the brain on one side is a read-only brain :-). Feel free to
shut down the old primary at your convenience.

We do have on our list of potential future enhancements an item to try
and shut down the primary more cleanly in this situation. This would
be at best an "attempt" and it would follow after the existing
mechanism. We currently only need to set a flag in memory in response
to the poison pill. To do more involves actions that may or may not
succeed if the old primary is wedged up somehow. (We don't just panic
the instance because it's not a good citizen kind a thing for a piece
of HA software to increase the scope of a failure.) It is important to
note that this potential enhancement is *not* a high priority for us,
as there are a number of higher-value potential HADR enhancements.
It's hard to make a business case to change the way the old primary
goes away from ugly to maybe more graceful in a scenario which is rare
or involves incorrect operation of the system, and where the change
does not really enhance the availability nor the consistency of the
system.

Finally, note that while the poison pill mechanism is intended to
prevent active/continuing split brain, it does not guarantee that
inconsistency is avoided during the event. The standby (new primary)
waits only very briefly for any last log traffic from the old primary.
We assume that the primary is dead (or ought to be) and that for
availability is it important not to delay failover in case some last
bit of log is struggling to flow across. It is possible that some does
not make it across before the standby takes over in this case. The
consequence is that it may not be possible to later reintegrate the old
primary as the new standby later due to the divergence, and instead a
reinitialization will be required (back up new primary and restore it
to old primary/new standby).

Regards,
- Steve P.
--
Steve Pearson, IBM DB2 for Linux, UNIX, and Windows, IBM Software Group
DB2 "Portland" Development Team, IBM Beaverton Lab, Beaverton, OR, USA

Jul 26 '06 #2

P: n/a

Steve Pearson (news only) schrieb:
Hi, Joachim.

If HADR is functioning correctly, a TAKEOVER .. BY FORCE while the two
sites are in Peer state will result in (a) a new viable primary, and
(b) a zombie old primary. The poison pill that the standby sends to
the primary in this case does not itself shut the old primary down.
What it does do is hobble the primary such that it can no longer
generate any new log, and the next time that an agent attempts to do
so, it will bring down the server.

Note that the poison pill is a secondary mechanism intended to help
prevent split brain. A fundamental premise is that the primary is dead
when a TAKEOVER .. BY FORCE is issued. The poison pill is a backstop
in case of either (a) incorrect operation of the subsystem, or (b)
primary is wedged such that it is alive but not functioning well
(perhaps user can't even cause the primary db to shut down without
impacting a larger-grain entity such as the entire instance or the host
machine).

In this scenario, switching the old primary's role to standby or
immediate shutdown of the old primary sounds nice, but the devil is in
the details. If the user wanted to switch roles, then the non-forced
takeover should have been issued. Since a forced takeover was issued,
we assume the primary is dead or at least in a world of hurt, and the
user has requested a failover. As such we're not at liberty to wait
for a clean shutdown or role transition on the old primary. (Note as
well that we can't be sure such action would be successful were we to
attempt it.)

Anyway, you may well see the old primary still reporting its role as
primary after such an event, and it can even perform non-logged
operations, indefinitely. While this is not ideal, it should not be
mistaken for the system being split brained at that time (well, at
least the brain on one side is a read-only brain :-). Feel free to
shut down the old primary at your convenience.

We do have on our list of potential future enhancements an item to try
and shut down the primary more cleanly in this situation. This would
be at best an "attempt" and it would follow after the existing
mechanism. We currently only need to set a flag in memory in response
to the poison pill. To do more involves actions that may or may not
succeed if the old primary is wedged up somehow. (We don't just panic
the instance because it's not a good citizen kind a thing for a piece
of HA software to increase the scope of a failure.) It is important to
note that this potential enhancement is *not* a high priority for us,
as there are a number of higher-value potential HADR enhancements.
It's hard to make a business case to change the way the old primary
goes away from ugly to maybe more graceful in a scenario which is rare
or involves incorrect operation of the system, and where the change
does not really enhance the availability nor the consistency of the
system.

Finally, note that while the poison pill mechanism is intended to
prevent active/continuing split brain, it does not guarantee that
inconsistency is avoided during the event. The standby (new primary)
waits only very briefly for any last log traffic from the old primary.
We assume that the primary is dead (or ought to be) and that for
availability is it important not to delay failover in case some last
bit of log is struggling to flow across. It is possible that some does
not make it across before the standby takes over in this case. The
consequence is that it may not be possible to later reintegrate the old
primary as the new standby later due to the divergence, and instead a
reinitialization will be required (back up new primary and restore it
to old primary/new standby).

Regards,
- Steve P.
--
Steve Pearson, IBM DB2 for Linux, UNIX, and Windows, IBM Software Group
DB2 "Portland" Development Team, IBM Beaverton Lab, Beaverton, OR, USA
Steve,

thanks a lot for that very detailed explanation (again :-) ).
there are a number of higher-value potential HADR enhancements
I lately was on a DB2 Viper workshop where I was told that there will
be no HADR major enhancements in DB2 V9 GA. Can you comment on this ?

thanks again
Joachim

Jul 27 '06 #3

P: n/a
I lately was on a DB2 Viper workshop where I was told that there will
be no HADR major enhancements in DB2 V9 GA. Can you comment on this ?
There are no HADR-specific features in DB2 9. However, HADR does
support new DB2 9 features such as XML data, compression,
range-partitioned tables, and IPv6.

Regards,
- Steve P.
--
Steve Pearson, IBM DB2 for Linux, UNIX, and Windows, IBM Software Group
DB2 "Portland" Development Team, IBM Beaverton Lab, Beaverton, OR, USA

Jul 27 '06 #4

This discussion thread is closed

Replies have been disabled for this discussion.