473,403 Members | 2,071 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,403 software developers and data experts.

HADR takeover by force during peer state - why does this end up with split brain ?

Hi all,

if I accidentally use a TAKEOVER command with BY FORCE clause while
primary and standby are in peer state I'll end up with two primary's
(at least with FP10 and Windows). Is this works as designed or a bug ?
Manuals say that the standby will inform the primary about the takeover
but will not wait for acknowledgement, so the primary knows about whats
going on. In my eyes primary should either switch to standby or
shutdown immediately in this situation - what do you think ?

TIA
Joachim

Jul 26 '06 #1
3 5601
Hi, Joachim.

If HADR is functioning correctly, a TAKEOVER .. BY FORCE while the two
sites are in Peer state will result in (a) a new viable primary, and
(b) a zombie old primary. The poison pill that the standby sends to
the primary in this case does not itself shut the old primary down.
What it does do is hobble the primary such that it can no longer
generate any new log, and the next time that an agent attempts to do
so, it will bring down the server.

Note that the poison pill is a secondary mechanism intended to help
prevent split brain. A fundamental premise is that the primary is dead
when a TAKEOVER .. BY FORCE is issued. The poison pill is a backstop
in case of either (a) incorrect operation of the subsystem, or (b)
primary is wedged such that it is alive but not functioning well
(perhaps user can't even cause the primary db to shut down without
impacting a larger-grain entity such as the entire instance or the host
machine).

In this scenario, switching the old primary's role to standby or
immediate shutdown of the old primary sounds nice, but the devil is in
the details. If the user wanted to switch roles, then the non-forced
takeover should have been issued. Since a forced takeover was issued,
we assume the primary is dead or at least in a world of hurt, and the
user has requested a failover. As such we're not at liberty to wait
for a clean shutdown or role transition on the old primary. (Note as
well that we can't be sure such action would be successful were we to
attempt it.)

Anyway, you may well see the old primary still reporting its role as
primary after such an event, and it can even perform non-logged
operations, indefinitely. While this is not ideal, it should not be
mistaken for the system being split brained at that time (well, at
least the brain on one side is a read-only brain :-). Feel free to
shut down the old primary at your convenience.

We do have on our list of potential future enhancements an item to try
and shut down the primary more cleanly in this situation. This would
be at best an "attempt" and it would follow after the existing
mechanism. We currently only need to set a flag in memory in response
to the poison pill. To do more involves actions that may or may not
succeed if the old primary is wedged up somehow. (We don't just panic
the instance because it's not a good citizen kind a thing for a piece
of HA software to increase the scope of a failure.) It is important to
note that this potential enhancement is *not* a high priority for us,
as there are a number of higher-value potential HADR enhancements.
It's hard to make a business case to change the way the old primary
goes away from ugly to maybe more graceful in a scenario which is rare
or involves incorrect operation of the system, and where the change
does not really enhance the availability nor the consistency of the
system.

Finally, note that while the poison pill mechanism is intended to
prevent active/continuing split brain, it does not guarantee that
inconsistency is avoided during the event. The standby (new primary)
waits only very briefly for any last log traffic from the old primary.
We assume that the primary is dead (or ought to be) and that for
availability is it important not to delay failover in case some last
bit of log is struggling to flow across. It is possible that some does
not make it across before the standby takes over in this case. The
consequence is that it may not be possible to later reintegrate the old
primary as the new standby later due to the divergence, and instead a
reinitialization will be required (back up new primary and restore it
to old primary/new standby).

Regards,
- Steve P.
--
Steve Pearson, IBM DB2 for Linux, UNIX, and Windows, IBM Software Group
DB2 "Portland" Development Team, IBM Beaverton Lab, Beaverton, OR, USA

Jul 26 '06 #2

Steve Pearson (news only) schrieb:
Hi, Joachim.

If HADR is functioning correctly, a TAKEOVER .. BY FORCE while the two
sites are in Peer state will result in (a) a new viable primary, and
(b) a zombie old primary. The poison pill that the standby sends to
the primary in this case does not itself shut the old primary down.
What it does do is hobble the primary such that it can no longer
generate any new log, and the next time that an agent attempts to do
so, it will bring down the server.

Note that the poison pill is a secondary mechanism intended to help
prevent split brain. A fundamental premise is that the primary is dead
when a TAKEOVER .. BY FORCE is issued. The poison pill is a backstop
in case of either (a) incorrect operation of the subsystem, or (b)
primary is wedged such that it is alive but not functioning well
(perhaps user can't even cause the primary db to shut down without
impacting a larger-grain entity such as the entire instance or the host
machine).

In this scenario, switching the old primary's role to standby or
immediate shutdown of the old primary sounds nice, but the devil is in
the details. If the user wanted to switch roles, then the non-forced
takeover should have been issued. Since a forced takeover was issued,
we assume the primary is dead or at least in a world of hurt, and the
user has requested a failover. As such we're not at liberty to wait
for a clean shutdown or role transition on the old primary. (Note as
well that we can't be sure such action would be successful were we to
attempt it.)

Anyway, you may well see the old primary still reporting its role as
primary after such an event, and it can even perform non-logged
operations, indefinitely. While this is not ideal, it should not be
mistaken for the system being split brained at that time (well, at
least the brain on one side is a read-only brain :-). Feel free to
shut down the old primary at your convenience.

We do have on our list of potential future enhancements an item to try
and shut down the primary more cleanly in this situation. This would
be at best an "attempt" and it would follow after the existing
mechanism. We currently only need to set a flag in memory in response
to the poison pill. To do more involves actions that may or may not
succeed if the old primary is wedged up somehow. (We don't just panic
the instance because it's not a good citizen kind a thing for a piece
of HA software to increase the scope of a failure.) It is important to
note that this potential enhancement is *not* a high priority for us,
as there are a number of higher-value potential HADR enhancements.
It's hard to make a business case to change the way the old primary
goes away from ugly to maybe more graceful in a scenario which is rare
or involves incorrect operation of the system, and where the change
does not really enhance the availability nor the consistency of the
system.

Finally, note that while the poison pill mechanism is intended to
prevent active/continuing split brain, it does not guarantee that
inconsistency is avoided during the event. The standby (new primary)
waits only very briefly for any last log traffic from the old primary.
We assume that the primary is dead (or ought to be) and that for
availability is it important not to delay failover in case some last
bit of log is struggling to flow across. It is possible that some does
not make it across before the standby takes over in this case. The
consequence is that it may not be possible to later reintegrate the old
primary as the new standby later due to the divergence, and instead a
reinitialization will be required (back up new primary and restore it
to old primary/new standby).

Regards,
- Steve P.
--
Steve Pearson, IBM DB2 for Linux, UNIX, and Windows, IBM Software Group
DB2 "Portland" Development Team, IBM Beaverton Lab, Beaverton, OR, USA
Steve,

thanks a lot for that very detailed explanation (again :-) ).
there are a number of higher-value potential HADR enhancements
I lately was on a DB2 Viper workshop where I was told that there will
be no HADR major enhancements in DB2 V9 GA. Can you comment on this ?

thanks again
Joachim

Jul 27 '06 #3
I lately was on a DB2 Viper workshop where I was told that there will
be no HADR major enhancements in DB2 V9 GA. Can you comment on this ?
There are no HADR-specific features in DB2 9. However, HADR does
support new DB2 9 features such as XML data, compression,
range-partitioned tables, and IPv6.

Regards,
- Steve P.
--
Steve Pearson, IBM DB2 for Linux, UNIX, and Windows, IBM Software Group
DB2 "Portland" Development Team, IBM Beaverton Lab, Beaverton, OR, USA

Jul 27 '06 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: bwmiller16 | last post by:
Folks - OpSys: RH Linux, AS3 DB: UDB 8.2.0 (FP7 Stinger) HW: 2 X-series We setup a HADR pair and they went into PEER state (after we backed-up the DB on primary and restored the DB on the...
7
by: Mark A | last post by:
If server 01 running HADR in the primary role crashes, and the DBA does a HADR takeover by force on the 02 server to switch roles, then the 02 server is now the primary. What happens when the...
0
by: Mark A | last post by:
A consultant has recommended to us that we use virtual IP addresses for our HADR databases (the virtual IP address is moved when the primary database is moved to other server), even though...
1
by: Mark A | last post by:
DB2 ESE 8.2.3 (FP10) for Linux We are experiencing a connection hang of 10 - 15 minutes in the following HADR and automatic client reroute scenario: 01 server is primary database 02 server is...
8
by: Challenge | last post by:
Hi, I got error, SQL1768N Unable to start HADR. Reason code = "7", when I tried to start hadr primary database. Here are the hadr configuration of my primary db: HADR database role ...
6
by: shorti | last post by:
I have two questions about HADR recovery. I am running db2 v8 fp12. 1) If the primary suddenly crashes would you always want to switch the standby to the primary by force...or would there be...
3
by: Laurence | last post by:
Hi folks, Anyone knows what do these mean? Primary log position(file, page, LSN) = S0000009.LOG, 0, 00000000036B0000 Standby log position(file, page, LSN) = S0000008.LOG, 357,...
1
by: agentlease | last post by:
Hi, Testing the above without TSA or HA, just plain HADR performing manual db2 TAKEOVER HADR ......................... etc. I am testing without the PEER_WINDOW i.e. set to 0 and...
2
by: agentlease | last post by:
Hi, If the HADR state is 'Disconnected' and commit transactions to the Primary database, in the event of a Failover to the Standby database, how do we determine if it is safe i.e. how do we know...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.