HADR takeover by force during peer state - why does this end up with split brain ?

Joachim Klassen

Hi all,

if I accidentally use a TAKEOVER command with BY FORCE clause while
primary and standby are in peer state I'll end up with two primary's
(at least with FP10 and Windows). Is this works as designed or a bug ?
Manuals say that the standby will inform the primary about the takeover
but will not wait for acknowledgement, so the primary knows about whats
going on. In my eyes primary should either switch to standby or
shutdown immediately in this situation - what do you think ?

TIA
Joachim

Jul 26 '06 #1

Subscribe Post Reply

5601

Steve Pearson (news only)

Hi, Joachim.

If HADR is functioning correctly, a TAKEOVER .. BY FORCE while the two
sites are in Peer state will result in (a) a new viable primary, and
(b) a zombie old primary. The poison pill that the standby sends to
the primary in this case does not itself shut the old primary down.
What it does do is hobble the primary such that it can no longer
generate any new log, and the next time that an agent attempts to do
so, it will bring down the server.

Note that the poison pill is a secondary mechanism intended to help
prevent split brain. A fundamental premise is that the primary is dead
when a TAKEOVER .. BY FORCE is issued. The poison pill is a backstop
in case of either (a) incorrect operation of the subsystem, or (b)
primary is wedged such that it is alive but not functioning well
(perhaps user can't even cause the primary db to shut down without
impacting a larger-grain entity such as the entire instance or the host
machine).

In this scenario, switching the old primary's role to standby or
immediate shutdown of the old primary sounds nice, but the devil is in
the details. If the user wanted to switch roles, then the non-forced
takeover should have been issued. Since a forced takeover was issued,
we assume the primary is dead or at least in a world of hurt, and the
user has requested a failover. As such we're not at liberty to wait
for a clean shutdown or role transition on the old primary. (Note as
well that we can't be sure such action would be successful were we to
attempt it.)

Anyway, you may well see the old primary still reporting its role as
primary after such an event, and it can even perform non-logged
operations, indefinitely. While this is not ideal, it should not be
mistaken for the system being split brained at that time (well, at
least the brain on one side is a read-only brain :-). Feel free to
shut down the old primary at your convenience.

We do have on our list of potential future enhancements an item to try
and shut down the primary more cleanly in this situation. This would
be at best an "attempt" and it would follow after the existing
mechanism. We currently only need to set a flag in memory in response
to the poison pill. To do more involves actions that may or may not
succeed if the old primary is wedged up somehow. (We don't just panic
the instance because it's not a good citizen kind a thing for a piece
of HA software to increase the scope of a failure.) It is important to
note that this potential enhancement is *not* a high priority for us,
as there are a number of higher-value potential HADR enhancements.
It's hard to make a business case to change the way the old primary
goes away from ugly to maybe more graceful in a scenario which is rare
or involves incorrect operation of the system, and where the change
does not really enhance the availability nor the consistency of the
system.

Finally, note that while the poison pill mechanism is intended to
prevent active/continuing split brain, it does not guarantee that
inconsistency is avoided during the event. The standby (new primary)
waits only very briefly for any last log traffic from the old primary.
We assume that the primary is dead (or ought to be) and that for
availability is it important not to delay failover in case some last
bit of log is struggling to flow across. It is possible that some does
not make it across before the standby takes over in this case. The
consequence is that it may not be possible to later reintegrate the old
primary as the new standby later due to the divergence, and instead a
reinitialization will be required (back up new primary and restore it
to old primary/new standby).

Regards,
- Steve P.
--
Steve Pearson, IBM DB2 for Linux, UNIX, and Windows, IBM Software Group
DB2 "Portland" Development Team, IBM Beaverton Lab, Beaverton, OR, USA

Jul 26 '06 #2

Joachim Klassen

Steve Pearson (news only) schrieb:

Hi, Joachim.

If HADR is functioning correctly, a TAKEOVER .. BY FORCE while the two
sites are in Peer state will result in (a) a new viable primary, and
(b) a zombie old primary. The poison pill that the standby sends to
the primary in this case does not itself shut the old primary down.
What it does do is hobble the primary such that it can no longer
generate any new log, and the next time that an agent attempts to do
so, it will bring down the server.

Note that the poison pill is a secondary mechanism intended to help
prevent split brain. A fundamental premise is that the primary is dead
when a TAKEOVER .. BY FORCE is issued. The poison pill is a backstop
in case of either (a) incorrect operation of the subsystem, or (b)
primary is wedged such that it is alive but not functioning well
(perhaps user can't even cause the primary db to shut down without
impacting a larger-grain entity such as the entire instance or the host
machine).

In this scenario, switching the old primary's role to standby or
immediate shutdown of the old primary sounds nice, but the devil is in
the details. If the user wanted to switch roles, then the non-forced
takeover should have been issued. Since a forced takeover was issued,
we assume the primary is dead or at least in a world of hurt, and the
user has requested a failover. As such we're not at liberty to wait
for a clean shutdown or role transition on the old primary. (Note as
well that we can't be sure such action would be successful were we to
attempt it.)

Anyway, you may well see the old primary still reporting its role as
primary after such an event, and it can even perform non-logged
operations, indefinitely. While this is not ideal, it should not be
mistaken for the system being split brained at that time (well, at
least the brain on one side is a read-only brain :-). Feel free to
shut down the old primary at your convenience.

We do have on our list of potential future enhancements an item to try
and shut down the primary more cleanly in this situation. This would
be at best an "attempt" and it would follow after the existing
mechanism. We currently only need to set a flag in memory in response
to the poison pill. To do more involves actions that may or may not
succeed if the old primary is wedged up somehow. (We don't just panic
the instance because it's not a good citizen kind a thing for a piece
of HA software to increase the scope of a failure.) It is important to
note that this potential enhancement is *not* a high priority for us,
as there are a number of higher-value potential HADR enhancements.
It's hard to make a business case to change the way the old primary
goes away from ugly to maybe more graceful in a scenario which is rare
or involves incorrect operation of the system, and where the change
does not really enhance the availability nor the consistency of the
system.

Finally, note that while the poison pill mechanism is intended to
prevent active/continuing split brain, it does not guarantee that
inconsistency is avoided during the event. The standby (new primary)
waits only very briefly for any last log traffic from the old primary.
We assume that the primary is dead (or ought to be) and that for
availability is it important not to delay failover in case some last
bit of log is struggling to flow across. It is possible that some does
not make it across before the standby takes over in this case. The
consequence is that it may not be possible to later reintegrate the old
primary as the new standby later due to the divergence, and instead a
reinitialization will be required (back up new primary and restore it
to old primary/new standby).

Regards,
- Steve P.
--
Steve Pearson, IBM DB2 for Linux, UNIX, and Windows, IBM Software Group
DB2 "Portland" Development Team, IBM Beaverton Lab, Beaverton, OR, USA

Steve,

thanks a lot for that very detailed explanation (again :-) ).

there are a number of higher-value potential HADR enhancements

I lately was on a DB2 Viper workshop where I was told that there will
be no HADR major enhancements in DB2 V9 GA. Can you comment on this ?

thanks again
Joachim

Jul 27 '06 #3

Steve Pearson (news only)

I lately was on a DB2 Viper workshop where I was told that there will
be no HADR major enhancements in DB2 V9 GA. Can you comment on this ?

There are no HADR-specific features in DB2 9. However, HADR does
support new DB2 9 features such as XML data, compression,
range-partitioned tables, and IPv6.

Regards,
- Steve P.
--
Steve Pearson, IBM DB2 for Linux, UNIX, and Windows, IBM Software Group
DB2 "Portland" Development Team, IBM Beaverton Lab, Beaverton, OR, USA

Jul 27 '06 #4

Similar topics

Hadr, Congestion, Recover-Pending

by: bwmiller16 | last post by:

Folks - OpSys: RH Linux, AS3 DB: UDB 8.2.0 (FP7 Stinger) HW: 2 X-series We setup a HADR pair and they went into PEER state (after we backed-up the DB on primary and restored the DB on the...

DB2 Database

HADR split brain question

by: Mark A | last post by:

If server 01 running HADR in the primary role crashes, and the DBA does a HADR takeover by force on the 02 server to switch roles, then the 02 server is now the primary. What happens when the...

DB2 Database

Using Virtual IP addresses with HADR in addtion to automatic client reroute

by: Mark A | last post by:

A consultant has recommended to us that we use virtual IP addresses for our HADR databases (the virtual IP address is moved when the primary database is moved to other server), even though...

DB2 Database

Connection hang with HADR takeover by force and old primary server is down

by: Mark A | last post by:

DB2 ESE 8.2.3 (FP10) for Linux We are experiencing a connection hang of 10 - 15 minutes in the following HADR and automatic client reroute scenario: 01 server is primary database 02 server is...

DB2 Database

Start hadr primary db failed with SQL1768N, reason code 7.

by: Challenge | last post by:

Hi, I got error, SQL1768N Unable to start HADR. Reason code = "7", when I tried to start hadr primary database. Here are the hadr configuration of my primary db: HADR database role ...

DB2 Database

HADR - recovery

by: shorti | last post by:

I have two questions about HADR recovery. I am running db2 v8 fp12. 1) If the primary suddenly crashes would you always want to switch the standby to the primary by force...or would there be...

DB2 Database

DB2 HADR state for DB2 v9.1

by: Laurence | last post by:

Hi folks, Anyone knows what do these mean? Primary log position(file, page, LSN) = S0000009.LOG, 0, 00000000036B0000 Standby log position(file, page, LSN) = S0000008.LOG, 357,...

DB2 Database

V9.5 HADR On AIX

by: agentlease | last post by:

Hi, Testing the above without TSA or HA, just plain HADR performing manual db2 TAKEOVER HADR ......................... etc. I am testing without the PEER_WINDOW i.e. set to 0 and...

DB2 Database

HADR - AIX

by: agentlease | last post by:

Hi, If the HADR state is 'Disconnected' and commit transactions to the Primary database, in the event of a Failover to the Standby database, how do we determine if it is safe i.e. how do we know...

DB2 Database

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General