473,890 Members | 1,825 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

HADR takeover by force during peer state - why does this end up with split brain ?

Hi all,

if I accidentally use a TAKEOVER command with BY FORCE clause while
primary and standby are in peer state I'll end up with two primary's
(at least with FP10 and Windows). Is this works as designed or a bug ?
Manuals say that the standby will inform the primary about the takeover
but will not wait for acknowledgement , so the primary knows about whats
going on. In my eyes primary should either switch to standby or
shutdown immediately in this situation - what do you think ?

TIA
Joachim

Jul 26 '06 #1
3 5714
Hi, Joachim.

If HADR is functioning correctly, a TAKEOVER .. BY FORCE while the two
sites are in Peer state will result in (a) a new viable primary, and
(b) a zombie old primary. The poison pill that the standby sends to
the primary in this case does not itself shut the old primary down.
What it does do is hobble the primary such that it can no longer
generate any new log, and the next time that an agent attempts to do
so, it will bring down the server.

Note that the poison pill is a secondary mechanism intended to help
prevent split brain. A fundamental premise is that the primary is dead
when a TAKEOVER .. BY FORCE is issued. The poison pill is a backstop
in case of either (a) incorrect operation of the subsystem, or (b)
primary is wedged such that it is alive but not functioning well
(perhaps user can't even cause the primary db to shut down without
impacting a larger-grain entity such as the entire instance or the host
machine).

In this scenario, switching the old primary's role to standby or
immediate shutdown of the old primary sounds nice, but the devil is in
the details. If the user wanted to switch roles, then the non-forced
takeover should have been issued. Since a forced takeover was issued,
we assume the primary is dead or at least in a world of hurt, and the
user has requested a failover. As such we're not at liberty to wait
for a clean shutdown or role transition on the old primary. (Note as
well that we can't be sure such action would be successful were we to
attempt it.)

Anyway, you may well see the old primary still reporting its role as
primary after such an event, and it can even perform non-logged
operations, indefinitely. While this is not ideal, it should not be
mistaken for the system being split brained at that time (well, at
least the brain on one side is a read-only brain :-). Feel free to
shut down the old primary at your convenience.

We do have on our list of potential future enhancements an item to try
and shut down the primary more cleanly in this situation. This would
be at best an "attempt" and it would follow after the existing
mechanism. We currently only need to set a flag in memory in response
to the poison pill. To do more involves actions that may or may not
succeed if the old primary is wedged up somehow. (We don't just panic
the instance because it's not a good citizen kind a thing for a piece
of HA software to increase the scope of a failure.) It is important to
note that this potential enhancement is *not* a high priority for us,
as there are a number of higher-value potential HADR enhancements.
It's hard to make a business case to change the way the old primary
goes away from ugly to maybe more graceful in a scenario which is rare
or involves incorrect operation of the system, and where the change
does not really enhance the availability nor the consistency of the
system.

Finally, note that while the poison pill mechanism is intended to
prevent active/continuing split brain, it does not guarantee that
inconsistency is avoided during the event. The standby (new primary)
waits only very briefly for any last log traffic from the old primary.
We assume that the primary is dead (or ought to be) and that for
availability is it important not to delay failover in case some last
bit of log is struggling to flow across. It is possible that some does
not make it across before the standby takes over in this case. The
consequence is that it may not be possible to later reintegrate the old
primary as the new standby later due to the divergence, and instead a
reinitializatio n will be required (back up new primary and restore it
to old primary/new standby).

Regards,
- Steve P.
--
Steve Pearson, IBM DB2 for Linux, UNIX, and Windows, IBM Software Group
DB2 "Portland" Development Team, IBM Beaverton Lab, Beaverton, OR, USA

Jul 26 '06 #2

Steve Pearson (news only) schrieb:
Hi, Joachim.

If HADR is functioning correctly, a TAKEOVER .. BY FORCE while the two
sites are in Peer state will result in (a) a new viable primary, and
(b) a zombie old primary. The poison pill that the standby sends to
the primary in this case does not itself shut the old primary down.
What it does do is hobble the primary such that it can no longer
generate any new log, and the next time that an agent attempts to do
so, it will bring down the server.

Note that the poison pill is a secondary mechanism intended to help
prevent split brain. A fundamental premise is that the primary is dead
when a TAKEOVER .. BY FORCE is issued. The poison pill is a backstop
in case of either (a) incorrect operation of the subsystem, or (b)
primary is wedged such that it is alive but not functioning well
(perhaps user can't even cause the primary db to shut down without
impacting a larger-grain entity such as the entire instance or the host
machine).

In this scenario, switching the old primary's role to standby or
immediate shutdown of the old primary sounds nice, but the devil is in
the details. If the user wanted to switch roles, then the non-forced
takeover should have been issued. Since a forced takeover was issued,
we assume the primary is dead or at least in a world of hurt, and the
user has requested a failover. As such we're not at liberty to wait
for a clean shutdown or role transition on the old primary. (Note as
well that we can't be sure such action would be successful were we to
attempt it.)

Anyway, you may well see the old primary still reporting its role as
primary after such an event, and it can even perform non-logged
operations, indefinitely. While this is not ideal, it should not be
mistaken for the system being split brained at that time (well, at
least the brain on one side is a read-only brain :-). Feel free to
shut down the old primary at your convenience.

We do have on our list of potential future enhancements an item to try
and shut down the primary more cleanly in this situation. This would
be at best an "attempt" and it would follow after the existing
mechanism. We currently only need to set a flag in memory in response
to the poison pill. To do more involves actions that may or may not
succeed if the old primary is wedged up somehow. (We don't just panic
the instance because it's not a good citizen kind a thing for a piece
of HA software to increase the scope of a failure.) It is important to
note that this potential enhancement is *not* a high priority for us,
as there are a number of higher-value potential HADR enhancements.
It's hard to make a business case to change the way the old primary
goes away from ugly to maybe more graceful in a scenario which is rare
or involves incorrect operation of the system, and where the change
does not really enhance the availability nor the consistency of the
system.

Finally, note that while the poison pill mechanism is intended to
prevent active/continuing split brain, it does not guarantee that
inconsistency is avoided during the event. The standby (new primary)
waits only very briefly for any last log traffic from the old primary.
We assume that the primary is dead (or ought to be) and that for
availability is it important not to delay failover in case some last
bit of log is struggling to flow across. It is possible that some does
not make it across before the standby takes over in this case. The
consequence is that it may not be possible to later reintegrate the old
primary as the new standby later due to the divergence, and instead a
reinitializatio n will be required (back up new primary and restore it
to old primary/new standby).

Regards,
- Steve P.
--
Steve Pearson, IBM DB2 for Linux, UNIX, and Windows, IBM Software Group
DB2 "Portland" Development Team, IBM Beaverton Lab, Beaverton, OR, USA
Steve,

thanks a lot for that very detailed explanation (again :-) ).
there are a number of higher-value potential HADR enhancements
I lately was on a DB2 Viper workshop where I was told that there will
be no HADR major enhancements in DB2 V9 GA. Can you comment on this ?

thanks again
Joachim

Jul 27 '06 #3
I lately was on a DB2 Viper workshop where I was told that there will
be no HADR major enhancements in DB2 V9 GA. Can you comment on this ?
There are no HADR-specific features in DB2 9. However, HADR does
support new DB2 9 features such as XML data, compression,
range-partitioned tables, and IPv6.

Regards,
- Steve P.
--
Steve Pearson, IBM DB2 for Linux, UNIX, and Windows, IBM Software Group
DB2 "Portland" Development Team, IBM Beaverton Lab, Beaverton, OR, USA

Jul 27 '06 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
5062
by: bwmiller16 | last post by:
Folks - OpSys: RH Linux, AS3 DB: UDB 8.2.0 (FP7 Stinger) HW: 2 X-series We setup a HADR pair and they went into PEER state (after we backed-up the DB on primary and restored the DB on the standby). We created our tables and started importing thousands of rows into on the primary-side.
7
10675
by: Mark A | last post by:
If server 01 running HADR in the primary role crashes, and the DBA does a HADR takeover by force on the 02 server to switch roles, then the 02 server is now the primary. What happens when the Server 01 is brought back up? It still thinks it is the primary because that was its role when it crashed and it does not know about the takeover by force command that was issued. Does the 01 server check the 02 server to see what role they are in...
0
2347
by: Mark A | last post by:
A consultant has recommended to us that we use virtual IP addresses for our HADR databases (the virtual IP address is moved when the primary database is moved to other server), even though automatic client reroute should be able to point applications to the correct primary HADR database. The consultant is concerned about a split brain (both servers think they are the primary), especially when the original primary server crashes, an HADR...
1
3918
by: Mark A | last post by:
DB2 ESE 8.2.3 (FP10) for Linux We are experiencing a connection hang of 10 - 15 minutes in the following HADR and automatic client reroute scenario: 01 server is primary database 02 server is standby database a. applications connected to database on 01 server b. shutdown 01 server
8
23498
by: Challenge | last post by:
Hi, I got error, SQL1768N Unable to start HADR. Reason code = "7", when I tried to start hadr primary database. Here are the hadr configuration of my primary db: HADR database role = STANDARD HADR local host name (HADR_LOCAL_HOST) = testserver HADR local service name (HADR_LOCAL_SVC) = 56000 HADR remote host name (HADR_REMOTE_HOST) = testserver
6
3481
by: shorti | last post by:
I have two questions about HADR recovery. I am running db2 v8 fp12. 1) If the primary suddenly crashes would you always want to switch the standby to the primary by force...or would there be times when you would want to make it a standard? (and why) 2) Let us say the primary suddenly crashes (someone pulled the power cable) and you switch the standby to primary by force. Then you bring the primary back up and issue a START as...
3
3809
by: Laurence | last post by:
Hi folks, Anyone knows what do these mean? Primary log position(file, page, LSN) = S0000009.LOG, 0, 00000000036B0000 Standby log position(file, page, LSN) = S0000008.LOG, 357, 000000000342D073 Log gap running average(bytes) = 2633608
1
3147
by: agentlease | last post by:
Hi, Testing the above without TSA or HA, just plain HADR performing manual db2 TAKEOVER HADR ......................... etc. I am testing without the PEER_WINDOW i.e. set to 0 and HADR_SYNCMODE = SYNC. A quick question, say the primary and standby database servers are in PEER state and the primary database server suddenly crashes through a
2
3320
by: agentlease | last post by:
Hi, If the HADR state is 'Disconnected' and commit transactions to the Primary database, in the event of a Failover to the Standby database, how do we determine if it is safe i.e. how do we know how far behind the standby database is? Basically what is the procedure to determine if it's safe to do a failover to the standby with a takeover hadr .....force.
0
9812
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
11212
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10899
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10446
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
8004
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5832
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
6032
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
4255
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
3263
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.