473,702 Members | 2,219 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

HADR split brain question

If server 01 running HADR in the primary role crashes, and the DBA does a
HADR takeover by force on the 02 server to switch roles, then the 02 server
is now the primary.

What happens when the Server 01 is brought back up? It still thinks it is
the primary because that was its role when it crashed and it does not know
about the takeover by force command that was issued. Does the 01 server
check the 02 server to see what role they are in before allowing any
connections?

Jan 23 '06 #1
7 10589
I have tried exactly that. It will keep its role as primary and will
also serve the clients by default, unless you do NOTHING but the
following:

db2 "start hadr on database <original_prima ry_database> as STANDBY"

This will change the state of your original_primar y_database from
PRIMARY to STANDBY and will also initiate the log replay from current
primary server.

If by accident you type db2 stop hadr, then your HADR configuration is
toast, and your only option is to re-initialize your HADR setup (backup
followed by restore).

regards,
dotyet

Jan 23 '06 #2
"dotyet" <do****@yahoo.c om> wrote in message
news:11******** **************@ g14g2000cwa.goo glegroups.com.. .
I have tried exactly that. It will keep its role as primary and will
also serve the clients by default, unless you do NOTHING but the
following:

db2 "start hadr on database <original_prima ry_database> as STANDBY"

This will change the state of your original_primar y_database from
PRIMARY to STANDBY and will also initiate the log replay from current
primary server.

If by accident you type db2 stop hadr, then your HADR configuration is
toast, and your only option is to re-initialize your HADR setup (backup
followed by restore).

regards,
dotyet


When the old primary server comes back up, are you saying that it is
immediately accessible by applications even though the other server is now
primary also?

Is there some command that must issued (such as activate database) that
prevents it from being accessible by applications before you can issue:
db2 "start hadr on database <original_prima ry_database> as STANDBY"

If not, then how do you prevent the split brain problem?
Jan 24 '06 #3
The failed primary will retain the role of primary if it is simply
restarted (i.e., via application connection *attempt*, activate db, or
restart db command).

However, importantly, it should *not* allow an application connection
to succeed unless the standby is there and successfully re-pairs with
it. Rather, since the original standby took over and is no longer a
standby, the activation or connection to the original primary database
should be delayed for HADR_TIMEOUT (or 30 seconds of that's longer),
then fail with error SQL1768N reason 7 ("The primary database failed to
establish a connection to its standby database within the HADR timeout
interval").

If you observe otherwise, please report it to IBM as a defect.

Now, if you *force* the restarting original primary to start in primary
role (it should require the START HADR .. AS STANDBY BY FORCE command
to do so), then it will oblige. Starting "by force" tells HADR you
want to forget about the requirement for the standby to be there, since
you know there's good reason for it to be gone (maybe both primary and
standby failed concurrently, and the original primary is the first to
be restarted). If you happen to do this while the original standby has
meanwhile taken over as primary, guess what...self inflicted split
brain results.

Regarding "STOP HADR", yes, that command will make HADR go away.
Whether or not you can follow it by a successful attempt to restart
HADR it depends on whether the database is in a valid initialization
state (because HADR would be starting over from scratch). For example,
the standby should be in rollforward mode and with a database and log
stream that matches well with that of the primary. It is possible that
if you do nothing but stop hadr followed by start hadr, it might just
work. However, issuing stop hadr is not advisable if you really wanted
the current instantiation of the db to play HADR again later w/o
starting over from scratch. If you want to temporarily stop log
shipping, a better approach is to issue the "deactivate db" command at
the standby.

Regards,
- Steve P.
------------------------------------
Steve Pearson
IBM DB2 UDB for LUW Development
Portland, OR, USA

Jan 24 '06 #4
"Steve Pearson (news only)" <st*******@my-deja.com> wrote in message
news:11******** **************@ g44g2000cwa.goo glegroups.com.. .
The failed primary will retain the role of primary if it is simply
restarted (i.e., via application connection *attempt*, activate db, or
restart db command).

However, importantly, it should *not* allow an application connection
to succeed unless the standby is there and successfully re-pairs with
it. Rather, since the original standby took over and is no longer a
standby, the activation or connection to the original primary database
should be delayed for HADR_TIMEOUT (or 30 seconds of that's longer),
then fail with error SQL1768N reason 7 ("The primary database failed to
establish a connection to its standby database within the HADR timeout
interval").

If you observe otherwise, please report it to IBM as a defect.

Now, if you *force* the restarting original primary to start in primary
role (it should require the START HADR .. AS STANDBY BY FORCE command
to do so), then it will oblige. Starting "by force" tells HADR you
want to forget about the requirement for the standby to be there, since
you know there's good reason for it to be gone (maybe both primary and
standby failed concurrently, and the original primary is the first to
be restarted). If you happen to do this while the original standby has
meanwhile taken over as primary, guess what...self inflicted split
brain results.

Regarding "STOP HADR", yes, that command will make HADR go away.
Whether or not you can follow it by a successful attempt to restart
HADR it depends on whether the database is in a valid initialization
state (because HADR would be starting over from scratch). For example,
the standby should be in rollforward mode and with a database and log
stream that matches well with that of the primary. It is possible that
if you do nothing but stop hadr followed by start hadr, it might just
work. However, issuing stop hadr is not advisable if you really wanted
the current instantiation of the db to play HADR again later w/o
starting over from scratch. If you want to temporarily stop log
shipping, a better approach is to issue the "deactivate db" command at
the standby.

Regards,
- Steve P.
------------------------------------
Steve Pearson
IBM DB2 UDB for LUW Development
Portland, OR, USA


Steve, I appreciate your comments, but lets get back to the question I
raised. For the purposes of this discussion, please assume that I am fairly
knowledgeable about HADR, having worked with it for several months now, so
lets dispense the fundamentals.

If the original primary server crashes (assume a hardware failure of some
kind), we will do an HADR takeover by force (force is necessary because the
original primary is unreachable) and original standby is now the primary.
Obviously databases are no longer in peer state if the original primary
server crashes because of hardware failure. Once the takeover has occurred,
DB2 automatic client reroute (or whatever mechanism one chooses) will point
the applications to the new primary server (which was previously the standby
database). Processing of the application continues normally.

Now, at some subsequent point, we will fix the hardware problem with the
original primary and attempt to bring it online as the standby. After it is
brought online as the standby, it will catch up with the logs, and only then
we can do a HADR takeover (without force) to make it the primary again. I
don't think the timeout is relevant since I am assuming that original
primary will be down for several hours before it can be repaired.

However, the problem is how do I bring the original primary server back up
after hardware repair as the standby. In its last state before the server
crashed, it thought it was the primary, and since a HADR takeover has now
occurred and the original standby is now the primary, then I will have 2
primary databases (split brain) when the original standby is repaired and
brought back up. Any new connections might go to the original primary before
I have a chance to make it the standby by issuing the command:
db2 "start hadr on database <original_prima ry_database> as STANDBY"

So how do I prevent a split brain (even for a short period) when my primary
server crashes and I bring it back online, and before I can designate it as
the standby (I already have a primary running). This seems like a
fundamental issue that must be solved for HADR to provide a continuous
availability solution.

One of the things that I think DB2 should do, is that any database where
HADR is configured should attempt establish peer state before any
connections are allowed, and if the other database is already in primary
role, and it was activated first, the last database to be activated should
either automatically start as standby, or should not allow connections until
some affirmative action is taken by the DBA (allowing the DBA to designate
it as standby before any connections are allowed).

In the absence of DB2 providing the above capability, perhaps there are some
procedural things that can be done to not allow connections when the server
is brought back up, allowing the DBA to make it standby. But I don't see how
this can be done via SQL statements (such as revoke connection authority)
since the revoke can only be issued on a database that is primary and
available for new connections.
Jan 24 '06 #5

Mark A wrote:
"Steve Pearson (news only)" <st*******@my-deja.com> wrote in message
news:11******** **************@ g44g2000cwa.goo glegroups.com.. .
The failed primary will retain the role of primary if it is simply
restarted (i.e., via application connection *attempt*, activate db, or
restart db command).

However, importantly, it should *not* allow an application connection
to succeed unless the standby is there and successfully re-pairs with
it. Rather, since the original standby took over and is no longer a
standby, the activation or connection to the original primary database
should be delayed for HADR_TIMEOUT (or 30 seconds of that's longer),
then fail with error SQL1768N reason 7 ("The primary database failed to
establish a connection to its standby database within the HADR timeout
interval").

If you observe otherwise, please report it to IBM as a defect.

Now, if you *force* the restarting original primary to start in primary
role (it should require the START HADR .. AS STANDBY BY FORCE command
to do so), then it will oblige. Starting "by force" tells HADR you
want to forget about the requirement for the standby to be there, since
you know there's good reason for it to be gone (maybe both primary and
standby failed concurrently, and the original primary is the first to
be restarted). If you happen to do this while the original standby has
meanwhile taken over as primary, guess what...self inflicted split
brain results.

Regarding "STOP HADR", yes, that command will make HADR go away.
Whether or not you can follow it by a successful attempt to restart
HADR it depends on whether the database is in a valid initialization
state (because HADR would be starting over from scratch). For example,
the standby should be in rollforward mode and with a database and log
stream that matches well with that of the primary. It is possible that
if you do nothing but stop hadr followed by start hadr, it might just
work. However, issuing stop hadr is not advisable if you really wanted
the current instantiation of the db to play HADR again later w/o
starting over from scratch. If you want to temporarily stop log
shipping, a better approach is to issue the "deactivate db" command at
the standby.

Regards,
- Steve P.
------------------------------------
Steve Pearson
IBM DB2 UDB for LUW Development
Portland, OR, USA


Steve, I appreciate your comments, but lets get back to the question I
raised. For the purposes of this discussion, please assume that I am fairly
knowledgeable about HADR, having worked with it for several months now, so
lets dispense the fundamentals.

If the original primary server crashes (assume a hardware failure of some
kind), we will do an HADR takeover by force (force is necessary because the
original primary is unreachable) and original standby is now the primary.
Obviously databases are no longer in peer state if the original primary
server crashes because of hardware failure. Once the takeover has occurred,
DB2 automatic client reroute (or whatever mechanism one chooses) will point
the applications to the new primary server (which was previously the standby
database). Processing of the application continues normally.

Now, at some subsequent point, we will fix the hardware problem with the
original primary and attempt to bring it online as the standby. After it is
brought online as the standby, it will catch up with the logs, and only then
we can do a HADR takeover (without force) to make it the primary again. I
don't think the timeout is relevant since I am assuming that original
primary will be down for several hours before it can be repaired.

However, the problem is how do I bring the original primary server back up
after hardware repair as the standby. In its last state before the server
crashed, it thought it was the primary, and since a HADR takeover has now
occurred and the original standby is now the primary, then I will have 2
primary databases (split brain) when the original standby is repaired and
brought back up. Any new connections might go to the original primary before
I have a chance to make it the standby by issuing the command:
db2 "start hadr on database <original_prima ry_database> as STANDBY"

So how do I prevent a split brain (even for a short period) when my primary
server crashes and I bring it back online, and before I can designate it as
the standby (I already have a primary running). This seems like a
fundamental issue that must be solved for HADR to provide a continuous
availability solution.

One of the things that I think DB2 should do, is that any database where
HADR is configured should attempt establish peer state before any
connections are allowed, and if the other database is already in primary
role, and it was activated first, the last database to be activated should
either automatically start as standby, or should not allow connections until
some affirmative action is taken by the DBA (allowing the DBA to designate
it as standby before any connections are allowed).

In the absence of DB2 providing the above capability, perhaps there are some
procedural things that can be done to not allow connections when the server
is brought back up, allowing the DBA to make it standby. But I don't see how
this can be done via SQL statements (such as revoke connection authority)
since the revoke can only be issued on a database that is primary and
available for new connections.

Oneway of doing it requires a start up script for db2. First we make
sure that DB2
can not auto start on any of the HADR server pairs. Then as part of
each start up
script both databases are place in a standby role... Then one of the
servers is
changed to be the primary. In a the case of a hardware crash, we don't
have to
worry about a slipt brain once the hardware problem is fix. However,
now a dba
must be involve on system reboot, which in some shops is on a schedule.
We are
still working a script to automatically start the databases in the
correct mode.
One thing way we are looking into is reading the db configuration files
from both
servers and then determining which database to start as the primary....

doug
www.db2helpdesk.com

Jan 24 '06 #6
> [...] I will have 2
primary databases (split brain) when the original standby is repaired and
brought back up. Any new connections might go to the original primary before
I have a chance to make it the standby by issuing the command:
db2 "start hadr on database <original_prima ry_database> as STANDBY"


That is not entirely correct. Yes, there will be two copies of the
database and both will report the current role as PRIMARY. However,
only the new primary will be able to do work. There is no window of
vulerability in the described scenario.

As I said above (and yes it is very fundamental to HADR), the
previously failed primary will ** NOT ** allow in new connections until
it establishes a connection with a standby. (With the mentioned
exception that if you force it to start, overriding this
usually-desired behavior by issuing "START HADR .. AS PRIMARY BY
FORCE", then it will start on its own, and in the case where the
original standby already took over as new primary, you'll have split
brain.)

All the means to normally restart the previously failed primary (such
as app connection, restart db command, activate db command, or START
HADR .. AS PRIMARY w/o the "by force" option) will be delayed, and
given there is no standby, will eventually fail with SQL1768N rc 7.
This is expressly designed to prevent just exactly this split brain
problem.

The only exception to this behavior of non-forced START HADR .. AS
PRIMARY is when HADR is started for the very first time. (That is,
when the HADR db role is changed from STANDARD to PRIMARY by the START
HADR command.) In this scenario, the primary may aleady be active and
connected by applications. We do not force them off (though the start
hadr command will still time out and fail if the standby is not
present). There cannot be a split brain in this scenario because only
the intended primary is present.

Regards,
- Steve P.
------------------------------------
Steve Pearson
IBM DB2 UDB for LUW Development
Portland, OR, USA

Jan 24 '06 #7
"Steve Pearson (news only)" <st*******@my-deja.com> wrote in message
That is not entirely correct. Yes, there will be two copies of the
database and both will report the current role as PRIMARY. However,
only the new primary will be able to do work. There is no window of
vulerability in the described scenario.

As I said above (and yes it is very fundamental to HADR), the
previously failed primary will ** NOT ** allow in new connections until
it establishes a connection with a standby. (With the mentioned
exception that if you force it to start, overriding this
usually-desired behavior by issuing "START HADR .. AS PRIMARY BY
FORCE", then it will start on its own, and in the case where the
original standby already took over as new primary, you'll have split
brain.)

All the means to normally restart the previously failed primary (such
as app connection, restart db command, activate db command, or START
HADR .. AS PRIMARY w/o the "by force" option) will be delayed, and
given there is no standby, will eventually fail with SQL1768N rc 7.
This is expressly designed to prevent just exactly this split brain
problem.

The only exception to this behavior of non-forced START HADR .. AS
PRIMARY is when HADR is started for the very first time. (That is,
when the HADR db role is changed from STANDARD to PRIMARY by the START
HADR command.) In this scenario, the primary may aleady be active and
connected by applications. We do not force them off (though the start
hadr command will still time out and fail if the standby is not
present). There cannot be a split brain in this scenario because only
the intended primary is present.

Regards,
- Steve P.
------------------------------------
Steve Pearson
IBM DB2 UDB for LUW Development
Portland, OR, USA


Thanks Steve. I do feel better about this now and will try it when I get a
chance.
Jan 24 '06 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
2329
by: Mark A | last post by:
A consultant has recommended to us that we use virtual IP addresses for our HADR databases (the virtual IP address is moved when the primary database is moved to other server), even though automatic client reroute should be able to point applications to the correct primary HADR database. The consultant is concerned about a split brain (both servers think they are the primary), especially when the original primary server crashes, an HADR...
4
4643
by: Joachim Klassen | last post by:
DB2 V8.2 FP10 on Windows I tested the following HADR scenario: - a new tablespace on a new filesytem is created on the primary System - the replay on standby fails because of lacking permissions - the tablespace is backed up on the primary system - tables are created in the new tablespace and data is inserted (and a couple of logs are archived) - Takeover is done by the standby
16
21775
by: gumby | last post by:
I'm having trouble getting HADR to work with the sample databases on two HS20 xSeries blades, Red Hat ES4 up3, DB2 8.2.4, getting the following error. SQL1768N Unable to start HADR. Reason code = "7" - The primary database failed to establish a connection to its standby database within the HADR timeout interval. What things should I check besides the remote host and remote service parameters on the standby database, which seem to be...
8
23456
by: Challenge | last post by:
Hi, I got error, SQL1768N Unable to start HADR. Reason code = "7", when I tried to start hadr primary database. Here are the hadr configuration of my primary db: HADR database role = STANDARD HADR local host name (HADR_LOCAL_HOST) = testserver HADR local service name (HADR_LOCAL_SVC) = 56000 HADR remote host name (HADR_REMOTE_HOST) = testserver
3
5674
by: Joachim Klassen | last post by:
Hi all, if I accidentally use a TAKEOVER command with BY FORCE clause while primary and standby are in peer state I'll end up with two primary's (at least with FP10 and Windows). Is this works as designed or a bug ? Manuals say that the standby will inform the primary about the takeover but will not wait for acknowledgement, so the primary knows about whats going on. In my eyes primary should either switch to standby or shutdown...
6
4081
by: RayRay | last post by:
I was hoping I might be able to get an answer regarding DB2 and HADR. Let me preface this by saying that I know very little about networking and I'm not familiar with the HADR product or clustering. But, I've been asked to research a couple issues for a client. They are thinking about implementing HADR. They have DB2 production databases running on more than one box. Does a liscense for HADR need to be purchased for each production box...
4
4197
by: Mark A | last post by:
I would like to make a split mirror copy of a HADR standby database. Please note that I do not want to create a standby with a split mirror of the primary, but I want to make a split mirror of the standby (for a reporting database). I know that I cannot do a write suspend on the standby because it does not allow any connections. But what if I just did a split mirror of the standby, do a db2inidb, then ship the logs over to the split...
3
3543
by: tensi4u | last post by:
Hi, I'm running two db2s v9.5 (64bit) on Redhat (64bit). I thought I finished to set up HADR and ACR successfully with CLP, the status looked good Peer, Sync, Connected, etc,. from the get snapshot command. * I did it all with CLP, so I'm not useing DAS. -- First But, it doesn't seem HADR does work automatically in case of the event of failover, due to the test I killed the db2sysc process on primary. The primary was killed normally but...
0
4044
DB2 Warlord
by: DB2 Warlord | last post by:
After countless time I was unabl to find the answer to my HADR questions and getting HADR to work in 9.5 via the Control Center. Please see my fix that seem strange but works. My error was as follows. When setting up HADR via the CCenter at the last step in a full setup called "START HADR" it would fail with reason 7. Now trying to solve the port issue that is listed gets you no where. In fact on my db2diag.log it listed a comm error. ...
0
8739
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8652
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9234
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9089
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
8940
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7831
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6575
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4667
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3107
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.