473,320 Members | 1,961 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Connection hang with HADR takeover by force and old primary server is down

DB2 ESE 8.2.3 (FP10) for Linux

We are experiencing a connection hang of 10 - 15 minutes in the following
HADR and automatic client reroute scenario:

01 server is primary database
02 server is standby database

a. applications connected to database on 01 server
b. shutdown 01 server
c. run takeover db by force on 02 server (force is necessary because
databases are no longer in peer state)
d. a user logged on directly to the 02 server can connect to new primary
database without delay as soon as takeover completed
e. for remote clients it takes about 10-15 minutes to get any response back
(wait time varies each time, and even varies somewhat by app tier blade).
f. after 10-15 minute delay, automatic client reroute on remote clients
reconnects to alternate server 02 after SQL retry.

However, if the following scenario occurs, there is no delay:

a. applications connected to database on 01 server
b. db2 instance stopped with force on 01 server (but 01 server is still up
and can be pinged)
c. run HADR takeover db by force on 02 server (not in peer state)
d. after only a 5-10 second delay, automatic client reroute reconnects to
alternate server 02 after SQL retry

Both of the above scenarios exhibit the same symptoms (delays) with either
the type 2 driver (SQL commands submitted from remote client via CLI) or a
type 4 client (Websphere 6).

Does anyone know why the connections to the 01 server are hung for 10-15
minutes after an HADR takeover by force on 02, only if the 01 server is
completely down, but there is no delay.if the server 01 is still reachable
(but instance is down).

We tried setting the db2 type 2 client's registry to have
db2tcp_client_rcvtimeout=15 (15 seconds). The registry value seems to have
helped the waiting issue (connection released after about 1 minute) but it
also seems to have severed the connection (that is no automatic client
reroute retry). The following error message was received:

SQL30081N A communication error has been detected. Communication protocol
being used: "TCP/IP". Communication API being used: "SOCKETS". Location
where the error was detected: "10.34.9.139". Communication function
detecting the error: RecvTimeout". Protocol specific error code(s): "4",
"*", "*". SQLSTATE=08001

Then after retry:
Communication function detecting the error: "selectForRecvTimeout".
Protocol specific error code(s): "4", "*", "*". SQLSTATE=08001
Feb 2 '06 #1
1 3870

You might want to look at your TCP keepalive (system configuration).
We have seen cases where Automatic Client Reroute suffers a response
delay due to the fact that it does not learn of the connection failure
in a timely fashion. This shows up where the socket is broken in the
comms layer (such as when the server host is shut down) but doesn't
show up where the database server connection has an explicit error
returned from DB2; those symptoms seem highly correlated to what you
report.

Regards,
-Steve P.
----------------------------
Steve Pearson
IBM DB2 UDB for LUW Development
Portland, OR, USA

Feb 3 '06 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
by: Mark A | last post by:
If server 01 running HADR in the primary role crashes, and the DBA does a HADR takeover by force on the 02 server to switch roles, then the 02 server is now the primary. What happens when the...
0
by: Mark A | last post by:
A consultant has recommended to us that we use virtual IP addresses for our HADR databases (the virtual IP address is moved when the primary database is moved to other server), even though...
4
by: Joachim Klassen | last post by:
DB2 V8.2 FP10 on Windows I tested the following HADR scenario: - a new tablespace on a new filesytem is created on the primary System - the replay on standby fails because of lacking permissions...
8
by: Challenge | last post by:
Hi, I got error, SQL1768N Unable to start HADR. Reason code = "7", when I tried to start hadr primary database. Here are the hadr configuration of my primary db: HADR database role ...
3
by: Joachim Klassen | last post by:
Hi all, if I accidentally use a TAKEOVER command with BY FORCE clause while primary and standby are in peer state I'll end up with two primary's (at least with FP10 and Windows). Is this works ...
6
by: shorti | last post by:
I have two questions about HADR recovery. I am running db2 v8 fp12. 1) If the primary suddenly crashes would you always want to switch the standby to the primary by force...or would there be...
4
by: ebusiness | last post by:
Hi, I have setup a HADR between two servers in different locations. When I compare the application response time in standard environment and in HADR environment, I find the latter is more than...
2
by: paul | last post by:
Hi, I have two databases in hadr, this morning i issued a "takeover ... by force" because the normal takeover didn't worked. Now i have two standard databases, and if i try to put the old primary...
1
by: agentlease | last post by:
Hi, Testing the above without TSA or HA, just plain HADR performing manual db2 TAKEOVER HADR ......................... etc. I am testing without the PEER_WINDOW i.e. set to 0 and...
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
0
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.