More HADR thoughts

bwmiller16

Folks -

Again, a three-peat:

RH AS3, UDB 8.1.7 on one pair of x-series, 8.1.8 on 2 i86 test boxes...

We were just about to put all we had into production and now we're
unable to get HADR to work consistently; we get congested-status even
though we have a giggy-net that's very fast (we are able to SSH copy
between our production servers at about 30mb/second, encrypted).

For instance, our production scripts fail (see commentary below) but I
can create a similar set of tables and load 10.4 million rows into the
primary-side and watch the data go to the standby-side perfectly with
no problems. There doesn't seem to be any improvement using our test
8.1.8 boxes over the 8.1.7 boxes.

I'm posting here to see if anybody has any great ideas...I'm sure that
I'll be calling IBM support in a few minutes....

------------------------------- Comments from the Developer:

1) The behavior I generally saw with SWG was the typical "hangup" in
the log processing on the standby where the log numbers would not
continue to increase and it would just stop, then eventually report
congestion on the primary. I always noted that the import script on the
primary would hang, typically around the user/or user activity stuff.

2) I also tried, via command line as db2admin, manually executing the
steps scripted in import_db.sh. Everything would work up until the
user_activity table at which point the "committed XXX rows" would stop
printing out, at which point I would see the same "hung" behavior on
the standby where the logs would stop progressing and the primary would
eventually get congested.

3) 2 times I saw different behavior, typically when manually doing it
line by line, where the secondary would just suddenly "disconnect" for
no apparent reason. I would start hadr again on it as standby and then
it would re-connect.

4) When I saw the import script output hanging in (2) above, I did a ps
-axf and saw several processes I did not notice before. Basically
db2event (db2detaildeadlock). Not sure if related

5) Unfortunately the only way to get out of these situations of
congestion is to kill db2 processes since everything is hung.

Nov 12 '05 #1

Subscribe Post Reply

1860

Steve Pearson

I think that contacting support was the right move. We need to find
out what is going on at the standby to cause replay to stall.

Some things that occur to me and may or may not be helpful:

Probably most important, what db operation(s) exactly are happening
when the hang occurs? With all the mention of load and an import
script, is it possible that LOAD with COPY YES is going on? If so, is
there any issue with accessibility of the COPY file from the standby?
If the file is large and/or slow to access, then the standby replay
might come to a screeching halt while the file is retrieved and
applied.

Re (3), the standby might disconnect if it didn't receive anything
(including heartbeat) from the primary for the configured HADR_TIMEOUT.
Btw, in this case, issuing a new START HADR AS STANDBY is not
required; the standby should retry connecting to the primary itself in
a delayed loop.

Re (4) I don't really know if that event monitor means a deadlock
occurred or if it was coincidentally started to look for same. But if
there is a hang affecting the primary, then the chances of a deadlock
could potentially increase due to stalled progress of some transactions
that hold locks.

Regards,
- Steve P.
IBM DB2 UDB for LUW Development
Portland, OR

Nov 12 '05 #2

Steve Pearson

In case anyone else is following this, here's in a nutshell what
happened here:

1. Log replay got hung on the standby as a side effect of a failed
buffer pool creation (insufficient db memory configured). The standby
should be failing rather than hanging. We're looking into this.

2. Once log replay stalls, the receive buffer on the standby soon
fills up. After that, the standby can no longer receive any more log
data from the primary. Because the system is configured with NEARSYNC
mode, the primary is not allowed to progress until it receives an ACK
indicating log data is in the standby's memory. So this causes logged
activity to stall on the primary.

3. Heartbeating continues, however. Since the standby cannot receive
any more data off the wire, it doesn't consume heartbeats. Likely the
primary's heartbeats eventually filled the TCP/IP send/receive buffers
between primary and standby, leading to the "congestion" indication on
the primary.

Regards,
- Steve P.
IBM DB2 UDB for LUW Development
Portland, OR

Nov 12 '05 #3

bwmiller16

Steve -

Thanks for posting the end-result; We can all learn from these sorts
of things.

When the BPs were created we DID receive an error in the db2diag.log
saying that the create was deferred; Now, we feel our buffers were
tiny and that we shouldn't have necessarily have received this error
but we didn't see the error's downstream effects. This was our major
error.

I'm told that the HADR process should have thrown an error when the
rollforward failed because of the buffer pool issue; I'm also told
that this is being looked into, as you mentioned.

BTW, Jamie Nisbet at IBM support did an absolutely incredible job for
us and we would like to thank him for his dedication to helping us
solve this problem. He was able to track this problem down in an
extremely professional manner and we commend him. It isn't easy to
track down a sync problem between two servers but he did it and did it
wonderfully well.

I don't want to forget Jeffery Dokos, at level 2 support who helped us
immensely and also deserves our praise for a job well-done.

An Atta boy and our sincere thanks to Jamie and Jefferey.

Nov 12 '05 #4

Jean-Marc Blaise

"Steve Pearson" <st*******@my-deja.com> a écrit dans le message de
news:11**********************@g14g2000cwa.googlegr oups.com...

In case anyone else is following this, here's in a nutshell what
happened here:

1. Log replay got hung on the standby as a side effect of a failed
buffer pool creation (insufficient db memory configured). The standby
should be failing rather than hanging. We're looking into this.

2. Once log replay stalls, the receive buffer on the standby soon
fills up. After that, the standby can no longer receive any more log
data from the primary. Because the system is configured with NEARSYNC
mode, the primary is not allowed to progress until it receives an ACK
indicating log data is in the standby's memory. So this causes logged
activity to stall on the primary.

3. Heartbeating continues, however. Since the standby cannot receive
any more data off the wire, it doesn't consume heartbeats. Likely the
primary's heartbeats eventually filled the TCP/IP send/receive buffers
between primary and standby, leading to the "congestion" indication on
the primary.

Regards,
- Steve P.
IBM DB2 UDB for LUW Development
Portland, OR

Hi Steve,

thanks for posting, so we can keep learning on HADR ...

Best regards,

Jean-Marc

Nov 12 '05 #5

by: Bruce M | last post by:

Is there a way to run the db2-generated HADR-create script outside of DB2CC beyond the obvious? My DB2CC generates the HADR script and then abends because of a java error meaning I can't finish my...

DB2 Database

Why are 2 ports required for a HADR Pair?

by: bwmiller16 | last post by:

Why are 2 ports required for a HADR Pair? Environment: RH Linux AS3 UDB Express 8.2 with HADR Option Folks - Working with HADR and getting this setup to work right.

DB2 Database

HADR split brain question

by: Mark A | last post by:

If server 01 running HADR in the primary role crashes, and the DBA does a HADR takeover by force on the 02 server to switch roles, then the 02 server is now the primary. What happens when the...

DB2 Database

Using Virtual IP addresses with HADR in addtion to automatic client reroute

by: Mark A | last post by:

A consultant has recommended to us that we use virtual IP addresses for our HADR databases (the virtual IP address is moved when the primary database is moved to other server), even though...

DB2 Database

HADR and failed tablespace creation

by: Joachim Klassen | last post by:

DB2 V8.2 FP10 on Windows I tested the following HADR scenario: - a new tablespace on a new filesytem is created on the primary System - the replay on standby fails because of lacking permissions...

DB2 Database

Unable to start HADR reason code 7

by: gumby | last post by:

I'm having trouble getting HADR to work with the sample databases on two HS20 xSeries blades, Red Hat ES4 up3, DB2 8.2.4, getting the following error. SQL1768N Unable to start HADR. Reason code...

DB2 Database

Start hadr primary db failed with SQL1768N, reason code 7.

by: Challenge | last post by:

Hi, I got error, SQL1768N Unable to start HADR. Reason code = "7", when I tried to start hadr primary database. Here are the hadr configuration of my primary db: HADR database role ...

DB2 Database

Few Questions with HADR - 1

by: deshaipet | last post by:

Hi - We have primary and standby databases on different servers. Here is my /etc/service entry for HADR pair : On Primary : "vi /etc/services

DB2 Database

Db2 9.5 HADR failures on Linux Susie 10 on multi servers\VM ware

by: DB2 Warlord | last post by:

After countless time I was unabl to find the answer to my HADR questions and getting HADR to work in 9.5 via the Control Center. Please see my fix that seem strange but works. My error was as...

DB2 Database

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Similar topics