469,317 Members | 1,980 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,317 developers. It's quick & easy.

More HADR thoughts

Folks -

Again, a three-peat:

RH AS3, UDB 8.1.7 on one pair of x-series, 8.1.8 on 2 i86 test boxes...

We were just about to put all we had into production and now we're
unable to get HADR to work consistently; we get congested-status even
though we have a giggy-net that's very fast (we are able to SSH copy
between our production servers at about 30mb/second, encrypted).

For instance, our production scripts fail (see commentary below) but I
can create a similar set of tables and load 10.4 million rows into the
primary-side and watch the data go to the standby-side perfectly with
no problems. There doesn't seem to be any improvement using our test
8.1.8 boxes over the 8.1.7 boxes.

I'm posting here to see if anybody has any great ideas...I'm sure that
I'll be calling IBM support in a few minutes....

------------------------------- Comments from the Developer:

1) The behavior I generally saw with SWG was the typical "hangup" in
the log processing on the standby where the log numbers would not
continue to increase and it would just stop, then eventually report
congestion on the primary. I always noted that the import script on the
primary would hang, typically around the user/or user activity stuff.

2) I also tried, via command line as db2admin, manually executing the
steps scripted in import_db.sh. Everything would work up until the
user_activity table at which point the "committed XXX rows" would stop
printing out, at which point I would see the same "hung" behavior on
the standby where the logs would stop progressing and the primary would
eventually get congested.

3) 2 times I saw different behavior, typically when manually doing it
line by line, where the secondary would just suddenly "disconnect" for
no apparent reason. I would start hadr again on it as standby and then
it would re-connect.

4) When I saw the import script output hanging in (2) above, I did a ps
-axf and saw several processes I did not notice before. Basically
db2event (db2detaildeadlock). Not sure if related

5) Unfortunately the only way to get out of these situations of
congestion is to kill db2 processes since everything is hung.

Nov 12 '05 #1
4 1628

I think that contacting support was the right move. We need to find
out what is going on at the standby to cause replay to stall.

Some things that occur to me and may or may not be helpful:

Probably most important, what db operation(s) exactly are happening
when the hang occurs? With all the mention of load and an import
script, is it possible that LOAD with COPY YES is going on? If so, is
there any issue with accessibility of the COPY file from the standby?
If the file is large and/or slow to access, then the standby replay
might come to a screeching halt while the file is retrieved and
applied.

Re (3), the standby might disconnect if it didn't receive anything
(including heartbeat) from the primary for the configured HADR_TIMEOUT.
Btw, in this case, issuing a new START HADR AS STANDBY is not
required; the standby should retry connecting to the primary itself in
a delayed loop.

Re (4) I don't really know if that event monitor means a deadlock
occurred or if it was coincidentally started to look for same. But if
there is a hang affecting the primary, then the chances of a deadlock
could potentially increase due to stalled progress of some transactions
that hold locks.

Regards,
- Steve P.
IBM DB2 UDB for LUW Development
Portland, OR

Nov 12 '05 #2

In case anyone else is following this, here's in a nutshell what
happened here:

1. Log replay got hung on the standby as a side effect of a failed
buffer pool creation (insufficient db memory configured). The standby
should be failing rather than hanging. We're looking into this.

2. Once log replay stalls, the receive buffer on the standby soon
fills up. After that, the standby can no longer receive any more log
data from the primary. Because the system is configured with NEARSYNC
mode, the primary is not allowed to progress until it receives an ACK
indicating log data is in the standby's memory. So this causes logged
activity to stall on the primary.

3. Heartbeating continues, however. Since the standby cannot receive
any more data off the wire, it doesn't consume heartbeats. Likely the
primary's heartbeats eventually filled the TCP/IP send/receive buffers
between primary and standby, leading to the "congestion" indication on
the primary.

Regards,
- Steve P.
IBM DB2 UDB for LUW Development
Portland, OR

Nov 12 '05 #3
Steve -

Thanks for posting the end-result; We can all learn from these sorts
of things.

When the BPs were created we DID receive an error in the db2diag.log
saying that the create was deferred; Now, we feel our buffers were
tiny and that we shouldn't have necessarily have received this error
but we didn't see the error's downstream effects. This was our major
error.

I'm told that the HADR process should have thrown an error when the
rollforward failed because of the buffer pool issue; I'm also told
that this is being looked into, as you mentioned.

BTW, Jamie Nisbet at IBM support did an absolutely incredible job for
us and we would like to thank him for his dedication to helping us
solve this problem. He was able to track this problem down in an
extremely professional manner and we commend him. It isn't easy to
track down a sync problem between two servers but he did it and did it
wonderfully well.

I don't want to forget Jeffery Dokos, at level 2 support who helped us
immensely and also deserves our praise for a job well-done.

An Atta boy and our sincere thanks to Jamie and Jefferey.

Nov 12 '05 #4
"Steve Pearson" <st*******@my-deja.com> a écrit dans le message de
news:11**********************@g14g2000cwa.googlegr oups.com...

In case anyone else is following this, here's in a nutshell what
happened here:

1. Log replay got hung on the standby as a side effect of a failed
buffer pool creation (insufficient db memory configured). The standby
should be failing rather than hanging. We're looking into this.

2. Once log replay stalls, the receive buffer on the standby soon
fills up. After that, the standby can no longer receive any more log
data from the primary. Because the system is configured with NEARSYNC
mode, the primary is not allowed to progress until it receives an ACK
indicating log data is in the standby's memory. So this causes logged
activity to stall on the primary.

3. Heartbeating continues, however. Since the standby cannot receive
any more data off the wire, it doesn't consume heartbeats. Likely the
primary's heartbeats eventually filled the TCP/IP send/receive buffers
between primary and standby, leading to the "congestion" indication on
the primary.

Regards,
- Steve P.
IBM DB2 UDB for LUW Development
Portland, OR

Hi Steve,

thanks for posting, so we can keep learning on HADR ...

Best regards,

Jean-Marc
Nov 12 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

3 posts views Thread by bwmiller16 | last post: by
7 posts views Thread by Mark A | last post: by
4 posts views Thread by Joachim Klassen | last post: by
16 posts views Thread by gumby | last post: by
1 post views Thread by CARIGAR | last post: by
reply views Thread by zhoujie | last post: by
reply views Thread by suresh191 | last post: by
1 post views Thread by Geralt96 | last post: by
reply views Thread by harlem98 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.