Note that other thescripts reports show problems with db2hmon going back to version 8.2.
My advice for high transaction volume databases is to not even think about leaving Health Monitor active. (I also see other autonomous features such as STMM giving undesirable effects, so we have deactivated self tuning memory manager as well.).
Here is more detail, from the doc I sent IBM on 2008.03.14:
======
Summary
We are a DB2 version 9 Fixpack 4 user, 64-bit, running on a Solaris 10 zone on a T2000. The production application is very high transaction volume, 20-90 transactions per second.
We have experienced three separate errors with DB2 Health Monitor that have forced us to deactivate it entirely. To summarize they are:
1. In high transaction environment, every time DB2 Health Monitor runs (every 2 hours) causes heavy new connection slowdown and heavy table locking.
2. In one server, DB2 Health Monitor crashes every time it runs with a sqlerRemoveAllIPCforRow violation. When it crashes it causes heavy connection slowdowns and heavy table locking which result in errors for the heavy concurrency user application.
3. DB2 Health Monitor CALL REORGCHK_IX_STATS fails with a SQLCODE “-2306” even though the table exists.
Detail
Problem #1 Health Monitor causes heavy locking and new connections slowdown
We originally opened this PMR to request IBM’s help in determining why our high transaction volume application was experiencing strong connection slowdowns every 2 hours. IBM's engineer said that it did not seem to be a DB2 process that was causing the problem. I did decide to fix MAXLOCKS and that made a marginal improvement, but the problem continued.
Yesterday, realizing that DB2HMON runs every 2 hours, we deactivated DB2HMON ( UPDATE DBM CFG USING HEALTH_MON OFF ) and the connection slowdown/locking related application failures immediately ended. It is obvious that the problem was entirely caused by Health Monitor.
Problem #2 Health Monitor crashes with IPC Resource Error
On the staging server (NOT the production server), every time DB2HMON attempts to run, it crashes. As a result we have decided to completely prevent DB2HMON from starting, using IBM's advice ( http://www-1.ibm.com/support/docview...id=swg21259046 ) , by setting the DB2_FMP_COMM_HEAPSZ to 0. Despite the fact that we are not using Health Monitor, we would like IBM to fix this crash behavior.
Here is an example from today’s db2diag.log on the staging database:
Expand|Select|Wrap|Line Numbers
- 2008-03-13-00.18.44.059066-300 E876A408 LEVEL: Warning
- PID : 10499 TID : 1 PROC : db2sysc 0
- INSTANCE: db2 NODE : 000
- FUNCTION: DB2 UDB, routine_infrastructure, sqlerReturnFmpToPool, probe:999
- DATA #1 : String, 22 bytes
- Removing FMP from pool
- DATA #2 : Hexdump, 16 bytes
- 0xFFFFFFFF7FFFE160 : 0000 0000 0000 0000 0000 084F 0000 0000 ...........O....
- 2008-03-13-00.18.44.076838-300 E1285A456 LEVEL: Error
- PID : 10499 TID : 1 PROC : db2sysc 0
- INSTANCE: db2 NODE : 000
- FUNCTION: DB2 UDB, routine_infrastructure, sqlerRemoveAllIPCforRow, probe:10
- DATA #1 : String, 32 bytes
- Freeing IPC resource explicitly:
- DATA #2 : SQLO_PID, PD_TYPE_SQLO_PID, 4 bytes
- 2127
- DATA #3 : Hexdump, 4 bytes
- 0x0000000200447FE0 : 0000 0000 ....
- 2008-03-13-00.18.44.077523-300 E1742A345 LEVEL: Error
- PID : 10499 TID : 1 PROC : db2sysc 0
- INSTANCE: db2 NODE : 000
- FUNCTION: DB2 UDB, routine_infrastructure, sqlerRemoveAllIPCforRow, probe:20
- DATA #1 : String, 22 bytes
- IPC resources Address:
- DATA #2 : Pointer, 8 bytes
- 0x0000000210010080
- 2008-03-13-00.18.44.078221-300 E2088A1046 LEVEL: Error
- PID : 10499 TID : 1 PROC : db2sysc 0
- INSTANCE: db2 NODE : 000
- FUNCTION: DB2 UDB, routine_infrastructure, sqlerRemoveAllIPCforRow, probe:30
- DATA #1 : String, 29 bytes
- Number of IPC resource found:
- DATA #2 : signed integer, 4 bytes
- 1
- DATA #3 : String, 29 bytes
- Number of IPC resource freed:
- DATA #4 : signed integer, 4 bytes
- 1
- CALLSTCK:
- [0] 0xFFFFFFFF7C88B804 __1cXsqlerRemoveAllIPCforRow6FpnLsqlerFmpRow_b_i_ + 0x89C
- [1] 0xFFFFFFFF7C881A68 __1cXsqlerRemoveFmpFromTable6FpnLsqlerFmpRow_b_i_ + 0x220
- [2] 0xFFFFFFFF7C884A6C __1cUsqlerReturnFmpToPool6FccpnOsqlerFmpHandle_pnNsqle_agent_cb__i_ + 0x15DC
- [3] 0x000000010000AD9C __1cOsqleRunSysCtlr6F_i_ + 0x3D4
- [4] 0x0000000100006A64 __1cLsqleSysCtlr6F_i_ + 0x384
- [5] 0xFFFFFFFF7AF75CDC __1cHDBGTerm6F_i_ + 0x1F94
- [6] 0xFFFFFFFF7AF76984 sqloRunInstance + 0x5E4
- [7] 0x0000000100006524 main + 0x924
- [8] 0x0000000100004E5C _start + 0x17C
- [9] 0x0000000000000000 ?unknown + 0x0
- 2008-03-13-00.18.44.091581-300 I3135A433 LEVEL: Severe
- PID : 10499 TID : 1 PROC : db2sysc 0
- INSTANCE: db2 NODE : 000
- FUNCTION: DB2 UDB, base sys utilities, sqleChildCrashHandler, probe:15
- DATA #1 : Hexdump, 31 bytes
- 0x000000010000DB20 : 4865 616C 7468 204D 6F6E 6974 6F72 2050 Health Monitor P
- 0x000000010000DB30 : 726F 6365 7373 2063 7261 7368 6564 2E rocess crashed.
- 2008-03-13-00.18.44.092207-300 I3569A340 LEVEL: Severe
- PID : 10499 TID : 1 PROC : db2sysc 0
- INSTANCE: db2 NODE : 000
- FUNCTION: DB2 UDB, base sys utilities, sqleChildCrashHandler, probe:16
- DATA #1 : Hexdump, 4 bytes
- 0xFFFFFFFF7FFFE170 : 0000 084F ...O
- 2008-03-13-00.18.44.092769-300 I3910A340 LEVEL: Severe
- PID : 10499 TID : 1 PROC : db2sysc 0
- INSTANCE: db2 NODE : 000
- FUNCTION: DB2 UDB, base sys utilities, sqleChildCrashHandler, probe:17
- DATA #1 : Hexdump, 4 bytes
- 0xFFFFFFFF7FFFE174 : 0000 0101 ....
- 2008-03-13-00.18.44.093349-300 I4251A340 LEVEL: Severe
- PID : 10499 TID : 1 PROC : db2sysc 0
- INSTANCE: db2 NODE : 000
- FUNCTION: DB2 UDB, base sys utilities, sqleChildCrashHandler, probe:18
- DATA #1 : Hexdump, 4 bytes
- 0xFFFFFFFF7FFFE178 : FFFF FFFF ....
- 2008-03-13-00.18.44.110896-300 I4592A282 LEVEL: Warning
- PID : 10499 TID : 1 PROC : db2sysc 0
- INSTANCE: db2 NODE : 000
- FUNCTION: DB2 UDB, base sys utilities, sqleRunSysCtlr, probe:63
- MESSAGE : Health Monitor Process restarted.
When I was running an event monitor to diagnose “Problem #1” above-referenced in this document, I noticed that Health Monitor was running CALL REORGCHK_IX_STATS with a valid table name, for example:
CALL REORGCHK_IX_STATS ('T', '"NEAT "."LOCATION_SETS_LOG"')
And returning SQLCODE -2306, “The table or index name does not exist” even though the table NEAT.LOCATION_SETS_LOG does exist.
Here is event monitor output illustrating the problem:
Expand|Select|Wrap|Line Numbers
- 89214) Statement Event ...
- Appl Handle: 587
- Appl Id: *LOCAL.db2.080311150558
- Appl Seq number: 00006
- Record is the result of a flush: FALSE
- -------------------------------------------
- Type : Dynamic
- Operation: Execute
- Section : 6
- Creator : NULLID
- Package : SYSSH200
- Consistency Token : SYSLVL01
- Package Version ID :
- Cursor : SQL_CURSH200C6
- Cursor was blocking: FALSE
- Text : CALL REORGCHK_IX_STATS ('T', '"NEAT "."FLIGHT_FARE_LOG"')
- -------------------------------------------
- Start Time: 03/11/2008 10:05:57.563826
- Stop Time: 03/11/2008 10:05:57.713239
- Elapsed Execution Time: 0.149413 seconds
- Number of Agents created: 1
- User CPU: 0.000000 seconds
- System CPU: 0.000000 seconds
- Fetch Count: 0
- Sorts: 1
- Total sort time: 0
- Sort overflows: 0
- Rows read: 0
- Rows written: 0
- Internal rows deleted: 0
- Internal rows updated: 0
- Internal rows inserted: 0
- Bufferpool data logical reads: 0
- Bufferpool data physical reads: 0
- Bufferpool temporary data logical reads: 0
- Bufferpool temporary data physical reads: 0
- Bufferpool index logical reads: 0
- Bufferpool index physical reads: 0
- Bufferpool temporary index logical reads: 0
- Bufferpool temporary index physical reads: 0
- Bufferpool xda logical page reads: 0
- Bufferpool xda physical page reads: 0
- Bufferpool temporary xda logical page reads: 0
- Bufferpool temporary xda physical page reads: 0
- SQLCA:
- sqlcode: -2306
- sqlstate:
Expand|Select|Wrap|Line Numbers
- 2008-03-11-10.05.57.713912-300 I276926A452 LEVEL: Error
- PID : 18727 TID : 26 PROC : db2acd 0
- INSTANCE: db2 NODE : 000
- APPID : *LOCAL.db2.080311150558
- FUNCTION: DB2 UDB, Automatic Table Maintenance, db2HmonEvalReorg, probe:515
- MESSAGE : Failed to evaluate table "NEAT "."FLIGHT_FARE_LOG". Probe 400
- DATA #1 : Hexdump, 4 bytes
- 0xFFFFFFFF6F53F470 : FFFF F6FE ....