I just inherited support of this app so bear with me. The scenario: Oracle
9i replicated databases, same C++ server process runs on both boxes. On a 15
minute schedule, the server that is primary deletes records out of a lock
table based on age. Once a week, the primary box is rebooted and clients are
redirected to the secondary. Once the boot is complete, the primary box
continues to handle record cleanup although clients stay pointed to the
secondary. The next night, the secondary bounces and clients move back to
primary.
The problem that I have is that some (and only some) records that are
inserted by the secondary server are never cleaned up by the primary even
though the primary server logs the existence of these records when they
first age & the fact that a delete was executed. Further, despite the fact
that a command line sqlplus select displays them days later, the primary
server never logs their occurrence again. It is as if once the delete is
executed, they are no longer returned in the select to the server process.
This condition spans reboots such that the primary fails to recognize
records that are several weeks old. Hundreds of records are properly cleaned
each week but those that are not are always generated from the secondary
server. No errors are logged to indicate a problem.
At first blush, the answer is simple: broken code. But a code review could
not find any cause for this behavior. See code below:
<start of pseudo-code>
//agebuf set to '0 00:15:00'
exec sql at :instancename declare c cursor for
select * from lock_tbl where lock_timestamp at local <
(localtimestamp - to_dsinterval(:agebuf));
if (sqlca.sqlcode != 0) {