I am normally the admin for our PostgreSQL servers, but someone else tried
to kill a runaway query while I was out sick. Unfortunately, he tried
killing the runaway query by killing the postmaster. Now we are dealing
with a persistent problem that just won't go away.
1) Spontaneous back end deaths.
2) Spontaneous backend disconnects (which are probably because of 1).
3) pg_clog files disappearing.
The first symptom of the problem was a message I received while within psql.
It was something like this: "Some backend process died abnormally, causing
your session to terminate." I could reconnect to the database afterwards.
The second symptom was another use receiving an error message along the
lines of:
"ERROR: could not access status of transaction 97201274
DETAIL: could not open file "/var/lib/pgsql/data/pg_clog/005C": No such
file or directory".
The query was a simple count(*) operation: "select count(*) from
[sometable]".
I read prior Usenet postings for this error and did the following:
1) touch /var/lib/pgsql/data/pg_clog/005C.temp
2) chown postgres.postgres /var/lib/pgsql/data/pg_clog/005C.temp
3) chmod 0600 /var/lib/pgsql/data/pg_clog/005C.temp
4) dd if=/dev/zero of=/var/lib/pgsql/data/pg_clog/005C.temp bs=8192 count=1
5) mv /var/lib/pgsql/data/pg_clog/005C.temp /var/lib/pgsql/data/pg_clog/005C
Then I reloaded PostgreSQL (the prior postings didn't indicated a need to
reload or restart the database, so I chose the least disruptive option):
/etc/init.d/postgresql reload
I tried the query again, and I got an error message saying that PostgreSQL
couldn't read the new 005C file past a certain offset. So I reran the dd
command above, but used a count of 100 instead of 1. The count query then
worked. I crossed my fingers and hoped the problem was solved.
However, the same problem reappeared the next morning. My suspicion was
that PostgreSQL's metadata tables weren't properly updated since I
performed the above operations on a running database. I then brought
PostgreSQL down completely (/etc/init.d/postgresql stop), repeated the
procedure, then restarted the database. Again, the problem seemed to be
fixed.
That was until a user called and reported an error message about a
spontaneous backend disconnection (I viewed the message myself, and it
indeed said that data could not be received from the server). I'm not
certain that was caused by PostgreSQL, because the developer of that
particular program has had long standing problems caused by his own bugs.
I came into work this morning, and tried the same select count(*) query that
I mentioned above, and I got the same file not found error I posted above.
Following the same procedures I outlined above fixed the problem, but now I
know the fix is only temporary.
My questions are:
1) What kind of database damage can I expect to find?
2) I think my daily pg_dumps are succeeding, so is my only real option to
start a new database and restore from a backup?
3) Why is the clog I create disappearing? Is it because of vacuuming?
4) Can this problem be fixed without resorting to creating a new database
cluster? This cluster hosts about a dozen active databases spread across 4
departments.