By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
432,463 Members | 774 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 432,463 IT Pros & Developers. It's quick & easy.

Postmaster Problem

P: n/a
I am normally the admin for our PostgreSQL servers, but someone else tried
to kill a runaway query while I was out sick. Unfortunately, he tried
killing the runaway query by killing the postmaster. Now we are dealing
with a persistent problem that just won't go away.

1) Spontaneous back end deaths.
2) Spontaneous backend disconnects (which are probably because of 1).
3) pg_clog files disappearing.

The first symptom of the problem was a message I received while within psql.
It was something like this: "Some backend process died abnormally, causing
your session to terminate." I could reconnect to the database afterwards.

The second symptom was another use receiving an error message along the
lines of:

"ERROR: could not access status of transaction 97201274
DETAIL: could not open file "/var/lib/pgsql/data/pg_clog/005C": No such
file or directory".

The query was a simple count(*) operation: "select count(*) from
[sometable]".

I read prior Usenet postings for this error and did the following:

1) touch /var/lib/pgsql/data/pg_clog/005C.temp
2) chown postgres.postgres /var/lib/pgsql/data/pg_clog/005C.temp
3) chmod 0600 /var/lib/pgsql/data/pg_clog/005C.temp
4) dd if=/dev/zero of=/var/lib/pgsql/data/pg_clog/005C.temp bs=8192 count=1
5) mv /var/lib/pgsql/data/pg_clog/005C.temp /var/lib/pgsql/data/pg_clog/005C

Then I reloaded PostgreSQL (the prior postings didn't indicated a need to
reload or restart the database, so I chose the least disruptive option):

/etc/init.d/postgresql reload

I tried the query again, and I got an error message saying that PostgreSQL
couldn't read the new 005C file past a certain offset. So I reran the dd
command above, but used a count of 100 instead of 1. The count query then
worked. I crossed my fingers and hoped the problem was solved.

However, the same problem reappeared the next morning. My suspicion was
that PostgreSQL's metadata tables weren't properly updated since I
performed the above operations on a running database. I then brought
PostgreSQL down completely (/etc/init.d/postgresql stop), repeated the
procedure, then restarted the database. Again, the problem seemed to be
fixed.

That was until a user called and reported an error message about a
spontaneous backend disconnection (I viewed the message myself, and it
indeed said that data could not be received from the server). I'm not
certain that was caused by PostgreSQL, because the developer of that
particular program has had long standing problems caused by his own bugs.

I came into work this morning, and tried the same select count(*) query that
I mentioned above, and I got the same file not found error I posted above.
Following the same procedures I outlined above fixed the problem, but now I
know the fix is only temporary.

My questions are:

1) What kind of database damage can I expect to find?

2) I think my daily pg_dumps are succeeding, so is my only real option to
start a new database and restore from a backup?

3) Why is the clog I create disappearing? Is it because of vacuuming?

4) Can this problem be fixed without resorting to creating a new database
cluster? This cluster hosts about a dozen active databases spread across 4
departments.

Mar 2 '06 #1
Share this Question
Share on Google+
2 Replies


P: n/a
Tony O'Bryan wrote:
I am normally the admin for our PostgreSQL servers, but someone else tried
to kill a runaway query while I was out sick. Unfortunately, he tried
killing the runaway query by killing the postmaster. Now we are dealing
with a persistent problem that just won't go away.


The only quick solution appears to be copying the database cluster.

1) Create a new cluster at a different location using initdb
2) Disable remote access to the old cluster.
3) Dump the cluster.
4) Start a parallel instance of the postmaster at the new location.
5) Import the dump into the new instance.
6) Shut down both instances.
7) Rename the old instance to an archival area.
8) Rename the new instance to that of the old instance.
9) Restart the database.

The new instance is now fixed of all the problems caused by killing the
postmaster.

Mar 3 '06 #2

P: n/a
Tony O'Bryan wrote:
Tony O'Bryan wrote:

I am normally the admin for our PostgreSQL servers, but someone else tried
to kill a runaway query while I was out sick. Unfortunately, he tried
killing the runaway query by killing the postmaster. Now we are dealing
with a persistent problem that just won't go away.

The only quick solution appears to be copying the database cluster.

1) Create a new cluster at a different location using initdb
2) Disable remote access to the old cluster.
3) Dump the cluster.
4) Start a parallel instance of the postmaster at the new location.
5) Import the dump into the new instance.
6) Shut down both instances.
7) Rename the old instance to an archival area.
8) Rename the new instance to that of the old instance.
9) Restart the database.

The new instance is now fixed of all the problems caused by killing the
postmaster.

I'm not an expert with Postgres but do use it a lot and after reading
your problem and before looking at your follow up I would have suggested
backing up your data with a pg_dump and re-installing the DB cluster.
Interesting how simple and yet effective Postgres is! I love this
database!!!

Schmidty
Apr 2 '06 #3

This discussion thread is closed

Replies have been disabled for this discussion.