467,188 Members | 1,434 Online
Bytes | Developer Community
Ask Question

Home New Posts Topics Members FAQ

Post your question to a community of 467,188 developers. It's quick & easy.

Problems restarting after database crashed (signal 11).

Yesterday, while attempting to access a database, I received errors
saying that the database was innaccessible. After investigating a
little, I found the following in the PostgreSQL log files:

2004-06-30 08:30:19 [24119] LOG: checkpoint process (PID 28423) was
terminated by signal 11
2004-06-30 08:30:19 [24119] LOG: terminating any other active server
processes
2004-06-30 08:30:19 [28383] WARNING: terminating connection because of
crash of another server process
DETAIL: The postmaster has commanded this server process to roll back
the curre nt transaction and exit, because another server process exited
abnormally and po ssibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat yo ur command.
2004-06-30 08:30:19 [28362] WARNING: terminating connection because of
crash of another server process
DETAIL: The postmaster has commanded this server process to roll back
the curre nt transaction and exit, because another server process exited
abnormally and po ssibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat yo ur command.

The last bit then repeated a few more times, and then:

2004-06-30 08:30:20 [24119] LOG: all server processes terminated;
reinitializing
2004-06-30 08:30:20 [28424] LOG: database system was interrupted at 2004-06-30
08:22:23 CDT
2004-06-30 08:30:20 [28424] LOG: checkpoint record is at 8/77703F9C
2004-06-30 08:30:20 [28424] LOG: redo record is at 8/775B1D38; undo
record is at 0/0; shutdown FALSE
2004-06-30 08:30:20 [28424] LOG: next transaction ID: 1638554; next
OID: 1058492
2004-06-30 08:30:20 [28424] LOG: database system was not properly shut
down; automatic recovery in progress
2004-06-30 08:30:20 [28424] LOG: redo starts at 8/775B1D38
2004-06-30 08:30:21 [28430] LOG: connection received: host=[local] port=
2004-06-30 08:30:21 [28430] FATAL: the database system is starting up
2004-06-30 08:30:38 [28424] LOG: record with zero length at 8/78855F38
2004-06-30 08:30:38 [28424] LOG: redo done at 8/78853EE0
2004-06-30 08:31:40 [28449] LOG: connection received: host=[local] port=
2004-06-30 08:31:40 [28449] FATAL: the database system is starting up
2004-06-30 08:31:48 [28452] LOG: connection received: host=[local] port=
2004-06-30 08:31:48 [28452] FATAL: the database system is starting up
2004-06-30 08:31:53 [28459] LOG: connection received: host=[local] port=
2004-06-30 08:31:53 [28459] FATAL: the database system is starting up

And this then continues on and on. Even 20 minutes later, attempts to
connect to the database were met with the same FATAL error.

Eventually I attempted to shut it down and restart it, however that
failed too. When I attempted to shut it down, I discovered a hung
'startup subprocess' that can't be killed.

nexus:~# ps aux | grep postgres
postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres:
startup subprocess
nexus:~# kill -9 28424
nexus:~# ps aux | grep postgres
postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres:
startup subprocess
nexus:~#

As soon as I can get physical access to the machine, I'm planning to
reboot it, as I can't think of anything else to do to kill a process
that can't be kill -KILL'ed.

I'm worried that attempting to start the database after rebooting will
fail in the same way, however. Has anyone seen anything like this
before, or have any ideas on how to proceed?

I'm running on an Intel Pentium Pro box, with Debian/GNU Linux, running
'unstable'. I'm using PostgreSQL 7.4.3.

Thank you for your help.

--
| Christopher
+------------------------------------------------+
| Here I stand. I can do no other. |
+------------------------------------------------+
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 23 '05 #1
  • viewed: 2830
Share:
13 Replies
Christopher Cashell <to**********@zyp.org> writes:
Eventually I attempted to shut it down and restart it, however that
failed too. When I attempted to shut it down, I discovered a hung
'startup subprocess' that can't be killed.


This is interesting because it seems just about exactly like this
recent Red Hat bug report:
https://bugzilla.redhat.com/bugzilla....cgi?id=126885

As I commented there, I think that it must be a kernel or hardware
issue --- Postgres itself can surely not make an unkillable process.
However it's common to see processes that don't respond to kill if
they are stuck inside a kernel I/O request. That could mean either
unresponsive hardware or a kernel bug.

I wonder whether you have any similarities in hardware or Linux kernel
to the person who filed the above report?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 23 '05 #2
At Wed, 30 Jun 04, Unidentified Flying Banana Tom Lane, said:
Christopher Cashell <to**********@zyp.org> writes:
Eventually I attempted to shut it down and restart it, however that
failed too. When I attempted to shut it down, I discovered a hung
'startup subprocess' that can't be killed.
This is interesting because it seems just about exactly like this
recent Red Hat bug report:
https://bugzilla.redhat.com/bugzilla....cgi?id=126885


Hrm. Yes, it does appear to be a very similar, if not identical, issue.
As I commented there, I think that it must be a kernel or hardware
issue --- Postgres itself can surely not make an unkillable process.
However it's common to see processes that don't respond to kill if
they are stuck inside a kernel I/O request. That could mean either
unresponsive hardware or a kernel bug.
That is somewhat along the lines of what I was thinking, although I have
had no problems like this before. The machine has been running for over
100 days, and the database as well, without issue.

28424 postgres 18 0 16804 3044 15m D 0.0 1.6 0:06.72 postmaster

Note that it does have a process status of 'D', or uninterruptible
sleep. That would explain the unkillable part, though I'm curious how
it ended up there. Unless it just happened to be in a really bad spot
when Posgres segfaulted. . . although, I wouldn't expect that would
affect the 'startup subprocess'.
I wonder whether you have any similarities in hardware or Linux kernel
to the person who filed the above report?
Here's all the information I can provide for this machine:

IBM IntelliStation Z Pro
Model: 6899-12U
Dual Pentium Pro 200
192MB RAM
4.5 GB IBM SCSI HDD
9 GB IBM SCSI HDD
6.4 GB WD HDD

The database resides on the 4.5 GB SCSI, with the pg_xlog directory
symlinked from there, and actually existing on the 9GB SCSI.

nexus:~$ uname -a
Linux nexus.zyp.org 2.6.4 #1 SMP Thu Mar 11 14:04:49 CST 2004 i686 GNU/Linux
nexus:~$ uptime
21:15:39 up 107 days, 20:57, 7 users, load average: 2.04, 2.31, 2.38

If there's any other information I can provide, please let me know.

I'm going to reboot the box right now, and cross my fingers, hoping
it'll come back up. ;-)
regards, tom lane


--
| Christopher
+------------------------------------------------+
| Here I stand. I can do no other. |
+------------------------------------------------+
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 23 '05 #3
Christopher Cashell <to**********@zyp.org> writes:
28424 postgres 18 0 16804 3044 15m D 0.0 1.6 0:06.72 postmaster Note that it does have a process status of 'D', or uninterruptible
sleep. That would explain the unkillable part, though I'm curious how
it ended up there.


Perhaps I'm just of an older generation, but I always thought 'D' meant
"disk I/O wait". Which definitely is uninterruptible on most Unixen.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #4
On Wed, 2004-06-30 at 18:57, Christopher Cashell wrote:
Yesterday, while attempting to access a database, I received errors
saying that the database was innaccessible. After investigating a
little, I found the following in the PostgreSQL log files:

2004-06-30 08:30:19 [24119] LOG: checkpoint process (PID 28423) was
terminated by signal 11 Eventually I attempted to shut it down and restart it, however that
failed too. When I attempted to shut it down, I discovered a hung
'startup subprocess' that can't be killed.

nexus:~# ps aux | grep postgres
postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres:
startup subprocess
nexus:~# kill -9 28424
nexus:~# ps aux | grep postgres
postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres:
startup subprocess
nexus:~#


The combination of a Sig 11 failure and a process stuck in a D state
makes me lean towards thinking it's bad hardware (CPU or memory). Have
you tested this machine?
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 23 '05 #5
On Wed, 2004-06-30 at 21:41, Scott Marlowe wrote:
On Wed, 2004-06-30 at 18:57, Christopher Cashell wrote:
Yesterday, while attempting to access a database, I received errors
saying that the database was innaccessible. After investigating a
little, I found the following in the PostgreSQL log files:

2004-06-30 08:30:19 [24119] LOG: checkpoint process (PID 28423) was
terminated by signal 11

Eventually I attempted to shut it down and restart it, however that
failed too. When I attempted to shut it down, I discovered a hung
'startup subprocess' that can't be killed.

nexus:~# ps aux | grep postgres
postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres:
startup subprocess
nexus:~# kill -9 28424
nexus:~# ps aux | grep postgres
postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres:
startup subprocess
nexus:~#


The combination of a Sig 11 failure and a process stuck in a D state
makes me lean towards thinking it's bad hardware (CPU or memory). Have
you tested this machine?


Oh, and a possibly buggy kernel or kernel module somewhere as well.
Didn't mean to not say it, and have had problems with some kernels under
heavy parallel loads doing stupid things that look just like this.
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 23 '05 #6
At Wed, 30 Jun 04, Unidentified Flying Banana Scott Marlowe, said:
The combination of a Sig 11 failure and a process stuck in a D state
makes me lean towards thinking it's bad hardware (CPU or memory). Have
you tested this machine?


It's possible that it's bad hardware, as the machine is a little long in
the tooth. However, at the same time, it is a *very* well tested box.
It's been in production use for 4 years, and I have yet to experience a
significant hardware issue.

Up until this, prior to when I rebooted it a few hours ago over this
issue, it had been running for 110 days, with the database running
(under approximately the same load) for that whole time.

If it *is* a hardware problem, I would have expected it to show up
somewhat sooner than this.

--
| Christopher
+------------------------------------------------+
| Here I stand. I can do no other. |
+------------------------------------------------+
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 23 '05 #7
At Wed, 30 Jun 04, Unidentified Flying Banana Tom Lane, said:
Christopher Cashell <to**********@zyp.org> writes:
28424 postgres 18 0 16804 3044 15m D 0.0 1.6 0:06.72 postmaster
Note that it does have a process status of 'D', or uninterruptible
sleep. That would explain the unkillable part, though I'm curious how
it ended up there.


Perhaps I'm just of an older generation, but I always thought 'D' meant
"disk I/O wait". Which definitely is uninterruptible on most Unixen.


I've always heard it as 'uninteruptable sleep', which I understood to be
the generic term description, and that 'disk I/O wait' is the most
common cause for it (but not the only possible cause).

Either way, I think you're right, and I think that is what happened
here.
regards, tom lane


--
| Christopher
+------------------------------------------------+
| Here I stand. I can do no other. |
+------------------------------------------------+
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 23 '05 #8
Christopher Cashell <to**********@zyp.org> writes:
If it *is* a hardware problem, I would have expected it to show up
somewhat sooner than this.


[sorry, hit Send a bit too early]

Parts do go bad. Sig11 is often an indication of either bad RAM or an
overheating processor. Have you checked your fans, especially since
it's an old machine?

-Doug

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 23 '05 #9
Christopher Cashell <to**********@zyp.org> writes:
If it *is* a hardware problem, I would have expected it to show up
somewhat sooner than this.


Parts do go bad. Sig11 is often an indication of ether bad RAM

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 23 '05 #10
Christopher Cashell <to**********@zyp.org> writes:
Either way, I think you're right, and I think that is what happened
here.


So did it come up after you rebooted? The other guy wasn't having
any luck that way :-(

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 23 '05 #11
At Thu, 01 Jul 04, Unidentified Flying Banana Tom Lane, said:
Christopher Cashell <to**********@zyp.org> writes:
Either way, I think you're right, and I think that is what happened
here.
So did it come up after you rebooted? The other guy wasn't having
any luck that way :-(


Yep. ;-)

In a testament to PostgreSQL's robustness, after rebooting the machine,
things went mostly[1] well. Postgres started up without issue, programs
made their connections to the database, and queries were happily made.

Everything looks to be working perfectly now.

And luckily, this machine, though it holds critical data, isn't a
time/immediate access critical machine. So having it down for a few
hours wasn't any kind of a problem, other than for my blood pressure.

Thank you Tom, and the rest of you, for your help. It's greatly
appreciated.
regards, tom lane

[1] Upon rebooting, the autovacuum utility kind of harfed. Here is
the log entries that it made:

---
[2004-06-30 08:44:53 AM] Failed connection to database template1
with error: FATAL: the database system is starting up
Nov 23 '05 #12
Christopher Cashell <to**********@zyp.org> writes:
[2004-06-30 08:44:53 AM] Failed connection to database template1
with error: FATAL: the database system is starting up
.
[2004-06-30 08:44:53 AM] Failed connection to database template1
with error: FATAL: the database system is starting up
.
[2004-06-30 08:44:53 AM] Error: Cannot connect to template1,
exiting. So, I shut down Postgres, then restarted Postgres, then restarted
the autovacuum utility, and everything worked just peachy. I'm
guessing that perhaps the autovacuum tool was trying to connect to
Postgres while it was replaying the transaction log from not
having been shut down cleanly, and that's why it choked, but I
don't know that for sure.


Yeah, that's what it looks like to me --- autovacuum just a bit too
quick to give up. You could've just restarted autovacuum once the
database was up.

In 7.5 I think autovacuum will be integrated, and the postmaster won't
bother to start it till the startup sequence is done, so this won't
be an issue.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 23 '05 #13
Tom Lane <tg*@sss.pgh.pa.us> writes:
Christopher Cashell <to**********@zyp.org> writes:
28424 postgres 18 0 16804 3044 15m D 0.0 1.6 0:06.72 postmaster

Note that it does have a process status of 'D', or uninterruptible
sleep. That would explain the unkillable part, though I'm curious how
it ended up there.


Is there an NFS server involved? If an NFS server disappears any process
waiting on I/O for it enters disk-wait indefinitely until it reappears.

--
greg
---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 23 '05 #14

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by Subhash | last post: by
reply views Thread by Martin Gill | last post: by
8 posts views Thread by BARTKO, Zoltán | last post: by
1 post views Thread by tony.fountaine@gmail.com | last post: by
4 posts views Thread by Dan Stromberg | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.