Problems restarting after database crashed (signal 11).

Christopher Cashell

Yesterday, while attempting to access a database, I received errors
saying that the database was innaccessible. After investigating a
little, I found the following in the PostgreSQL log files:

2004-06-30 08:30:19 [24119] LOG: checkpoint process (PID 28423) was
terminated by signal 11
2004-06-30 08:30:19 [24119] LOG: terminating any other active server
processes
2004-06-30 08:30:19 [28383] WARNING: terminating connection because of
crash of another server process
DETAIL: The postmaster has commanded this server process to roll back
the curre nt transaction and exit, because another server process exited
abnormally and po ssibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat yo ur command.
2004-06-30 08:30:19 [28362] WARNING: terminating connection because of
crash of another server process
DETAIL: The postmaster has commanded this server process to roll back
the curre nt transaction and exit, because another server process exited
abnormally and po ssibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat yo ur command.

The last bit then repeated a few more times, and then:

2004-06-30 08:30:20 [24119] LOG: all server processes terminated;
reinitializing
2004-06-30 08:30:20 [28424] LOG: database system was interrupted at 2004-06-30
08:22:23 CDT
2004-06-30 08:30:20 [28424] LOG: checkpoint record is at 8/77703F9C
2004-06-30 08:30:20 [28424] LOG: redo record is at 8/775B1D38; undo
record is at 0/0; shutdown FALSE
2004-06-30 08:30:20 [28424] LOG: next transaction ID: 1638554; next
OID: 1058492
2004-06-30 08:30:20 [28424] LOG: database system was not properly shut
down; automatic recovery in progress
2004-06-30 08:30:20 [28424] LOG: redo starts at 8/775B1D38
2004-06-30 08:30:21 [28430] LOG: connection received: host=[local] port=
2004-06-30 08:30:21 [28430] FATAL: the database system is starting up
2004-06-30 08:30:38 [28424] LOG: record with zero length at 8/78855F38
2004-06-30 08:30:38 [28424] LOG: redo done at 8/78853EE0
2004-06-30 08:31:40 [28449] LOG: connection received: host=[local] port=
2004-06-30 08:31:40 [28449] FATAL: the database system is starting up
2004-06-30 08:31:48 [28452] LOG: connection received: host=[local] port=
2004-06-30 08:31:48 [28452] FATAL: the database system is starting up
2004-06-30 08:31:53 [28459] LOG: connection received: host=[local] port=
2004-06-30 08:31:53 [28459] FATAL: the database system is starting up

And this then continues on and on. Even 20 minutes later, attempts to
connect to the database were met with the same FATAL error.

Eventually I attempted to shut it down and restart it, however that
failed too. When I attempted to shut it down, I discovered a hung
'startup subprocess' that can't be killed.

nexus:~# ps aux | grep postgres
postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres:
startup subprocess
nexus:~# kill -9 28424
nexus:~# ps aux | grep postgres
postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres:
startup subprocess
nexus:~#

As soon as I can get physical access to the machine, I'm planning to
reboot it, as I can't think of anything else to do to kill a process
that can't be kill -KILL'ed.

I'm worried that attempting to start the database after rebooting will
fail in the same way, however. Has anyone seen anything like this
before, or have any ideas on how to proceed?

I'm running on an Intel Pentium Pro box, with Debian/GNU Linux, running
'unstable'. I'm using PostgreSQL 7.4.3.

Thank you for your help.

--
| Christopher
+------------------------------------------------+
| Here I stand. I can do no other. |
+------------------------------------------------+
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 23 '05 #1

Subscribe Post Reply

3089

Tom Lane

Christopher Cashell <to**********@zyp.org> writes:

Eventually I attempted to shut it down and restart it, however that
failed too. When I attempted to shut it down, I discovered a hung
'startup subprocess' that can't be killed.

This is interesting because it seems just about exactly like this
recent Red Hat bug report:
https://bugzilla.redhat.com/bugzilla....cgi?id=126885

As I commented there, I think that it must be a kernel or hardware
issue --- Postgres itself can surely not make an unkillable process.
However it's common to see processes that don't respond to kill if
they are stuck inside a kernel I/O request. That could mean either
unresponsive hardware or a kernel bug.

I wonder whether you have any similarities in hardware or Linux kernel
to the person who filed the above report?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 23 '05 #2

Christopher Cashell

At Wed, 30 Jun 04, Unidentified Flying Banana Tom Lane, said:

Christopher Cashell <to**********@zyp.org> writes:
Eventually I attempted to shut it down and restart it, however that
failed too. When I attempted to shut it down, I discovered a hung
'startup subprocess' that can't be killed.
This is interesting because it seems just about exactly like this
recent Red Hat bug report:
https://bugzilla.redhat.com/bugzilla....cgi?id=126885

Hrm. Yes, it does appear to be a very similar, if not identical, issue.
As I commented there, I think that it must be a kernel or hardware
issue --- Postgres itself can surely not make an unkillable process.
However it's common to see processes that don't respond to kill if
they are stuck inside a kernel I/O request. That could mean either
unresponsive hardware or a kernel bug.
That is somewhat along the lines of what I was thinking, although I have
had no problems like this before. The machine has been running for over
100 days, and the database as well, without issue.

28424 postgres 18 0 16804 3044 15m D 0.0 1.6 0:06.72 postmaster

Note that it does have a process status of 'D', or uninterruptible
sleep. That would explain the unkillable part, though I'm curious how
it ended up there. Unless it just happened to be in a really bad spot
when Posgres segfaulted. . . although, I wouldn't expect that would
affect the 'startup subprocess'.
I wonder whether you have any similarities in hardware or Linux kernel
to the person who filed the above report?
Here's all the information I can provide for this machine:

IBM IntelliStation Z Pro
Model: 6899-12U
Dual Pentium Pro 200
192MB RAM
4.5 GB IBM SCSI HDD
9 GB IBM SCSI HDD
6.4 GB WD HDD

The database resides on the 4.5 GB SCSI, with the pg_xlog directory
symlinked from there, and actually existing on the 9GB SCSI.

nexus:~$ uname -a
Linux nexus.zyp.org 2.6.4 #1 SMP Thu Mar 11 14:04:49 CST 2004 i686 GNU/Linux
nexus:~$ uptime
21:15:39 up 107 days, 20:57, 7 users, load average: 2.04, 2.31, 2.38

If there's any other information I can provide, please let me know.

I'm going to reboot the box right now, and cross my fingers, hoping
it'll come back up. ;-)
regards, tom lane

--
| Christopher
+------------------------------------------------+
| Here I stand. I can do no other. |
+------------------------------------------------+
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 23 '05 #3

Tom Lane

Christopher Cashell <to**********@zyp.org> writes:

28424 postgres 18 0 16804 3044 15m D 0.0 1.6 0:06.72 postmaster Note that it does have a process status of 'D', or uninterruptible
sleep. That would explain the unkillable part, though I'm curious how
it ended up there.

Perhaps I'm just of an older generation, but I always thought 'D' meant
"disk I/O wait". Which definitely is uninterruptible on most Unixen.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #4

Scott Marlowe

On Wed, 2004-06-30 at 18:57, Christopher Cashell wrote:

Yesterday, while attempting to access a database, I received errors
saying that the database was innaccessible. After investigating a
little, I found the following in the PostgreSQL log files:

2004-06-30 08:30:19 [24119] LOG: checkpoint process (PID 28423) was
terminated by signal 11 Eventually I attempted to shut it down and restart it, however that
failed too. When I attempted to shut it down, I discovered a hung
'startup subprocess' that can't be killed.

nexus:~# ps aux | grep postgres
postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres:
startup subprocess
nexus:~# kill -9 28424
nexus:~# ps aux | grep postgres
postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres:
startup subprocess
nexus:~#

The combination of a Sig 11 failure and a process stuck in a D state
makes me lean towards thinking it's bad hardware (CPU or memory). Have
you tested this machine?
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 23 '05 #5

Scott Marlowe

On Wed, 2004-06-30 at 21:41, Scott Marlowe wrote:

On Wed, 2004-06-30 at 18:57, Christopher Cashell wrote:
Yesterday, while attempting to access a database, I received errors
saying that the database was innaccessible. After investigating a
little, I found the following in the PostgreSQL log files:

2004-06-30 08:30:19 [24119] LOG: checkpoint process (PID 28423) was
terminated by signal 11

Eventually I attempted to shut it down and restart it, however that
failed too. When I attempted to shut it down, I discovered a hung
'startup subprocess' that can't be killed.

nexus:~# ps aux | grep postgres
postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres:
startup subprocess
nexus:~# kill -9 28424
nexus:~# ps aux | grep postgres
postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres:
startup subprocess
nexus:~#

The combination of a Sig 11 failure and a process stuck in a D state
makes me lean towards thinking it's bad hardware (CPU or memory). Have
you tested this machine?

Oh, and a possibly buggy kernel or kernel module somewhere as well.
Didn't mean to not say it, and have had problems with some kernels under
heavy parallel loads doing stupid things that look just like this.
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 23 '05 #6

Christopher Cashell

At Wed, 30 Jun 04, Unidentified Flying Banana Scott Marlowe, said:

The combination of a Sig 11 failure and a process stuck in a D state
makes me lean towards thinking it's bad hardware (CPU or memory). Have
you tested this machine?

It's possible that it's bad hardware, as the machine is a little long in
the tooth. However, at the same time, it is a *very* well tested box.
It's been in production use for 4 years, and I have yet to experience a
significant hardware issue.

Up until this, prior to when I rebooted it a few hours ago over this
issue, it had been running for 110 days, with the database running
(under approximately the same load) for that whole time.

If it *is* a hardware problem, I would have expected it to show up
somewhat sooner than this.

--
| Christopher
+------------------------------------------------+
| Here I stand. I can do no other. |
+------------------------------------------------+
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 23 '05 #7

Christopher Cashell

At Wed, 30 Jun 04, Unidentified Flying Banana Tom Lane, said:

Christopher Cashell <to**********@zyp.org> writes:
28424 postgres 18 0 16804 3044 15m D 0.0 1.6 0:06.72 postmaster
Note that it does have a process status of 'D', or uninterruptible
sleep. That would explain the unkillable part, though I'm curious how
it ended up there.

Perhaps I'm just of an older generation, but I always thought 'D' meant
"disk I/O wait". Which definitely is uninterruptible on most Unixen.

I've always heard it as 'uninteruptable sleep', which I understood to be
the generic term description, and that 'disk I/O wait' is the most
common cause for it (but not the only possible cause).

Either way, I think you're right, and I think that is what happened
here.
regards, tom lane

--
| Christopher
+------------------------------------------------+
| Here I stand. I can do no other. |
+------------------------------------------------+
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 23 '05 #8

Doug McNaught

Christopher Cashell <to**********@zyp.org> writes:

If it *is* a hardware problem, I would have expected it to show up
somewhat sooner than this.

[sorry, hit Send a bit too early]

Parts do go bad. Sig11 is often an indication of either bad RAM or an
overheating processor. Have you checked your fans, especially since
it's an old machine?

-Doug

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 23 '05 #9

Doug McNaught

Christopher Cashell <to**********@zyp.org> writes:

If it *is* a hardware problem, I would have expected it to show up
somewhat sooner than this.

Parts do go bad. Sig11 is often an indication of ether bad RAM

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 23 '05 #10

Tom Lane

Christopher Cashell <to**********@zyp.org> writes:

Either way, I think you're right, and I think that is what happened
here.

So did it come up after you rebooted? The other guy wasn't having
any luck that way :-(

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 23 '05 #11

Christopher Cashell

At Thu, 01 Jul 04, Unidentified Flying Banana Tom Lane, said:

Christopher Cashell <to**********@zyp.org> writes:
Either way, I think you're right, and I think that is what happened
here.
So did it come up after you rebooted? The other guy wasn't having
any luck that way :-(

Yep. ;-)

In a testament to PostgreSQL's robustness, after rebooting the machine,
things went mostly[1] well. Postgres started up without issue, programs
made their connections to the database, and queries were happily made.

Everything looks to be working perfectly now.

And luckily, this machine, though it holds critical data, isn't a
time/immediate access critical machine. So having it down for a few
hours wasn't any kind of a problem, other than for my blood pressure.

Thank you Tom, and the rest of you, for your help. It's greatly
appreciated.
regards, tom lane

[1] Upon rebooting, the autovacuum utility kind of harfed. Here is
the log entries that it made:

---
[2004-06-30 08:44:53 AM] Failed connection to database template1
with error: FATAL: the database system is starting up

Nov 23 '05 #12

Tom Lane

Christopher Cashell <to**********@zyp.org> writes:

[2004-06-30 08:44:53 AM] Failed connection to database template1
with error: FATAL: the database system is starting up
.
[2004-06-30 08:44:53 AM] Failed connection to database template1
with error: FATAL: the database system is starting up
.
[2004-06-30 08:44:53 AM] Error: Cannot connect to template1,
exiting. So, I shut down Postgres, then restarted Postgres, then restarted
the autovacuum utility, and everything worked just peachy. I'm
guessing that perhaps the autovacuum tool was trying to connect to
Postgres while it was replaying the transaction log from not
having been shut down cleanly, and that's why it choked, but I
don't know that for sure.

Yeah, that's what it looks like to me --- autovacuum just a bit too
quick to give up. You could've just restarted autovacuum once the
database was up.

In 7.5 I think autovacuum will be integrated, and the postmaster won't
bother to start it till the startup sequence is done, so this won't
be an issue.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 23 '05 #13

Greg Stark

Tom Lane <tg*@sss.pgh.pa.us> writes:

Christopher Cashell <to**********@zyp.org> writes:
28424 postgres 18 0 16804 3044 15m D 0.0 1.6 0:06.72 postmaster

Note that it does have a process status of 'D', or uninterruptible
sleep. That would explain the unkillable part, though I'm curious how
it ended up there.

Is there an NFS server involved? If an NFS server disappears any process
waiting on I/O for it enters disk-wait indefinitely until it reappears.

--
greg
---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 23 '05 #14

Similar topics

PHP crashed on imagettfbox()

by: Subhash | last post by:

Hello, I am trying to generate dynamic graphics using the GD library and TTF fonts. The resulting image shows up fine. But in the apache error_log file the following lines are appended: child...

PHP

ZODB memory problems (was: processing a Very Large file)

by: DJTB | last post by:

zodb-dev@zope.org] Hi, I'm having problems storing large amounts of objects in a ZODB. After committing changes to the database, elements are not cleared from memory. Since the number of...

Python

Problems with innoDB files

by: Martin Gill | last post by:

I am a novice user of MySQL so please excuse my ignorance. I recently installed MySQL and have a third party tool using it to update data in a database. Recently it started creating erorr...

MySQL Database

Windows Forms No-Touch Deployment problems using DAO on User-Level Secured Access DB to access groups 1

by: James | last post by:

I have a VB windows forms application that accesses a Microsoft Access database that has been secured using user-level security. The application is being deployed using No-Touch deployment. The...

Visual Basic .NET

FATAL: the database system is in recovery mode

by: Ron St-Pierre | last post by:

We're developing a java app and are using postgres as the database. On our dev server I started the app, closed it, but the java process was still open so I killed it, which caused the above error....

PostgreSQL Database

pg_hba.conf changes without restarting postmaster

by: Si Chen | last post by:

Hello. It seems that every time I make a change to pg_hba.conf, I have to restart the database server for the new authentication to take effect. Is there a way to have the server use the new...

PostgreSQL Database

beta3 winxp initdb problems

by: BARTKO, Zoltán | last post by:

Hello, folks, I am trying to install pgsql8 on winxp. I tried first to install "as is" with pginstaller beta2-dev3, no luck, it froze, switched off Nod32, froze a little later, ran through the...

PostgreSQL Database

Data structure problems.

by: tony.fountaine | last post by:

I am working on a project to read a Bosch Measurement Data File (MDF). The file contains a number of blocks that can be read from the file using a baisc structure. For example the ID BLOCK is as...

C / C++

Automatically restarting system calls?

by: Dan Stromberg | last post by:

I wrote a script(1) replacement in python (http://stromberg.dnsalias.org/ ~dstromberg/pypty/), but I'm encountering a problem in it. I think I know the solution to the problem, but I'd've thought...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General