473,395 Members | 1,584 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

Problems restarting after database crashed (signal 11).

Yesterday, while attempting to access a database, I received errors
saying that the database was innaccessible. After investigating a
little, I found the following in the PostgreSQL log files:

2004-06-30 08:30:19 [24119] LOG: checkpoint process (PID 28423) was
terminated by signal 11
2004-06-30 08:30:19 [24119] LOG: terminating any other active server
processes
2004-06-30 08:30:19 [28383] WARNING: terminating connection because of
crash of another server process
DETAIL: The postmaster has commanded this server process to roll back
the curre nt transaction and exit, because another server process exited
abnormally and po ssibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat yo ur command.
2004-06-30 08:30:19 [28362] WARNING: terminating connection because of
crash of another server process
DETAIL: The postmaster has commanded this server process to roll back
the curre nt transaction and exit, because another server process exited
abnormally and po ssibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat yo ur command.

The last bit then repeated a few more times, and then:

2004-06-30 08:30:20 [24119] LOG: all server processes terminated;
reinitializing
2004-06-30 08:30:20 [28424] LOG: database system was interrupted at 2004-06-30
08:22:23 CDT
2004-06-30 08:30:20 [28424] LOG: checkpoint record is at 8/77703F9C
2004-06-30 08:30:20 [28424] LOG: redo record is at 8/775B1D38; undo
record is at 0/0; shutdown FALSE
2004-06-30 08:30:20 [28424] LOG: next transaction ID: 1638554; next
OID: 1058492
2004-06-30 08:30:20 [28424] LOG: database system was not properly shut
down; automatic recovery in progress
2004-06-30 08:30:20 [28424] LOG: redo starts at 8/775B1D38
2004-06-30 08:30:21 [28430] LOG: connection received: host=[local] port=
2004-06-30 08:30:21 [28430] FATAL: the database system is starting up
2004-06-30 08:30:38 [28424] LOG: record with zero length at 8/78855F38
2004-06-30 08:30:38 [28424] LOG: redo done at 8/78853EE0
2004-06-30 08:31:40 [28449] LOG: connection received: host=[local] port=
2004-06-30 08:31:40 [28449] FATAL: the database system is starting up
2004-06-30 08:31:48 [28452] LOG: connection received: host=[local] port=
2004-06-30 08:31:48 [28452] FATAL: the database system is starting up
2004-06-30 08:31:53 [28459] LOG: connection received: host=[local] port=
2004-06-30 08:31:53 [28459] FATAL: the database system is starting up

And this then continues on and on. Even 20 minutes later, attempts to
connect to the database were met with the same FATAL error.

Eventually I attempted to shut it down and restart it, however that
failed too. When I attempted to shut it down, I discovered a hung
'startup subprocess' that can't be killed.

nexus:~# ps aux | grep postgres
postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres:
startup subprocess
nexus:~# kill -9 28424
nexus:~# ps aux | grep postgres
postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres:
startup subprocess
nexus:~#

As soon as I can get physical access to the machine, I'm planning to
reboot it, as I can't think of anything else to do to kill a process
that can't be kill -KILL'ed.

I'm worried that attempting to start the database after rebooting will
fail in the same way, however. Has anyone seen anything like this
before, or have any ideas on how to proceed?

I'm running on an Intel Pentium Pro box, with Debian/GNU Linux, running
'unstable'. I'm using PostgreSQL 7.4.3.

Thank you for your help.

--
| Christopher
+------------------------------------------------+
| Here I stand. I can do no other. |
+------------------------------------------------+
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 23 '05 #1
13 3089
Christopher Cashell <to**********@zyp.org> writes:
Eventually I attempted to shut it down and restart it, however that
failed too. When I attempted to shut it down, I discovered a hung
'startup subprocess' that can't be killed.


This is interesting because it seems just about exactly like this
recent Red Hat bug report:
https://bugzilla.redhat.com/bugzilla....cgi?id=126885

As I commented there, I think that it must be a kernel or hardware
issue --- Postgres itself can surely not make an unkillable process.
However it's common to see processes that don't respond to kill if
they are stuck inside a kernel I/O request. That could mean either
unresponsive hardware or a kernel bug.

I wonder whether you have any similarities in hardware or Linux kernel
to the person who filed the above report?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 23 '05 #2
At Wed, 30 Jun 04, Unidentified Flying Banana Tom Lane, said:
Christopher Cashell <to**********@zyp.org> writes:
Eventually I attempted to shut it down and restart it, however that
failed too. When I attempted to shut it down, I discovered a hung
'startup subprocess' that can't be killed.
This is interesting because it seems just about exactly like this
recent Red Hat bug report:
https://bugzilla.redhat.com/bugzilla....cgi?id=126885


Hrm. Yes, it does appear to be a very similar, if not identical, issue.
As I commented there, I think that it must be a kernel or hardware
issue --- Postgres itself can surely not make an unkillable process.
However it's common to see processes that don't respond to kill if
they are stuck inside a kernel I/O request. That could mean either
unresponsive hardware or a kernel bug.
That is somewhat along the lines of what I was thinking, although I have
had no problems like this before. The machine has been running for over
100 days, and the database as well, without issue.

28424 postgres 18 0 16804 3044 15m D 0.0 1.6 0:06.72 postmaster

Note that it does have a process status of 'D', or uninterruptible
sleep. That would explain the unkillable part, though I'm curious how
it ended up there. Unless it just happened to be in a really bad spot
when Posgres segfaulted. . . although, I wouldn't expect that would
affect the 'startup subprocess'.
I wonder whether you have any similarities in hardware or Linux kernel
to the person who filed the above report?
Here's all the information I can provide for this machine:

IBM IntelliStation Z Pro
Model: 6899-12U
Dual Pentium Pro 200
192MB RAM
4.5 GB IBM SCSI HDD
9 GB IBM SCSI HDD
6.4 GB WD HDD

The database resides on the 4.5 GB SCSI, with the pg_xlog directory
symlinked from there, and actually existing on the 9GB SCSI.

nexus:~$ uname -a
Linux nexus.zyp.org 2.6.4 #1 SMP Thu Mar 11 14:04:49 CST 2004 i686 GNU/Linux
nexus:~$ uptime
21:15:39 up 107 days, 20:57, 7 users, load average: 2.04, 2.31, 2.38

If there's any other information I can provide, please let me know.

I'm going to reboot the box right now, and cross my fingers, hoping
it'll come back up. ;-)
regards, tom lane


--
| Christopher
+------------------------------------------------+
| Here I stand. I can do no other. |
+------------------------------------------------+
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 23 '05 #3
Christopher Cashell <to**********@zyp.org> writes:
28424 postgres 18 0 16804 3044 15m D 0.0 1.6 0:06.72 postmaster Note that it does have a process status of 'D', or uninterruptible
sleep. That would explain the unkillable part, though I'm curious how
it ended up there.


Perhaps I'm just of an older generation, but I always thought 'D' meant
"disk I/O wait". Which definitely is uninterruptible on most Unixen.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #4
On Wed, 2004-06-30 at 18:57, Christopher Cashell wrote:
Yesterday, while attempting to access a database, I received errors
saying that the database was innaccessible. After investigating a
little, I found the following in the PostgreSQL log files:

2004-06-30 08:30:19 [24119] LOG: checkpoint process (PID 28423) was
terminated by signal 11 Eventually I attempted to shut it down and restart it, however that
failed too. When I attempted to shut it down, I discovered a hung
'startup subprocess' that can't be killed.

nexus:~# ps aux | grep postgres
postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres:
startup subprocess
nexus:~# kill -9 28424
nexus:~# ps aux | grep postgres
postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres:
startup subprocess
nexus:~#


The combination of a Sig 11 failure and a process stuck in a D state
makes me lean towards thinking it's bad hardware (CPU or memory). Have
you tested this machine?
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 23 '05 #5
On Wed, 2004-06-30 at 21:41, Scott Marlowe wrote:
On Wed, 2004-06-30 at 18:57, Christopher Cashell wrote:
Yesterday, while attempting to access a database, I received errors
saying that the database was innaccessible. After investigating a
little, I found the following in the PostgreSQL log files:

2004-06-30 08:30:19 [24119] LOG: checkpoint process (PID 28423) was
terminated by signal 11

Eventually I attempted to shut it down and restart it, however that
failed too. When I attempted to shut it down, I discovered a hung
'startup subprocess' that can't be killed.

nexus:~# ps aux | grep postgres
postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres:
startup subprocess
nexus:~# kill -9 28424
nexus:~# ps aux | grep postgres
postgres 28424 0.0 1.5 16804 3044 pts/313 D 08:35 0:06 postgres:
startup subprocess
nexus:~#


The combination of a Sig 11 failure and a process stuck in a D state
makes me lean towards thinking it's bad hardware (CPU or memory). Have
you tested this machine?


Oh, and a possibly buggy kernel or kernel module somewhere as well.
Didn't mean to not say it, and have had problems with some kernels under
heavy parallel loads doing stupid things that look just like this.
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 23 '05 #6
At Wed, 30 Jun 04, Unidentified Flying Banana Scott Marlowe, said:
The combination of a Sig 11 failure and a process stuck in a D state
makes me lean towards thinking it's bad hardware (CPU or memory). Have
you tested this machine?


It's possible that it's bad hardware, as the machine is a little long in
the tooth. However, at the same time, it is a *very* well tested box.
It's been in production use for 4 years, and I have yet to experience a
significant hardware issue.

Up until this, prior to when I rebooted it a few hours ago over this
issue, it had been running for 110 days, with the database running
(under approximately the same load) for that whole time.

If it *is* a hardware problem, I would have expected it to show up
somewhat sooner than this.

--
| Christopher
+------------------------------------------------+
| Here I stand. I can do no other. |
+------------------------------------------------+
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 23 '05 #7
At Wed, 30 Jun 04, Unidentified Flying Banana Tom Lane, said:
Christopher Cashell <to**********@zyp.org> writes:
28424 postgres 18 0 16804 3044 15m D 0.0 1.6 0:06.72 postmaster
Note that it does have a process status of 'D', or uninterruptible
sleep. That would explain the unkillable part, though I'm curious how
it ended up there.


Perhaps I'm just of an older generation, but I always thought 'D' meant
"disk I/O wait". Which definitely is uninterruptible on most Unixen.


I've always heard it as 'uninteruptable sleep', which I understood to be
the generic term description, and that 'disk I/O wait' is the most
common cause for it (but not the only possible cause).

Either way, I think you're right, and I think that is what happened
here.
regards, tom lane


--
| Christopher
+------------------------------------------------+
| Here I stand. I can do no other. |
+------------------------------------------------+
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 23 '05 #8
Christopher Cashell <to**********@zyp.org> writes:
If it *is* a hardware problem, I would have expected it to show up
somewhat sooner than this.


[sorry, hit Send a bit too early]

Parts do go bad. Sig11 is often an indication of either bad RAM or an
overheating processor. Have you checked your fans, especially since
it's an old machine?

-Doug

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 23 '05 #9
Christopher Cashell <to**********@zyp.org> writes:
If it *is* a hardware problem, I would have expected it to show up
somewhat sooner than this.


Parts do go bad. Sig11 is often an indication of ether bad RAM

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 23 '05 #10
Christopher Cashell <to**********@zyp.org> writes:
Either way, I think you're right, and I think that is what happened
here.


So did it come up after you rebooted? The other guy wasn't having
any luck that way :-(

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 23 '05 #11
At Thu, 01 Jul 04, Unidentified Flying Banana Tom Lane, said:
Christopher Cashell <to**********@zyp.org> writes:
Either way, I think you're right, and I think that is what happened
here.
So did it come up after you rebooted? The other guy wasn't having
any luck that way :-(


Yep. ;-)

In a testament to PostgreSQL's robustness, after rebooting the machine,
things went mostly[1] well. Postgres started up without issue, programs
made their connections to the database, and queries were happily made.

Everything looks to be working perfectly now.

And luckily, this machine, though it holds critical data, isn't a
time/immediate access critical machine. So having it down for a few
hours wasn't any kind of a problem, other than for my blood pressure.

Thank you Tom, and the rest of you, for your help. It's greatly
appreciated.
regards, tom lane

[1] Upon rebooting, the autovacuum utility kind of harfed. Here is
the log entries that it made:

---
[2004-06-30 08:44:53 AM] Failed connection to database template1
with error: FATAL: the database system is starting up
Nov 23 '05 #12
Christopher Cashell <to**********@zyp.org> writes:
[2004-06-30 08:44:53 AM] Failed connection to database template1
with error: FATAL: the database system is starting up
.
[2004-06-30 08:44:53 AM] Failed connection to database template1
with error: FATAL: the database system is starting up
.
[2004-06-30 08:44:53 AM] Error: Cannot connect to template1,
exiting. So, I shut down Postgres, then restarted Postgres, then restarted
the autovacuum utility, and everything worked just peachy. I'm
guessing that perhaps the autovacuum tool was trying to connect to
Postgres while it was replaying the transaction log from not
having been shut down cleanly, and that's why it choked, but I
don't know that for sure.


Yeah, that's what it looks like to me --- autovacuum just a bit too
quick to give up. You could've just restarted autovacuum once the
database was up.

In 7.5 I think autovacuum will be integrated, and the postmaster won't
bother to start it till the startup sequence is done, so this won't
be an issue.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 23 '05 #13
Tom Lane <tg*@sss.pgh.pa.us> writes:
Christopher Cashell <to**********@zyp.org> writes:
28424 postgres 18 0 16804 3044 15m D 0.0 1.6 0:06.72 postmaster

Note that it does have a process status of 'D', or uninterruptible
sleep. That would explain the unkillable part, though I'm curious how
it ended up there.


Is there an NFS server involved? If an NFS server disappears any process
waiting on I/O for it enters disk-wait indefinitely until it reappears.

--
greg
---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 23 '05 #14

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Subhash | last post by:
Hello, I am trying to generate dynamic graphics using the GD library and TTF fonts. The resulting image shows up fine. But in the apache error_log file the following lines are appended: child...
1
by: DJTB | last post by:
zodb-dev@zope.org] Hi, I'm having problems storing large amounts of objects in a ZODB. After committing changes to the database, elements are not cleared from memory. Since the number of...
0
by: Martin Gill | last post by:
I am a novice user of MySQL so please excuse my ignorance. I recently installed MySQL and have a third party tool using it to update data in a database. Recently it started creating erorr...
1
by: James | last post by:
I have a VB windows forms application that accesses a Microsoft Access database that has been secured using user-level security. The application is being deployed using No-Touch deployment. The...
2
by: Ron St-Pierre | last post by:
We're developing a java app and are using postgres as the database. On our dev server I started the app, closed it, but the java process was still open so I killed it, which caused the above error....
3
by: Si Chen | last post by:
Hello. It seems that every time I make a change to pg_hba.conf, I have to restart the database server for the new authentication to take effect. Is there a way to have the server use the new...
8
by: BARTKO, Zoltán | last post by:
Hello, folks, I am trying to install pgsql8 on winxp. I tried first to install "as is" with pginstaller beta2-dev3, no luck, it froze, switched off Nod32, froze a little later, ran through the...
1
by: tony.fountaine | last post by:
I am working on a project to read a Bosch Measurement Data File (MDF). The file contains a number of blocks that can be read from the file using a baisc structure. For example the ID BLOCK is as...
4
by: Dan Stromberg | last post by:
I wrote a script(1) replacement in python (http://stromberg.dnsalias.org/ ~dstromberg/pypty/), but I'm encountering a problem in it. I think I know the solution to the problem, but I'd've thought...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.