473,399 Members | 2,159 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,399 software developers and data experts.

DB failure?

PostgreSQL: 7.4.1

Last week, I had a corrupt index on one table with 2 million rows. On a
specific search, the database would SEGV. I dropped and recreated the index
involved in the search, and did a REINDEX on the primary key. That problem
went away.

Now I'm seeing:

db=> select count(*) from messages;
ERROR: could not access status of transaction 859000513
DETAIL: could not open file "/db/pgsql/data/pg_clog/0333": No such file or
directory

db=> select count(*) from message_recipients;
ERROR: invalid page header in block 1238604 of relation
"message_recipients"
The above commands were successful on 8/21.

There are 240 million rows. Dump/reload is not something that would be an
attractive option. Based on previous timings, it could take 2-3 days (this
is with a dual hyper-threaded 2.4 Ghz with 2GB memory and an 8 drive RAID 5)
on a production system.

So far, this database has been all INSERTs - no deletes or updates.

Is there a way to recover this without a dump/reload? We do nightly
backups, but since we don't know when the problem really started, it would
be rather difficult to restore and reapply several million updates.

Wes
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #1
3 1472
Wes
On 8/30/04 11:07 PM, "Wes Palmer" <we****@syntegra.com> wrote:
db=> select count(*) from messages;
ERROR: could not access status of transaction 859000513
DETAIL: could not open file "/db/pgsql/data/pg_clog/0333": No such file or
directory

db=> select count(*) from message_recipients;
ERROR: invalid page header in block 1238604 of relation
"message_recipients"


Uh, oh.. This would appear to be a big problem... I just tried to do a
pg_dumpall. The server SEGV'd around a Gig into the pg_dumpall:

LOG: server process (PID 12541) was terminated by signal 11
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted at 2004-08-31 07:13:31 MEST
LOG: checkpoint record is at 6E/13A916C
LOG: redo record is at 6E/13A916C; undo record is at 0/0; shutdown FALSE
LOG: next transaction ID: 173895; next OID: 243689524
LOG: database system was not properly shut down; automatic recovery in
progress
LOG: record with zero length at 6E/13A91AC
LOG: redo is not required
LOG: recycled transaction log file "0000006E00000000"
LOG: database system is ready

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 23 '05 #2
Time to test your memory and harddisk....

On Tue, Aug 31, 2004 at 12:24:53AM -0500, Wes wrote:
On 8/30/04 11:07 PM, "Wes Palmer" <we****@syntegra.com> wrote:
db=> select count(*) from messages;
ERROR: could not access status of transaction 859000513
DETAIL: could not open file "/db/pgsql/data/pg_clog/0333": No such file or
directory

db=> select count(*) from message_recipients;
ERROR: invalid page header in block 1238604 of relation
"message_recipients"
Uh, oh.. This would appear to be a big problem... I just tried to do a
pg_dumpall. The server SEGV'd around a Gig into the pg_dumpall:

LOG: server process (PID 12541) was terminated by signal 11
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted at 2004-08-31 07:13:31 MEST
LOG: checkpoint record is at 6E/13A916C
LOG: redo record is at 6E/13A916C; undo record is at 0/0; shutdown FALSE
LOG: next transaction ID: 173895; next OID: 243689524
LOG: database system was not properly shut down; automatic recovery in
progress
LOG: record with zero length at 6E/13A91AC
LOG: redo is not required
LOG: recycled transaction log file "0000006E00000000"
LOG: database system is ready



---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend


--
Martijn van Oosterhout <kl*****@svana.org> http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQFBNEEPY5Twig3Ge+YRAny9AJ90y4XCC7XQhnzuydKzIc 1czTocyQCg0HIJ
jGX3OCOTrU1JX0IqmBOkrrw=
=7209
-----END PGP SIGNATURE-----

Nov 23 '05 #3
On Mon, 30 Aug 2004, Wes Palmer wrote:
PostgreSQL: 7.4.1

Last week, I had a corrupt index on one table with 2 million rows. On a
specific search, the database would SEGV. I dropped and recreated the index
involved in the search, and did a REINDEX on the primary key. That problem
went away.

Now I'm seeing:

db=> select count(*) from messages;
ERROR: could not access status of transaction 859000513
DETAIL: could not open file "/db/pgsql/data/pg_clog/0333": No such file or
directory


I saw the above 2 types of errors (transaction status, segv) yesterday on
a box that turned out to have issues writing files correctly to disk. I
wrote a tool to write a large file to disk and then reopened the file
to read/verify the contents and it would fail every so often (like 1-5% of
the time.) What puzzles me is that the machine would work at all with an
issue like that. Turning on/off the battery backed cache had no effect.

The machine has ECC memory, but I tested that as well but it turned up
nothing. I am using PostgreSQL 7.4.1.

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

10
by: x2164 | last post by:
hi all, Linux 2.4.28 Glibc 2.2.5 gcc 2.95.3 I'm new to Python. I've compiled Python 2.4 from tar file.
3
by: Damaji Jambhale | last post by:
COMException: Catostrphic failure When I added a "dll" reference in the web project. I was able to instantiate the class OK. But when I tried to set the properties, it failed with...
2
by: JustaCowboy | last post by:
Greetings, I am seeking information related to this subject. BOL suggests backing up the active transaction log immediately after a failure, so that the backup can be used in a recovery scenario...
4
by: J. Marshall Latham | last post by:
I have written an ASP.NET web app in C# that is trying to connect to a database using OleDb. I put code in a dll that uses another dll to create a connection object (and open it if requested) to...
5
by: Ron Louzon | last post by:
I have some C++ code that uses the CSingleLock( CCriticalSection *) constructor. In visual C++ 6.0, this code compiles and runs fine in both Debug and release modes. However, in Visual Studio...
8
by: Antony | last post by:
compiler£ºVisual Studio.Net 2003 (VC7.1) compile type£ºDebug problem: wanted more information about the "Run-Time Check Failure #n",thanks! Example1: #include "stdafx.h" void malice() {...
0
by: Marty Cruise | last post by:
I successfully deploy my application to 20 domain users. Only one new user is giving me a problem, although he can access all domain resources. When he clicks the installation link on the...
66
by: Johan Tibell | last post by:
I've written a piece of code that uses sockets a lot (I know that sockets aren't portable C, this is not a question about sockets per se). Much of my code ended up looking like this: if...
1
by: vierling | last post by:
As a digibetic I don't know how to repair Jscript failure (run time failure rule 7 and 5): indicating : object is expected. This is what MS Script Editor tells me, but it doesnot tell me how to...
8
by: =?Utf-8?B?TWFyaw==?= | last post by:
We've got a wierd failure happening on just one machine. One part of our product uses a 3rd party search implementation (dtSearch). DtSearch has a native core (dten600.dll), late-bound, and a...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.