By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,661 Members | 1,330 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,661 IT Pros & Developers. It's quick & easy.

disaster recovery

P: n/a
We are evaluating Postgres and would like some input about disaster recovery. I know in MsSQL they have a feature called transactional
logs that would enable a database to be put back together based off those logs. Does Postgres do anything like this? I saw in the documentation
transactional logging but I don't know if it is the same. Where can I find info about disaster recovery in Postgres. Thank you in advance
for any info given.

Jason Tesser
Web/Multimedia Programmer
Northland Ministries Inc.
(715)324-6900 x3050
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 12 '05 #1
Share this Question
Share on Google+
18 Replies


P: n/a
Jason Tesser wrote:
We are evaluating Postgres and would like some input about disaster recovery.
I'm going to try to communicate what I understand, and other list
members can correct me at their selected level of vehemence :)
Please send corrections to the list - I may take days to post follow-ups.
I know in MsSQL they have a feature called transactional
logs that would enable a database to be put back together based off those logs.


A roughly parallel concept in PostgreSQL (what's the correct
capitalisation and spelling?) is the "Write Ahead Log" (WAL). There is
also a quite dissimilar concept called the query log - which is good to
inspect for common queries to allow database tuning, but is not replay-able.

The theory is that given a PostgreSQL database and the respective WAL,
you can recreate the database to the time that the last entry of the WAL
was written to disk.

Some caveats though:
1) Under Linux, if you have the file system containing the WAL mounted
with asynchronous writes, "all bets are off". The *BSD crowd (that I
know of) take great pleasure in constantly reminding me that if the
power fails, my file system will be in an indeterminate state - things
could be half-written all over the file system.
2) If you're using IDE drives, under any operating system, and have
write-caching turned on in the IDE drives themselves, again "all bets
are off"
3) If you're using IDE drives behind a RAID controller, YMMV.

So to play things safe, one recommendation to ensure database robustness
is to:
1) Store the WAL on a separate physical drive
2) Under Linux, mount that file system with synchronous writes (ie:
fsync won't return until the data is actually, really, written to the
interface)
3) If using IDE drives, turn off write caching on the WAL volume so that
you know data is actually written to disk when the drive claims it is.

Note that disabling write caching will impact write performance
significantly. Most people *want* write caching turned on for
throughput-critical file systems, and turned off for mission-critical
file systems.

Note too that SCSI systems tend to have no "write cache" as such, since
they use "tagged command queues". The OS can say to the SCSI drive
something that is effectively, "here are 15 blocks of data to write to
disk, get back to me when the last one is actually written to the
media", and continue on its way. On IDE, the OS can only have one
command outstanding - the purpose of the write cache is to allow
multiple commands to be received and "acknowledged" before any data is
actually written to the media.

When the host is correctly configured, you can recover a PostgreSQL
database from a hardware failure by recovering the database file itself
and "replaying" the WAL to that database.

Read more about WAL here:
http://www.postgresql.org/docs/current/static/wal.html

Regards
Alex
PS: Please send corrections to the list
PPS: Don't forget to include "fire drills" as part of your disaster
recovery plan - get plenty of practice at recovering a database from a
crashed machine so that you don't make mistakes when the time comes that
you actually need to do it!
PPPS: And follow your own advice ;)
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 12 '05 #2

P: n/a
Alex Satrapa <al**@lintelsys.com.au> writes:
1) Under Linux, if you have the file system containing the WAL mounted
with asynchronous writes, "all bets are off". The *BSD crowd (that I
know of) take great pleasure in constantly reminding me that if the
power fails, my file system will be in an indeterminate state - things
could be half-written all over the file system.


This is pretty out of date. If you use a journaling filesystem
(there are four solid ones available and modern distros use them)
metadata is consistent and crash recovery is fast.

Even with ext2, WAL files are preallocated and PG calls fsync() after
writing, so in practice it's not likely to cause problems.

-Doug

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 12 '05 #3

P: n/a
Alex Satrapa wrote:
Some caveats though:
1) Under Linux, if you have the file system containing the WAL mounted
with asynchronous writes, "all bets are off". The *BSD crowd (that I
know of) take great pleasure in constantly reminding me that if the
power fails, my file system will be in an indeterminate state - things
could be half-written all over the file system.


This is only a problem for ext2. Ext3, Reiser, XFS, JFS are all fine,
though you get better performance from them by mounting them
'writeback'.

--
Bruce Momjian | http://candle.pha.pa.us
pg***@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 12 '05 #4

P: n/a
Doug McNaught <do**@mcnaught.org> writes:
Alex Satrapa <al**@lintelsys.com.au> writes:
1) Under Linux, if you have the file system containing the WAL mounted
with asynchronous writes, "all bets are off".

...
Even with ext2, WAL files are preallocated and PG calls fsync() after
writing, so in practice it's not likely to cause problems.


Um. I took the reference to "mounted with async write" to mean a
soft-mounted NFS filesystem. It does not matter which OS you think is
the one true OS --- running a database over NFS is the act of someone
with a death wish. But, yeah, soft-mounted NFS is a particularly
malevolent variety ...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 12 '05 #5

P: n/a
Tom Lane <tg*@sss.pgh.pa.us> writes:
Doug McNaught <do**@mcnaught.org> writes:
Alex Satrapa <al**@lintelsys.com.au> writes:
1) Under Linux, if you have the file system containing the WAL mounted
with asynchronous writes, "all bets are off".

...
Even with ext2, WAL files are preallocated and PG calls fsync() after
writing, so in practice it's not likely to cause problems.


Um. I took the reference to "mounted with async write" to mean a
soft-mounted NFS filesystem. It does not matter which OS you think is
the one true OS --- running a database over NFS is the act of someone
with a death wish. But, yeah, soft-mounted NFS is a particularly
malevolent variety ...


I took it as a garbled understanding of the "Linux does async metadata
updates" criticism. Which is true for ext2, but was never the
show-stopper some BSD-ers wanted it to be. :)

-Doug

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 12 '05 #6

P: n/a
Doug McNaught wrote:
I took it as a garbled understanding of the "Linux does async metadata
updates" criticism. Which is true for ext2, but was never the
show-stopper some BSD-ers wanted it to be. :)


I have on several occasions demonstrated how "bad" asynchronous writes
are to a BSD-bigot by pulling the plug on a mail server (having a
terminal on another machine showing the results of tail -f
/var/log/mail.log), then showing that when the machine comes back up the
most we've ever lost is one message

From the BSD-bigot's point of view, this is equivalent to the end of
the world as we know it.

From my point of view, it's just support for my demands to have each
mission-critical server supported by a UPS, if not redundant power
supplies and two UPSes.

Alex
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 12 '05 #7

P: n/a
>
From my point of view, it's just support for my demands to have each
mission-critical server supported by a UPS, if not redundant power
supplies and two UPSes.


Never had a kernel panic? I've had a few. Probably flakey hardware. I
feel safer since journalling file systems hit linux.
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 12 '05 #8

P: n/a
Craig O'Shannessy wrote:
Never had a kernel panic? I've had a few. Probably flakey hardware. I
feel safer since journalling file systems hit linux.


The only kernel panic I've ever had was when playing with a development
version of the kernel (2.3.x). Never played with development kernels
since then - I'm a user, not a developer.

All the outages I've experienced so far have been due to external
factors such as (in order of frequency):
- Colocation facility technicians repatching panels and
putting my connection "back" into the wrong port
- Colo facility power failure (we were told they had dual
redundant diesel+battery UPS, but they only had one, the
second was being installed "any time now")
- End user's machines crashing
- Client software crashing
- Colo facility techs ripping power cables or network
cables while "cleaning up" cable trays
- Hard drive failure (hard, fast and very real - one
revolution the drive was working, the next it was a
charred blackened mess of fibreglass, silicon and
aluminium)

I have to admit that in none of those cases would synchronous vs
asynchronous, journalling vs non-journalling or *any* file system
decision have made the slightest jot of a difference to the integrity of
my data.

I've yet to experience a CPU failure (touch wood!).
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 12 '05 #9

P: n/a
al**@lintelsys.com.au (Alex Satrapa) writes:
Craig O'Shannessy wrote:
Never had a kernel panic? I've had a few. Probably flakey hardware. I
feel safer since journalling file systems hit linux.


The only kernel panic I've ever had was when playing with a
development version of the kernel (2.3.x). Never played with
development kernels since then - I'm a user, not a developer.


You apparently don't "get out enough;" while Linux is certainly a lot
more reliable than systems that need to be rebooted every few days so
that they don't spontaneously reboot, perfection is not to be had:

1. Flakey hardware can _always_ take things down.

A buggy video card and/or X driver can and will take systems down
in a flash. (And this problem shouldn't leave *BSD folk feeling
comfortable; they have no "silver bullet" against this
problem...)

2. Devices that pretend to be SCSI devices have a history of being
troublesome. I have encountered kernel panics as a result of
IDE-CDROMs, USB memory card readers, and the USB Palm interface
going 'flakey.'

3. There's an oft-heavily-loaded system that I have been working with
that has occasionally kernel paniced. Haven't been able to get
enough error messages out of it to track it down.

Note that none of these scenarios have anything to do with
"development kernels;" in ALL these cases, I have experienced the
problems when running "production" kernels.

There have been times when I have tracked "bleeding edge" kernels; I
never, in those times, experienced data loss, although there have,
historically, been experimental versions which did break so badly as
to trash filesystems.

I have seen a LOT more kernel panics in "production" versions than in
"experimental" versions, personally; the notion that avoiding "dev"
kernels will eliminate kernel panics is just fantasy.

Production kernels can't prevent disk hardware from being flakey;
that, alone, is point enough.
--
let name="cbbrowne" and tld="libertyrms.info" in String.concat "@" [name;tld];;
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 646 3304 x124 (land)
Nov 12 '05 #10

P: n/a
On Thu, 27 Nov 2003, Doug McNaught wrote:
Tom Lane <tg*@sss.pgh.pa.us> writes:
Doug McNaught <do**@mcnaught.org> writes:
Alex Satrapa <al**@lintelsys.com.au> writes:
1) Under Linux, if you have the file system containing the WAL mounted
with asynchronous writes, "all bets are off".
...
Even with ext2, WAL files are preallocated and PG calls fsync() after
writing, so in practice it's not likely to cause problems.


Um. I took the reference to "mounted with async write" to mean a
soft-mounted NFS filesystem. It does not matter which OS you think is
the one true OS --- running a database over NFS is the act of someone
with a death wish. But, yeah, soft-mounted NFS is a particularly
malevolent variety ...


I took it as a garbled understanding of the "Linux does async metadata
updates" criticism. Which is true for ext2, but was never the
show-stopper some BSD-ers wanted it to be. :)


And it's not file metadata, it's directory data. Metadata (inode
data) is synced, even in ext2, AFAIK.

Quoting the man page:
fsync copies all in-core parts of a file to disk, and
waits until the device reports that all parts are on sta-
ble storage. It also updates metadata stat information.
It does not necessarily ensure that the entry in the
directory containing the file has also reached disk. For
that an explicit fsync on the file descriptor of the
directory is also needed

For WALs, this is perfectly fine. It can be a problem for those
applications that do a lot of renames and relay on those as
sync/locking mechanisms (think of mail spoolers).

..TM.
--
____/ ____/ /
/ / / Marco Colombo
___/ ___ / / Technical Manager
/ / / ESI s.r.l.
_____/ _____/ _/ Co*****@ESI.it
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 12 '05 #11

P: n/a
On Fri, 28 Nov 2003, Alex Satrapa wrote:
Doug McNaught wrote:
I took it as a garbled understanding of the "Linux does async metadata
updates" criticism. Which is true for ext2, but was never the
show-stopper some BSD-ers wanted it to be. :)
I have on several occasions demonstrated how "bad" asynchronous writes
are to a BSD-bigot by pulling the plug on a mail server (having a
terminal on another machine showing the results of tail -f
/var/log/mail.log), then showing that when the machine comes back up the
most we've ever lost is one message


Sorry, I can't resist. Posting this on a PostgreSQL list it's too funny.
This is the last place they want to hear about a lost transaction. Even
just one.

The problem is (was) with programs using _directory_ operations as
syncronization primitives, and mail spoolers (say, MTAs) are typical
in that. Beware that in the MTA world loosing a single message is
just as bad as loosing a committed transaction for DB people. MTAs
are expected to return "OK, I received and stored the message."
only _after_ they committed it to disk in a safe manner (that's because
the other side is allowed to delete its copy after seeing the "OK").
The only acceptable behaviour in case of failure (which is of course
unacceptable in the DB world) for MTAs is to deliver _two_ copies
of a message, but _never_ zero (message lost). That's what might happen
if something crashes (or connection is lost) _after_ the MTA committed
the message to disk and _before_ the peer received notification of
that. Later the peer will try and send the message again (the receiving
MTA has enough knowledge to detect the duplication, but usually real
world MTAs don't do that AFAIK).

My understanding of the problem is: UNIX fsync(), historically,
used to sync also directory data (filename entries) before returning.
MTAs used to call rename()/fsync() or link()/unlink()/fsync()
sequences to "commit" a message to disk. In Linux, fsync() is
documented _not_ to sync directory data, "just" file data and metadata
(inode). While the UNIX behaviour turned out to be very useful,
personally I don't think Linux fsync() is broken/buggy. A file in
UNIX is just that, data blocks and inode. Syncing directory data
was just a (useful) side-effect of one implementation. In Linux,
an explicit fsync() on the directory itself is needed (and in each
path component if you changed one of them too), if you want to
commit changes to disk. Doing that is just as safe as on any filesystem,
even on ext2 with async writes enabled (it doesn't mean "ignore fsync()"
after all!).

AFAIK, but I might be wrong as I know little of this, PostgreSQL
does not relay on directory operations for commits or WAL writes.
It operates on file _data_ and uses fsync(). That works fine with
ext2 in async writes mode, too, no wonder. No need to mount noasync
or to use chattr -S.

BTW, there's no change in the fsync() itself, AFAIK. Some journalled FS
(maybe _all_ of them) will update directory data with fsync() too,
but that's an implementation detail. In my very personal opinion,
any application relaying on that is buggy. A directory and a file
are different "objects" in UNIX, and if you need both synced to disk,
you need to call fsync() two times. Note that syncing a file on
most journalled FS means syncing the journal, _all_ pending writes on
that FS, even those not related to your file. How could the FS
"partially" sync the journal, to sync just _your_ file data and metadata?
That's why directory data gets synced, too. There's no magic in fsync().
From the BSD-bigot's point of view, this is equivalent to the end of
the world as we know it.
From anyone's point of view, loosing track of a committed transaction
(and an accepted message is just that) is the end of the world.
From my point of view, it's just support for my demands to have each
mission-critical server supported by a UPS, if not redundant power
supplies and two UPSes.


Of course. The OS can only be sure it delivered the data to the disk.
If the disk lies on having actually stored it on the plates (as IDE
disks do), there's still a window of vulnerability. What I don't
really get is how SCSI disks can not lie about writes and at the same
time not show performance degradation on writes compared to their
IDE cousins. How any disk mechanics can perform at the same speed of
DRAM is beyond my understanding (even if that mechanics is 3 time
as expensive as IDE one).

..TM.
--
____/ ____/ /
/ / / Marco Colombo
___/ ___ / / Technical Manager
/ / / ESI s.r.l.
_____/ _____/ _/ Co*****@ESI.it
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 12 '05 #12

P: n/a
On Fri, 28 Nov 2003, Craig O'Shannessy wrote:

From my point of view, it's just support for my demands to have each
mission-critical server supported by a UPS, if not redundant power
supplies and two UPSes.


Never had a kernel panic? I've had a few. Probably flakey hardware. I
feel safer since journalling file systems hit linux.


On any hardware flakey enough to cause panics, no FS code will save
you. The FS may "reliably" write total rubbish to disk. It may have been
doing that for hours, thrashing the whole FS structure, before something
triggered the panic.
You are no safer with journal than you are with a plain FAT (or any
other FS technology). Journal files get corrupted themselves.

..TM.
--
____/ ____/ /
/ / / Marco Colombo
___/ ___ / / Technical Manager
/ / / ESI s.r.l.
_____/ _____/ _/ Co*****@ESI.it
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 12 '05 #13

P: n/a
On Fri, 28 Nov 2003, Alex Satrapa wrote:

[...]
I have to admit that in none of those cases would synchronous vs
asynchronous, journalling vs non-journalling or *any* file system
decision have made the slightest jot of a difference to the integrity of
my data.

I've yet to experience a CPU failure (touch wood!).


I have. I have seen memory failures, too. Bits getting flipped at random.
CPUs going mad. Video cards whose text buffer gets overwritten by
"something"... all were HW failures. There's little the SW can do when
the HW fails, just report that, if it gets any chance.
Your data is already (potentially) lost when that happens. Reliably
saving the content of a memory-corrupted buffer to disk will just cause
_more_ damage to your data. That's expecially true when the "data" is
filesystem metadata. Horror stories. I still remember the day when
/bin/chmod became of type ? and size +4GB on my home PC (that was
Linux 0.98 on a 100MB HD - with a buggy IDE chipset).

..TM.
--
____/ ____/ /
/ / / Marco Colombo
___/ ___ / / Technical Manager
/ / / ESI s.r.l.
_____/ _____/ _/ Co*****@ESI.it
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 12 '05 #14

P: n/a
On Fri, Nov 28, 2003 at 12:28:25 +0100,
Marco Colombo <ma***@esi.it> wrote:

My understanding of the problem is: UNIX fsync(), historically,
used to sync also directory data (filename entries) before returning.
MTAs used to call rename()/fsync() or link()/unlink()/fsync()
sequences to "commit" a message to disk. In Linux, fsync() is
documented _not_ to sync directory data, "just" file data and metadata
(inode). While the UNIX behaviour turned out to be very useful,
personally I don't think Linux fsync() is broken/buggy. A file in
UNIX is just that, data blocks and inode. Syncing directory data
was just a (useful) side-effect of one implementation. In Linux,
an explicit fsync() on the directory itself is needed (and in each
path component if you changed one of them too), if you want to
commit changes to disk. Doing that is just as safe as on any filesystem,
even on ext2 with async writes enabled (it doesn't mean "ignore fsync()"
after all!).


A new function name should have been used to go along with the new semantics.

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 12 '05 #15

P: n/a
> This is only a problem for ext2. Ext3, Reiser, XFS, JFS are all fine,
though you get better performance from them by mounting them
'writeback'.


What does 'writeback' do exactly?

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 12 '05 #16

P: n/a
"Rick Gigger" <ri**@alpinenetworking.com> writes:
This is only a problem for ext2. Ext3, Reiser, XFS, JFS are all fine,
though you get better performance from them by mounting them
'writeback'.


What does 'writeback' do exactly?


AFAIK 'writeback' only applies to ext3. The 'data=writeback' setting
journals metadata but not data, so it's faster but may lose file
contents in case of a crash. For Postgres, which calls fsync() on the
WAL, this is not an issue since when fsync() returns the file contents
are commited to disk.

AFAIK XFS and JFS are always in 'writeback' mode; I'm not sure about
Reiser.

-Doug

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 12 '05 #17

P: n/a
Marco Colombo wrote:
On Fri, 28 Nov 2003, Alex Satrapa wrote:
From the BSD-bigot's point of view, this is equivalent to the end of
the world as we know it.
From anyone's point of view, loosing track of a committed transaction
(and an accepted message is just that) is the end of the world.


When hardware fails, you'd be mad to trust the data stored on the
hardware. You can't be sure that the data that's actually on disk is
what was supposed to be there, the whole of what's supposed to be there,
and nothing but what's supposed to be there. You just can't. This
emphasis that some people have on "committing writes to disk" is misplaced.

If the data is really that important, you'd be sending it to three
places at once (one or three, not two - ask any sailor about clocks) -
async or not.
What I don't
really get is how SCSI disks can not lie about writes and at the same
time not show performance degradation on writes compared to their
IDE cousins.
SCSI disks have the advantage of "tagged command queues". A simplified
version of the difference between IDE's single-transaction model and
SCSI's tagged command queue is as follows (this is based on my vague
understanding of SCSI magic):

On an IDE disk, you do this:

PC: here, disk, store this data
Disk: Okay, done
PC: and here's a second block
Disk: Okay, done
.... ad nauseum ...
PC: and here's a ninety fifth block
Disk: Okay, done.

On a SCSI disk, you do this:
PC: Disk, stor these ninety five blocks, and tell me when you've finished
[time passes]
PC: Oh, can you fetch me some blocks from over there while you're at it?
[time passes]
Disk: Okay, all those writes are done!
[fetching continues]

How any disk mechanics can perform at the same speed of
DRAM is beyond my understanding (even if that mechanics is 3 time
as expensive as IDE one).


It's not the mechanics that are faster, it's just the the transferring
stuff to the disk's buffers can be done "asynchronously" - you're not
waiting for previous writes to complete before queuing new writes (or
reads). At the same time, the SCSI disk isn't "lying" to you about
having committed the data to media, since the two stages of request and
confirmation can be separated in time.

So at any time, the disk can have a number of read and write requests
queued up, and it can decide which order to do them in. The OS can
happily go on its way.

At least, that's my understanding.
Alex
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 12 '05 #18

P: n/a
On Tue, 2 Dec 2003, Alex Satrapa wrote:
Marco Colombo wrote:
On Fri, 28 Nov 2003, Alex Satrapa wrote:
From the BSD-bigot's point of view, this is equivalent to the end of
the world as we know it.


From anyone's point of view, loosing track of a committed transaction
(and an accepted message is just that) is the end of the world.


When hardware fails, you'd be mad to trust the data stored on the
hardware. You can't be sure that the data that's actually on disk is
what was supposed to be there, the whole of what's supposed to be there,
and nothing but what's supposed to be there. You just can't. This
emphasis that some people have on "committing writes to disk" is misplaced.

If the data is really that important, you'd be sending it to three
places at once (one or three, not two - ask any sailor about clocks) -
async or not.


Sure, but we were discussing a 'pull the plug' scenario, not HW failures.
Only RAID (which is a way of sending data to different places)
saves you from a disk failure (if it can be _detected_!), and nothing
from a CPU/RAM failure, on a conventional PC (but a second PC, if you're
lucky). The original problem was ext2 loosing _only_ one message after
reboot when someone pulls the plug. The real problem is not the disk, it's
the application returning "OK, COMMITTED" to the other side (which may
be a SMTP client or a PostgreSQL client). IDE tricks these applications
in returning OK _before_ the data hits safe storage (platters). The FS
may play a role too, expecially for those applications that use fsync()
on a file to sync directory data too. On many journalled FS, fsync()
triggers a (global) journal write (which sometimes can be a performance
killer), so, as a side effect, a sync of directory data too.

AFAIK, ext2 is safe to use with PostgreSQL, since commits do not involve
any directory operation (if so, I hope PostgreSQL does a fsync() on the
involved directory too). With heavy transaction loads, I guess it will
outperform journalled filesystems, w/o _any_ loss in data safety. I have
no data to back up such a statement, though.

[ ok on the SCSI async behavior ]

..TM.
--
____/ ____/ /
/ / / Marco Colombo
___/ ___ / / Technical Manager
/ / / ESI s.r.l.
_____/ _____/ _/ Co*****@ESI.it
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 12 '05 #19

This discussion thread is closed

Replies have been disabled for this discussion.