By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,374 Members | 2,014 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,374 IT Pros & Developers. It's quick & easy.

SCSI vs. IDE performance test

P: n/a
http://hardware.devchannel.org/hardw...&tid=38&tid=49

--
-----------------------------------------------------------------
Ron Johnson, Jr. ro***********@cox.net
Jefferson, LA USA

I can't make you have an abortion, but you can *make* me pay
child support for 18 years? However, if I want the child (and
all the expenses that entails) for the *rest*of*my*life*, and you
don't want it for 9 months, tough luck???
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 12 '05 #1
Share this Question
Share on Google+
36 Replies


P: n/a
The SCSI improvement over IDE seems overrated in the test. I would have
expected at most a 30% improvement. Other reviews seem to point out that IDE
performs just as well or better.

See Tom's hardware:
http://www20.tomshardware.com/storag...305/index.html

Stephen
"Ron Johnson" <ro***********@cox.net> wrote in message
news:1066837102.12532.176.camel@haggis...
http://hardware.devchannel.org/hardw...9.shtml?tid=20
&tid=38&tid=49
--
-----------------------------------------------------------------
Ron Johnson, Jr. ro***********@cox.net
Jefferson, LA USA

I can't make you have an abortion, but you can *make* me pay
child support for 18 years? However, if I want the child (and
all the expenses that entails) for the *rest*of*my*life*, and you
don't want it for 9 months, tough luck???
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 12 '05 #2

P: n/a
On Wed, 2003-10-22 at 11:01, Stephen wrote:
The SCSI improvement over IDE seems overrated in the test. I would have
expected at most a 30% improvement. Other reviews seem to point out that IDE
performs just as well or better.

See Tom's hardware:
http://www20.tomshardware.com/storag...305/index.html
When TCQ becomes a reality in IDE drives, they'll have a fighting
chance, but the slower seek times and rotational speeds will still
do them in.

Also, does an 8MB cache *really* make that much of a difference?
After all, it can only cache 0.0067% of a 120GB drive, and 0.00267%
of the new 300GB disks.

Speaking of which, that 300GB HDD sounds like a dream for near-
line storage, and even for nightly backups, if it is ever put in
SBB-type packaging.
http://www20.tomshardware.com/storag...008/index.html
Imagine a scheme where you rapidly pg_dump to the 300GB drive,
then, at leisure, tar the dump file to tape. Stripe a few together,
and keep a month of backups on-line for quick recovery, along with
the tape archives, in case the stripeset gets wasted, too.
"Ron Johnson" <ro***********@cox.net> wrote in message
news:1066837102.12532.176.camel@haggis...

http://hardware.devchannel.org/hardw...9.shtml?tid=20
&tid=38&tid=49


--
-----------------------------------------------------------------
Ron Johnson, Jr. ro***********@cox.net
Jefferson, LA USA

"Adventure is a sign of incompetence"
Stephanson, great polar explorer
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 12 '05 #3

P: n/a
I just ran some benchmarks against a 10K SCSI drive and 7200 RPM IDE
drive here:

http://fsbench.netnation.com/

The results vary quite a bit, and it seems the file system you use
can make a huge difference.

SCSI is obviously faster, but a 20% performance gain for 5x the cost is
only worth it for a very small percentage of people, I would think.

On Wed, 2003-10-22 at 09:01, Stephen wrote:
The SCSI improvement over IDE seems overrated in the test. I would have
expected at most a 30% improvement. Other reviews seem to point out that IDE
performs just as well or better.

See Tom's hardware:
http://www20.tomshardware.com/storag...305/index.html

Stephen
"Ron Johnson" <ro***********@cox.net> wrote in message
news:1066837102.12532.176.camel@haggis...

http://hardware.devchannel.org/hardw...9.shtml?tid=20
&tid=38&tid=49

--
-----------------------------------------------------------------
Ron Johnson, Jr. ro***********@cox.net
Jefferson, LA USA

I can't make you have an abortion, but you can *make* me pay
child support for 18 years? However, if I want the child (and
all the expenses that entails) for the *rest*of*my*life*, and you
don't want it for 9 months, tough luck???
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org


---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

--
Best Regards,

Mike Benoit

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 12 '05 #4

P: n/a
Mike Benoit wrote:
I just ran some benchmarks against a 10K SCSI drive and 7200 RPM IDE
drive here:

http://fsbench.netnation.com/

The results vary quite a bit, and it seems the file system you use
can make a huge difference.

SCSI is obviously faster, but a 20% performance gain for 5x the cost is
only worth it for a very small percentage of people, I would think.


Did you turn off the IDE write cache? If not, the SCSI drive is
reliable in case of OS failure, while the IDE is not.

--
Bruce Momjian | http://candle.pha.pa.us
pg***@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 12 '05 #5

P: n/a
It seems to me file system journaling should fix the whole problem by giving
you a record of what was actually commited to disk and what was not. I must
not understand journaling correctly. Can anyone explain to me how
journaling works.

----- Original Message -----
From: "Bruce Momjian" <pg***@candle.pha.pa.us>
To: <mi***@netnation.com>
Cc: "Stephen" <jl*****@xxxxxx.com>; <pg***********@postgresql.org>
Sent: Monday, October 27, 2003 12:14 PM
Subject: Re: [GENERAL] SCSI vs. IDE performance test

Mike Benoit wrote:
I just ran some benchmarks against a 10K SCSI drive and 7200 RPM IDE
drive here:

http://fsbench.netnation.com/

The results vary quite a bit, and it seems the file system you use
can make a huge difference.

SCSI is obviously faster, but a 20% performance gain for 5x the cost is
only worth it for a very small percentage of people, I would think.
Did you turn off the IDE write cache? If not, the SCSI drive is
reliable in case of OS failure, while the IDE is not.

--
Bruce Momjian | http://candle.pha.pa.us
pg***@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania

19073
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 12 '05 #6

P: n/a
"Rick Gigger" <ri**@alpinenetworking.com> writes:
It seems to me file system journaling should fix the whole problem by giving
you a record of what was actually commited to disk and what was not.


Nope, a journaling FS has exactly the same problem Postgres does
(because the underlying "WAL" concept is the same: write the log entries
before you change the files they describe). If the drive lies about
write order, the FS can be screwed just as badly. Now the FS code might
have a low-level way to force write order that Postgres doesn't have
access to ... but simply uttering the magic incantation "journaling file
system" will not make this problem disappear.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 12 '05 #7

P: n/a
ahhh. "lies about write order" is the phrase that I was looking for. That
seemed to make sense but I didn't know if I could go directly from "lying
about fsync" to that. Obviously I don't understand exactly what fsync is
doing. I assume this means that if you were to turn fsync off you would get
considerably better performance but introduce the possibility of corrupting
the files in your database.

Thank you. This makes a lot more sense now.

----- Original Message -----
From: "Tom Lane" <tg*@sss.pgh.pa.us>
To: "Rick Gigger" <ri**@alpinenetworking.com>
Cc: <pg***********@postgresql.org>
Sent: Monday, October 27, 2003 3:39 PM
Subject: Re: [GENERAL] SCSI vs. IDE performance test

"Rick Gigger" <ri**@alpinenetworking.com> writes:
It seems to me file system journaling should fix the whole problem by giving you a record of what was actually commited to disk and what was not.


Nope, a journaling FS has exactly the same problem Postgres does
(because the underlying "WAL" concept is the same: write the log entries
before you change the files they describe). If the drive lies about
write order, the FS can be screwed just as badly. Now the FS code might
have a low-level way to force write order that Postgres doesn't have
access to ... but simply uttering the magic incantation "journaling file
system" will not make this problem disappear.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 12 '05 #8

P: n/a
Tom, this discussion brings up something that's been bugging me about the
recommendations for getting more performance out of PG.. in particular the
one that suggests you put your WAL files on a different physical drive from
the database.

Consider the following scenario:
Database on drive1
WAL on drive2

1. PG write of some sort occurs.
2. PG writes out the WAL.
3. PG writes out the data.
4. PG updates the WAL to reflect data actually written.
5. System crashes/reboots/whatever.

With the DB and the WAL on different drives, it seems possible to me that
drive2 could've fsync()'d or otherwise properly written all of the data
out, but drive1 could have failed somewhere along the way and not actually
written the data to the DB.

The next time PG is brought up, the WAL would indicate the transaction, as
it were, was a success.. but the data wouldn't actually be there.

In the case of using only one drive, the rollback (from a FS perspective)
couldn't possibly occur in such a way as to leave step 4 as a success, but
step 3 as a failure -- worst case, the data would be written out but the
WAL wouldn't have been updated (rolled back say by the FS) and thus PG will
roll back the data itself, or use whatever mechanism it uses to insure data
integrity is consistent with the WAL.

Am I smoking something here or is this a real, if rare in practice, risk
that occurs when you have the WAL on a different drive than the data is on?
At 17:39 10/27/2003, Tom Lane wrote:
"Rick Gigger" <ri**@alpinenetworking.com> writes:
It seems to me file system journaling should fix the whole problem by

giving
you a record of what was actually commited to disk and what was not.


Nope, a journaling FS has exactly the same problem Postgres does
(because the underlying "WAL" concept is the same: write the log entries
before you change the files they describe). If the drive lies about
write order, the FS can be screwed just as badly. Now the FS code might
have a low-level way to force write order that Postgres doesn't have
access to ... but simply uttering the magic incantation "journaling file
system" will not make this problem disappear.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 12 '05 #9

P: n/a
"Rick Gigger" <ri**@alpinenetworking.com> writes:
ahhh. "lies about write order" is the phrase that I was looking for. That
seemed to make sense but I didn't know if I could go directly from "lying
about fsync" to that. Obviously I don't understand exactly what fsync is
doing.


What we actually care about is write order: WAL entries have to hit the
platter before the corresponding data-file changes do. Unfortunately we
have no portable means of expressing that exact constraint to the
kernel. We use fsync() (or related constructs) instead: issue the WAL
writes, fsync the WAL file, then issue the data-file writes. This
constrains the write ordering more than is really needed, but it's the
best we can do in a portable Unix application.

The problem is that the kernel thinks fsync is done when the disk drive
reports the writes are complete. When we say a drive lies about this,
we mean it accepts a sector of data into its on-board RAM and then
immediately claims write-complete, when in reality the data hasn't hit
the platter yet and will be lost if power dies before the drive gets
around to writing it.

So we can have a scenario where we think WAL is down to disk and go
ahead with issuing data-file writes. These will also be shoved over to
the drive and stored in its on-board RAM. Now the drive has multiple
sectors pending write in its buffers. If it chooses to write these in
some order other than the order they were given to it, it could write
the data file updates to disk first. If power drops *now*, we lose,
because the data files are inconsistent and there's no WAL entry to tell
us to fix it.

Got it? It's really the combination of "lie about write completion" and
"write pending sectors out of order" that can mess things up.

The reason IDE drives have to do this for reasonable performance is that
the IDE interface is single-threaded: you can only have one read or
write in process at a time, from the point of view of the
kernel-to-drive interface. But in order to schedule reads and writes in
a way that makes sense physically (minimizes seeks), the drive has to
have multiple read and write requests pending that it can pick and
choose from. The only possibility to do that in the IDE world is to
let a write "complete" in interface terms before it's really done ...
that is, lie.

The reason SCSI drives do *not* do this is that the SCSI interface is
logically multi-threaded: you can have multiple reads or writes pending
at once. When you want to write on a SCSI drive, you send over a
command that says "write this data at this sector". Sometime later the
drive sends back a status report "yessir boss, I done did that write".
Similarly, a read consists of a command "read this sector", followed
sometime later by a response that delivers the requested data. But you
can send other commands to read or write other sectors meanwhile, and
the drive is free to reorder them to suit its convenience. So in the
SCSI world, there is no need for the drive to lie in order to do its own
read/write scheduling. The kernel knows the truth about whether a given
sector has hit disk, and so it won't conclude that the WAL file has been
completely fsync'd until it really is all down to the platter.

This is also why SCSI disks shine on the read side when you have lots of
processes doing reads: in an IDE drive, there is no way for the drive to
satisfy read requests in any order but the one they're issued in. If the
kernel guesses wrong about the best ordering for a set of read requests,
then everybody waits for the seeks needed to get the earlier processes'
data. A SCSI drive can fetch the "nearest" data first, and then that
requester is freed to make progress in the CPU while the other guys wait
for their longer seeks. There's no win here with a single active user
process (since it probably wants specific data in a specific order), but
it's a huge win if lots of processes are making unrelated read requests.

Clear now?

(In a previous lifetime I wrote SCSI disk driver code ...)

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 12 '05 #10

P: n/a
On Mon, 2003-10-27 at 12:44, Mike Benoit wrote:
I just ran some benchmarks against a 10K SCSI drive and 7200 RPM IDE
drive here:

http://fsbench.netnation.com/

The results vary quite a bit, and it seems the file system you use
can make a huge difference.

SCSI is obviously faster, but a 20% performance gain for 5x the cost is
only worth it for a very small percentage of people, I would think.
Running bonnie++ in 4 or 5 parallel runs would be interesting, to
see how IDE & SCSI in a multi-user environment.
On Wed, 2003-10-22 at 09:01, Stephen wrote:
The SCSI improvement over IDE seems overrated in the test. I would have
expected at most a 30% improvement. Other reviews seem to point out that IDE
performs just as well or better.

See Tom's hardware:
http://www20.tomshardware.com/storag...305/index.html

Stephen
"Ron Johnson" <ro***********@cox.net> wrote in message
news:1066837102.12532.176.camel@haggis...

http://hardware.devchannel.org/hardw...9.shtml?tid=20
&tid=38&tid=49


--
-----------------------------------------------------------------
Ron Johnson, Jr. ro***********@cox.net
Jefferson, LA USA

"Why should we not accept all in favor of woman suffrage to our
platform and association even though they be rabid pro-slavery
Democrats."
Susan B. Anthony, _History_of_Woman_Suffrage_
http://www.ifeminists.com/introducti...roduction.html
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 12 '05 #11

P: n/a
Thanks! Now it is much, much more clear. It leaves me with a few
additional questions though.

Question 1:
"we have no portable means of expressing that exact constraint to the
kernel"
Does this mean that specific operating systems have a better way of dealing
with this? Which ones and how? I'm guessing that it couldn't make to big
of a performance difference or it would probably be implemented already.

Question 2:
Do serial ATA drives suffer from the same issue?

----- Original Message -----
From: "Tom Lane" <tg*@sss.pgh.pa.us>
To: "Rick Gigger" <ri**@alpinenetworking.com>
Cc: <pg***********@postgresql.org>
Sent: Monday, October 27, 2003 5:05 PM
Subject: Re: [GENERAL] SCSI vs. IDE performance test

"Rick Gigger" <ri**@alpinenetworking.com> writes:
ahhh. "lies about write order" is the phrase that I was looking for. That seemed to make sense but I didn't know if I could go directly from "lying about fsync" to that. Obviously I don't understand exactly what fsync is doing.


What we actually care about is write order: WAL entries have to hit the
platter before the corresponding data-file changes do. Unfortunately we
have no portable means of expressing that exact constraint to the
kernel. We use fsync() (or related constructs) instead: issue the WAL
writes, fsync the WAL file, then issue the data-file writes. This
constrains the write ordering more than is really needed, but it's the
best we can do in a portable Unix application.

The problem is that the kernel thinks fsync is done when the disk drive
reports the writes are complete. When we say a drive lies about this,
we mean it accepts a sector of data into its on-board RAM and then
immediately claims write-complete, when in reality the data hasn't hit
the platter yet and will be lost if power dies before the drive gets
around to writing it.

So we can have a scenario where we think WAL is down to disk and go
ahead with issuing data-file writes. These will also be shoved over to
the drive and stored in its on-board RAM. Now the drive has multiple
sectors pending write in its buffers. If it chooses to write these in
some order other than the order they were given to it, it could write
the data file updates to disk first. If power drops *now*, we lose,
because the data files are inconsistent and there's no WAL entry to tell
us to fix it.

Got it? It's really the combination of "lie about write completion" and
"write pending sectors out of order" that can mess things up.

The reason IDE drives have to do this for reasonable performance is that
the IDE interface is single-threaded: you can only have one read or
write in process at a time, from the point of view of the
kernel-to-drive interface. But in order to schedule reads and writes in
a way that makes sense physically (minimizes seeks), the drive has to
have multiple read and write requests pending that it can pick and
choose from. The only possibility to do that in the IDE world is to
let a write "complete" in interface terms before it's really done ...
that is, lie.

The reason SCSI drives do *not* do this is that the SCSI interface is
logically multi-threaded: you can have multiple reads or writes pending
at once. When you want to write on a SCSI drive, you send over a
command that says "write this data at this sector". Sometime later the
drive sends back a status report "yessir boss, I done did that write".
Similarly, a read consists of a command "read this sector", followed
sometime later by a response that delivers the requested data. But you
can send other commands to read or write other sectors meanwhile, and
the drive is free to reorder them to suit its convenience. So in the
SCSI world, there is no need for the drive to lie in order to do its own
read/write scheduling. The kernel knows the truth about whether a given
sector has hit disk, and so it won't conclude that the WAL file has been
completely fsync'd until it really is all down to the platter.

This is also why SCSI disks shine on the read side when you have lots of
processes doing reads: in an IDE drive, there is no way for the drive to
satisfy read requests in any order but the one they're issued in. If the
kernel guesses wrong about the best ordering for a set of read requests,
then everybody waits for the seeks needed to get the earlier processes'
data. A SCSI drive can fetch the "nearest" data first, and then that
requester is freed to make progress in the CPU while the other guys wait
for their longer seeks. There's no win here with a single active user
process (since it probably wants specific data in a specific order), but
it's a huge win if lots of processes are making unrelated read requests.

Clear now?

(In a previous lifetime I wrote SCSI disk driver code ...)

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 12 '05 #12

P: n/a
On Mon, 2003-10-27 at 17:18, Rick Gigger wrote:
ahhh. "lies about write order" is the phrase that I was looking for. That
seemed to make sense but I didn't know if I could go directly from "lying
about fsync" to that. Obviously I don't understand exactly what fsync is
doing. I assume this means that if you were to turn fsync off you would get
considerably better performance but introduce the possibility of corrupting
the files in your database.
Yes.

There was a recent thread (in -general or -performance) regarding
putting the WAL files on a different disk, and changing wal_sync_-
method to open_sync (or open_datasync, don't remember).

This will allow the device(s) that the database is on to
run asynchronously, while the WAL is synchronous, for safety.
Thank you. This makes a lot more sense now.

----- Original Message -----
From: "Tom Lane" <tg*@sss.pgh.pa.us>
To: "Rick Gigger" <ri**@alpinenetworking.com>
Cc: <pg***********@postgresql.org>
Sent: Monday, October 27, 2003 3:39 PM
Subject: Re: [GENERAL] SCSI vs. IDE performance test

"Rick Gigger" <ri**@alpinenetworking.com> writes:
It seems to me file system journaling should fix the whole problem by giving you a record of what was actually commited to disk and what was not.


Nope, a journaling FS has exactly the same problem Postgres does
(because the underlying "WAL" concept is the same: write the log entries
before you change the files they describe). If the drive lies about
write order, the FS can be screwed just as badly. Now the FS code might
have a low-level way to force write order that Postgres doesn't have
access to ... but simply uttering the magic incantation "journaling file
system" will not make this problem disappear.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend


--
-----------------------------------------------------------------
Ron Johnson, Jr. ro***********@cox.net
Jefferson, LA USA

Some former UNSCOM officials are alarmed, however. Terry Taylor,
a British senior UNSCOM inspector from 1993 to 1997, says the
figure of 95 percent disarmament is "complete nonsense because
inspectors never learned what 100 percent was. UNSCOM found a
great deal and destroyed a great deal, but we knew [Iraq's] work
was continuing while we were there, and I'm sure it continues,"
says Mr. Taylor, now head of the Washington
http://www.csmonitor.com/2002/0829/p01s03-wosc.html
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 12 '05 #13

P: n/a
"Rick Gigger" <ri**@alpinenetworking.com> writes:
"we have no portable means of expressing that exact constraint to the
kernel" Does this mean that specific operating systems have a better way of dealing
with this? Which ones and how?


I'm not aware of any that offer a way of expressing "write these
particular blocks before those particular blocks". It doesn't seem like
it would require rocket scientists to devise such an API, but no one's
got round to it yet. Part of the problem is that the issue would have
to be approached at multiple levels: there is no point in offering an
OS-level API for this when the hardware underlying the bus-level API
(IDE) is doing its level best to sabotage the entire semantics.
Do serial ATA drives suffer from the same issue?


Um, not an expert, but I think ATA is the same as IDE except for bus
width and transfer rate. If either one allows for multiple concurrent
read/write transactions I'll be very surprised.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 12 '05 #14

P: n/a
On Tue, Oct 28, 2003 at 12:17:59AM -0500, Tom Lane wrote:
"Rick Gigger" <ri**@alpinenetworking.com> writes:
Do serial ATA drives suffer from the same issue?
Um, not an expert, but I think ATA is the same as IDE except for bus
width and transfer rate. If either one allows for multiple concurrent
read/write transactions I'll be very surprised.


Well, some googleing around seems to indicate that Serial ATA I/ATA-6 has
Tagged Command Queueing (TCQ) which is adding this feature specifically.
Whether it is a mandatory part of the spec I don't know.

--
Martijn van Oosterhout <kl*****@svana.org> http://svana.org/kleptog/ "All that is needed for the forces of evil to triumph is for enough good
men to do nothing." - Edmond Burke
"The penalty good people pay for not being interested in politics is to be
governed by people worse than themselves." - Plato


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE/ngB6Y5Twig3Ge+YRAjz8AKCmJ1B0PCiZ495jCx8CvvXV3pnbOA CZAUnj
zdH4Bofso3+4Ucg5WsD4P8w=
=hKOp
-----END PGP SIGNATURE-----

Nov 12 '05 #15

P: n/a
Martijn van Oosterhout <kl*****@svana.org> writes:
Well, some googleing around seems to indicate that Serial ATA I/ATA-6 has
Tagged Command Queueing (TCQ) which is adding this feature specifically.
Whether it is a mandatory part of the spec I don't know.


Yeah? If so, and *if fully implemented* on both sides of the interface,
this would eliminate the architectural advantages I was just sketching
for SCSI. I can't claim to be up on what's happening in the IDE/ATA
world though...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 12 '05 #16

P: n/a
Allen Landsidel <al*@biosys.net> writes:
Tom, this discussion brings up something that's been bugging me about the
recommendations for getting more performance out of PG.. in particular the
one that suggests you put your WAL files on a different physical drive from
the database.
...
With the DB and the WAL on different drives, it seems possible to me that
drive2 could've fsync()'d or otherwise properly written all of the data
out, but drive1 could have failed somewhere along the way and not actually
written the data to the DB.


Drive failure, in terms of losing something the drive claimed it had
written successfully, is not something that we can protect against.
For that, you go to your backup tapes. I don't see that it makes any
difference whether the database is spread across one drive or several;
you could still have a scenario where the claimed-complete write to
a data file failed to happen and then we recorded a checkpoint anyway.

Now, if the data drive fails to write and we can detect that, then we're
OK, because we won't record a checkpoint. We can redo the write based
on the contents of WAL after the problem's been fixed.

This is another reason why the IDE lie-about-write-completion behavior
is a Bad Idea: if the drive accepts data and then later has a problem
writing it, there is no way for it to report that fact --- and it's
too late anyhow since we've already taken other actions on the
assumption that the write is done. I'm not at all sure what IDE drives
do when they have a failure writing out cached buffers; anyone have
experience with that?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 12 '05 #17

P: n/a
Martijn van Oosterhout <kl*****@svana.org> writes:
On Tue, Oct 28, 2003 at 12:17:59AM -0500, Tom Lane wrote:
"Rick Gigger" <ri**@alpinenetworking.com> writes:
Do serial ATA drives suffer from the same issue?


Um, not an expert, but I think ATA is the same as IDE except for bus
width and transfer rate. If either one allows for multiple concurrent
read/write transactions I'll be very surprised.


Well, some googleing around seems to indicate that Serial ATA I/ATA-6 has
Tagged Command Queueing (TCQ) which is adding this feature specifically.
Whether it is a mandatory part of the spec I don't know.


The post on linux-kernel from the maxtor guy seemed to indicate we would have
to wait for ATA-7 drives (which are not out in the market yet) before the
features we really need are there.

Currently the linux-kernel folks are talking about how to integrate an IDE
SYNC operation into the world. It looks like filesystems with journals will
issue an IDE SYNC to checkpoint the journal, but it doesn't really look like
they're planning to hook it into fsync unless people speak up and explain what
databases need in that regard. However SYNC flushes the entire cache and means
that all other writes are blocked until the SYNC completes.

Apparently the feature needed to *really* implement fsync is called FUA which
would give real feedback of the status of the write without preventing all
other writes from proceeding. That's what isn't going to appear until ATA-7.

All this is from a few posts on linux-kernel. e.g.:

http://www.ussg.iu.edu/hypermail/lin...04.1/0450.html

--
greg
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 12 '05 #18

P: n/a
Tom Lane <tg*@sss.pgh.pa.us> writes:
I'm not at all sure what IDE drives do when they have a failure writing out
cached buffers; anyone have experience with that?


There's a looooong discussion about this too on linux-kernel, search for
"blockbusting". I think the conclusion is "it depends".

Often write failures aren't detected until the block is subsequently read. In
that case of course there's no hope. What's worse is the drive might not remap
the block on a read, so the problem can stick around even after the error.

If the write failure is caused by a bad block and the drive detects this at
the time it's written then the drive can actually remap that block to one of
its spare blocks. This is invisible to the host.

If it runs out of spare blocks, then you're in trouble. And there's no warning
that you're running low on spare blocks in any particular region unless you
use special utilities to query the drive. Also if the failure is caused by
environmental factors like vibrations or heat then you can be in trouble too.

--
greg
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 12 '05 #19

P: n/a
> It seems to me file system journaling should fix the whole problem by giving
you a record of what was actually commited to disk and what was not. I must
not understand journaling correctly. Can anyone explain to me how
journaling works.


Journaling depends, absolutely critically, on the OS knowing what data has
actually been written to disk. It can't be any other way; with an in-disk
write cache the OS has no way to know when the *journal* has been written to
disk, therefore journaling can't work.
--
Scott Ribe
sc********@killerbytes.com
http://www.killerbytes.com/
(303) 665-7007 voice
---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 12 '05 #20

P: n/a
Greg Stark wrote:
Tom Lane <tg*@sss.pgh.pa.us> writes:
I'm not at all sure what IDE drives do when they have a failure writing out
cached buffers; anyone have experience with that?


There's a looooong discussion about this too on linux-kernel, search for
"blockbusting". I think the conclusion is "it depends".

Often write failures aren't detected until the block is subsequently read. In
that case of course there's no hope. What's worse is the drive might not remap
the block on a read, so the problem can stick around even after the error.

If the write failure is caused by a bad block and the drive detects this at
the time it's written then the drive can actually remap that block to one of
its spare blocks. This is invisible to the host.

If it runs out of spare blocks, then you're in trouble. And there's no warning
that you're running low on spare blocks in any particular region unless you
use special utilities to query the drive. Also if the failure is caused by
environmental factors like vibrations or heat then you can be in trouble too.


My Buslogic/Mylex plain SCSI controller would beep when it hit a bad block ---
I didn't know why my computer was beeping for a while until I figured it out ---
can't beat that service.

--
Bruce Momjian | http://candle.pha.pa.us
pg***@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 12 '05 #21

P: n/a
> "Rick Gigger" <ri**@alpinenetworking.com> writes:
"we have no portable means of expressing that exact constraint to the
kernel"

Does this mean that specific operating systems have a better way of dealing with this? Which ones and how?


I'm not aware of any that offer a way of expressing "write these
particular blocks before those particular blocks". It doesn't seem like
it would require rocket scientists to devise such an API, but no one's
got round to it yet. Part of the problem is that the issue would have
to be approached at multiple levels: there is no point in offering an
OS-level API for this when the hardware underlying the bus-level API
(IDE) is doing its level best to sabotage the entire semantics.


But for those of us using scsi wouldn't it be possible to get a performance
gain here? Would the gain be worth the effort?
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 12 '05 #22

P: n/a
On Tue, Oct 28, 2003 at 01:04:27PM -0500, Greg Stark wrote:
If it runs out of spare blocks, then you're in trouble. And there's no warning
that you're running low on spare blocks in any particular region unless you
use special utilities to query the drive. Also if the failure is caused by
environmental factors like vibrations or heat then you can be in trouble too.
Actually, drives have S.M.A.R.T for reporting these kind of issues. The idea
being that a counter decrements every time a block is remapped. When it
reaches a declared threshold the drive declares an error and if it's in
warranty that's enough to convince the manufacturer to send you a new disk.

Not that many people use this feature, but it is there.
--
Martijn van Oosterhout <kl*****@svana.org> http://svana.org/kleptog/ "All that is needed for the forces of evil to triumph is for enough good
men to do nothing." - Edmond Burke
"The penalty good people pay for not being interested in politics is to be
governed by people worse than themselves." - Plato


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE/nxjxY5Twig3Ge+YRAtRIAJwNsNfWlUwwyJWifQfriiKPOOS1dg CgurnI
x8cg3qhLVb1XNFPKINQbZBU=
=UBTm
-----END PGP SIGNATURE-----

Nov 12 '05 #23

P: n/a
In article <87************@stark.dyndns.tv>,
Greg Stark <gs*****@mit.edu> wrote:
Currently the linux-kernel folks are talking about how to integrate an IDE
SYNC operation into the world. It looks like filesystems with journals will
issue an IDE SYNC to checkpoint the journal, but it doesn't really look like
they're planning to hook it into fsync unless people speak up and explain what
databases need in that regard. However SYNC flushes the entire cache and means
that all other writes are blocked until the SYNC completes.

http://www.ussg.iu.edu/hypermail/lin...04.1/0450.html


Also, if you're interested in this kind of stuff and what's going on
in the Linux kernel development circles, Google for "ide write barrier".

For example http://lkml.org/lkml/2003/10/13/87

Mike.
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 12 '05 #24

P: n/a
>>> "we have no portable means of expressing that exact constraint to the
kernel"

Does this mean that specific operating systems have a better way of
dealing with this? Which ones and how?


I'm not aware of any that offer a way of expressing "write these
particular blocks before those particular blocks". It doesn't seem like
it would require rocket scientists to devise such an API, but no one's
got round to it yet. Part of the problem is that the issue would have
to be approached at multiple levels: there is no point in offering an
OS-level API for this when the hardware underlying the bus-level API
(IDE) is doing its level best to sabotage the entire semantics.

[sNip]

Actually, NetWare is one OS that does this, and has been doing so
since the 1980s with version 2 (version 6.5 is the current version today).
They have a Patented caching algorithm called "Elevator Seeking" which both
prolongs the life of the drive by reducing wear-and-tear and improving
read/write performance by minimizing seek operations.

With IDE it seems that this caching algorithm is also beneficial, but
it really shines with SCSI drives.

In all my experience, SCSI drives are much faster and far more
reliable than IDE drives. I've always assumed that it boils down to "you
get what you pay for."

--
Randolf Richardson - rr@8x.ca
Inter-Corporate Computer & Network Services, Inc.
Vancouver, British Columbia, Canada
http://www.8x.ca/

This message originated from within a secure, reliable,
high-performance network ... a Novell NetWare network.

Nov 12 '05 #25

P: n/a
>>> "we have no portable means of expressing that exact constraint to the
kernel"

Does this mean that specific operating systems have a better way of
dealing with this? Which ones and how?


I'm not aware of any that offer a way of expressing "write these
particular blocks before those particular blocks". It doesn't seem like
it would require rocket scientists to devise such an API, but no one's
got round to it yet. Part of the problem is that the issue would have
to be approached at multiple levels: there is no point in offering an
OS-level API for this when the hardware underlying the bus-level API
(IDE) is doing its level best to sabotage the entire semantics.

[sNip]

Actually, NetWare is one OS that does this, and has been doing so
since the 1980s with version 2 (version 6.5 is the current version today).
They have a Patented caching algorithm called "Elevator Seeking" which both
prolongs the life of the drive by reducing wear-and-tear and improving
read/write performance by minimizing seek operations.

With IDE it seems that this caching algorithm is also beneficial, but
it really shines with SCSI drives.

In all my experience, SCSI drives are much faster and far more
reliable than IDE drives. I've always assumed that it boils down to "you
get what you pay for."

--
Randolf Richardson - rr@8x.ca
Inter-Corporate Computer & Network Services, Inc.
Vancouver, British Columbia, Canada
http://www.8x.ca/

This message originated from within a secure, reliable,
high-performance network ... a Novell NetWare network.

Nov 12 '05 #26

P: n/a
On Wed, Nov 19, 2003 at 09:29:21PM +0000, Randolf Richardson, DevNet SysOp 29 wrote:
Actually, NetWare is one OS that does this, and has been doing so
since the 1980s with version 2 (version 6.5 is the current version today).
They have a Patented caching algorithm called "Elevator Seeking" which both
prolongs the life of the drive by reducing wear-and-tear and improving
read/write performance by minimizing seek operations.


Huh, is this different from your ordinary elevator algorithm? I'd be
surprised if there was an OS which didn't use something like that ...

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"La verdad no siempre es bonita, pero el hambre de ella sí"

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 12 '05 #27

P: n/a
"Randolf Richardson, DevNet SysOp 29" <rr@8x.ca> writes:
Actually, NetWare is one OS that does this, and has been doing so
since the 1980s with version 2 (version 6.5 is the current version today).
They have a Patented caching algorithm called "Elevator Seeking" which both


They've managed to patent ye olde elevator algorithm?? The USPTO really
is without a clue, isn't it :-(

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 12 '05 #28

P: n/a
[sNip]
They've managed to patent ye olde elevator algorithm?? The USPTO really
is without a clue, isn't it :-(


It's not the USPTO's fault -- the problem is that nobody objected to it
while it was in the "Patent Pending" state.

--
Randolf Richardson - rr@8x.ca
Vancouver, British Columbia, Canada

Please do not eMail me directly when responding
to my postings in the newsgroups.
Nov 12 '05 #29

P: n/a
Randolf Richardson <rr@8x.ca> writes:
They've managed to patent ye olde elevator algorithm?? The USPTO really
is without a clue, isn't it :-(
It's not the USPTO's fault -- the problem is that nobody objected to it
while it was in the "Patent Pending" state.


If their examiner had even *minimal* competency in the field, it would
not have gotten to the "Patent Pending" state. Algorithms that are well
documented in the standard textbooks of thirty years ago do not qualify
as something people should have to stand guard against.

Perhaps I should try to patent base-two arithmetic, and hope no one
notices till it goes through ... certainly the USPTO won't notice ...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 12 '05 #30

P: n/a
Ben
Base-two artihmetic sounds pretty broad. If only you could come up with a
scheme for division and multiplication by powers of two through
bitshifting.....

On Wed, 26 Nov 2003, Tom Lane wrote:
Randolf Richardson <rr@8x.ca> writes:
They've managed to patent ye olde elevator algorithm?? The USPTO really
is without a clue, isn't it :-(

It's not the USPTO's fault -- the problem is that nobody objected to it
while it was in the "Patent Pending" state.


If their examiner had even *minimal* competency in the field, it would
not have gotten to the "Patent Pending" state. Algorithms that are well
documented in the standard textbooks of thirty years ago do not qualify
as something people should have to stand guard against.

Perhaps I should try to patent base-two arithmetic, and hope no one
notices till it goes through ... certainly the USPTO won't notice ...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 12 '05 #31

P: n/a
Ben wrote:
Base-two artihmetic sounds pretty broad. If only you could come up with a
scheme for division and multiplication by powers of two through
bitshifting.....


I already have that patent! :-)

--
Bruce Momjian | http://candle.pha.pa.us
pg***@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 12 '05 #32

P: n/a
Martijn van Oosterhout wrote:
On Tue, Oct 28, 2003 at 01:04:27PM -0500, Greg Stark wrote:
If it runs out of spare blocks, then you're in trouble. And there's no warning
that you're running low on spare blocks in any particular region unless you
use special utilities to query the drive. Also if the failure is caused by
environmental factors like vibrations or heat then you can be in trouble too.

Actually, drives have S.M.A.R.T for reporting these kind of issues. The idea
being that a counter decrements every time a block is remapped. When it
reaches a declared threshold the drive declares an error and if it's in
warranty that's enough to convince the manufacturer to send you a new disk.

Not that many people use this feature, but it is there.


I used smartsuite (http://sourceforge.net/projects/smartsuite/) to view
the status of the drives, but the relocated sector count appears only
available on ide drives. Does anyone know if that is the nature of scsi
drives or is it just a limitation of that tool?
---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 12 '05 #33

P: n/a
Joseph Shraibman wrote:
Actually, drives have S.M.A.R.T for reporting these kind of issues. The idea
being that a counter decrements every time a block is remapped. When it
reaches a declared threshold the drive declares an error and if it's in
warranty that's enough to convince the manufacturer to send you a new disk.

Not that many people use this feature, but it is there.


I used smartsuite (http://sourceforge.net/projects/smartsuite/) to view
the status of the drives, but the relocated sector count appears only
available on ide drives. Does anyone know if that is the nature of scsi
drives or is it just a limitation of that tool?


Do SCSI drives even do relocation? I had a Seagate SCSI drive that
would beep every time I tried to access a bad block, basically telling
me to replace the drive.

--
Bruce Momjian | http://candle.pha.pa.us
pg***@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 12 '05 #34

P: n/a
>> Base-two artihmetic sounds pretty broad. If only you could come up with a
scheme for division and multiplication by powers of two through
bitshifting.....


I already have that patent! :-)


Please share your licensing agreement with the rest of us so that we
may decide to applaud you or throw tomatoes at you (throwing tomatoes, as far
as I'm aware, is a "process" which hasn't been Patented yet). =D

--
Randolf Richardson - rr@8x.ca
Vancouver, British Columbia, Canada

Please do not eMail me directly when responding
to my postings in the newsgroups.
Nov 12 '05 #35

P: n/a
[sNip]
I used smartsuite (http://sourceforge.net/projects/smartsuite/) to view
the status of the drives, but the relocated sector count appears only
available on ide drives. Does anyone know if that is the nature of
scsi drives or is it just a limitation of that tool?


Do SCSI drives even do relocation? I had a Seagate SCSI drive that
would beep every time I tried to access a bad block, basically telling
me to replace the drive.


Normally this should be handled by the OS since a judgement can be
made on data reliability whereas the hard drive wouldn't know which
algorithm to use (e.g., CRC, etc.).

Perhaps the following would be "food for thought" on future table
space implementation so as to do something that Oracle hasn't thought of...

On NetWare v2.x (c. 1980) through v6.5 (the current version, released
in 2003) a section of each Partition was designated as a "HotFix" area (the
percentage is configurable at the time of formatting) which is
automatically used in place of bad blocks as they are discovered, and error
messages are generated in system logs and on the System Console whenever
one is found.

The default percentage originally started out at 2% but has eventually
be lowered to 0.2% over time due to a number of factors including the
following:

1. Larger capacity hard drives; and,

2. Fewer defects on new hard drives -- in the old days (20 years
ago definitely qualifies as "old days" in the computer industry) it was
common for new hard drives to come with errors on the drive, but now all
hard drives come with zero bad sectors (I assume this is due to improved
techniques and practices at the manufacturing level).

I'd be quite happy to write the documentation explaining table spaces
in PostgreSQL should it become a feature in a future release. In fact, I
would really enjoy doing this, and so I believe that my contribution could
be very helpful.

--
Randolf Richardson - rr@8x.ca
Vancouver, British Columbia, Canada

Please do not eMail me directly when responding
to my postings in the newsgroups.
Nov 12 '05 #36

P: n/a
On Fri, 2003-11-28 at 21:45, Bruce Momjian wrote:
Do SCSI drives even do relocation? I had a Seagate SCSI drive that
would beep every time I tried to access a bad block, basically telling
me to replace the drive.


I'm pretty sure that SCSI drives, or at least more modern ones, do. The
ones I've used have a list of bad blocks stored internally and will
relocate blocks automatically. The drives allowed you to reset this
list by running a low level format from the scsi controller. The drives
would then clear the bad blocks list and recheck disk blocks again.

--
Suchandra Thapa <s-********@alumni.uchicago.edu>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQA/yCOO6nShCjt5AZIRAhcaAJ9rfxiiDrztXNVmavekp71nlikFtg CZAcRz
rVZFRhZv9J1wNFI2lGS/Ftk=
=1ZhU
-----END PGP SIGNATURE-----

Nov 12 '05 #37

This discussion thread is closed

Replies have been disabled for this discussion.