MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever) - Page 2

alf

Hi,

is it possible that due to OS crash or mysql itself crash or some e.g.
SCSI failure to lose all the data stored in the table (let's say million
of 1KB rows). In other words what is the worst case scenario for MyISAM
backend?
Also is it possible to not to lose data but get them corrupted?
Thx, Andy

Nov 9 '06

Subscribe Post Reply

110

10418

Jerry Stuckle

Bill Todd wrote:

Jerry Stuckle wrote:

>Bill Todd wrote:

>>Jerry Stuckle wrote:

...

ZFS is not proof against silent errors - they can still occur.

Of course they can, but they will be caught by the background
verification scrubbing before much time passes (i.e., within a time
window that radically reduces the likelihood that another disk will
fail before the error is caught and corrected), unlike the case with
conventional RAID (where they aren't caught at all, and rise up to
bite you - with non-negligible probability these days - if the good
copy then dies).

And ZFS *is* proof against silent errors in the sense that data thus
mangled will not be returned to an application (i.e., it will be
caught when read if the background integrity validation has not yet
reached it) - again, unlike the case with conventional mirroring,
where there's a good chance that it will be returned to the
application as good.

The same is true with RAID-1 and RAID-10. An error on the disk will
be detected and returned by the hardware to the OS.

I'd think that someone as uninformed as you are would have thought twice
about appending an ad for his services to his Usenet babble. But formal
studies have shown that the least competent individuals seem to be the
most confident of their opinions (because they just don't know enough to
understand how clueless they really are).

How you wish. I suspect I have many years more experience and knowledge
than you. And am more familiar with fault tolerant systems.

Do you even know what a silent error is? It's an error that the disk
does not notice, and hence cannot report.

Yep. And please tell me EXACTLY how this can occur.

Duh.

In some of your other recent drivel you've seemed to suggest that this
simply does not happen. Well, perhaps not in your own extremely limited
experience, but you really shouldn't generalize from that.

I didn't say it CAN'T happen. My statement was that is is SO UNLIKELY
to happen that it can be virtually ignored.

A friend of mine at DEC investigated this about a decade ago and found
that the (high-end) disk subsystems of some (high-end) Alpha platforms
were encountering undetected errors on average every few TB (i.e., what
they read back was, very rarely, not quite what they had written in,
with no indication of error). That may be better today (that's more
like the uncorrectable error rate now), but it still happens. The
causes are well known to people reasonably familiar with the technology:
the biggies are writes that report successful completion but in fact do
nothing, writes that go to the wrong target sector(s) (whether or not
they report success), and errors that the sector checksums just don't
catch (those used to be about three orders of magnitude rarer than
uncorrectable errors, but that was before the rush toward higher density
and longer checksums to catch the significantly-increased raw error
rates - disk manufacturers no longer report the undetected error rate,
but I suspect that it's considerably closer to the uncorrectable error
rate now). There are also a few special cases - e.g., the disk that
completes a sector update while power is failing, not knowing that the
transfer from memory got clamped part-way through and returned zeros
rather than whatever it was supposed to (so as far as the disk knows
they're valid).

That was a decade ago. What are the figures TODAY? Do you even know?
Do you even know why they happen?

IBM, Unisys, NetApp, and EMC (probably not an exhaustive list, but
they're the ones that spring immediately to mind) all use non-standard
disk sector sizes in some of their systems to hold additional validation
information (maintained by software or firmware well above the disk
level) aimed at catching some (but in most cases not all) of these
unreported errors.

That's one way things are done.

Silent errors are certainly rare, but they happen. ZFS catches them.
RAID does not. End of story.

And what about data which is corrupted once it's placed in the ZFS
buffer? ZFS buffers are in RAM, and can be overwritten at any time.

And ZFS itself can be corrupted - telling the disk to write to the wrong
sector, for instance. It is subject to viruses. It runs only on UNIX.

The list goes on. But these are things which can occur much more easily
than the hardware errors you mention. And they are conveniently ignored
by those who have bought the ZFS hype.

...

The

>big difference being ZFS if done in software, which requires CPU
cycles and other resources.

Since when was this discussion about use of resources rather than
integrity (not that ZFS's use of resources for implementing its own
RAID-1/RAID-10 facilities is significant anyway)?

It's always about performance. 100% integrity is no good if you need
100% of the system resources to handle it.

> It's also open to corruption.

No more than the data that some other file system gives to a hardware
RAID implementation would be: it all comes from the same place (main
memory).

Ah, but this is much more likely than once it's been passed off to the
hardware, no?

However, because ZFS subsequently checks what it wrote against a
*separate* checksum, if it *was* corrupted below the request-submission
level ZFS is very likely to find out, whereas a conventional RAID
implementation (and the higher layers built on top of it) won't: they
just write what (they think) they're told to, with no additional check.

So? If the buffer is corrupted, the checksum will be, also. And if the
data is written to the wrong sector, the checksum will still be correct.

RAID-1 and RAID-10

>are implemented in hardware/firmware which cannot be corrupted (Read
only memory) and require no CPU cycles.

If your operating system and file system have been corrupted, you've got
problems regardless of how faithfully your disk hardware transfers this
corruption to its platters: this alleged deficiency compared with a
hardware implementation is just not an issue.

The hardware is MUCH MORE RELIABLE than the software.

You've also suggested elsewhere that a hardware implementation is less
likely to contain bugs, which at least in this particular instance is
nonsense: ZFS's RAID-1/10 implementation benefits from the rest of its
design such that it's likely *far* simpler than any high-performance
hardware implementation (with its controller-level cache management and
deferred write-back behavior) is, and hence if anything likely *less*
buggy.

Yea, right. Keep believing it. Because when talking about software
implementations, you have to also consider the OS and other software
running at the time.

>>
>> Plus it is not proof against

data decaying after it is written to disk.

No - but, again, it will catch it before long, even in cases where
conventional disk scrubbing would not.

So do RAID-1 and RAID-10.

No, they typically do not: they may scrub to ensure that sectors can be
read successfully (and without checksum errors), but they do not compare
one copy with the other (and even if they did, if they found that the
copies differed they'd have no idea which one was the right one - but
ZFS knows).

No, they don't. But the odds of an incorrect read generating a valid
checksum with current algorithms (assuming high quality drives -
different manufacturers use different techniques) are now so high as to
be negligible. You're more likely to have something overwritten in
memory than a silent error.

>>
>> And, as you note, it doesn't

handle a disk crash.

It handles it with resilience comparable to RAID-1, but is more
flexible in that it can then use distributed free space to restore
the previous level of redundancy (whereas RAID-1/RAID-10 cannot
unless the number of configured hot spare disks equals the number of
failed disks).

And for a critical system you have that redundancy and more.

So, at best, RAID-1/10 matches ZFS in this specific regard (though of
course it can't leverage the additional bandwidth and IOPS of its spare
space, unlike ZFS). Whoopee.

And it does it with more integrity and better performance, as indicated
above.

>>
>>>>
But when properly implemented, RAID-1 and RAID-10 will detect and
correct even more errors than ZFS will.

A complete disk crash, for instance. Even Toby admitted ZFS cannot
recover from a disk crash.

ZFS is good. But it's a cheap software implementation of an expensive
hardware recovery system. And there is no way software can do it as
well as hardware does.

You at least got that right: ZFS does it considerably better, not
merely 'as well'. And does so at significantly lower cost (so you got
that part right too).

The fact that you even try to claim that ZFS is better than a RAID-1 or
RAID-10 system shows just how little you understand critical systems,
and how much you've bought into the ZFS hype.

The one advantage that a good hardware RAID-1/10 implementation has over
ZFS relates to performance, primarily small-synchronous-write latency:
while ZFS can group small writes to achieve competitive throughput (in
fact, superior throughput in some cases), it can't safely report
synchronous write completion until the data is on the disk platters,
whereas a good RAID controller will contain mirrored NVRAM that can
guarantee persistence in microseconds rather than milliseconds (and then
destage the writes to the platters lazily).

That's one advantage, yes.

Now, ZFS does have an 'intent log' for small writes, and does have the
capability of placing this log on (mirrored) NVRAM to achieve equivalent
small-synchronous-write latency - but that's a hardware option, not part
and parcel of ZFS itself.

Oh, so you're now saying that synchronous writes may not be truly
synchronous with ZFS? That's something I didn't know. I thought ZFS
was smarter than that.

...

>>Please name even one.

Why am I not surprised that you dodged that challenge?

Because I'm not the one making the claim. You make a claim? Don't
expect me to do your work backing it up for you.

Now, as far as credentials go, some people (who aren't sufficiently
familiar with this subject to know just how incompetent you really are
to discuss it) might find yours impressive (at least you appear to think
they might, since you made some effort to trot them out). I must admit
that I can't match your claim to have been programming since "1867", but
I have been designing and writing system software since 1976 (starting
with 11 years at DEC), and had a stint at EMC designing high-end storage
firmware in the early '90s. I specialize in designing and implementing
high-performance, high-availability distributed file, object, and
database systems, and have personally created significant portions of
several such; in this pursuit, I've kept current on the state of the art
both in academia and in the commercial arena.

That was 1967, obviously a typo.

OK, well, when you get an electronics background, you can start talking
with intelligence about just how all of those hardware problems occur.

And I say you're full of shit. Christ, you've never even heard of
people losing mirrored data at all - not from latent errors only
discovered at rebuild time, not from correlated failures of mirror pairs
from the same batch (or even not from the same batch - with a large
enough RAID-10 array there's a modest probability that some pair won't
recover from a simple power outage, and - though this may be news to you
- even high-end UPSs are *not* infallible)...

Sheesh.

- bill

ROFLMAO! Just like a troll. Jump into the middle of a discussion
uninvited. Doesn't have any real knowledge, but is an expert on
everything. Then makes personal attacks against the other person to
cover for this deficiency.

Hell, you'd probably have to look up Ohm's Law. And you're lecturing me
on the how much more reliable than hardware?

Go back into your little hole, troll.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 12 '06 #51

Jerry Stuckle

toby wrote:

toby wrote:

>>Jerry Stuckle wrote:

>>>...
Actually, I understand quite a bit about ZFS. However, unlike you, I
also understand its shortcomings.

This group and I would very much like to hear about those shortcomings,
if you would elucidate.

>>>That's because I started working on
fault-tolerant drive systems starting in 1977 as a hardware CE for IBM,
working on large mainframes. I've watched it grow over the years. And
as a EE major, I also understand the hardware and it's strengths and
weaknesses - in detail.

And as a CS major (dual majors) and programmer since 1867, including
working on system software for IBM in the 1980's I have a thorough
understanding of the software end.

And it's obvious from your statements you have no real understanding or
either, other than sales literature.

You would. This group would not. You want to find out, you go to the
relevant groups. Don't bring your garbage here. It is not appropriate
for this group.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 12 '06 #52

Frank Cusack

On 11 Nov 2006 19:30:25 -0800 "toby" <to**@telegraphics.com.auwrote:

Jerry Stuckle wrote:
>REAL RAID-1 and RAID-10 are implemented in hardware/firmware, not a
software system such as ZFS.

This simple statement shows a fundamental misunderstanding of the basics,
let alone zfs.

-frank

Nov 12 '06 #53

Bill Todd

Jerry Stuckle wrote:

Bill Todd wrote:
>Jerry Stuckle wrote:

>>Bill Todd wrote:

Jerry Stuckle wrote:

...

ZFS is not proof against silent errors - they can still occur.

Of course they can, but they will be caught by the background
verification scrubbing before much time passes (i.e., within a time
window that radically reduces the likelihood that another disk will
fail before the error is caught and corrected), unlike the case with
conventional RAID (where they aren't caught at all, and rise up to
bite you - with non-negligible probability these days - if the good
copy then dies).

And ZFS *is* proof against silent errors in the sense that data thus
mangled will not be returned to an application (i.e., it will be
caught when read if the background integrity validation has not yet
reached it) - again, unlike the case with conventional mirroring,
where there's a good chance that it will be returned to the
application as good.

The same is true with RAID-1 and RAID-10. An error on the disk will
be detected and returned by the hardware to the OS.

I'd think that someone as uninformed as you are would have thought
twice about appending an ad for his services to his Usenet babble.
But formal studies have shown that the least competent individuals
seem to be the most confident of their opinions (because they just
don't know enough to understand how clueless they really are).

How you wish. I suspect I have many years more experience and knowledge
than you.

You obviously suspect a great deal. Too bad that you don't have any
real clue. Even more too bad that you insist on parading that fact so
persistently.

And am more familiar with fault tolerant systems.

Exactly how many have you yourself actually architected and built,
rather than simply using the fault-tolerant hardware and software that
others have provided? I've been centrally involved in several.

>

>Do you even know what a silent error is? It's an error that the disk
does not notice, and hence cannot report.

Yep. And please tell me EXACTLY how this can occur.

I already did, but it seems that you need things spelled out in simpler
words: mostly, due to bugs in firmware in seldom-used recovery paths
that the vagaries of handling electro-mechanical devices occasionally
require. The proof is in the observed failures: as I said, end of
story (if you're not aware of the observed failures, it's time you
educated yourself in that area rather than kept babbling on
incompetently about the matter).

>
>Duh.

In some of your other recent drivel you've seemed to suggest that this
simply does not happen. Well, perhaps not in your own extremely
limited experience, but you really shouldn't generalize from that.

I didn't say it CAN'T happen. My statement was that is is SO UNLIKELY
to happen that it can be virtually ignored.

My word - you can't even competently discuss what you yourself have so
recently (and so incorrectly) stated.

For example, in response to my previous statement that ZFS detected
errors that RAID-1/10 did not, you said "The same is true with RAID-1
and RAID-10. An error on the disk will be detected and returned by the
hardware to the OS" - no probabilistic qualification there at all.

And on the subject of "data decaying after it is written to disk" (which
includes erroneous over-writes), when I asserted that ZFS "will catch it
before long, even in cases where conventional disk scrubbing would not"
you responded "So do RAID-1 and RAID-10" - again, no probabilistic
qualification whatsoever (leaving aside your incorrect assertion about
RAID's ability to catch those instances that disk-scrubbing does not
reveal).

You even offered up an example *yourself* of such a firmware failure
mode: "A failing controller can easily overwrite the data at some later
time." Easily, Jerry? That doesn't exactly sound like a failure mode
that 'can be virtually ignored' to me. (And, of course, you accompanied
that pearl of wisdom with another incompetent assertion to the effect
that ZFS would not catch such a failure, when of course that's
*precisely* the kind of failure that ZFS is *designed* to catch.)

Usenet is unfortunately rife with incompetent blowhards like you - so
full of themselves that they can't conceive of someone else knowing more
than they do about anything that they mistakenly think they understand,
and so insistent on preserving that self-image that they'll continue
spewing erroneous statements forever (despite their repeated promises to
stop: "I'm not going to respond to you any further", "I'm also not
going to discuss this any more with you", "I'm finished with this
conversation" all in separate responses to toby last night - yet here
you are this morning responding to him yet again).

I'm not fond of blowhards, nor of their ability to lead others astray
technically if they're not confronted. Besides, sticking pins in such
over-inflated balloons is kind of fun.

....

>IBM, Unisys, NetApp, and EMC (probably not an exhaustive list, but
they're the ones that spring immediately to mind) all use non-standard
disk sector sizes in some of their systems to hold additional
validation information (maintained by software or firmware well above
the disk level) aimed at catching some (but in most cases not all) of
these unreported errors.

That's one way things are done.

You're awfully long on generalizations and short on specifics, Jerry -
typical Usenet loud-mouthed ignorance. But at least you seem to
recognize that some rather significant industry players consider this
kind of error sufficiently important to take steps to catch them - as
ZFS does but RAID per se does not (that being the nub of this
discussion, in case you had forgotten).

Now, by all means tell us some of the *other* ways that such 'things are
done'.

>
>Silent errors are certainly rare, but they happen. ZFS catches them.
RAID does not. End of story.

And what about data which is corrupted once it's placed in the ZFS
buffer? ZFS buffers are in RAM, and can be overwritten at any time.

All file data comes from some RAM buffer, Jerry - even that handed to a
firmware RAID. So if it can be corrupted in system RAM, firmware RAID
is no cure at all.

>
And ZFS itself can be corrupted - telling the disk to write to the wrong
sector, for instance. It is subject to viruses.

All file data comes from some software OS environment, Jerry - same
comment as above.

It runs only on UNIX.

My, you do wander: what does this have to do with a discussion about
the value of different approaches to ensuring data integrity?

>
The list goes on.

Perhaps in your own fevered imagination.

But these are things which can occur much more easily

than the hardware errors you mention.

It really does depend on the environment: some system software
environments are wide-open to corruption, while others are so
well-protected from external attacks and so internally bullet-proof that
they often have up-times of a decade or more (and in those the
likelihood of disk firmware errors is sufficiently higher than the kind
of software problems that you're talking about that, by George, their
vendors find it worthwhile to take the steps I mentioned to guard
against them).

But, once again, in the cases where your OS integrity *is* a significant
problem, then firmware RAID isn't going to save you anyway.

And they are conveniently ignored

by those who have bought the ZFS hype.

I really don't know what you've got against ZFS, Jerry, save for the
fact that discussing it has so clearly highlighted your own
incompetence. The only 'hype' that I've noticed around ZFS involves its
alleged 128-bitness (when its files only reach 64 - or is it 63? - bits
in size, and the need for more than 70 - 80 bits of total file system
size within the next few decades is rather difficult to justify).

But its ability to catch the same errors that far more expensive
products from the likes of IBM, NetApp, and EMC are designed to catch is
not hype: it's simple fact.

>
>...

The

>>big difference being ZFS if done in software, which requires CPU
cycles and other resources.

Since when was this discussion about use of resources rather than
integrity (not that ZFS's use of resources for implementing its own
RAID-1/RAID-10 facilities is significant anyway)?

It's always about performance. 100% integrity is no good if you need
100% of the system resources to handle it.

Horseshit. It's only 'about performance' when the performance impact is
significant. In the case of ZFS's mirroring implementation, it isn't of
any significance at all (let alone any *real* drag on the system).

>

>> It's also open to corruption.

No more than the data that some other file system gives to a hardware
RAID implementation would be: it all comes from the same place (main
memory).

Ah, but this is much more likely than once it's been passed off to the
hardware, no?

No: once there's any noticeable likelihood of corruption in system RAM,
then it really doesn't matter how reliable the rest of the system is.

>
>However, because ZFS subsequently checks what it wrote against a
*separate* checksum, if it *was* corrupted below the
request-submission level ZFS is very likely to find out, whereas a
conventional RAID implementation (and the higher layers built on top
of it) won't: they just write what (they think) they're told to, with
no additional check.

So? If the buffer is corrupted, the checksum will be, also.

No: in many (possibly all - I'd have to check the code to make sure)
cases ZFS establishes the checksum when the data is moved *into* the
buffer (and IIRC performs any compression and/or encryption at that
point as well: it's a hell of a lot less expensive to do all these at
once as the data is passing through the CPU cache on the way to the
buffer than to fetch it back again later).

And if the

data is written to the wrong sector, the checksum will still be correct.

No: if the data is written to the wrong sector, any subsequent read
targeting the correct sector will find a checksum mismatch (as will any
read to the sector which was incorrectly written).

>
> RAID-1 and RAID-10

>>are implemented in hardware/firmware which cannot be corrupted (Read
only memory) and require no CPU cycles.

If your operating system and file system have been corrupted, you've
got problems regardless of how faithfully your disk hardware transfers
this corruption to its platters: this alleged deficiency compared
with a hardware implementation is just not an issue.

The hardware is MUCH MORE RELIABLE than the software.

Once again, nitwit: if the OS-level software is not reliable, it
*doesn't matter* how reliable the hardware is.

>
>You've also suggested elsewhere that a hardware implementation is less
likely to contain bugs, which at least in this particular instance is
nonsense: ZFS's RAID-1/10 implementation benefits from the rest of
its design such that it's likely *far* simpler than any
high-performance hardware implementation (with its controller-level
cache management and deferred write-back behavior) is, and hence if
anything likely *less* buggy.

Yea, right. Keep believing it.

As will anyone else remotely well-acquainted with system hardware and
software. 'Firmware' is just software that someone has committed to
silicon, after all: it is just as prone to bugs as the system-level
software that you keep disparaging - more so, when it's more complex
than the system-software implementation.

....

You're more likely to have something overwritten in

memory than a silent error.

Then (since you seem determined to keep ignoring this point) why on
Earth do you suppose that entirely reputable companies like IBM, NetApp,
and EMC go to such lengths to catch them? If it makes sense for them
(and for their customers), then it's really difficult to see why ZFS's
abilities in that area wouldn't be significant.

....

>>>>But when properly implemented, RAID-1 and RAID-10 will detect and
correct even more errors than ZFS will.

A complete disk crash, for instance. Even Toby admitted ZFS cannot
recover from a disk crash.

ZFS is good. But it's a cheap software implementation of an
expensive hardware recovery system. And there is no way software can
do it as well as hardware does.

You at least got that right: ZFS does it considerably better, not
merely 'as well'. And does so at significantly lower cost (so you got
that part right too).

The fact that you even try to claim that ZFS is better than a RAID-1 or
RAID-10 system shows just how little you understand critical systems,
and how much you've bought into the ZFS hype.

The fact that you so consistently misrepresent ZFS as being something
*different* from RAID-1/10 shows that you don't even understand the
definition of RAID: the ZFS back-end *is* RAID-1/10 - it just leverages
its implementation in software to improve its reliability (because
writing directly from system RAM to disk without an intermediate step
through a common controller buffer significantly improves the odds that
*one* of the copies will be correct - and the checksum enables ZFS to
determine which one that is).

>
>The one advantage that a good hardware RAID-1/10 implementation has
over ZFS relates to performance, primarily small-synchronous-write
latency: while ZFS can group small writes to achieve competitive
throughput (in fact, superior throughput in some cases), it can't
safely report synchronous write completion until the data is on the
disk platters, whereas a good RAID controller will contain mirrored
NVRAM that can guarantee persistence in microseconds rather than
milliseconds (and then destage the writes to the platters lazily).

That's one advantage, yes.

>Now, ZFS does have an 'intent log' for small writes, and does have the
capability of placing this log on (mirrored) NVRAM to achieve
equivalent small-synchronous-write latency - but that's a hardware
option, not part and parcel of ZFS itself.

Oh, so you're now saying that synchronous writes may not be truly
synchronous with ZFS? That's something I didn't know. I thought ZFS
was smarter than that.

ZFS is, of course, smarter than that - too bad that you aren't.

I said nothing whatsoever to suggest that ZFS did not honor requests to
write synchronously: reread what I wrote until you understand it (and
while you're at it, reread what, if anything, you have read about ZFS
until you understand that as well: your ignorant bluster is becoming
more tiresome than amusing by this point).

>
>...

>>>Please name even one.

Why am I not surprised that you dodged that challenge?

Because I'm not the one making the claim.

My - you're either an out-right liar or even more abysmally incompetent
than even I had thought.

Let me refresh your memory: the exchange went

[quote]

But when properly implemented, RAID-1 and RAID-10 will detect and

correct even more errors than ZFS will.

Please name even one.

[end quote]

At the risk of being repetitive (since reading comprehension does not
appear to be your strong suit), the specific claim (yours, quoted above)
was that "when properly implemented, RAID-1 and RAID-10 will detect and
correct even more errors than ZFS will."

I challenged you to name even one such - and I'm still waiting.

....

ROFLMAO! Just like a troll. Jump into the middle of a discussion
uninvited.

I can understand why you'd like to limit the discussion to people who
have even less of a clue than you do (and thus why you keep cropping out
newsgroups where more knowledgeable people might be found), but toby
invited those in comp.arch.storage to participate by cross-posting there.

Since I tend to feel that discussions should continue where they
started, and since it seemed appropriate to respond directly to your
drivel rather than through toby's quoting in his c.a.storage post, I
came over here - hardly uninvited. Rich Teer (who I suspect also
qualifies as significantly more knowledgeable than you) chose to
continue in c.a.storage; people like Jeff Bonwick probably just aren't
interested (as I said, deflating incompetent blowhards is kind of a
hobby of mine, plus something of a minor civic duty - otherwise, I
wouldn't bother with you either).

- bill

Nov 12 '06 #54

Gordon Burditt

>Do you even know what a silent error is? It's an error that the disk

>does not notice, and hence cannot report.

Yep. And please tell me EXACTLY how this can occur.

Some drives will accept a sector write into an on-drive buffer and
indicate completion of the write before even attempting it. This
speeds things up. A subsequent discovery of a problem with the
sector-header would not be reported *on that write*. (I don't know
how stuff like this does get reported, possibly on a later write
by a completely different program, but in any case, it's likely too
late to report it to the caller at the user-program level).

Such drives *might* still be able to write data in the buffer cache
(assuming no bad sectors) even if the power fails: something about
using the momentum of the spinning drive to generate power for a
few milliseconds needed. Or maybe just a big capacitor on the
drive.

Drives like this shouldn't be used in a RAID setup, or the option
to indicate completion should be turned off. In the case of SCSI,
the RAID controller probably knows how to do this. In the case of
IDE, it might be manufacturer-specific.

There's a reason that some RAID setups require drives with modified
firmware.

>In some of your other recent drivel you've seemed to suggest that this
simply does not happen. Well, perhaps not in your own extremely limited
experience, but you really shouldn't generalize from that.

I didn't say it CAN'T happen. My statement was that is is SO UNLIKELY
to happen that it can be virtually ignored.

Nov 12 '06 #55

Jerry Stuckle

Bill Todd wrote:

Jerry Stuckle wrote:

>Bill Todd wrote:

>>Jerry Stuckle wrote:

Bill Todd wrote:

Jerry Stuckle wrote:
>
...
>
>ZFS is not proof against silent errors - they can still occur.
>
>
>
>
Of course they can, but they will be caught by the background
verification scrubbing before much time passes (i.e., within a time
window that radically reduces the likelihood that another disk will
fail before the error is caught and corrected), unlike the case
with conventional RAID (where they aren't caught at all, and rise
up to bite you - with non-negligible probability these days - if
the good copy then dies).
>
And ZFS *is* proof against silent errors in the sense that data
thus mangled will not be returned to an application (i.e., it will
be caught when read if the background integrity validation has not
yet reached it) - again, unlike the case with conventional
mirroring, where there's a good chance that it will be returned to
the application as good.
>
The same is true with RAID-1 and RAID-10. An error on the disk will
be detected and returned by the hardware to the OS.

I'd think that someone as uninformed as you are would have thought
twice about appending an ad for his services to his Usenet babble.
But formal studies have shown that the least competent individuals
seem to be the most confident of their opinions (because they just
don't know enough to understand how clueless they really are).

How you wish. I suspect I have many years more experience and
knowledge than you.

You obviously suspect a great deal. Too bad that you don't have any
real clue. Even more too bad that you insist on parading that fact so
persistently.

> And am more familiar with fault tolerant systems.

Exactly how many have you yourself actually architected and built,
rather than simply using the fault-tolerant hardware and software that
others have provided? I've been centrally involved in several.

Disk drive systems? I admit, none. My design experience has been more
in the digital arena - although I have done some analog design -
balanced amplifiers, etc.

How many disk drive systems have you actually had to troubleshoot?
Locate and replace a failing head, for example? Or a bad op amp in a
read amplifier? Again, none, I suspect. I've done quite a few in my
years.

And from your comments you show absolutely know knowledge of the
underlying electronics, much less the firmware involved. Yet you claim
you've been "centrally involved. Doing what - assembling the pieces?
All you've done is assemble the pieces.

>>

>>Do you even know what a silent error is? It's an error that the disk
does not notice, and hence cannot report.

Yep. And please tell me EXACTLY how this can occur.

I already did, but it seems that you need things spelled out in simpler
words: mostly, due to bugs in firmware in seldom-used recovery paths
that the vagaries of handling electro-mechanical devices occasionally
require. The proof is in the observed failures: as I said, end of
story (if you're not aware of the observed failures, it's time you
educated yourself in that area rather than kept babbling on
incompetently about the matter).

OK, and exactly how many of these bugs are there? Disk drive and
similar firmware is some of the most specialized and most heavily tested
firmware on the planet.

And show me hard facts on the failure. Otherwise you're just spewing
marketing bullshit like all trolls - overly stating the weaknesses of
the other methods, while maximizing your products strengths and ignoring
its weaknesses.

You have made claims about how bad disk drives are without ZFS. It's
amazing that computers work at all with all those errors you claim exist!

>>
>>Duh.

In some of your other recent drivel you've seemed to suggest that
this simply does not happen. Well, perhaps not in your own extremely
limited experience, but you really shouldn't generalize from that.

I didn't say it CAN'T happen. My statement was that is is SO UNLIKELY
to happen that it can be virtually ignored.

My word - you can't even competently discuss what you yourself have so
recently (and so incorrectly) stated.

For example, in response to my previous statement that ZFS detected
errors that RAID-1/10 did not, you said "The same is true with RAID-1
and RAID-10. An error on the disk will be detected and returned by the
hardware to the OS" - no probabilistic qualification there at all.

The probabilities are much higher that you will be killed by a meteor in
the next 10 years.

Drive electronics detect errors all the time. They automatically mark
bad spots on the disk. They correct read wrrors, and if using
verification, they correct write errors.

And on the subject of "data decaying after it is written to disk" (which
includes erroneous over-writes), when I asserted that ZFS "will catch it
before long, even in cases where conventional disk scrubbing would not"
you responded "So do RAID-1 and RAID-10" - again, no probabilistic
qualification whatsoever (leaving aside your incorrect assertion about
RAID's ability to catch those instances that disk-scrubbing does not
reveal).

You never made any probabilistic qualification, so neither did I. You
want probabilities, you need to supply them.

You even offered up an example *yourself* of such a firmware failure
mode: "A failing controller can easily overwrite the data at some later
time." Easily, Jerry? That doesn't exactly sound like a failure mode
that 'can be virtually ignored' to me. (And, of course, you accompanied
that pearl of wisdom with another incompetent assertion to the effect
that ZFS would not catch such a failure, when of course that's
*precisely* the kind of failure that ZFS is *designed* to catch.)

I didn't say it could be ignored. I did say it could be handled by a
properly configured RAID-1 or RAID-10 array.

Usenet is unfortunately rife with incompetent blowhards like you - so
full of themselves that they can't conceive of someone else knowing more
than they do about anything that they mistakenly think they understand,
and so insistent on preserving that self-image that they'll continue
spewing erroneous statements forever (despite their repeated promises to
stop: "I'm not going to respond to you any further", "I'm also not
going to discuss this any more with you", "I'm finished with this
conversation" all in separate responses to toby last night - yet here
you are this morning responding to him yet again).

And unfortunately, it's full of trolls like you who jump unwanted into
conversations where they have no experience, blow assertions out their
asses, and then attack the other person.

I've seen assholes like you before. You're a dime a dozen.

I'm not fond of blowhards, nor of their ability to lead others astray
technically if they're not confronted. Besides, sticking pins in such
over-inflated balloons is kind of fun.

Then you should learn to keep your fat mouth shut about things you know
nothing.

...

>>IBM, Unisys, NetApp, and EMC (probably not an exhaustive list, but
they're the ones that spring immediately to mind) all use
non-standard disk sector sizes in some of their systems to hold
additional validation information (maintained by software or firmware
well above the disk level) aimed at catching some (but in most cases
not all) of these unreported errors.

That's one way things are done.

You're awfully long on generalizations and short on specifics, Jerry -
typical Usenet loud-mouthed ignorance. But at least you seem to
recognize that some rather significant industry players consider this
kind of error sufficiently important to take steps to catch them - as
ZFS does but RAID per se does not (that being the nub of this
discussion, in case you had forgotten).

Now, by all means tell us some of the *other* ways that such 'things are
done'.

Get me someone competent in the hardware/firmware end and I can talk all
the specifics you want.

But you've made a hell of a bunch of claims and had no specifics on your
own - other than "ZFS is great and RAID sucks". ROFLMAO!

>>
>>Silent errors are certainly rare, but they happen. ZFS catches them.
RAID does not. End of story.

And what about data which is corrupted once it's placed in the ZFS
buffer? ZFS buffers are in RAM, and can be overwritten at any time.

All file data comes from some RAM buffer, Jerry - even that handed to a
firmware RAID. So if it can be corrupted in system RAM, firmware RAID
is no cure at all.

Ah, but let's see ZFS correct for that! Oh, sorry - I found a failure
mode your beloved file system doesn't handle, didn't I?

>>
And ZFS itself can be corrupted - telling the disk to write to the
wrong sector, for instance. It is subject to viruses.

All file data comes from some software OS environment, Jerry - same
comment as above.

So you admit it can be corrupted.

> It runs only on UNIX.

My, you do wander: what does this have to do with a discussion about
the value of different approaches to ensuring data integrity?

Because your beloved ZFS isn't worth a damn on any other system, that's
why. Let's see it run on MVS/XE, for instance. It doesn't work.
RAID-1/RAID-10 does.

>>
The list goes on.

Perhaps in your own fevered imagination.

But these are things which can occur much more easily

Not at all. I'm just not going to waste my time listing all the
possibilities.

>than the hardware errors you mention.

It really does depend on the environment: some system software
environments are wide-open to corruption, while others are so
well-protected from external attacks and so internally bullet-proof that
they often have up-times of a decade or more (and in those the
likelihood of disk firmware errors is sufficiently higher than the kind
of software problems that you're talking about that, by George, their
vendors find it worthwhile to take the steps I mentioned to guard
against them).

But, once again, in the cases where your OS integrity *is* a significant
problem, then firmware RAID isn't going to save you anyway.

And they are conveniently ignored

And you conveniently ignore how ZFS can be corrupted. In fact, it is
much more easily corrupted than basic file systems using RAID-1/RAID-10
arrays - if for no other reason than it contains a lot more code and
needs to to more work.

>by those who have bought the ZFS hype.

I really don't know what you've got against ZFS, Jerry, save for the
fact that discussing it has so clearly highlighted your own
incompetence. The only 'hype' that I've noticed around ZFS involves its
alleged 128-bitness (when its files only reach 64 - or is it 63? - bits
in size, and the need for more than 70 - 80 bits of total file system
size within the next few decades is rather difficult to justify).

But its ability to catch the same errors that far more expensive
products from the likes of IBM, NetApp, and EMC are designed to catch is
not hype: it's simple fact.

I don't have anything against ZFS. What I don't like is blowhards like
you who pop in with a bunch of marketing hype but no real facts nor
knowledge of what you speak.

>>
>>...

The

big difference being ZFS if done in software, which requires CPU
cycles and other resources.

Since when was this discussion about use of resources rather than
integrity (not that ZFS's use of resources for implementing its own
RAID-1/RAID-10 facilities is significant anyway)?

It's always about performance. 100% integrity is no good if you need
100% of the system resources to handle it.

Horseshit. It's only 'about performance' when the performance impact is
significant. In the case of ZFS's mirroring implementation, it isn't of
any significance at all (let alone any *real* drag on the system).

Keep believing that. It will help you to justify your statements in
your mind.

>>
>>> It's also open to corruption.

No more than the data that some other file system gives to a hardware
RAID implementation would be: it all comes from the same place (main
memory).

Ah, but this is much more likely than once it's been passed off to the
hardware, no?

No: once there's any noticeable likelihood of corruption in system RAM,
then it really doesn't matter how reliable the rest of the system is.

And ZFS can be corrupted more easily than more basic file systems. A
point you conveniently ignore.

>>
>>However, because ZFS subsequently checks what it wrote against a
*separate* checksum, if it *was* corrupted below the
request-submission level ZFS is very likely to find out, whereas a
conventional RAID implementation (and the higher layers built on top
of it) won't: they just write what (they think) they're told to,
with no additional check.

So? If the buffer is corrupted, the checksum will be, also.

No: in many (possibly all - I'd have to check the code to make sure)
cases ZFS establishes the checksum when the data is moved *into* the
buffer (and IIRC performs any compression and/or encryption at that
point as well: it's a hell of a lot less expensive to do all these at
once as the data is passing through the CPU cache on the way to the
buffer than to fetch it back again later).

Gee, someone who can actually read code? WOW!

And if the

>data is written to the wrong sector, the checksum will still be correct.

No: if the data is written to the wrong sector, any subsequent read
targeting the correct sector will find a checksum mismatch (as will any
read to the sector which was incorrectly written).

So pray tell - how is it going to do that? The data was written just as
it was checksummed.

>>
>> RAID-1 and RAID-10

are implemented in hardware/firmware which cannot be corrupted (Read
only memory) and require no CPU cycles.

If your operating system and file system have been corrupted, you've
got problems regardless of how faithfully your disk hardware
transfers this corruption to its platters: this alleged deficiency
compared with a hardware implementation is just not an issue.

The hardware is MUCH MORE RELIABLE than the software.

Once again, nitwit: if the OS-level software is not reliable, it
*doesn't matter* how reliable the hardware is.

Ah, more personal attacks. Brilliant!

>>
>>You've also suggested elsewhere that a hardware implementation is
less likely to contain bugs, which at least in this particular
instance is nonsense: ZFS's RAID-1/10 implementation benefits from
the rest of its design such that it's likely *far* simpler than any
high-performance hardware implementation (with its controller-level
cache management and deferred write-back behavior) is, and hence if
anything likely *less* buggy.

Yea, right. Keep believing it.

As will anyone else remotely well-acquainted with system hardware and
software. 'Firmware' is just software that someone has committed to
silicon, after all: it is just as prone to bugs as the system-level
software that you keep disparaging - more so, when it's more complex
than the system-software implementation.

That right there shows how little you understand disk technology today.
Firmware is less prone to bugs because it is analyzed and tested so
much more thoroughly than software, both by humans and machines.

After all - a recall on disks with a firmware bug would cost any disk
company at least tens of millions of dollars - if it didn't bankrupt the
company. It's very cheap in comparison to spend a few million
analyzing, testing, retesting, etc. all the firmware.

Additionally, being a hardware interface, it has limited actions
required of it. And those functions can easily be emulated by system
test sets, which can duplicate both ends of the controller. They have
an equivalent to the system bus for commands, and a replacement for the
disk electronics to test the other end. Many have even gone to
simulating the signals to/from the R/W heads themselves. With such test
sets they can automatically simulate virtually every possible failure
mode of the disk, validating all of the hardware and firmware.

But of course, if you were as smart as you claim, you would know this.
And you wouldn't be making the asinine claims about bugs in firmware
that you are.

...

You're more likely to have something overwritten in

>memory than a silent error.

Then (since you seem determined to keep ignoring this point) why on
Earth do you suppose that entirely reputable companies like IBM, NetApp,
and EMC go to such lengths to catch them? If it makes sense for them
(and for their customers), then it's really difficult to see why ZFS's
abilities in that area wouldn't be significant.

Another claim without any proof - and you accuse me of making claim.
Another typical troll behavior.

I don't know about NetApp or EMC, but I still have contacts in IBM. And
they do not "go to such lengths" to catch silent errors.

...

>>>>>But when properly implemented, RAID-1 and RAID-10 will detect and
>correct even more errors than ZFS will.
>
>
>

A complete disk crash, for instance. Even Toby admitted ZFS cannot
recover from a disk crash.

ZFS is good. But it's a cheap software implementation of an
expensive hardware recovery system. And there is no way software
can do it as well as hardware does.

You at least got that right: ZFS does it considerably better, not
merely 'as well'. And does so at significantly lower cost (so you
got that part right too).

The fact that you even try to claim that ZFS is better than a RAID-1
or RAID-10 system shows just how little you understand critical
systems, and how much you've bought into the ZFS hype.

The fact that you so consistently misrepresent ZFS as being something
*different* from RAID-1/10 shows that you don't even understand the
definition of RAID: the ZFS back-end *is* RAID-1/10 - it just leverages
its implementation in software to improve its reliability (because
writing directly from system RAM to disk without an intermediate step
through a common controller buffer significantly improves the odds that
*one* of the copies will be correct - and the checksum enables ZFS to
determine which one that is).

And you're saying it's the same? You really don't understand what
RAID-1 or RAID-10 is.

No, ZFS is just a cheap software replacement for an expensive hardware
system.

>>
>>The one advantage that a good hardware RAID-1/10 implementation has
over ZFS relates to performance, primarily small-synchronous-write
latency: while ZFS can group small writes to achieve competitive
throughput (in fact, superior throughput in some cases), it can't
safely report synchronous write completion until the data is on the
disk platters, whereas a good RAID controller will contain mirrored
NVRAM that can guarantee persistence in microseconds rather than
milliseconds (and then destage the writes to the platters lazily).

That's one advantage, yes.

>>Now, ZFS does have an 'intent log' for small writes, and does have
the capability of placing this log on (mirrored) NVRAM to achieve
equivalent small-synchronous-write latency - but that's a hardware
option, not part and parcel of ZFS itself.

Oh, so you're now saying that synchronous writes may not be truly
synchronous with ZFS? That's something I didn't know. I thought ZFS
was smarter than that.

ZFS is, of course, smarter than that - too bad that you aren't.

I said nothing whatsoever to suggest that ZFS did not honor requests to
write synchronously: reread what I wrote until you understand it (and
while you're at it, reread what, if anything, you have read about ZFS
until you understand that as well: your ignorant bluster is becoming
more tiresome than amusing by this point).

No, I'm just trying to understand your statement.
[quote]

>>
>>...

Please name even one.

Why am I not surprised that you dodged that challenge?

Because I'm not the one making the claim.

My - you're either an out-right liar or even more abysmally incompetent
than even I had thought.

Let me refresh your memory: the exchange went

>
But when properly implemented, RAID-1 and RAID-10 will detect and
correct even more errors than ZFS will.

Please name even one.

[end quote]

At the risk of being repetitive (since reading comprehension does not
appear to be your strong suit), the specific claim (yours, quoted above)
was that "when properly implemented, RAID-1 and RAID-10 will detect and
correct even more errors than ZFS will."

But trolling does seem to be your strong suit.

I challenged you to name even one such - and I'm still waiting.

A corruption in the ZFS buffer between writes, where different data is
written to one disk than the other.

Errors where ZFS itself is corrupted.

...

>ROFLMAO! Just like a troll. Jump into the middle of a discussion
uninvited.

I can understand why you'd like to limit the discussion to people who
have even less of a clue than you do (and thus why you keep cropping out
newsgroups where more knowledgeable people might be found), but toby
invited those in comp.arch.storage to participate by cross-posting there.

Since I tend to feel that discussions should continue where they
started, and since it seemed appropriate to respond directly to your
drivel rather than through toby's quoting in his c.a.storage post, I
came over here - hardly uninvited. Rich Teer (who I suspect also
qualifies as significantly more knowledgeable than you) chose to
continue in c.a.storage; people like Jeff Bonwick probably just aren't
interested (as I said, deflating incompetent blowhards is kind of a
hobby of mine, plus something of a minor civic duty - otherwise, I
wouldn't bother with you either).

- bill

No, actually, I'd much rather be discussing this with someone who has
some real knowledge, not blowhard trolls like you.

So I'm not even going to bother to respond to you any more. I prefer to
carry out intelligent conversations with intelligent people.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 12 '06 #56

Jerry Stuckle

Gordon Burditt wrote:

>>>Do you even know what a silent error is? It's an error that the disk
does not notice, and hence cannot report.

Yep. And please tell me EXACTLY how this can occur.

Some drives will accept a sector write into an on-drive buffer and
indicate completion of the write before even attempting it. This
speeds things up. A subsequent discovery of a problem with the
sector-header would not be reported *on that write*. (I don't know
how stuff like this does get reported, possibly on a later write
by a completely different program, but in any case, it's likely too
late to report it to the caller at the user-program level).

Yes, it's very common for drives to buffer data like this. But also,
drives have a "write-through" command which forces synchronous writing.
The drive doesn't return from such a write until the data is
physically on the drive. And they usually even have a verify flag,
which rereads the data after it has been written and compares it to what
was written.

Such drives *might* still be able to write data in the buffer cache
(assuming no bad sectors) even if the power fails: something about
using the momentum of the spinning drive to generate power for a
few milliseconds needed. Or maybe just a big capacitor on the
drive.

I don't know of any which are able to do anything more than complete the
current operation in case of a power failure. Using the drive as a
generator would brake it too quickly, and it would take a huge capacitor
to handle the current requirements for a seek (seeks require huge
current spikes - several amps - for a very short time).

Drives like this shouldn't be used in a RAID setup, or the option
to indicate completion should be turned off. In the case of SCSI,
the RAID controller probably knows how to do this. In the case of
IDE, it might be manufacturer-specific.

Actually, they can under certain conditions.

Although the drive itself couldn't have a big enough capacitor, the
power supply could keep it up for a few hundred milliseconds.

First of all, the drive typically writes any buffered data pretty
quickly, anyway (usually < 100 ms) when it is idle. Of course, a
heavily loaded disk will slow this down.

But in the case of a power failure, the power supply needs to
immediately raise a "power fail" condition to the drive. The drive
should then not accept any new operations and immediately complete any
which are in progress. Properly designed power supplies will take this
into consideration and have enough storage to keep the drives going for
a minimum time.

There's a reason that some RAID setups require drives with modified
firmware.

Yep, among them being drives which are not set up as above.

>

>>>In some of your other recent drivel you've seemed to suggest that this
simply does not happen. Well, perhaps not in your own extremely limited
experience, but you really shouldn't generalize from that.

I didn't say it CAN'T happen. My statement was that is is SO UNLIKELY
to happen that it can be virtually ignored.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 12 '06 #57

Jerry Stuckle

Frank Cusack wrote:

On 11 Nov 2006 19:30:25 -0800 "toby" <to**@telegraphics.com.auwrote:

>>Jerry Stuckle wrote:

>>>REAL RAID-1 and RAID-10 are implemented in hardware/firmware, not a
software system such as ZFS.

This simple statement shows a fundamental misunderstanding of the basics,
let alone zfs.

-frank

Not at all. RAID-1 and RAID-10 devices are file system neutral. Just
like disk systems are file-system neutral.

And anyone who things otherwise doesn't understand real RAID
implementations - only cheap ones which use software for all or part of
their implementation.

Real RAID arrays are not cheap. $100-500/GB is not out of the question.
And you won't find them at COMP-USA or other retailers.

But you don't see those very often on PC's. Most of the time you see
cheap implementations where some of the work is done in software.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 12 '06 #58

Bill Todd

Jerry Stuckle wrote:

....

I suspect I have many years more experience and

>>knowledge than you.

You obviously suspect a great deal. Too bad that you don't have any
real clue. Even more too bad that you insist on parading that fact so
persistently.

>> And am more familiar with fault tolerant systems.

Exactly how many have you yourself actually architected and built,
rather than simply using the fault-tolerant hardware and software that
others have provided? I've been centrally involved in several.

Disk drive systems? I admit, none.

So no surprise there, but at least a little credit for honesty.

My design experience has been more

in the digital arena - although I have done some analog design -
balanced amplifiers, etc.

Gee, whiz - and somehow you think that qualifies you to make pompous and
inaccurate statements about other areas that you know so much less
about. There's a lesson there, but I doubt your ability to learn it.

>
How many disk drive systems have you actually had to troubleshoot?
Locate and replace a failing head, for example? Or a bad op amp in a
read amplifier? Again, none, I suspect. I've done quite a few in my
years.

As I suspected, a tech with inflated delusions of competence. You
really ought to learn the difference between being able to troubleshoot
a problem (that's what VCR repair guys do: it's not exactly rocket
science) and being able to design (or even really understand) the system
that exhibits it.

>
And from your comments you show absolutely know knowledge of the
underlying electronics, much less the firmware involved.

My job at EMC was designing exactly such firmware for a new high-end
disk array, Jerry. And I was working closely with people who had
already been through that exercise for their existing Symmetrix product.
You seem to be laboring under the illusion that 'firmware' is somehow
significantly different from software.

Yet you claim

you've been "centrally involved. Doing what - assembling the pieces?
All you've done is assemble the pieces.

The difference, Jerry, (since you seem to be ignorant of it, though
that's apparently only one small drop in the vast ocean of your
ignorance) is that the pieces you're talking about are *not*
fault-tolerant - they don't even reliably report the faults which they
encounter.

Building a fault-tolerant system involves understanding the limits of
such underlying pieces and designing ways to compensate for them -
exactly the kind of thing that ZFS does with its separate checksumming
(and IBM, NetApp, and EMC do, though not always as effectively, with
their additional in sector information that contains higher-level
sanity-checks than the disk checksums have).

....

show me hard facts on the failure.

As you said recently to me, I'm not going to do your homework for you:
educating you (assuming that it possible at all) is not specifically
part of my agenda, just making sure that anyone who might otherwise take
your bombastic certainty as evidence of actual knowledge understands the
shallowness of your understanding.

The fact that the vendors whom I cited above take this kind of failure
seriously enough to guard against it should be evidence enough for
anyone who does not have both eyes firmly squeezed shut. If disk
manufacturers were more forthcoming (as they were a few years ago) about
providing information about undetected error rates the information
wouldn't be as elusive now - though even then I suspect that it only
related to checksum strength rather than to errors caused by firmware bugs.

....

You have made claims about how bad disk drives are without ZFS.

Were you an even half-competent reader, you would know that I have made
no such claims: I've only observed that ZFS catches certain classes of
errors that RAID per se cannot - and that these classes are sufficiently
important that other major vendors take steps to catch them as well.

It's

amazing that computers work at all with all those errors you claim exist!

Hey, they work without any redundancy at all - most of the time. The
usual question is, just how important is your data compared with the
cost of protecting it better? ZFS has just significantly changed the
balance in that area, which is why it's interesting.

....

The probabilities are much higher that you will be killed by a meteor in
the next 10 years.

That's rather difficult to evaluate. On the one hand, *no one* in
recorded history has been killed by a meteor, which would suggest that
the probability of such is rather low indeed. On the other, the
probability of a large impact that would kill a significant percentage
of the Earth's population (and thus with non-negligible probability
include me) could be high enough to worry about.

But of course that's not the real issue anyway. A single modest-sized
server (3+ raw TB, which only requires 5 disks these days to achieve)
contains more disk sectors than there are people on the Earth, and even
a single error in maintaining those 6 billion sectors leads to
corruption unless it's reliably caught. Large installations can be
1,000 times this size. So while the probability that *any given sector*
will be corrupted is very small, the probability that *some* sector will
be corrupted is sufficiently disturbing that reputable vendors protect
against it, and enjoy considerable economic success doing so.

>
Drive electronics detect errors all the time.

No, Jerry: they just detect errors *almost* all of the time - and
that's the problem, in a nutshell. Try to wrap what passes for your
mind around that, and you might learn something.

....

>You even offered up an example *yourself* of such a firmware failure
mode: "A failing controller can easily overwrite the data at some
later time." Easily, Jerry? That doesn't exactly sound like a
failure mode that 'can be virtually ignored' to me. (And, of course,
you accompanied that pearl of wisdom with another incompetent
assertion to the effect that ZFS would not catch such a failure, when
of course that's *precisely* the kind of failure that ZFS is
*designed* to catch.)

I didn't say it could be ignored. I did say it could be handled by a
properly configured RAID-1 or RAID-10 array.

And yet have been conspicuously silent when challenged to explain
exactly how - because, of course, no RAID can do what you assert it can
above: the most it could do (if it actually performed background
comparisons between data copies rather than just scrubbed them to ensure
that they could be read without error) would be to determine that they
did not match - it would have no way to determine which one was correct,
because that information is higher level in nature.

Once again, that's exactly the kind of thing that features such as ZFS's
separate checksums and supplementary in-sector sanity-checks from the
likes of EMC are for. But of course they aren't part of RAID per se at
all: you really should read the original Berkeley papers if you're
still confused about that.

>
>Usenet is unfortunately rife with incompetent blowhards like you - so
full of themselves that they can't conceive of someone else knowing
more than they do about anything that they mistakenly think they
understand, and so insistent on preserving that self-image that
they'll continue spewing erroneous statements forever (despite their
repeated promises to stop: "I'm not going to respond to you any
further", "I'm also not going to discuss this any more with you", "I'm
finished with this conversation" all in separate responses to toby
last night - yet here you are this morning responding to him yet again).

And unfortunately, it's full of trolls like you who jump unwanted into
conversations

As I already observed, I completely understand why you wouldn't want
people around who could easily demonstrate just how incompetent you
really are. Unfortunately for you, this does not appear to be a forum
where you can exercise moderator control to make that happen - so tough
tooties.

....

>>>IBM, Unisys, NetApp, and EMC (probably not an exhaustive list, but
they're the ones that spring immediately to mind) all use
non-standard disk sector sizes in some of their systems to hold
additional validation information (maintained by software or
firmware well above the disk level) aimed at catching some (but in
most cases not all) of these unreported errors.
That's one way things are done.

You're awfully long on generalizations and short on specifics, Jerry -
typical Usenet loud-mouthed ignorance. But at least you seem to
recognize that some rather significant industry players consider this
kind of error sufficiently important to take steps to catch them - as
ZFS does but RAID per se does not (that being the nub of this
discussion, in case you had forgotten).

Now, by all means tell us some of the *other* ways that such 'things
are done'.

Get me someone competent in the hardware/firmware end and I can talk all
the specifics you want.

Sure, Jerry: bluster away and duck the specifics again - wouldn't want
to spoil your perfect record there.

....

>>>Silent errors are certainly rare, but they happen. ZFS catches
them. RAID does not. End of story.
And what about data which is corrupted once it's placed in the ZFS
buffer? ZFS buffers are in RAM, and can be overwritten at any time.

All file data comes from some RAM buffer, Jerry - even that handed to
a firmware RAID. So if it can be corrupted in system RAM, firmware
RAID is no cure at all.

Ah, but let's see ZFS correct for that! Oh, sorry - I found a failure
mode your beloved file system doesn't handle, didn't I?

The point, imbecile, is not that ZFS (or anything else) catches
*everything*: it's that ZFS catches the same kinds of errors that
conventional RAID-1/10 catches (because at the back end it *is*
conventional RAID-1/10), plus other kinds that conventional RAID-1/10
misses.

[massive quantity of drivel snipped - just not worthy of comment at all]

> And if the

>>data is written to the wrong sector, the checksum will still be correct.

No: if the data is written to the wrong sector, any subsequent read
targeting the correct sector will find a checksum mismatch (as will
any read to the sector which was incorrectly written).

So pray tell - how is it going to do that?

Careful there, Jerry: when you ask a question, you risk getting a real
answer that will even further expose the depths of your ignorance.

The data was written just as

it was checksummed.

Try reading what I said again: it really shouldn't be *that* difficult
to understand (unless you really don't know *anything* about how ZFS works).

When ZFS writes data, it doesn't over-write an existing copy if there is
one: it writes the new copy into free space and garbage-collects the
existing copy (unless it has to retain it temporarily for a snapshot)
after the new data is on disk. This means that it updates the metadata
to point to that new copy rather than to the old one, and when doing so
it includes a checksum so that later on, when it reads the data back in,
it can determine with a very high degree of confidence that this data is
indeed what it previously wrote (the feature that conventional RAID
completely lacks).

So if the disk misdirects the write, when the correct sector is later
read in the updated metadata checksum won't match its contents - and if
the incorrectly-overwritten sector is later read through its own
metadata path, that checksum won't match either: in both cases, the
correct information is then read from the other copy and used to update
the corrupted one, something which RAID-1/10 per se simply cannot do
(because it has no way to know which copy is the correct one even if it
did detect the difference by comparing them, though that's also not part
of the standard definition of RAID).

....

'Firmware' is just software that someone has committed to

>silicon, after all: it is just as prone to bugs as the system-level
software that you keep disparaging - more so, when it's more complex
than the system-software implementation.

That right there shows how little you understand disk technology today.

You're confused and babbling yet again: the firmware in question here
is not disk firmware (though the fact that disk firmware can and does
have bugs is the reason why sanity checks beyond those inherent in RAID
are desirable): it's RAID firmware, since your contention was that it
somehow magically was less bug-prone than RAID software.

Firmware is less prone to bugs because it is analyzed and tested so
much more thoroughly than software, both by humans and machines.

Firmware *is* software. And RAID software can be (and often is) checked
just as thoroughly, because, of course, it's just as important (and for
that matter has the same interface, since you attempted to present that
later as a difference between them).

....

(since you seem determined to keep ignoring this point) why on

>Earth do you suppose that entirely reputable companies like IBM,
NetApp, and EMC go to such lengths to catch them? If it makes sense
for them (and for their customers), then it's really difficult to see
why ZFS's abilities in that area wouldn't be significant.

Another claim without any proof - and you accuse me of making claim.
Another typical troll behavior.

I don't know about NetApp or EMC,

You obviously don't know about much at all, but that doesn't seem to
inhibit you from pontificating incompetently.

but I still have contacts in IBM. And

they do not "go to such lengths" to catch silent errors.

Yes, they do - in particular, in their i-series boxes, where they use
non-standard sector sizes (520 or 528 bytes, I forget which) to include
exactly the kind of additional sanity-checks that I described. Either
your 'contacts in IBM' are as incompetent as you are, or (probably more
likely) you phrased the question incorrectly.

I strongly suspect that IBM uses similar mechanisms in their mainframe
storage, but haven't followed that as closely.

....

>The fact that you so consistently misrepresent ZFS as being something
*different* from RAID-1/10 shows that you don't even understand the
definition of RAID: the ZFS back-end *is* RAID-1/10 - it just
leverages its implementation in software to improve its reliability
(because writing directly from system RAM to disk without an
intermediate step through a common controller buffer significantly
improves the odds that *one* of the copies will be correct - and the
checksum enables ZFS to determine which one that is).

And you're saying it's the same? You really don't understand what
RAID-1 or RAID-10 is.

Nor, apparently, does anyone else who has bothered to respond to you
here: everyone's out of step but you.

Sure, Jerry. Do you actually make a living in this industry? If so, I
truly pity your customers.

....
[quote]

>>>>>Please name even one.

Why am I not surprised that you dodged that challenge?
Because I'm not the one making the claim.

My - you're either an out-right liar or even more abysmally
incompetent than even I had thought.

Let me refresh your memory: the exchange went

>>
> But when properly implemented, RAID-1 and RAID-10 will detect and
correct even more errors than ZFS will.

Please name even one.

[end quote]

At the risk of being repetitive (since reading comprehension does not
appear to be your strong suit), the specific claim (yours, quoted
above) was that "when properly implemented, RAID-1 and RAID-10 will
detect and correct even more errors than ZFS will."

But trolling does seem to be your strong suit.

Ah - still ducking and weaving frantically, I see.

>
>I challenged you to name even one such - and I'm still waiting.

A corruption in the ZFS buffer between writes, where different data is
written to one disk than the other.

Errors where ZFS itself is corrupted.

Tsk, tsk. These alleged issues have nothing to do with RAID's ability
to 'detect and correct even more errors than ZFS' - in fact, they have
nothing whatsoever to do with RAID detecting or correcting *anything*.
They're just hypothetical exposures (rather than established problems)
that you've propped up to try to suggest deficiencies in ZFS compared
with moving some of its facilities into firmware.

Come on, Jerry: surely you can come up with *one* kind of error that a
firmware-based RAID can 'detect and correct' that ZFS would miss - or
were you just blowing smoke out of your ass on that one, as in so many
others?

....

I'm not even going to bother to respond to you any more.

O frabjous day! But wait: can we believe this any more than your
similar statements to toby last night?

Inquiring minds want to know...

- bill

Nov 13 '06 #59

Jerry Stuckle

Bill Todd wrote:

Jerry Stuckle wrote:

...

I suspect I have many years more experience and

>>>knowledge than you.

You obviously suspect a great deal. Too bad that you don't have any
real clue. Even more too bad that you insist on parading that fact
so persistently.

And am more familiar with fault tolerant systems.

Exactly how many have you yourself actually architected and built,
rather than simply using the fault-tolerant hardware and software
that others have provided? I've been centrally involved in several.

Disk drive systems? I admit, none.

So no surprise there, but at least a little credit for honesty.

My design experience has been more

>in the digital arena - although I have done some analog design -
balanced amplifiers, etc.

Gee, whiz - and somehow you think that qualifies you to make pompous and
inaccurate statements about other areas that you know so much less
about. There's a lesson there, but I doubt your ability to learn it.

>>
How many disk drive systems have you actually had to troubleshoot?
Locate and replace a failing head, for example? Or a bad op amp in a
read amplifier? Again, none, I suspect. I've done quite a few in my
years.

As I suspected, a tech with inflated delusions of competence. You
really ought to learn the difference between being able to troubleshoot
a problem (that's what VCR repair guys do: it's not exactly rocket
science) and being able to design (or even really understand) the system
that exhibits it.

No, a EE graduate with years of design experience before I got into
programming. Sorry, sucker.

And I've snipped the rest of your post. It's obvious you're only an
average programmer (if that) with no real knowledge of the electronics.
All you do is take a set of specs and write code to meet them. Anyone
with six months of experience can do that.

Sorry, troll. The rest of your post isn't even worth reading. Go crawl
back in your hole.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 13 '06 #60

Bill Todd

Jerry Stuckle wrote:

....

>As I suspected, a tech with inflated delusions of competence. You
really ought to learn the difference between being able to
troubleshoot a problem (that's what VCR repair guys do: it's not
exactly rocket science) and being able to design (or even really
understand) the system that exhibits it.

No, a EE graduate

Well, I guess some schools will graduate just about anybody.

....

It's obvious you're only an

average programmer (if that)

Wow - now you're such an expert on programming that you can infer such
conclusions from a discussion which barely touches on the subject.
That's pretty indicative of your level of understanding in general,
though - so once more, no surprises here.

All you do is take a set of specs and write code to meet them.

Well, I guess you could say that I take imperfect, real-world hardware
and surrounding environments and (after doing the necessary research,
high-level architecting, and intermediate-level designing) write the
code that creates considerably-less-imperfect systems from them. And
since you probably aren't capable of even beginning to understand the
difference between that and your own statement, I guess we can leave it
there.

....

The rest of your post isn't even worth reading.

No doubt especially the part where I wondered whether you'd stick by
your promise not to respond again. You're so predictable that you'd be
boring just for that - if you weren't already boring for so many other
reasons.

But I think that my job here is done: I doubt that there's anyone left
wondering whether you might be someone worth listening to on this subject.

- bill

Nov 13 '06 #61

Robert Milkowski

Jerry Stuckle <js*******@attglobal.netwrote:

Frank Cusack wrote:
On 11 Nov 2006 19:30:25 -0800 "toby" <to**@telegraphics.com.auwrote:

>Jerry Stuckle wrote:

REAL RAID-1 and RAID-10 are implemented in hardware/firmware, not a
software system such as ZFS.

This simple statement shows a fundamental misunderstanding of the basics,
let alone zfs.

-frank

Not at all. RAID-1 and RAID-10 devices are file system neutral. Just
like disk systems are file-system neutral.

And anyone who things otherwise doesn't understand real RAID
implementations - only cheap ones which use software for all or part of
their implementation.

Well, I would say that it's actually you who do not understand ZFS at all.
You claim you read Bonwick blog entry - I belive you just do not want to understand
it.

Real RAID arrays are not cheap. $100-500/GB is not out of the question.
And you won't find them at COMP-USA or other retailers.

But you don't see those very often on PC's. Most of the time you see
cheap implementations where some of the work is done in software.

So? I use ZFS with cheap drives and also with storage like EMC Symmetrix and
several vendors midrange arrays. In some workloads I get for example better
performance when RAID-10 is done completely by ZFS and not by hardware itself.

Also recently one such hardware RAID actually did generate data corruption
without reporting it and ZFS did manage it properly. And we happen to have to
fsck UFS file systems from time to time on those arrays for no apparent reason.

ps. IBM's "hardware" RAID arrays can also loose data, you'll be even informed
by that "hardware" that it did so, how convinient

btw: when you talk about hardware RAID - there is actually software running
on a array's hardware, in case you didn't know

--
Robert Milkowski
rm************@wp-sa.pl
http://milek.blogspot.com

Nov 13 '06 #62

Robert Milkowski

Jerry Stuckle <js*******@attglobal.netwrote:

Bill Todd wrote:

That was a decade ago. What are the figures TODAY? Do you even know?
Do you even know why they happen?

I don't care that much for figures - what matters is I can observe
it in my environment with lots of data and storage arrays. Not daily of course
but still. And ZFS has already detected many data corruption generated
by arrays with HW RAID.

Silent errors are certainly rare, but they happen. ZFS catches them.
RAID does not. End of story.

And what about data which is corrupted once it's placed in the ZFS
buffer? ZFS buffers are in RAM, and can be overwritten at any time.

That kind of problem doesn't disappear with RAID done in HW.
If buffer in OS is corrupted before data are sent to an array
then HW array also won't help.

Now if you have uncorrectable memory problems on your server and your
server and OS can't cope with that then you've got much bigger problem
anyway and RAID won't help you.

And ZFS itself can be corrupted - telling the disk to write to the wrong
sector, for instance. It is subject to viruses. It runs only on UNIX.

The beauty of ZFS is that even if ZFS itself write data to wrong sector
then in redundand config ZFS can still detect it, recover and provide
application correct data.

I really encourage you to read about ZFS internals as it's realy great
technology with features you can't find anywhere else.

http://opensolaris.org/os/community/zfs/

ps. viruses.... :))))) ok, if you have an VIRUS in your OS which
is capable of corrupting data then HW RAID also won't help

big difference being ZFS if done in software, which requires CPU
cycles and other resources.

That's of course true. There're definitely environments when due to CPU
doing RAID in ZFS will be slower than in HW, you're right.
However in most environments disk performance is actually the limiting
factor not CPU. Also in many cases it's much easier and cheaper to add
CPU power to the system than to increase disk performance.

It's always about performance. 100% integrity is no good if you need
100% of the system resources to handle it.

You are wrong. What's good from rock performance if your data is corrupted?
Actually you need an balance between two, otherwise people would use
only stripe and forget about other RAIDs, right?

And while people are worrying that ZFS can consume much CPU due to checksum
calculations in real life it seems that this is offseted by other features
(like RAID and FS integration, etc.) so at the end in many cases you
actually get better performance that doing RAID in HW.

I did actual tests. Also I have "tested" it in production.
Have you?

ps. see my blog and ZFS list at opensolaris.org for more info.

However, because ZFS subsequently checks what it wrote against a
*separate* checksum, if it *was* corrupted below the request-submission
level ZFS is very likely to find out, whereas a conventional RAID
implementation (and the higher layers built on top of it) won't: they
just write what (they think) they're told to, with no additional check.

So? If the buffer is corrupted, the checksum will be, also. And if the
data is written to the wrong sector, the checksum will still be correct.

If buffer is corrupted before OS sends data to the array then you've got problem
regardles of using software or hardware RAID.

Now even if ZFS writes data to wrong sector it can still detect it and correct.
This is due to fact that ZFS does NOT store checksum with data block itself.
Checksum is stored in metadata block pointing to data block. Also meta data
block is checksumed and its checksum is stored in its parent meta block, and so
on. So if ZFS due to bug would write data to wrong location, overwritten blocks
have checksums stored in different location and ZFS would detect it, correct and
still return good data.

Really, read something about ZFS before you express your opinions on it.

The hardware is MUCH MORE RELIABLE than the software.

1. you still have to use Application/OS to make any use of that hardware.

2. your hardware runs sotware anyway

3. your hardware returns corrupted data (sometimes)

The fact that you even try to claim that ZFS is better than a RAID-1 or
RAID-10 system shows just how little you understand critical systems,
and how much you've bought into the ZFS hype.

I would rather say that you are complete ignorant and never have actually
read with understanding about ZFS. Also it appears you've never got data
corruption from HW arrays - how lucky you are, or maybe you didn't realize
it was an array which corrupted your data.

Also it seems you don't understand that ZFS does also RAID-1 and/or RAID-10.

The one advantage that a good hardware RAID-1/10 implementation has over
ZFS relates to performance, primarily small-synchronous-write latency:
while ZFS can group small writes to achieve competitive throughput (in
fact, superior throughput in some cases), it can't safely report
synchronous write completion until the data is on the disk platters,
whereas a good RAID controller will contain mirrored NVRAM that can
guarantee persistence in microseconds rather than milliseconds (and then
destage the writes to the platters lazily).

That's one advantage, yes.

That's why the combination of ZFS+RAID with large caches is so compeling
in many cases. And yes, I do have such configs.

Oh, so you're now saying that synchronous writes may not be truly
synchronous with ZFS? That's something I didn't know. I thought ZFS
was smarter than that.

Please, stop trolling. Of course they are synchronous.
--
Robert Milkowski
rm********@wp-sa.pl
http://milek.blogspot.com

Nov 13 '06 #63

Robert Milkowski

Jerry Stuckle <js*******@attglobal.netwrote:

Bill Todd wrote:

OK, and exactly how many of these bugs are there? Disk drive and
similar firmware is some of the most specialized and most heavily tested
firmware on the planet.

What? How many arrays do you manage?
How many times did you have to upgrade disk firmware or
RAID controllers firmware on them? I did many times.

However I must say that arrays are reliable. But from time to time it just happens.
We did fsck or other magic to get our data working, not that often but still.

Recently I did use two SCSI JBODs (ok, it's not array) connected
via two SCSI adapters to a host. RAID-10 done in ZFS between JBODS.
Well, during data copy one of the controllers reported some warnings,
but keep operational. Well, it actually did corrupt data - fortunately
ZFS did handle it properly, and we replaced the adapter. With traditional
file systems we would be in trouble.

All file data comes from some RAM buffer, Jerry - even that handed to a
firmware RAID. So if it can be corrupted in system RAM, firmware RAID
is no cure at all.

Ah, but let's see ZFS correct for that! Oh, sorry - I found a failure
mode your beloved file system doesn't handle, didn't I?

So? Nobody claims ZFS protects you from ALL possible data corruption.
Only that it protects you from much more data corruptions than when RAID
is done only on the array. It's also not theoretical but actually it's an
experience of many sys admins.

Because your beloved ZFS isn't worth a damn on any other system, that's
why. Let's see it run on MVS/XE, for instance. It doesn't work.
RAID-1/RAID-10 does.

If you have to use MVS then you're right - you can't use ZFS and you
have to live with it.

And you conveniently ignore how ZFS can be corrupted. In fact, it is
much more easily corrupted than basic file systems using RAID-1/RAID-10
arrays - if for no other reason than it contains a lot more code and
needs to to more work.

Well, actually ZFS has less code than UFS, for example.
See http://blogs.sun.com/eschrock/entry/ufs_svm_vs_zfs_code

First check your assumptions before posting them.
But I don't blame you - when I first heard about ZFS my first
reaction was: it's too good to be true. Well, later I started using
it and after over two years of using it (also in production) it still
amazes me how wonderful it is. It has also its weak points, it had/has
some bugs but after using it for more than two years I've never loose data.

Horseshit. It's only 'about performance' when the performance impact is
significant. In the case of ZFS's mirroring implementation, it isn't of
any significance at all (let alone any *real* drag on the system).

Keep believing that. It will help you to justify your statements in
your mind.

Have you checked it? I DID. And in MY environment ZFS delivered
better performance than HW RAID.

A corruption in the ZFS buffer between writes, where different data is
written to one disk than the other.

Actually as soon as you will read those data ZFS will detect it and correct,
also will return correct data to an application.
Also you can run SCRUB process in a background from time to time,
so even if you do not read those data back again ZFS will check
all data and correct problems if it finds any.

So in above case you described ZFS will actually detect corruption
and repair.
--
Robert Milkowski
rm***********@wp-sa.pl
http://milek.blogspot.com

Nov 13 '06 #64

Jerry Stuckle

Bill Todd wrote:

Jerry Stuckle wrote:

...

>>As I suspected, a tech with inflated delusions of competence. You
really ought to learn the difference between being able to
troubleshoot a problem (that's what VCR repair guys do: it's not
exactly rocket science) and being able to design (or even really
understand) the system that exhibits it.

No, a EE graduate

Well, I guess some schools will graduate just about anybody.

Right, troll.

...

It's obvious you're only an

>average programmer (if that)

Wow - now you're such an expert on programming that you can infer such
conclusions from a discussion which barely touches on the subject.
That's pretty indicative of your level of understanding in general,
though - so once more, no surprises here.

Well, with almost 40 years of programming, I can spot a large-mouthed
asshole when I see one.

Of course, you're such an expert on me EE experience.

> All you do is take a set of specs and write code to meet them.

Well, I guess you could say that I take imperfect, real-world hardware
and surrounding environments and (after doing the necessary research,
high-level architecting, and intermediate-level designing) write the
code that creates considerably-less-imperfect systems from them. And
since you probably aren't capable of even beginning to understand the
difference between that and your own statement, I guess we can leave it
there.

ROFLMAO! "Imperfect, real-world hardware" is a hell of a lot more
reliable than your programming! Then you "improve them" by making them
even less perfect! That is just too great.

You're a troll - and the worst one I've ever seen on Usenet. Hell, you
can't even succeed as a troll. You don't understand the hardware you're
supposedly writing to. It's pretty obvious your claims are out your
ass. You have no idea what you're talking about, and no idea how the
disk drive manufacturers write the firmware.

...

> The rest of your post isn't even worth reading.

No doubt especially the part where I wondered whether you'd stick by
your promise not to respond again. You're so predictable that you'd be
boring just for that - if you weren't already boring for so many other
reasons.

But I think that my job here is done: I doubt that there's anyone left
wondering whether you might be someone worth listening to on this subject.

- bill

I respond to assholes when they make even bigger assholes of themselves,
as you just did.

Go back to your Tinker Toys, little boy. You're a troll, nothing more,
nothing less.

You may claim you're a programmer. And maybe you've even written a few
lines of code in your lifetime. And there's even a slight chance you
got it to work, with some help.

But your claim that you write disk drive controller firmware is full of
shit. Your completely inane claims have proven that.

Troll.
--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 13 '06 #65

Jerry Stuckle

Robert Milkowski wrote:

Jerry Stuckle <js*******@attglobal.netwrote:

>>Bill Todd wrote:

OK, and exactly how many of these bugs are there? Disk drive and
similar firmware is some of the most specialized and most heavily tested
firmware on the planet.

What? How many arrays do you manage?
How many times did you have to upgrade disk firmware or
RAID controllers firmware on them? I did many times.

Robert,

I've lost count over the years of how many I've managed.

As for upgrading disk firmware? Never. RAID firmware? Once, but that
was on recommendation from the manufacturer, not because we had a problem.

But the RAID devices I'm talking about are at COMP-USA. They are high
end arrays attached to minis and mainframes. Starting cost probably
$500K or more. And they are reliable.

However I must say that arrays are reliable. But from time to time it just happens.
We did fsck or other magic to get our data working, not that often but still.

Never had to do it.

Recently I did use two SCSI JBODs (ok, it's not array) connected
via two SCSI adapters to a host. RAID-10 done in ZFS between JBODS.
Well, during data copy one of the controllers reported some warnings,
but keep operational. Well, it actually did corrupt data - fortunately
ZFS did handle it properly, and we replaced the adapter. With traditional
file systems we would be in trouble.

Gee, with good hardware that wouldn't have happened. And with a real
RAID array it wouldn't have happened, either.

>

>>>All file data comes from some RAM buffer, Jerry - even that handed to a
firmware RAID. So if it can be corrupted in system RAM, firmware RAID
is no cure at all.

Ah, but let's see ZFS correct for that! Oh, sorry - I found a failure
mode your beloved file system doesn't handle, didn't I?

So? Nobody claims ZFS protects you from ALL possible data corruption.
Only that it protects you from much more data corruptions than when RAID
is done only on the array. It's also not theoretical but actually it's an
experience of many sys admins.

Ah, but that's what some of the people in this thread have claimed,
Robert. Check back.

>
>>Because your beloved ZFS isn't worth a damn on any other system, that's
why. Let's see it run on MVS/XE, for instance. It doesn't work.
RAID-1/RAID-10 does.

If you have to use MVS then you're right - you can't use ZFS and you
have to live with it.

Or even Windows. Or Mac. Or any of several other OS's.

>
>>And you conveniently ignore how ZFS can be corrupted. In fact, it is
much more easily corrupted than basic file systems using RAID-1/RAID-10
arrays - if for no other reason than it contains a lot more code and
needs to to more work.

Well, actually ZFS has less code than UFS, for example.
See http://blogs.sun.com/eschrock/entry/ufs_svm_vs_zfs_code

First check your assumptions before posting them.
But I don't blame you - when I first heard about ZFS my first
reaction was: it's too good to be true. Well, later I started using
it and after over two years of using it (also in production) it still
amazes me how wonderful it is. It has also its weak points, it had/has
some bugs but after using it for more than two years I've never loose data.

I have checked my assumptions. Note that I never said ZFS is bad. Just
that it isn't the magic cure-all that others in this thread are
claiming. And it's just a cheap replacement for proper RAID devices.

>

>>>Horseshit. It's only 'about performance' when the performance impact is
significant. In the case of ZFS's mirroring implementation, it isn't of
any significance at all (let alone any *real* drag on the system).

Keep believing that. It will help you to justify your statements in
your mind.

Have you checked it? I DID. And in MY environment ZFS delivered
better performance than HW RAID.

What RAID did you get? Did it have it's own drivers, or did it use the
system drivers? If the former, a lot of the work was done in software,
which is common in less expensive systems. Did it have dual
controllers, or did it use one controller for both drives? I could go on.

With cheap controllers you get cheap performance.

>
>>A corruption in the ZFS buffer between writes, where different data is
written to one disk than the other.

Actually as soon as you will read those data ZFS will detect it and correct,
also will return correct data to an application.
Also you can run SCRUB process in a background from time to time,
so even if you do not read those data back again ZFS will check
all data and correct problems if it finds any.

So in above case you described ZFS will actually detect corruption
and repair.

I never said ZFS couldn't correct and repair some problems. But it does
NOT do everything, like some people here have indicated.
--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 13 '06 #66

Jerry Stuckle

Robert Milkowski wrote:

Jerry Stuckle <js*******@attglobal.netwrote:

>>Bill Todd wrote:

That was a decade ago. What are the figures TODAY? Do you even know?
Do you even know why they happen?

I don't care that much for figures - what matters is I can observe
it in my environment with lots of data and storage arrays. Not daily of course
but still. And ZFS has already detected many data corruption generated
by arrays with HW RAID.

Or, at least ZFS claims to have detected those corruptions. What proof
do you have that there really were errors? What kinds of errors were these?

If they were "silent errors", I would be very suspicious of the
reporting, unless you have a cheap array.

>

>>>Silent errors are certainly rare, but they happen. ZFS catches them.
RAID does not. End of story.

And what about data which is corrupted once it's placed in the ZFS
buffer? ZFS buffers are in RAM, and can be overwritten at any time.

That kind of problem doesn't disappear with RAID done in HW.
If buffer in OS is corrupted before data are sent to an array
then HW array also won't help.

Actually, it does. For instance, it will sit in the ZFS buffer for a
lot longer, leaving it open to corruption longer. It has to be in the
buffer at least as long as it takes for the first write to complete and
the command sent off to the second drive. With RAID, the data are
protected as soon as they are sent to the array with the first write.

With ZFS the data will be in the buffer for several ms or longer, even
with no other load on the system or disks. With RAID devices, data will
be there typically less than a ms. And as the system load gets heavier,
this difference increases.

There is much more opportunity for data to be corrupted in ZFS than RAID.

Now if you have uncorrectable memory problems on your server and your
server and OS can't cope with that then you've got much bigger problem
anyway and RAID won't help you.

I didn't say anything about an uncorrectable memory problem.

>

>>And ZFS itself can be corrupted - telling the disk to write to the wrong
sector, for instance. It is subject to viruses. It runs only on UNIX.

The beauty of ZFS is that even if ZFS itself write data to wrong sector
then in redundand config ZFS can still detect it, recover and provide
application correct data.

Yes, it *can* detect it. But there is no guarantee it *will* detect it.
And how is it going to provide application correct data if that data
was overwritten?

I really encourage you to read about ZFS internals as it's realy great
technology with features you can't find anywhere else.

http://opensolaris.org/os/community/zfs/

ps. viruses.... :))))) ok, if you have an VIRUS in your OS which
is capable of corrupting data then HW RAID also won't help

I've read a lot about ZFS. I'm not saying it's a bad system. I'm just
saying there is a hell of a lot of marketing hype people have succumbed
to. And I have yet to get anyone with any technical background who can
support those

>

>>>>big difference being ZFS if done in software, which requires CPU
cycles and other resources.

That's of course true. There're definitely environments when due to CPU
doing RAID in ZFS will be slower than in HW, you're right.
However in most environments disk performance is actually the limiting
factor not CPU. Also in many cases it's much easier and cheaper to add
CPU power to the system than to increase disk performance.

>>It's always about performance. 100% integrity is no good if you need
100% of the system resources to handle it.

You are wrong. What's good from rock performance if your data is corrupted?
Actually you need an balance between two, otherwise people would use
only stripe and forget about other RAIDs, right?

And while people are worrying that ZFS can consume much CPU due to checksum
calculations in real life it seems that this is offseted by other features
(like RAID and FS integration, etc.) so at the end in many cases you
actually get better performance that doing RAID in HW.

I did actual tests. Also I have "tested" it in production.
Have you?

ps. see my blog and ZFS list at opensolaris.org for more info.

>>>However, because ZFS subsequently checks what it wrote against a
*separate* checksum, if it *was* corrupted below the request-submission
level ZFS is very likely to find out, whereas a conventional RAID
implementation (and the higher layers built on top of it) won't: they
just write what (they think) they're told to, with no additional check.

So? If the buffer is corrupted, the checksum will be, also. And if the
data is written to the wrong sector, the checksum will still be correct.

If buffer is corrupted before OS sends data to the array then you've got problem
regardles of using software or hardware RAID.

Now even if ZFS writes data to wrong sector it can still detect it and correct.
This is due to fact that ZFS does NOT store checksum with data block itself.
Checksum is stored in metadata block pointing to data block. Also meta data
block is checksumed and its checksum is stored in its parent meta block, and so
on. So if ZFS due to bug would write data to wrong location, overwritten blocks
have checksums stored in different location and ZFS would detect it, correct and
still return good data.

Really, read something about ZFS before you express your opinions on it.

>>The hardware is MUCH MORE RELIABLE than the software.

1. you still have to use Application/OS to make any use of that hardware.

2. your hardware runs sotware anyway

3. your hardware returns corrupted data (sometimes)

>>The fact that you even try to claim that ZFS is better than a RAID-1 or
RAID-10 system shows just how little you understand critical systems,
and how much you've bought into the ZFS hype.

I would rather say that you are complete ignorant and never have actually
read with understanding about ZFS. Also it appears you've never got data
corruption from HW arrays - how lucky you are, or maybe you didn't realize
it was an array which corrupted your data.

Also it seems you don't understand that ZFS does also RAID-1 and/or RAID-10.

>>>The one advantage that a good hardware RAID-1/10 implementation has over
ZFS relates to performance, primarily small-synchronous-write latency:
while ZFS can group small writes to achieve competitive throughput (in
fact, superior throughput in some cases), it can't safely report
synchronous write completion until the data is on the disk platters,
whereas a good RAID controller will contain mirrored NVRAM that can
guarantee persistence in microseconds rather than milliseconds (and then
destage the writes to the platters lazily).

That's one advantage, yes.

That's why the combination of ZFS+RAID with large caches is so compeling
in many cases. And yes, I do have such configs.

>>Oh, so you're now saying that synchronous writes may not be truly
synchronous with ZFS? That's something I didn't know. I thought ZFS
was smarter than that.

Please, stop trolling. Of course they are synchronous.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 13 '06 #67

Jerry Stuckle

Robert Milkowski wrote:

Jerry Stuckle <js*******@attglobal.netwrote:

>>Frank Cusack wrote:

>>>On 11 Nov 2006 19:30:25 -0800 "toby" <to**@telegraphics.com.auwrote:
Jerry Stuckle wrote:
>REAL RAID-1 and RAID-10 are implemented in hardware/firmware, not a
>software system such as ZFS.
This simple statement shows a fundamental misunderstanding of the basics,
let alone zfs.

-frank

Not at all. RAID-1 and RAID-10 devices are file system neutral. Just
like disk systems are file-system neutral.

And anyone who things otherwise doesn't understand real RAID
implementations - only cheap ones which use software for all or part of
their implementation.

Well, I would say that it's actually you who do not understand ZFS at all.
You claim you read Bonwick blog entry - I belive you just do not want to understand
it.

No, I read it. The difference is I have enough technical background to
separate the facts from the hype.

>
>>Real RAID arrays are not cheap. $100-500/GB is not out of the question.
And you won't find them at COMP-USA or other retailers.

But you don't see those very often on PC's. Most of the time you see
cheap implementations where some of the work is done in software.

So? I use ZFS with cheap drives and also with storage like EMC Symmetrix and
several vendors midrange arrays. In some workloads I get for example better
performance when RAID-10 is done completely by ZFS and not by hardware itself.

If you use cheap drives, you need something like ZFS. But if you depend
on cheap drives, your data isn't very critical.

Let's take a real-life example of a critical system - a major airline
where losing one minute of reservations will cost millions of dollars.
And if the system is down for 12 hours the entire company can go under.

Or losing a single hour's worth of flight information could bankrupt the
company. Even losing a single flight could cost millions of dollars,
not to mention the bad PR.

BTW, this airline not only has RAID devices, they have duplicate data
centers databases on those RAID devices are synchronized constantly.

Or a bank, where lost transactions can cause account balances to be
incorrect, bad data being sent to the Federal Reserve System can cost
millions of dollars. Even if they recover all the data, the time it
takes can cost huge losses - banks are on a schedule to send tapes to
the Federal Reserve every night, and missing a deadline can easily cost
$100K per hour in fines.

Also recently one such hardware RAID actually did generate data corruption
without reporting it and ZFS did manage it properly. And we happen to have to
fsck UFS file systems from time to time on those arrays for no apparent reason.

ps. IBM's "hardware" RAID arrays can also loose data, you'll be even informed
by that "hardware" that it did so, how convinient

I never said that hardware RAID systems can't lose data. My comments were:

1) With good drives, unreported ("silent") errors occur so seldom that
they can be ignored,
2) Virtually all other errors can be corrected by the hardware, and
3) ZFS cannot correct for all those errors.

ZFS makes some great claims. But Bonwick makes a great marketing piece
in the way he magnifies the possibilities of hardware problems and
minimizes potential problems in ZFS. He also magnifies the good things
about ZFS, but minimizes the positives of RAID devices.

The whole thing is a great exercise in marketing hype.

btw: when you talk about hardware RAID - there is actually software running
on a array's hardware, in case you didn't know

Of course I understand that. But I also understand it's isolated from
the system software, and not subject to viruses and other nasty things.
It's also not subject to corruption by other programs.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 13 '06 #68

Bill Todd

Jerry Stuckle wrote:

....

If they were "silent errors", I would be very suspicious of the
reporting, unless you have a cheap array.

Of course you would, Jerry: you've apparently never been one to let the
facts get in the way of a good preconception.

....

For instance, it will sit in the ZFS buffer for a

lot longer, leaving it open to corruption longer. It has to be in the
buffer at least as long as it takes for the first write to complete and
the command sent off to the second drive.

'Fraid not, moron: haven't you ever programmed an asynchronous system
before? The two writes are sent in parallel (if you don't understand
why, or how, well - that's pretty much on a par with the rest of your
ignorance).

....

Yes, it *can* detect it. But there is no guarantee it *will* detect it.
And how is it going to provide application correct data if that data
was overwritten?

From the other, uncorrupted copy, idiot.

....

I've read a lot about ZFS.

Then the problem clearly resides in your inability to understand what
you've (allegedly) read: again, no surprise here.

- bill

Nov 13 '06 #69

Bill Todd

Jerry Stuckle wrote:

....

But Bonwick makes a great marketing piece

in the way he magnifies the possibilities of hardware problems

Sure, Jerry: it's all just hype - ZFS's separate checksums, IBM's,
NetApp's, and EMC's similar (though not always as effective) ancillary
in-line sanity-checks, Oracle's 'Hardware Assisted Resilient Data'
initiative (again, not as fully end-to-end as ZFS's mechanism, but at
least it verifies that what it wrote is what gets down to the individual
disk - and all the major hardware vendors have supported it)...

And only you understand this, of course, due to your extensive
technician-level experience in component repair: everyone else here,
despite their actual experiences with ZFS and hardware, doesn't have a clue.

Are you also one of those people who hears voices in your head when no
one else is around?

- bill

Nov 13 '06 #70

Jerry Stuckle

Bill Todd wrote:

Jerry Stuckle wrote:

...

>If they were "silent errors", I would be very suspicious of the
reporting, unless you have a cheap array.

Of course you would, Jerry: you've apparently never been one to let the
facts get in the way of a good preconception.

Yep, you sure do, Bill. You are so convinced that ZFS is the best thing
since sliced bread you can't see the obvious.

...

For instance, it will sit in the ZFS buffer for a

>lot longer, leaving it open to corruption longer. It has to be in the
buffer at least as long as it takes for the first write to complete
and the command sent off to the second drive.

'Fraid not, moron: haven't you ever programmed an asynchronous system
before? The two writes are sent in parallel (if you don't understand
why, or how, well - that's pretty much on a par with the rest of your
ignorance).

...

>Yes, it *can* detect it. But there is no guarantee it *will* detect
it. And how is it going to provide application correct data if that
data was overwritten?

From the other, uncorrupted copy, idiot.

...

>I've read a lot about ZFS.

Then the problem clearly resides in your inability to understand what
you've (allegedly) read: again, no surprise here.

- bill

Or your inability to understand basic facts.

Troll.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 13 '06 #71

Jerry Stuckle

Bill Todd wrote:

Jerry Stuckle wrote:

...

But Bonwick makes a great marketing piece

>in the way he magnifies the possibilities of hardware problems

Sure, Jerry: it's all just hype - ZFS's separate checksums, IBM's,
NetApp's, and EMC's similar (though not always as effective) ancillary
in-line sanity-checks, Oracle's 'Hardware Assisted Resilient Data'
initiative (again, not as fully end-to-end as ZFS's mechanism, but at
least it verifies that what it wrote is what gets down to the individual
disk - and all the major hardware vendors have supported it)...

And only you understand this, of course, due to your extensive
technician-level experience in component repair: everyone else here,
despite their actual experiences with ZFS and hardware, doesn't have a
clue.

Are you also one of those people who hears voices in your head when no
one else is around?

- bill

And digital design - something you've never attempted, nor are you
capable of attempting.

You're just pissed off because you found someone with more knowledge
than you who is challenging your bullshit. And you can't stand it, so
you try the personal attacks.

Troll.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 13 '06 #72

Robert Milkowski

Jerry Stuckle <js*******@attglobal.netwrote:

Robert Milkowski wrote:
Jerry Stuckle <js*******@attglobal.netwrote:

>Bill Todd wrote:

OK, and exactly how many of these bugs are there? Disk drive and
similar firmware is some of the most specialized and most heavily tested
firmware on the planet.

What? How many arrays do you manage?
How many times did you have to upgrade disk firmware or
RAID controllers firmware on them? I did many times.

Robert,

I've lost count over the years of how many I've managed.

As for upgrading disk firmware? Never. RAID firmware? Once, but that
was on recommendation from the manufacturer, not because we had a problem.

But the RAID devices I'm talking about are at COMP-USA. They are high
end arrays attached to minis and mainframes. Starting cost probably
$500K or more. And they are reliable.

And if you have managed them for some time then you definitely upgraded their
firmware more than once, including disk firmware. Well, maybe not you
but EMC engeneer did it for you :)

I have upgraded for example (ok, EMC engeener) Symmetrix firmware
more than once. And this is the array you are talkin about I guess.

Recently I did use two SCSI JBODs (ok, it's not array) connected
via two SCSI adapters to a host. RAID-10 done in ZFS between JBODS.
Well, during data copy one of the controllers reported some warnings,
but keep operational. Well, it actually did corrupt data - fortunately
ZFS did handle it properly, and we replaced the adapter. With traditional
file systems we would be in trouble.

Gee, with good hardware that wouldn't have happened. And with a real
RAID array it wouldn't have happened, either.

Really? Well it did more than once (well known vendors).

>>Horseshit. It's only 'about performance' when the performance impact is
significant. In the case of ZFS's mirroring implementation, it isn't of
any significance at all (let alone any *real* drag on the system).
Keep believing that. It will help you to justify your statements in
your mind.

Have you checked it? I DID. And in MY environment ZFS delivered
better performance than HW RAID.

What RAID did you get? Did it have it's own drivers, or did it use the
system drivers? If the former, a lot of the work was done in software,
which is common in less expensive systems. Did it have dual
controllers, or did it use one controller for both drives? I could go on.

Dual FC links, RAID-10, etc.....

I never said ZFS couldn't correct and repair some problems. But it does
NOT do everything, like some people here have indicated.

Of course. But it does protect you for more data corruption scenarios than ANY
HW RAID can.
--
Robert Milkowski
rm***********@wp-sa.pl
http://milek.blogspot.com

Nov 13 '06 #73

Bill Todd

Jerry Stuckle wrote:

....

You're just pissed off because you found someone with more knowledge
than you

And, obviously, more knowledge than anyone else, whether here (those
with actual experience of the errors you claim don't exist in noticeable
quantities) or in the rest of the industry (such as those who actually
implemented the mechanisms that you claim are just hype without even
beginning to understand them).

You also appear to have a rather loose grasp on reality, at least when
it comes to presenting utter drivel as fact.

Are you familiar with the concept of 'delusional megalomania' , Jerry?
If not, perhaps you ought to become acquainted with it.

- bill

Nov 13 '06 #74

Good Man

alf <ask@mewrote in news:q_******************************@comcast.com:

Hi,

is it possible that due to OS crash or mysql itself crash or some e.g.
SCSI failure to lose all the data stored in the table (let's say million
of 1KB rows). In other words what is the worst case scenario for MyISAM
backend?

Hi everyone

Thanks for contributing... for the most part, it was great to see very
knowledgable people discuss the intriciacies of data safety and management.

Nov 13 '06 #75

Jerry Stuckle

Robert Milkowski wrote:

Jerry Stuckle <js*******@attglobal.netwrote:

>>Robert Milkowski wrote:

>>>Jerry Stuckle <js*******@attglobal.netwrote:
Bill Todd wrote:

OK, and exactly how many of these bugs are there? Disk drive and
similar firmware is some of the most specialized and most heavily tested
firmware on the planet.
What? How many arrays do you manage?
How many times did you have to upgrade disk firmware or
RAID controllers firmware on them? I did many times.

Robert,

I've lost count over the years of how many I've managed.

As for upgrading disk firmware? Never. RAID firmware? Once, but that
was on recommendation from the manufacturer, not because we had a problem.

But the RAID devices I'm talking about are at COMP-USA. They are high
end arrays attached to minis and mainframes. Starting cost probably
$500K or more. And they are reliable.

And if you have managed them for some time then you definitely upgraded their
firmware more than once, including disk firmware. Well, maybe not you
but EMC engeneer did it for you :)

I have upgraded for example (ok, EMC engeener) Symmetrix firmware
more than once. And this is the array you are talkin about I guess.

No, I'm not talking Symmetrix. If we're talking the same company, they
are a software supplier (quite good software, I must add), not a RAID
array manufacturer. They may, however, have software to run RAID
arrays; if so I'm not familiar with that particular product.

>

>>>Recently I did use two SCSI JBODs (ok, it's not array) connected
via two SCSI adapters to a host. RAID-10 done in ZFS between JBODS.
Well, during data copy one of the controllers reported some warnings,
but keep operational. Well, it actually did corrupt data - fortunately
ZFS did handle it properly, and we replaced the adapter. With traditional
file systems we would be in trouble.

Gee, with good hardware that wouldn't have happened. And with a real
RAID array it wouldn't have happened, either.

Really? Well it did more than once (well known vendors).

McDonalds is also well known. But I wouldn't equate that to quality food.

>

>>>>>Horseshit. It's only 'about performance' when the performance impact is
>significant. In the case of ZFS's mirroring implementation, it isn't of
>any significance at all (let alone any *real* drag on the system).
>

Keep believing that. It will help you to justify your statements in
your mind.
Have you checked it? I DID. And in MY environment ZFS delivered
better performance than HW RAID.

What RAID did you get? Did it have it's own drivers, or did it use the
system drivers? If the former, a lot of the work was done in software,
which is common in less expensive systems. Did it have dual
controllers, or did it use one controller for both drives? I could go on.

Dual FC links, RAID-10, etc.....

But you didn't answer my questions. Did it have it's own drivers? Did
it have dual controllers?

>
>>I never said ZFS couldn't correct and repair some problems. But it does
NOT do everything, like some people here have indicated.

Of course. But it does protect you for more data corruption scenarios than ANY
HW RAID can.

But good HW RAID will detect and, if at all possible, correct data
corruption. And if it's not possible, it's because the data is lost -
i.e. completely scrambled and/or overwritten on both drives. Even ZFS
can't handle that.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 13 '06 #76

Jerry Stuckle

Bill Todd wrote:

Jerry Stuckle wrote:

...

>You're just pissed off because you found someone with more knowledge
than you

And, obviously, more knowledge than anyone else, whether here (those
with actual experience of the errors you claim don't exist in noticeable
quantities) or in the rest of the industry (such as those who actually
implemented the mechanisms that you claim are just hype without even
beginning to understand them).

Again, just another troll response. You can't dispute the facts, so you
make personal attacks on the messenger.

For the record, I have more knowledge of the hardware and internals than
anyone here has shown. And I have yet to see anything in the ZFS
references provided to indicate that ANY of the people there have more
than a cursory knowledge of the hardware and firmware behind disk drives
themselves, much less an in depth knowledge. Yet they spew "facts" like
they are experts.

I have no argument with their programming skills. Merely their lack of
knowledge of disk hardware.

You also appear to have a rather loose grasp on reality, at least when
it comes to presenting utter drivel as fact.

ROFLMAO! Because I give you the true reality, and not some hype?

Are you familiar with the concept of 'delusional megalomania' , Jerry?
If not, perhaps you ought to become acquainted with it.

- bill

You seem to be quite familiar with it, Bill. How many times have you
been diagnosed with it?

Just another troll.
--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 13 '06 #77

Robert Milkowski

Jerry Stuckle <js*******@attglobal.netwrote:

>
But good HW RAID will detect and, if at all possible, correct data
corruption. And if it's not possible, it's because the data is lost -
i.e. completely scrambled and/or overwritten on both drives. Even ZFS
can't handle that.

Ok, it doesn't make sense to reason with you.
You live in a world of your own - fine, keep dreaming.
--
Robert Milkowski
rm************@wp-sa.pl
http://milek.blogspot.com

Nov 13 '06 #78

Bill Todd

Jerry Stuckle wrote:

....

>Are you familiar with the concept of 'delusional megalomania' , Jerry?
If not, perhaps you ought to become acquainted with it.

- bill

You seem to be quite familiar with it, Bill. How many times have you
been diagnosed with it?

None, but thirty-odd years ago I did work full-time for three years in a
mental hospital, treating disturbed adolescents.

Now, unlike you, I'm not prone to making sweeping assertions far outside
my area of professional expertise and with little or no solid
foundation, but it doesn't take a degree in psychology to know how
clearly your behavior here reminds me of them. And the more I've
noticed that, the more I've begun to feel that perhaps you were more
deserving of pity than of scorn.

But of course, since my experience in this area is purely practical
(though obtained working closely with people who *were* professionals in
this area), I could be wrong: as Freud might have said, sometimes an
asshole is simply an asshole, rather than mentally ill.

- bill

Nov 13 '06 #79

Robert Milkowski

Jerry Stuckle <js*******@attglobal.netwrote:

Robert Milkowski wrote:
Well, I would say that it's actually you who do not understand ZFS at all.
You claim you read Bonwick blog entry - I belive you just do not want to understand
it.

No, I read it. The difference is I have enough technical background to
separate the facts from the hype.

So you keep saying...
But your posts indicate you are actually ignorant when it comes
to technical details. All you presented so far is whishful thinking
and belive that if something costs lots of monay then it will automagicaly
solve all problems. Well, in reality it's not a case. No matter how much money
you put in a HW RAID it won't detect some data corruptions which would otherwise
be easily detected and corrected by zfs.

I think you should stay in your wonderland and we should not waste our time anymore.
--
Robert Milkowski
rm***********@wp-sa.pl
http://milek.blogspot.com

Nov 13 '06 #80

Jerry Stuckle

Robert Milkowski wrote:

Jerry Stuckle <js*******@attglobal.netwrote:

>>But good HW RAID will detect and, if at all possible, correct data
corruption. And if it's not possible, it's because the data is lost -
i.e. completely scrambled and/or overwritten on both drives. Even ZFS
can't handle that.

Ok, it doesn't make sense to reason with you.
You live in a world of your own - fine, keep dreaming.

And you ignore the facts. Good luck.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 13 '06 #81

Jerry Stuckle

Bill Todd wrote:

Jerry Stuckle wrote:

...

>>Are you familiar with the concept of 'delusional megalomania' ,
Jerry? If not, perhaps you ought to become acquainted with it.

- bill

You seem to be quite familiar with it, Bill. How many times have you
been diagnosed with it?

None, but thirty-odd years ago I did work full-time for three years in a
mental hospital, treating disturbed adolescents.

I more suspect you were a patient.

Now, unlike you, I'm not prone to making sweeping assertions far outside
my area of professional expertise and with little or no solid
foundation, but it doesn't take a degree in psychology to know how
clearly your behavior here reminds me of them. And the more I've
noticed that, the more I've begun to feel that perhaps you were more
deserving of pity than of scorn.

Hmmm, it seems you've made some sweeping statements in this thread. And
unlike me, you don't have the hardware background to support your
statements. And even your software background is questionable.

But of course, since my experience in this area is purely practical
(though obtained working closely with people who *were* professionals in
this area), I could be wrong: as Freud might have said, sometimes an
asshole is simply an asshole, rather than mentally ill.

- bill

Right. What did you do - empty their trash cans for them?
--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 13 '06 #82

Jerry Stuckle

Robert Milkowski wrote:

Jerry Stuckle <js*******@attglobal.netwrote:

>>Robert Milkowski wrote:

>>>Well, I would say that it's actually you who do not understand ZFS at all.
You claim you read Bonwick blog entry - I belive you just do not want to understand
it.

No, I read it. The difference is I have enough technical background to
separate the facts from the hype.

So you keep saying...
But your posts indicate you are actually ignorant when it comes
to technical details. All you presented so far is whishful thinking
and belive that if something costs lots of monay then it will automagicaly
solve all problems. Well, in reality it's not a case. No matter how much money
you put in a HW RAID it won't detect some data corruptions which would otherwise
be easily detected and corrected by zfs.

I think you should stay in your wonderland and we should not waste our time anymore.

No, a truly fault-tolerant hardware RAID is VERY expensive to develop
and manufacture. You don't take $89 100GB disk drives off the shelf,
tack them onto an EIDE controller and add some software to the system.

You first have to start with high quality disk drives. The electronic
components are also higher quality, with complicated circuits to detect
marginal signal strength off of the platter, determine when a signal is
marginal, change the sensing parameters in an attempt to reread the data
correctly, and so on.

The firmware must be able to work with this hardware to handle read
errors and change those parameters, automatically mark marginal sectors
bad before the become totally wiped out, and if the data cannot be read,
automatically retry from the mirror. And if the retry occurs, the
firmware must mark the original track bad and rewrite it with the good data.

Also, with two or more controllers, the controllers talk to each other
directly, generally over a dedicated bus. They keep each other informed
of their status and constantly run diagnostics on themselves and each
other when the system is idle. These tests include reading and writing
test cylinders on the disks to verify proper operation.

Additionally, in the more expensive RAID devices, checksums are
typically at least 32 bits long (your off-the-shelf drive typically uses
a 16 bit checksum), and the checksum is built in hardware - much more
expensive, but much faster than doing it in firmware. Checksum
comparisons are done in hardware, also.

Plus, with verified writes, the firmware has to go back and reread the
data the next time the sector comes around and compare it with the
contents of the buffer. Again, this is often done in hardware on the
high end RAID systems.

And, most of these RAID devices use custom chip sets - not something off
the shelf. Designing the chipsets themselves is in itself quite
expensive, and due to the relatively limited run and high density of the
chipsets, they are quite expensive to produce.

There's a lot more to it. But the final result is these devices have a
lot more hardware and software, a lot more internal communications, and
a lot more firmware. And it costs a lot of money to design and
manufacture these devices. That's why you won't find them at your
local computer store.

Some of this can be emulated in software. But the software cannot
detect when a signal is getting marginal (it's either "good" or "bad",
adjust the r/w head parameters, and similar things. Yes, it can
checksum the data coming back and read from the mirror drive if
necessary. It might even be able to tell the controller to run a
self-check (most controllers do have that capability) during idle times.
But it can't do a lot more than that. The controller interface isn't
smart enough to do a lot more.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 13 '06 #83

Bill Todd

Jerry Stuckle wrote:

....

>None, but thirty-odd years ago I did work full-time for three years in
a mental hospital, treating disturbed adolescents.

....

Right. What did you do - empty their trash cans for them?

I fully understand how limited your reading skills are, but exactly what
part of the end of the above sentence surpassed even your meager ability
to read?

One of the many things I learned there was that some people are simply
beyond help - professional or otherwise. You appear to be one of them:
the fact that not a single person here has defended you, but rather
uniformly told you how deluded you are, doesn't faze you in the slightest.

Fortunately, that's in no way my problem, nor of any concern to me. So
have a nice life in your own private little fantasy world - just don't
be surprised that no one else subscribes to it.

- bill

Nov 13 '06 #84

Robert Milkowski

Jerry Stuckle <js*******@attglobal.netwrote:

Robert Milkowski wrote:

No, a truly fault-tolerant hardware RAID is VERY expensive to develop
and manufacture. You don't take $89 100GB disk drives off the shelf,
tack them onto an EIDE controller and add some software to the system.

You first have to start with high quality disk drives. The electronic
components are also higher quality, with complicated circuits to detect
marginal signal strength off of the platter, determine when a signal is
marginal, change the sensing parameters in an attempt to reread the data
correctly, and so on.

The firmware must be able to work with this hardware to handle read
errors and change those parameters, automatically mark marginal sectors
bad before the become totally wiped out, and if the data cannot be read,
automatically retry from the mirror. And if the retry occurs, the
firmware must mark the original track bad and rewrite it with the good data.

Also, with two or more controllers, the controllers talk to each other
directly, generally over a dedicated bus. They keep each other informed
of their status and constantly run diagnostics on themselves and each
other when the system is idle. These tests include reading and writing
test cylinders on the disks to verify proper operation.

Additionally, in the more expensive RAID devices, checksums are
typically at least 32 bits long (your off-the-shelf drive typically uses
a 16 bit checksum), and the checksum is built in hardware - much more
expensive, but much faster than doing it in firmware. Checksum
comparisons are done in hardware, also.

Plus, with verified writes, the firmware has to go back and reread the
data the next time the sector comes around and compare it with the
contents of the buffer. Again, this is often done in hardware on the
high end RAID systems.

And, most of these RAID devices use custom chip sets - not something off
the shelf. Designing the chipsets themselves is in itself quite
expensive, and due to the relatively limited run and high density of the
chipsets, they are quite expensive to produce.

There's a lot more to it. But the final result is these devices have a
lot more hardware and software, a lot more internal communications, and
a lot more firmware. And it costs a lot of money to design and
manufacture these devices. That's why you won't find them at your
local computer store.

Some of this can be emulated in software. But the software cannot
detect when a signal is getting marginal (it's either "good" or "bad",
adjust the r/w head parameters, and similar things. Yes, it can
checksum the data coming back and read from the mirror drive if
necessary. It might even be able to tell the controller to run a
self-check (most controllers do have that capability) during idle times.
But it can't do a lot more than that. The controller interface isn't
smart enough to do a lot more.

The point is that you can still use such array and put on top of it ZFS
for many reasons - easier management is one of reasons, another is better
data protection than if you use classic file system.

--
Robert Milkowski
rm************@wp-sa.pl
http://milek.blogspot.com

Nov 13 '06 #85

Bill Todd

Dear me - I just bade you a fond farewell, and here you've at last come
up with something at least vaguely technical (still somewhat mistaken,
but at least technical). So I'll respond to it in kind:

Jerry Stuckle wrote:

....

a truly fault-tolerant hardware RAID is VERY expensive to develop

and manufacture.

That's true, and one of the reasons why it makes a lot more sense to do
the work in software instead (as long as small-update latency is not
critical). One of the reasons for the rise of high-end, high-cost
hardware RAID systems was the lag in development of system software in
that area. Another was the ability of the hardware approach to bundle
in NVRAM write acceleration that was considerably more difficult to add
(and then use) as a special system device (the venerable Prestoserve
product comes to mind), plus large amounts of additional cache that
systems may not even have been able to support at all due to
address-space limitations: the ability to look like a plain old disk
(no system or application software changes required at all) but offer
far higher reliability (through redundancy) and far better small-update
and/or read-caching performance helped make the sale.

But time marches on. Most serious operating systems now support (either
natively or via extremely reputable decade-old, thoroughly-tested
third-party system software products from people like Veritas) software
RAID, and as much cache memory as you can afford (no more address-space
limitations there) - plus (with products like ZFS) are at least starting
to address synchronous small-update throughput (though when synchronous
small-update *latency* is critical there's still no match for NVRAM).

You don't take $89 100GB disk drives off the shelf,

tack them onto an EIDE controller and add some software to the system.

Actually, you can do almost *precisely* that, as long as the software is
handles the situation appropriately - and that's part of what ZFS is
offering (and what you so obviously completely fail to be able to grasp).

No disk or firmware is completely foolproof. Not one. No matter how
expensive and well-designed. So the question isn't whether the disks
and firmware are unreliable, but just the degree and manner in which
they are.

There is, to be sure, no way that you can make a pair of inexpensive
SATA drives just as reliable as a pair of Cheetahs, all other things
being equal. But it is *eminently* possible, using appropriate software
(or firmware), to make *three or four* inexpensive SATA drives *more*
reliable than a pair of Cheetahs that cost far more - and to obtain
better performance in many areas in the bargain.

Do you buy Brand X drives off the back of a truck in an alley? Of
course not: you buy from Seagate (or someone you think has similar
credibility - and perhaps not their newly-acquired Maxtor drives for a
while yet), and for 24/7 use you buy their 'near line' drives (which
aren't much more expensive than their desktop versions). For *really*
hard 24/7 seek-intensive pounding, your only real SATA choice is Western
Digital's Raptor series - unless you just throw so many lesser drives at
the workload and distribute it across them sufficiently evenly that
there's no long any real pounding on any given drive (which in fact is
not an unrealistic possibility, though one which must be approached with
due care).

Such reputable SATA drives aren't the equal of their high-end FC
cousins, but neither are they crap: in both cases, as long as your
expectations are realistic, you compensate for their limitations, and
you don't abuse them, they won't let you down.

And you don't attach them through Brand X SATA controllers, either:
ideally, you attach them directly (since you no longer need any
intermediate RAID hardware), using the same quality electronics you have
on the rest of your system board (so the SATA connection won't
constitute a weak link). And by virtue of being considerably simpler
hardware/firmware than a RAID implementation, that controller may well
be *more* reliable.

If you've got a lot of disks to attach, quality SATA port multipliers,
SAS connections, and fibre-channel-to-SATA links are available.

>
You first have to start with high quality disk drives. The electronic
components are also higher quality, with complicated circuits to detect
marginal signal strength off of the platter, determine when a signal is
marginal, change the sensing parameters in an attempt to reread the data
correctly, and so on.

That's all very nice, but that actually (while as explained above being
an eminently debatable question in its own right) hasn't been the main
subject under discussion here: it's been whether hardware *RAID* is any
more reliable than software RAID (not what kind of disks one should use
after having made that RAID choice).

>
The firmware must be able to work with this hardware to handle read
errors and change those parameters, automatically mark marginal sectors
bad before the become totally wiped out,

Whether you're aware of it or not, modern SATA drives (and even
not-too-old ATA drives) do *all* the things that you just described in
your last one-and-a-half paragraphs.

and if the data cannot be read,

automatically retry from the mirror.

It really doesn't matter whether that's done in hardware or in software.

And if the retry occurs, the

firmware must mark the original track bad and rewrite it with the good
data.

Modern disks (both FC/SCSI and ATA/SATA) do that themselves, without
waiting for instructions from a higher level. They report any failure
up so that the higher level (again, doesn't matter whether it's firmware
or software) can correct the data if a good copy can be found elsewhere.
If its internal retry succeeds, the disk doesn't report an error, but
does log it internally such that any interested higher-level firmware or
software can see whether such successful retries are starting to become
alarmingly frequent and act accordingly.

>
Also, with two or more controllers, the controllers talk to each other
directly, generally over a dedicated bus. They keep each other informed
of their status and constantly run diagnostics on themselves and each
other when the system is idle.

Which is only necessary because they're doing things like capturing
updates in NVRAM (updates that must survive controller failure and thus
need to be mirrored in NVRAM at the other controller): if you eliminate
that level of function, you lose any need for that level of complexity
(not to mention eliminating a complete layer of complex hardware with
its own potential to fail).

As I said at the outset, hardware RAID *does* have some *performance*
advantages (though new software approaches to handling data continue to
erode them). But there's no intrinsic *reliability* advantage: if you
don't need that NVRAM mirrored between controllers for performance
reasons, it adds nothing to (and may actually subtract from) your
system's reliability compared with a software approach.

Having multiple paths to each disk isn't all that critical in RAID-1/10
configurations, since you can split the copies across two controllers to
ensure that one copy remains available if a controller dies (not that
frequent an occurrence - arguably, no more likely than that your system
board will experience some single point of *complete* failure). SATA
port selectors allow system fail-over, as does use of SAS or FC
connectivity to the disks (and the latter two support multiple paths to
each disk as well, should you want them).

These tests include reading and writing

test cylinders on the disks to verify proper operation.

The background disk scrubbing which both hardware and software RAID
approaches should be doing covers that (and if there's really *no*
writing going on in the system for long periods of time, the software
can exercise that as well once in a while).

>
Additionally, in the more expensive RAID devices, checksums are
typically at least 32 bits long (your off-the-shelf drive typically uses
a 16 bit checksum), and the checksum is built in hardware - much more
expensive, but much faster than doing it in firmware. Checksum
comparisons are done in hardware, also.

Your hand-waving just got a bit fast to follow there.

1. Disks certainly use internal per-sector error-correction codes when
transferring data to and from their platters. They are hundreds
(perhaps by now *many* hundreds) of bits long.

2. Disks use cyclic redundancy checks on the data that they accept from
and distribute to the outside world (old IDE disks did not, but ATA
disks do and SATA disks do as well - IIRC the width is 32 bits).

3. I'd certainly expect any RAID hardware to use those CRCs to
communicate with both disks and host systems: that hardly qualifies as
anything unusual. If you were talking about some *other* kind of
checksum, it would have to have been internal to the RAID, since the
disks wouldn't know anything about it (a host using special driver
software potentially could, but it would add nothing of obvious value to
the CRC mechanisms that the host already uses to communicate directly
with disks, so I'd just expect the RAID box to emulate a disk for such
communication).

4. Thus data going from system memory to disk platter and back goes (in
each direction) through several interfaces and physical connectors and
multiple per-hop checks, and the probability of some undetected failure,
while very small for any given interface, connector, or hop, is not
quite as small for the sum of all of them (as well as there being some
errors, such as misdirected or lost writes, that none of those checks
can catch). What ZFS provides (that by definition hardware RAID cannot,
since it must emulate a standard block-level interface to the host) is
an end-to-end checksum that verifies data from the time it is created in
main memory to the time it has been fetched back into main memory from
disk. IBM, NetApp, and EMC use somewhat analogous supplementary
checksums to protect data: in the i-series case I believe that they are
created and checked in main memory at the driver level and are thus
comparably strong, while in NetApp's and EMC's cases they are created
and checked in the main memory of the file server or hardware box but
then must get to and from client main memory across additional
interfaces, connectors, and hops which have their own individual checks
and are thus not comparably end-to-end in nature - though if the NetApp
data is accessed through a file-level protocol that includes an
end-to-end checksum that is created and checked in client and server
main memory rather than, e.g., in some NIC hardware accelerator it could
be *almost* comparable in strength.

>
Plus, with verified writes, the firmware has to go back and reread the
data the next time the sector comes around and compare it with the
contents of the buffer. Again, this is often done in hardware on the
high end RAID systems.

And can just as well be done in system software (indeed, this is often a
software option in high-end systems).

>
And, most of these RAID devices use custom chip sets - not something off
the shelf.

That in itself is a red flag: they are far more complex and also get
far less thoroughly exercised out in the field than more standard
components - regardless of how diligently they're tested.

As others have pointed out, high-end RAID firmware updates are *not*
infrequent. And they don't just do them for fun.

Designing the chipsets themselves is in itself quite

expensive, and due to the relatively limited run and high density of the
chipsets, they are quite expensive to produce.

As I observed at the outset, another reason to do the work in system
software.

>
There's a lot more to it. But the final result is these devices have a
lot more hardware and software, a lot more internal communications, and
a lot more firmware. And it costs a lot of money to design and
manufacture these devices.

And all those things are *disadvantages*, not recommendations.

They can also significantly limit their utility. For example, VMS
clusters support synchronous operation at separations up to 500 miles
(actually, more, but beyond that it starts to get into needs for special
tweaking) - but using host-based software mirroring rather than hardware
mirroring (because most hardware won't mirror synchronously at anything
like that distance - not to mention requiring a complete second
connection at *any* distance, whereas the normal cluster LAN or WAN can
handle software mirroring activity).

That's why you won't find them at your

local computer store.

I seriously doubt that anyone who's been talking with you (or at least
trying to) about hardware RAID solutions has been talking about any that
you'd find at CompUSA. EMC's Symmetrix, for example, was the gold
standard of enterprise-level hardware RAID for most of the '90s - only
relatively recently did IBM claw back substantial market share in that
area (along with HDS).

>
Some of this can be emulated in software.

*All* of the RAID part can be.

But the software cannot

detect when a signal is getting marginal (it's either "good" or "bad",
adjust the r/w head parameters, and similar things.

And neither can hardware RAID: those things happen strictly internally
at the disk (for that matter, by definition *anything* that the disk
externalizes can be handled by software as well as by RAID hardware).

Yes, it can

checksum the data coming back and read from the mirror drive if
necessary.

Yup.

Now, that *used* to be at least something of a performance issue - being
able to offload that into firmware was measurably useful. But today's
processor and memory bandwidth makes it eminently feasible - even in
cases where it's not effectively free (if you have to move the data, or
have to compress/decompress or encrypt/decrypt it, you can generate the
checksum as it's passing through and pay virtually no additional cost at
all).

That's still only a wash when conventional checksum mechanisms are used.
But when you instead use an end-to-end checksum like ZFS's (which you
can do *only* when the data is in main memory, hence can't offload) you
get a significant benefit from it.

It might even be able to tell the controller to run a

self-check (most controllers do have that capability) during idle times.

If there were any reason to - but without the complexity of RAID
firmware to worry about, any need for checks beyond what the simpler
controller should probably be doing on its own becomes questionable.

But it can't do a lot more than that. The controller interface isn't
smart enough to do a lot more.

And without having to handle RAID management, it doesn't have to be.

- bill

Nov 14 '06 #86

Jerry Stuckle

Bill Todd wrote:

Jerry Stuckle wrote:

...

>>None, but thirty-odd years ago I did work full-time for three years
in a mental hospital, treating disturbed adolescents.

...

>Right. What did you do - empty their trash cans for them?

I fully understand how limited your reading skills are, but exactly what
part of the end of the above sentence surpassed even your meager ability
to read?

One of the many things I learned there was that some people are simply
beyond help - professional or otherwise. You appear to be one of them:
the fact that not a single person here has defended you, but rather
uniformly told you how deluded you are, doesn't faze you in the slightest.

Fortunately, that's in no way my problem, nor of any concern to me. So
have a nice life in your own private little fantasy world - just don't
be surprised that no one else subscribes to it.

- bill

Yes, I agree. I would suggest you go back through this thread and see
who needs the help, but it's obviously beyond your comprehension.

A quick refresher. You came up with some "facts" but had nothing other
than a couple of blogs to back them up. You have no technical
background, and are incapable of understanding even the basic
electronics about which you espouse "facts". Yet you regard them as the
ultimate truths.

And when I came back and shot down your arguments one by one, you
started the personal attacks. You have yet to refute any of the facts I
gave you, other than to repeat your hype and drivel (as if that makes
them even more factual) and more personal attacks.

Go away, little troll You're mommy is calling you.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 14 '06 #87

Jerry Stuckle

Robert Milkowski wrote:

Jerry Stuckle <js*******@attglobal.netwrote:

>>Robert Milkowski wrote:

No, a truly fault-tolerant hardware RAID is VERY expensive to develop
and manufacture. You don't take $89 100GB disk drives off the shelf,
tack them onto an EIDE controller and add some software to the system.

You first have to start with high quality disk drives. The electronic
components are also higher quality, with complicated circuits to detect
marginal signal strength off of the platter, determine when a signal is
marginal, change the sensing parameters in an attempt to reread the data
correctly, and so on.

The firmware must be able to work with this hardware to handle read
errors and change those parameters, automatically mark marginal sectors
bad before the become totally wiped out, and if the data cannot be read,
automatically retry from the mirror. And if the retry occurs, the
firmware must mark the original track bad and rewrite it with the good data.

Also, with two or more controllers, the controllers talk to each other
directly, generally over a dedicated bus. They keep each other informed
of their status and constantly run diagnostics on themselves and each
other when the system is idle. These tests include reading and writing
test cylinders on the disks to verify proper operation.

Additionally, in the more expensive RAID devices, checksums are
typically at least 32 bits long (your off-the-shelf drive typically uses
a 16 bit checksum), and the checksum is built in hardware - much more
expensive, but much faster than doing it in firmware. Checksum
comparisons are done in hardware, also.

Plus, with verified writes, the firmware has to go back and reread the
data the next time the sector comes around and compare it with the
contents of the buffer. Again, this is often done in hardware on the
high end RAID systems.

And, most of these RAID devices use custom chip sets - not something off
the shelf. Designing the chipsets themselves is in itself quite
expensive, and due to the relatively limited run and high density of the
chipsets, they are quite expensive to produce.

There's a lot more to it. But the final result is these devices have a
lot more hardware and software, a lot more internal communications, and
a lot more firmware. And it costs a lot of money to design and
manufacture these devices. That's why you won't find them at your
local computer store.

Some of this can be emulated in software. But the software cannot
detect when a signal is getting marginal (it's either "good" or "bad",
adjust the r/w head parameters, and similar things. Yes, it can
checksum the data coming back and read from the mirror drive if
necessary. It might even be able to tell the controller to run a
self-check (most controllers do have that capability) during idle times.
But it can't do a lot more than that. The controller interface isn't
smart enough to do a lot more.

The point is that you can still use such array and put on top of it ZFS
for many reasons - easier management is one of reasons, another is better
data protection than if you use classic file system.

The point is that such an array makes ZFS unnecessary. Sure, you *can*
use it (if you're using Linux - most of these systems do not). There is
nothing for ZFS to "manage" - configuration is done through utilities
(and sometimes an Ethernet port or similar). There is no management
interface for the file system - it all looks like a single disk (or
several disks, depending on the configuration).

As for data protection - if the RAID array can't read the data, it's
lost far beyond what ZFS or any other file system can do - unless you
have another complete RAID being run by ZFS. And if that's the case,
it's cheaper to have multiple mirrors.

There are a few very high end who use 3 drives and compare everything (2
out of 3 win). But these are very, very rare, and only used for the
absolutely most critical data (i.e. space missions, where the can't be
repaired/replaced easily).

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 14 '06 #88

Jerry Stuckle

Bill Todd wrote:

Dear me - I just bade you a fond farewell, and here you've at last come
up with something at least vaguely technical (still somewhat mistaken,
but at least technical). So I'll respond to it in kind:

Jerry Stuckle wrote:

...

a truly fault-tolerant hardware RAID is VERY expensive to develop

>and manufacture.

That's true, and one of the reasons why it makes a lot more sense to do
the work in software instead (as long as small-update latency is not
critical). One of the reasons for the rise of high-end, high-cost
hardware RAID systems was the lag in development of system software in
that area. Another was the ability of the hardware approach to bundle
in NVRAM write acceleration that was considerably more difficult to add
(and then use) as a special system device (the venerable Prestoserve
product comes to mind), plus large amounts of additional cache that
systems may not even have been able to support at all due to
address-space limitations: the ability to look like a plain old disk
(no system or application software changes required at all) but offer
far higher reliability (through redundancy) and far better small-update
and/or read-caching performance helped make the sale.

Yes, and software implementations are poor replacements for a truly
fault-tolerant system. And the high end RAID devices do not require
special software - they look like any other disk device attached to the
system.

As for bundling write acceleration in NVRAM - again, meaningless because
good RAID devices aren't loaded as a "special system device".

Prestoserve was one of the first lower-end RAID products made. However,
there were a huge number of them before that. But you wouldn't find
them on a PC. They were primarily medium and large system devices.
Prestoserve took some of the ideas and moved much of the hardware
handling into software. Unfortunately, when they did it, they lost the
ability to handle problems at a low-level (i.e. read head biasing,
etc.). It did make the arrays a lot cheaper, but at a price.

And in the RAID devices, system address space was never a problem -
because the data was transferred to RAID cache immediately. This did
not come out of the system pool; the controllers have their own cache.

I remember 64MB caches in the controllers way back in the mid 80's.
It's in the GB, now. No address space limitations on the system because
it didn't use system memory.

But time marches on. Most serious operating systems now support (either
natively or via extremely reputable decade-old, thoroughly-tested
third-party system software products from people like Veritas) software
RAID, and as much cache memory as you can afford (no more address-space
limitations there) - plus (with products like ZFS) are at least starting
to address synchronous small-update throughput (though when synchronous
small-update *latency* is critical there's still no match for NVRAM).

Sure, you can get software RAID. But it's not as reliable as a good
hardware RAID.

You don't take $89 100GB disk drives off the shelf,

>tack them onto an EIDE controller and add some software to the system.

Actually, you can do almost *precisely* that, as long as the software is
handles the situation appropriately - and that's part of what ZFS is
offering (and what you so obviously completely fail to be able to grasp).

In the cheap RAID devices, sure. But not in the good ones. You're
talking cheap. I'm talking quality.

No disk or firmware is completely foolproof. Not one. No matter how
expensive and well-designed. So the question isn't whether the disks
and firmware are unreliable, but just the degree and manner in which
they are.

I never said they were 100% foolproof. Rather, I said they are amongst
the most tested software made. Probably the only software tested more
thoroughly is the microcode on CPU's. And they are as reliable as
humanly possible.

Of course, the same thing goes for ZFS and any file system. They're not
completely foolproof, either, are they?

There is, to be sure, no way that you can make a pair of inexpensive
SATA drives just as reliable as a pair of Cheetahs, all other things
being equal. But it is *eminently* possible, using appropriate software
(or firmware), to make *three or four* inexpensive SATA drives *more*
reliable than a pair of Cheetahs that cost far more - and to obtain
better performance in many areas in the bargain.

And there is no way to make a pair of Cheetahs as reliable as drives
made strictly for high end RAID devices. Some of these drives still
sell for $30-60/GB (or more).

Do you buy Brand X drives off the back of a truck in an alley? Of
course not: you buy from Seagate (or someone you think has similar
credibility - and perhaps not their newly-acquired Maxtor drives for a
while yet), and for 24/7 use you buy their 'near line' drives (which
aren't much more expensive than their desktop versions). For *really*
hard 24/7 seek-intensive pounding, your only real SATA choice is Western
Digital's Raptor series - unless you just throw so many lesser drives at
the workload and distribute it across them sufficiently evenly that
there's no long any real pounding on any given drive (which in fact is
not an unrealistic possibility, though one which must be approached with
due care).

Or RAID drives not available as single units - other than as replacement
parts for their specific RAID arrays.

Such reputable SATA drives aren't the equal of their high-end FC
cousins, but neither are they crap: in both cases, as long as your
expectations are realistic, you compensate for their limitations, and
you don't abuse them, they won't let you down.

No, I didn't say ANY drive was "crap". They're good drives, when used
for what they are designed. But drives made for RAID arrays are in a
class by themselves. And they can do things that standard drives can't
(like dynamically adjust amplifiers and slew rates when reading and
writing data).

And you don't attach them through Brand X SATA controllers, either:
ideally, you attach them directly (since you no longer need any
intermediate RAID hardware), using the same quality electronics you have
on the rest of your system board (so the SATA connection won't
constitute a weak link). And by virtue of being considerably simpler
hardware/firmware than a RAID implementation, that controller may well
be *more* reliable.

There is no way this is more reliable than a good RAID system. If you
had ever used one, you wouldn't even try to make that claim.

If you've got a lot of disks to attach, quality SATA port multipliers,
SAS connections, and fibre-channel-to-SATA links are available.

Sure. But they still don't do the things RAID drives can do.

>>
You first have to start with high quality disk drives. The electronic
components are also higher quality, with complicated circuits to
detect marginal signal strength off of the platter, determine when a
signal is marginal, change the sensing parameters in an attempt to
reread the data correctly, and so on.

That's all very nice, but that actually (while as explained above being
an eminently debatable question in its own right) hasn't been the main
subject under discussion here: it's been whether hardware *RAID* is any
more reliable than software RAID (not what kind of disks one should use
after having made that RAID choice).

And the disk drive is a part of hardware RAID. Only a total idiot would
ignore the disk drive quality when discussing RAID reliability.

>>
The firmware must be able to work with this hardware to handle read
errors and change those parameters, automatically mark marginal
sectors bad before the become totally wiped out,

Whether you're aware of it or not, modern SATA drives (and even
not-too-old ATA drives) do *all* the things that you just described in
your last one-and-a-half paragraphs.

And let's see those drives do things like dynamically adjust the
electronics - such as amp gain, bias, slew rate... They can't do it.

and if the data cannot be read,

>automatically retry from the mirror.

It really doesn't matter whether that's done in hardware or in software.

Spoken by someone who truly has no idea what he's talking about. Anyone
who has worked with high performance, critical systems knows there is a
*huge* difference between doing it in hardware and software.

And if the retry occurs, the

>firmware must mark the original track bad and rewrite it with the good
data.

Modern disks (both FC/SCSI and ATA/SATA) do that themselves, without
waiting for instructions from a higher level. They report any failure
up so that the higher level (again, doesn't matter whether it's firmware
or software) can correct the data if a good copy can be found elsewhere.
If its internal retry succeeds, the disk doesn't report an error, but
does log it internally such that any interested higher-level firmware or
software can see whether such successful retries are starting to become
alarmingly frequent and act accordingly.

Yes, they report total failure on a read. But they can't go back and
try to reread the sector with different parms to the read amps, for
instance. And a good RAID controller will make decisions based in part
on what parameters it takes to read the data.

>>
Also, with two or more controllers, the controllers talk to each other
directly, generally over a dedicated bus. They keep each other
informed of their status and constantly run diagnostics on themselves
and each other when the system is idle.

Which is only necessary because they're doing things like capturing
updates in NVRAM (updates that must survive controller failure and thus
need to be mirrored in NVRAM at the other controller): if you eliminate
that level of function, you lose any need for that level of complexity
(not to mention eliminating a complete layer of complex hardware with
its own potential to fail).

This has nothing to do with updates in NVRAM. This has everything to do
with processing the data, constant self-checks, etc. This is critical
in high-reliabilty systems.

As I said at the outset, hardware RAID *does* have some *performance*
advantages (though new software approaches to handling data continue to
erode them). But there's no intrinsic *reliability* advantage: if you
don't need that NVRAM mirrored between controllers for performance
reasons, it adds nothing to (and may actually subtract from) your
system's reliability compared with a software approach.

Again, you make a generalization about which you know nothing. How many
$500K+ RAID arrays have you actually worked on? For that matter, how
many $50K arrays? $5K?

Having multiple paths to each disk isn't all that critical in RAID-1/10
configurations, since you can split the copies across two controllers to
ensure that one copy remains available if a controller dies (not that
frequent an occurrence - arguably, no more likely than that your system
board will experience some single point of *complete* failure). SATA
port selectors allow system fail-over, as does use of SAS or FC
connectivity to the disks (and the latter two support multiple paths to
each disk as well, should you want them).

I don't believe I ever said anything about multiple paths to each disk.
But you're correct, some RAID arrays have them.

These tests include reading and writing

>test cylinders on the disks to verify proper operation.

The background disk scrubbing which both hardware and software RAID
approaches should be doing covers that (and if there's really *no*
writing going on in the system for long periods of time, the software
can exercise that as well once in a while).

No, it doesn't. For instance, these tests include things like writing
with a lower-level signal than normal and trying to read it back. It
helps catch potential problems in the heads and electronics. The same
is true for writing with stronger than normal currents - and trying to
read them back. Also checking adjacent tracks for "bit bleed". And a
lot of other things.

These are things again no software implementation can do.

>>
Additionally, in the more expensive RAID devices, checksums are
typically at least 32 bits long (your off-the-shelf drive typically
uses a 16 bit checksum), and the checksum is built in hardware - much
more expensive, but much faster than doing it in firmware. Checksum
comparisons are done in hardware, also.

Your hand-waving just got a bit fast to follow there.

1. Disks certainly use internal per-sector error-correction codes when
transferring data to and from their platters. They are hundreds
(perhaps by now *many* hundreds) of bits long.

Actually, not. Sectors are still 512 bytes. And the checksums (or ECC,
if they use them) are still only 16 or 32 bits. And even if they use
ECC, 32 bits can only can only correct up to 3 bad bits out of the 512
bytes. None use "many hundreds of bits". It would waste too much disk
space.

>
2. Disks use cyclic redundancy checks on the data that they accept from
and distribute to the outside world (old IDE disks did not, but ATA
disks do and SATA disks do as well - IIRC the width is 32 bits).

See above. And even the orignal IDE drives used a 16 bit checksum.

3. I'd certainly expect any RAID hardware to use those CRCs to
communicate with both disks and host systems: that hardly qualifies as
anything unusual. If you were talking about some *other* kind of
checksum, it would have to have been internal to the RAID, since the
disks wouldn't know anything about it (a host using special driver
software potentially could, but it would add nothing of obvious value to
the CRC mechanisms that the host already uses to communicate directly
with disks, so I'd just expect the RAID box to emulate a disk for such
communication).

CRC's are not transferred to the host system, either in RAID or non-RAID
drives. Yes, some drives have that capability for diagnostic purposes.
But as a standard practice, transferring 512 bytes is 512 bytes of
data - no more, no less.

4. Thus data going from system memory to disk platter and back goes (in
each direction) through several interfaces and physical connectors and
multiple per-hop checks, and the probability of some undetected failure,
while very small for any given interface, connector, or hop, is not
quite as small for the sum of all of them (as well as there being some
errors, such as misdirected or lost writes, that none of those checks
can catch). What ZFS provides (that by definition hardware RAID cannot,
since it must emulate a standard block-level interface to the host) is
an end-to-end checksum that verifies data from the time it is created in
main memory to the time it has been fetched back into main memory from
disk. IBM, NetApp, and EMC use somewhat analogous supplementary
checksums to protect data: in the i-series case I believe that they are
created and checked in main memory at the driver level and are thus
comparably strong, while in NetApp's and EMC's cases they are created
and checked in the main memory of the file server or hardware box but
then must get to and from client main memory across additional
interfaces, connectors, and hops which have their own individual checks
and are thus not comparably end-to-end in nature - though if the NetApp
data is accessed through a file-level protocol that includes an
end-to-end checksum that is created and checked in client and server
main memory rather than, e.g., in some NIC hardware accelerator it could
be *almost* comparable in strength.

Yes, ZFS can correct for errors like bad connectors and cables. And I
guess you need it if you use cheap connectors or cables. But even if
they do fail - it's not going to be a one-time occurrance. Chances are
your system will crash within a few hundred ms.

I dont' know about NetApp, but IBM doesn't work this way at all. The
channel itself is parity checked by hardware on both ends. Any parity
check brings the system to an immediate halt.

>>
Plus, with verified writes, the firmware has to go back and reread the
data the next time the sector comes around and compare it with the
contents of the buffer. Again, this is often done in hardware on the
high end RAID systems.

And can just as well be done in system software (indeed, this is often a
software option in high-end systems).

Sure, it *can* be done with software, at a price.

>>
And, most of these RAID devices use custom chip sets - not something
off the shelf.

That in itself is a red flag: they are far more complex and also get
far less thoroughly exercised out in the field than more standard
components - regardless of how diligently they're tested.

Gotten a cell phone lately? Chances are the chips in your phone are
custom-made. Each manufacturer creates its own. Or an X-BOX, Nintendo,
PlayStation, etc.? Most of those have customer chips. And the same is
true for microwaves, TV sets and more.

The big difference is that Nokia can make 10M custom chilps for its
phones; for a high-end RAID device, 100K is a big run.

As others have pointed out, high-end RAID firmware updates are *not*
infrequent. And they don't just do them for fun.

Designing the chipsets themselves is in itself quite

>expensive, and due to the relatively limited run and high density of
the chipsets, they are quite expensive to produce.

As I observed at the outset, another reason to do the work in system
software.

And data in the system and system software can be corrupted. Once the
data is in the RAID device, it cannot.

>>
There's a lot more to it. But the final result is these devices have
a lot more hardware and software, a lot more internal communications,
and a lot more firmware. And it costs a lot of money to design and
manufacture these devices.

And all those things are *disadvantages*, not recommendations.

And all of these are advantages. They increase reliability and integrity.

You seem to think software is the way to go. Just tell me one thing.
When was the last time you had to have your computer fixed because of a
hardware problem? And how many times have you had to reboot due to a
software problem?

And you say software is as reliable?

They can also significantly limit their utility. For example, VMS
clusters support synchronous operation at separations up to 500 miles
(actually, more, but beyond that it starts to get into needs for special
tweaking) - but using host-based software mirroring rather than hardware
mirroring (because most hardware won't mirror synchronously at anything
like that distance - not to mention requiring a complete second
connection at *any* distance, whereas the normal cluster LAN or WAN can
handle software mirroring activity).

Who's talking about mirroring for 500 miles? Not me. And none of the
systems I know about do this for data integrity reasons.

Some do it for off-site backup, but that has nothing to do with RAID.

That's why you won't find them at your

>local computer store.

I seriously doubt that anyone who's been talking with you (or at least
trying to) about hardware RAID solutions has been talking about any that
you'd find at CompUSA. EMC's Symmetrix, for example, was the gold
standard of enterprise-level hardware RAID for most of the '90s - only
relatively recently did IBM claw back substantial market share in that
area (along with HDS).

Actually, Symmetrix grew big in the small and medium systems, but IBM
never lost the lead in the top end RAID solutions. But they also were
(and still are) quite a bit more expensive than EMC's.

>>
Some of this can be emulated in software.

*All* of the RAID part can be.

Let's see you do things like adjust drive electronics in software. And
what you can do - let's see you do it without any impact to the system.

But the software cannot

>detect when a signal is getting marginal (it's either "good" or "bad",
adjust the r/w head parameters, and similar things.

And neither can hardware RAID: those things happen strictly internally
at the disk (for that matter, by definition *anything* that the disk
externalizes can be handled by software as well as by RAID hardware).

And here you show you know nothing about what you talk. RAID drives are
specially built to work with their controllers. And RAID controllers
are made to be able to do these things. This is very low level stuff -
not things which are avialable outside the drive/controller.

Effectively, the RAID controller and the disk controller become one
unit. Separate, but one.

Yes, it can

>checksum the data coming back and read from the mirror drive if
necessary.

Yup.

Now, that *used* to be at least something of a performance issue - being
able to offload that into firmware was measurably useful. But today's
processor and memory bandwidth makes it eminently feasible - even in
cases where it's not effectively free (if you have to move the data, or
have to compress/decompress or encrypt/decrypt it, you can generate the
checksum as it's passing through and pay virtually no additional cost at
all).

Sorry, Bill, this statement is really off the wall.

Then why do all the high end disk controllers use DMA to transfer data?
Because it's faster and takes fewer CPU cycles than doing it software,
that's why. And computing checkums for 512 bytes takes a significantly
longer time that actually transferring the data to/from memory via software.

Also, instead of allocating 512 byte buffers, the OS would have to
allocate 514 or 516 byte buffers. This removes a lot of the optimizaton
possible when the system is using buffers during operations.

Additionally, differerent disk drives internally use differrent checksums.

Plus there is no way to tell the disk what to write for a checksum.
This is hard-coded into the disk controller.

That's still only a wash when conventional checksum mechanisms are used.
But when you instead use an end-to-end checksum like ZFS's (which you
can do *only* when the data is in main memory, hence can't offload) you
get a significant benefit from it.

Sure, if there's a hardware failure. But I repeat - how often do you
get hardware errors? How often do you get software errors? Which is
more reliable?

It might even be able to tell the controller to run a

>self-check (most controllers do have that capability) during idle times.

If there were any reason to - but without the complexity of RAID
firmware to worry about, any need for checks beyond what the simpler
controller should probably be doing on its own becomes questionable.

> But it can't do a lot more than that. The controller interface isn't
smart enough to do a lot more.

And without having to handle RAID management, it doesn't have to be.

- bill

Nope, and it's too bad, also.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 14 '06 #89

Robert Milkowski

Jerry Stuckle <js*******@attglobal.netwrote:

Robert Milkowski wrote:
The point is that you can still use such array and put on top of it ZFS
for many reasons - easier management is one of reasons, another is better
data protection than if you use classic file system.

The point is that such an array makes ZFS unnecessary. Sure, you *can*
use it (if you're using Linux - most of these systems do not). There is
nothing for ZFS to "manage" - configuration is done through utilities
(and sometimes an Ethernet port or similar). There is no management
interface for the file system - it all looks like a single disk (or
several disks, depending on the configuration).

As for data protection - if the RAID array can't read the data, it's
lost far beyond what ZFS or any other file system can do - unless you
have another complete RAID being run by ZFS. And if that's the case,
it's cheaper to have multiple mirrors.

It's here you don't understand. ZFS protect me from bad driver, bad FC switch,
bad FC/SCSI/ESCON/... adapter corrupting data. As one SCSI adapter unfortunately
did few weeks ago. So it's not only the array.
And while I admit that I haven't seen (yet) ZFS detecting data corruption
on Symmetrix boxes, but I did on another arrays, it could be due to fact
I put ZFS on Symmetrix boxes not that long ago and comparing to other
arrays it's not that much storage under ZFS on Symmetrix here. So statisticaly
it can be that I'm just more lucky. And of course I expect Symmetrix to be
more reliable than JBOD or medium array.

Now when it comes to manageability - well, it's actually manageability
of ZFS that drove my attention at first. It's because of features like
pooled storage, many file systems, shrinking/growin on the fly, etc.
which make ZFS especially in a fast changing environments just rock.
When you've got manage lots of fast changing data for MANY clients, and all of
these is changing, with ZFS it's not problem at all - you create another
filesystem in a second, you've got all available storage to it, it doesn't really
matter which file system is being consumed faster, etc.

Then you got other feature which make ZFS quite compeling. In our environment
with lot of small random writes which with ZFS are mainly sequential writes,
write speed-up is considerable. It helps even with Symmetrix boxes with 16B or more
cache, not to mention smaller arrays. Well in some tests ZFS was actually
quickier even with write-thru on the array than with traditional file systems
with write-back cache. But most important "test" is production - and ZFS
is faster here.

Then you got basicaly free snapshots with no impact on performance, no need
for extra-sliced storage, etc. So you get used to make them automaticaly on
daily basis. Then if you have file systems with tens of milions of small files
then doing backup using zfs tools instead standard tools (Legato, Tivoli) gives
you even 10-15x shorter time not to mention much less IO needed to complete work.
Well it's like doing backup for several days and do it in hours difference here
sometimes.
Want to create several virtual environment each with its own file system?
But you don't know exactly how many of them you'll end up and you
don't know how much disk space each of them will consume. With ZFS such
problems just doesn't exists.

Then you've got dynamic block size which also helps, especialy when during
years your mean file size changes consoderably and file size distribution
is you've got lots of small and lots of large files.

Then ZFS keeps all file system information within pool - so I don't have
to put entries in any system config files, even nfs shares can be managed
by zfs. It means I can take freshly installed Solaris box, connect it
into SAN and just import ZFS pool with all config - no backup needed, no
manual config - I get all the same parameters for all file systems in a
pool within seconds.

Then if my old SPARC box becoming slow I can just import pool on x64 box,
or vice versa and everything just works (tested it myself) without any
conversion, data migration, etc. It just works.

Then in our devel environments I need some times to make a writable
copy of a file system, test some changes, etc. With zfs regardles of file
system size (several TB's, some time more) I get WRITABLE copy in one
second, without copying data, without any need for more space. When I'm
done I just delete clone. Well, need to clone entire virtual machine
in one second, with no additional disk space needed and make some tests?
I did, works great. Need entire data base copy on a devel machine in 1s?
Writable copy? Regardles of database size? With no performance impact
on original database? No problem.

You've got new quipment and want to test different RAID config with
your application to see which config performs best. So you setup
50TB RAID-10 config and make tests. Then you setup 50TB RAID-5 config
and make tests. Then RAID-6. Then some combination (dynamic striping
of RAID-6?). How much time it would take to just make RAID-5 on the array?
Well, sometimes even 2 days to just make some test, then wait another
dayy or two for another test. ZFS creates RAIDs within seconds with no
background synchronization, etc. so disks are immediately ready to use.
Again, you saved lot of time.

You need RAID-5 or RAID-6 and you're doing lots of small writes with
lots of concurrent streams? Your performance generally sucks regardles
of cache size in your Symmetrix or other array. Then you create
RAID-5 (or RAID-6) using zfs and suddenly you get N times the performance
of the array on the same hardware. Well, sometimes you just can't walk-by.

You've got application which is disk IO constrained and there's plenty
of CPU power left. And you're running out of disk space. Well, just
turn on compression in ZFS on-the-fly and all new writes are compressed.
The end effect is disk free is rising, and performance is not worse but
even better. Well, I did exactly that some time ago.

I could go on and all above is from my own experience on production environments
with ZFS for over last two years.

Like it or not but ZFS makes admin's life MUCH easier, solves many problems,
and in many cases saves your data when your array screw up.

--
Robert Milkowski
rm************@wp-sa.pl
http://milek.blogspot.com

Nov 14 '06 #90

Robert Milkowski

Jerry Stuckle <js*******@attglobal.netwrote:

Bill Todd wrote:

But time marches on. Most serious operating systems now support (either
natively or via extremely reputable decade-old, thoroughly-tested
third-party system software products from people like Veritas) software
RAID, and as much cache memory as you can afford (no more address-space
limitations there) - plus (with products like ZFS) are at least starting
to address synchronous small-update throughput (though when synchronous
small-update *latency* is critical there's still no match for NVRAM).

Sure, you can get software RAID. But it's not as reliable as a good
hardware RAID.

Not true. Actually in some environments what you do is software
mirroring between two enterprise arrays and put your Oracle on top
of it. That way you get more reliable config.

Also when you're using your high-end array without ZFS basicaly
you get less reliability when you use the same array with ZFS.

And let's see those drives do things like dynamically adjust the
electronics - such as amp gain, bias, slew rate... They can't do it.

Again, you missing the point. You still get all of this as with ZFS you do
not throw array your array - you use it. ZFS is great you know, but it
doesn't make storage out of air molecules. So with ZFS among other
features you get additional protection which HW RAID itself cannot offer.

who has worked with high performance, critical systems knows there is a
*huge* difference between doing it in hardware and software.

Really? Actually depending on workload specifics and hardware specifics
I can see HW being faster than software, and the opposite.

In some cases clever combination of both gives best results.

Yes, ZFS can correct for errors like bad connectors and cables. And I
guess you need it if you use cheap connectors or cables. But even if
they do fail - it's not going to be a one-time occurrance. Chances are
your system will crash within a few hundred ms.

Geezz... I don't know how you configure your systems but just 'coz
of bad cable or connector my systems won't crash. They will use another
link. These are basics in HA storage management and I'm suprised you
don't know how to do it. And now thanks to ZFS if a FC switch, hba
or something else will corrupt data ZFS will detect and correct.

I dont' know about NetApp, but IBM doesn't work this way at all. The
channel itself is parity checked by hardware on both ends. Any parity
check brings the system to an immediate halt.

What???? Just becaouse you get some errors on a link you halt entire system?
Well, just switch to good link.
I don't belive they are doing it actually.

And data in the system and system software can be corrupted. Once the
data is in the RAID device, it cannot.

Really? Unfortunately for your claims it happens.
And you know, even your beloved IBM's array lost some data here.
The array even warned us about it :) It wasn't Shark, but also
not low-end in IBMs arrays. And it did more than once.

>
You seem to think software is the way to go. Just tell me one thing.
When was the last time you had to have your computer fixed because of a
hardware problem? And how many times have you had to reboot due to a
software problem?

And how many times you had to reboot entire array for some upgrade or
corrections? Even high-end arrays? Including IBM's arrays? I had to do
it many times because I work with them. What about you? Maybe your envoronment
isn't as demanding?

Actually, Symmetrix grew big in the small and medium systems, but IBM
never lost the lead in the top end RAID solutions. But they also were
(and still are) quite a bit more expensive than EMC's.

What IBM array are you talking about? Shark? Or maybe they got
for years something top secret only you know about?
In 1 minute I found some links for you.
As it seems you're fond of IBM lets start with them.

http://www-03.ibm.com/systems/storag...snapvalidator/
"
The challenge: the risk of data corruption is inherent in data transfers

Organizations of any size that rely heavily on the integrity of Oracle data need to safeguard against data corruption. Because database servers and storage devices reside at opposite ends of the I/O path, corruption can occur as each data block transfer passes through a series of logical layers involving hardware and software from multiple vendors. Other factors, such as application anomalies and human error, present additional risk. As a result, data corruption can occur at any stage of the process, even with the protection inherent in the most robust storage systems. The impact of these corruptions can cause considerable disruption to business continuity, which can be time consuming and costly to resolve.
The solution: end-to-end data validation

IBM System Storage N series with SnapValidator. software is designed to provide a high level of protection for Oracle data, helping you to detect potential data corruption before it occurs. By adding intelligence and database awareness to modular storage systems-across iSCSI SAN, FC SAN and NAS protocols-the software can help extend the advantages of checksum functionality to a greater variety of organizations."

Of course it's not trully end-to-end and it's only for writes, but at least IBM
recognizes that data integrity is a problem despite using enterprise RAID arrays.
Then something similar from EMC
http://www.emc.com/products/software/checksum.jsp

or Oracle itself
http://www.oracle.com/technology/dep...ocs/hardf.html
Other main vendors also recognizes data corruption as a problem and all
know RAID isn't complete answer. So they develop half-baked solutions as above.
Of course it's better than nothing.

Then comes ZFS and completely changes the game. They (Sun) did something which
is really ahead of competition and is innovative. And whether you like it or
not, and whether in your mind enterprise arrays are reliable or not, data corrpution
happens and ZFS greatly protects from it. Even more - ZFS does excellent its job
both on enterprise storage and on cheap industry disks. Which is great as for many
environments you can actually build reliable solution wiht orders of magnitude lower
costs.

Now I understand why IBM doesn't like it :)
--
Robert Milkowski
rm************@wp-sa.pl
http://milek.blogspot.com

Nov 14 '06 #91

Jerry Stuckle

Robert Milkowski wrote:

Jerry Stuckle <js*******@attglobal.netwrote:

>>Robert Milkowski wrote:

>>>The point is that you can still use such array and put on top of it ZFS
for many reasons - easier management is one of reasons, another is better
data protection than if you use classic file system.

The point is that such an array makes ZFS unnecessary. Sure, you *can*
use it (if you're using Linux - most of these systems do not). There is
nothing for ZFS to "manage" - configuration is done through utilities
(and sometimes an Ethernet port or similar). There is no management
interface for the file system - it all looks like a single disk (or
several disks, depending on the configuration).

As for data protection - if the RAID array can't read the data, it's
lost far beyond what ZFS or any other file system can do - unless you
have another complete RAID being run by ZFS. And if that's the case,
it's cheaper to have multiple mirrors.

It's here you don't understand. ZFS protect me from bad driver, bad FC switch,
bad FC/SCSI/ESCON/... adapter corrupting data. As one SCSI adapter unfortunately
did few weeks ago. So it's not only the array.
And while I admit that I haven't seen (yet) ZFS detecting data corruption
on Symmetrix boxes, but I did on another arrays, it could be due to fact
I put ZFS on Symmetrix boxes not that long ago and comparing to other
arrays it's not that much storage under ZFS on Symmetrix here. So statisticaly
it can be that I'm just more lucky. And of course I expect Symmetrix to be
more reliable than JBOD or medium array.

Immaterial.

A bad driver won't let the system run - at least not for long. Same
with a bad FC switch, etc. And how long did your system run with a bad
SCSI adapter?

And yes, as I've stated before - like anything else, you get what you
paid for. Get a good quality RAID and you won't get data corruption issues.

Now when it comes to manageability - well, it's actually manageability
of ZFS that drove my attention at first. It's because of features like
pooled storage, many file systems, shrinking/growin on the fly, etc.
which make ZFS especially in a fast changing environments just rock.
When you've got manage lots of fast changing data for MANY clients, and all of
these is changing, with ZFS it's not problem at all - you create another
filesystem in a second, you've got all available storage to it, it doesn't really
matter which file system is being consumed faster, etc.

This has nothing to do with the reliability issues being discussed.

Then you got other feature which make ZFS quite compeling. In our environment
with lot of small random writes which with ZFS are mainly sequential writes,
write speed-up is considerable. It helps even with Symmetrix boxes with 16B or more
cache, not to mention smaller arrays. Well in some tests ZFS was actually
quickier even with write-thru on the array than with traditional file systems
with write-back cache. But most important "test" is production - and ZFS
is faster here.

Again, nothing to do with the reliability issues.

Then you got basicaly free snapshots with no impact on performance, no need
for extra-sliced storage, etc. So you get used to make them automaticaly on
daily basis. Then if you have file systems with tens of milions of small files
then doing backup using zfs tools instead standard tools (Legato, Tivoli) gives
you even 10-15x shorter time not to mention much less IO needed to complete work.
Well it's like doing backup for several days and do it in hours difference here
sometimes.

Ditto.

>
Want to create several virtual environment each with its own file system?
But you don't know exactly how many of them you'll end up and you
don't know how much disk space each of them will consume. With ZFS such
problems just doesn't exists.

When you going to get back to reliability - which is the issue here?

Then you've got dynamic block size which also helps, especialy when during
years your mean file size changes consoderably and file size distribution
is you've got lots of small and lots of large files.

Then ZFS keeps all file system information within pool - so I don't have
to put entries in any system config files, even nfs shares can be managed
by zfs. It means I can take freshly installed Solaris box, connect it
into SAN and just import ZFS pool with all config - no backup needed, no
manual config - I get all the same parameters for all file systems in a
pool within seconds.

Then if my old SPARC box becoming slow I can just import pool on x64 box,
or vice versa and everything just works (tested it myself) without any
conversion, data migration, etc. It just works.

Ho Hum... I'm falling asleep.

Then in our devel environments I need some times to make a writable
copy of a file system, test some changes, etc. With zfs regardles of file
system size (several TB's, some time more) I get WRITABLE copy in one
second, without copying data, without any need for more space. When I'm
done I just delete clone. Well, need to clone entire virtual machine
in one second, with no additional disk space needed and make some tests?
I did, works great. Need entire data base copy on a devel machine in 1s?
Writable copy? Regardles of database size? With no performance impact
on original database? No problem.

You've got new quipment and want to test different RAID config with
your application to see which config performs best. So you setup
50TB RAID-10 config and make tests. Then you setup 50TB RAID-5 config
and make tests. Then RAID-6. Then some combination (dynamic striping
of RAID-6?). How much time it would take to just make RAID-5 on the array?
Well, sometimes even 2 days to just make some test, then wait another
dayy or two for another test. ZFS creates RAIDs within seconds with no
background synchronization, etc. so disks are immediately ready to use.
Again, you saved lot of time.

One correction. ZFS does not "create RAIDS". It EMULATES RAIDS. A big
difference. But still no discussion about reliability.

You need RAID-5 or RAID-6 and you're doing lots of small writes with
lots of concurrent streams? Your performance generally sucks regardles
of cache size in your Symmetrix or other array. Then you create
RAID-5 (or RAID-6) using zfs and suddenly you get N times the performance
of the array on the same hardware. Well, sometimes you just can't walk-by.

You've got application which is disk IO constrained and there's plenty
of CPU power left. And you're running out of disk space. Well, just
turn on compression in ZFS on-the-fly and all new writes are compressed.
The end effect is disk free is rising, and performance is not worse but
even better. Well, I did exactly that some time ago.

I could go on and all above is from my own experience on production environments
with ZFS for over last two years.

Like it or not but ZFS makes admin's life MUCH easier, solves many problems,
and in many cases saves your data when your array screw up.

And what does any of this have to do with the discussion at hand - which
is data reliability? You seen to have a penchance for changing the
subject when you can't refute the facts.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 14 '06 #92

Jerry Stuckle

Robert Milkowski wrote:

Jerry Stuckle <js*******@attglobal.netwrote:

>>Bill Todd wrote:

>>>But time marches on. Most serious operating systems now support (either
natively or via extremely reputable decade-old, thoroughly-tested
third-party system software products from people like Veritas) software
RAID, and as much cache memory as you can afford (no more address-space
limitations there) - plus (with products like ZFS) are at least starting
to address synchronous small-update throughput (though when synchronous
small-update *latency* is critical there's still no match for NVRAM).

Sure, you can get software RAID. But it's not as reliable as a good
hardware RAID.

Not true. Actually in some environments what you do is software
mirroring between two enterprise arrays and put your Oracle on top
of it. That way you get more reliable config.

That just means it's more reliable than putting something on one array.
No surprises there. And putting it on 100 arrays is even more reliable.

And prove to me how doing it in software is more reliable than doing it
in hardware.

Also when you're using your high-end array without ZFS basicaly
you get less reliability when you use the same array with ZFS.

Proof? Statistics?

>

>>And let's see those drives do things like dynamically adjust the
electronics - such as amp gain, bias, slew rate... They can't do it.

Again, you missing the point. You still get all of this as with ZFS you do
not throw array your array - you use it. ZFS is great you know, but it
doesn't make storage out of air molecules. So with ZFS among other
features you get additional protection which HW RAID itself cannot offer.

This is where you've gone off the deep end, Robert, and prove you have
no idea what you're talking about.

ZFS cannot adjust amp gains. It cannot change the bias. It cannot
tweak the slew rates. And a lot more. These are all very low level
operations available only to the disk controller. And they are much of
what makes a difference between a high-end drive and a throw-away drive
(platter coating being another major cost difference).

None of this is available at any level to the OS.

>
>>who has worked with high performance, critical systems knows there is a
*huge* difference between doing it in hardware and software.

Really? Actually depending on workload specifics and hardware specifics
I can see HW being faster than software, and the opposite.

In some cases clever combination of both gives best results.

Wrong! Data transfer in hardware is ALWAYS faster than in software.
Hardware can transfer 4 bytes every clock cycle. Software requires a
loop and about 7 clock cycles to transfer the same 4 bytes.

And you can't have a "combination of both". Either the hardware does
the transfer, or it waits for the software to do it.

>

>>Yes, ZFS can correct for errors like bad connectors and cables. And I
guess you need it if you use cheap connectors or cables. But even if
they do fail - it's not going to be a one-time occurrance. Chances are
your system will crash within a few hundred ms.

Geezz... I don't know how you configure your systems but just 'coz
of bad cable or connector my systems won't crash. They will use another
link. These are basics in HA storage management and I'm suprised you
don't know how to do it. And now thanks to ZFS if a FC switch, hba
or something else will corrupt data ZFS will detect and correct.

Yep, and you need it if you use cheap cables and connectors.
Personally, I haven't seen a disk bad cable or connector in quite a
number of years. In fact, the only one I can remember seeing in the
past 10+ years was a display cable on a laptop - but that was because of
the flexing from opening and closing the lid.

So how often do YOU get bad cables or connectors?

Also, you're telling me you can go into your system while it's running
and just unplug any cable you want and it will keep running? Gee,
you've accomplished something computer manufacturers have dreamed about
for decades!

>
>>I dont' know about NetApp, but IBM doesn't work this way at all. The
channel itself is parity checked by hardware on both ends. Any parity
check brings the system to an immediate halt.

What???? Just becaouse you get some errors on a link you halt entire system?
Well, just switch to good link.
I don't belive they are doing it actually.

Yep. It sure does. And it happens to a system about once every 10
years. And in all of my years in hardware, I had exactly ONE time it
was a cable or connector. And that was because someone tried to force
it into the socket. Any other failure was caused by electronics.

>
>>And data in the system and system software can be corrupted. Once the
data is in the RAID device, it cannot.

Really? Unfortunately for your claims it happens.
And you know, even your beloved IBM's array lost some data here.
The array even warned us about it :) It wasn't Shark, but also
not low-end in IBMs arrays. And it did more than once.

Not with good quality RAIDS. And obviously your claim of having IBM's
array is as full of manure as your earlier claims in this thread.

>

>>You seem to think software is the way to go. Just tell me one thing.
When was the last time you had to have your computer fixed because of a
hardware problem? And how many times have you had to reboot due to a
software problem?

And how many times you had to reboot entire array for some upgrade or
corrections? Even high-end arrays? Including IBM's arrays? I had to do
it many times because I work with them. What about you? Maybe your envoronment
isn't as demanding?

What does this have to do with the question?

But since you asked. How many times have you had to reboot because of
some upgrade or correction to your software? A hell of a lot more than
with any array.

>

>>Actually, Symmetrix grew big in the small and medium systems, but IBM
never lost the lead in the top end RAID solutions. But they also were
(and still are) quite a bit more expensive than EMC's.

What IBM array are you talking about? Shark? Or maybe they got
for years something top secret only you know about?
In 1 minute I found some links for you.
As it seems you're fond of IBM lets start with them.

http://www-03.ibm.com/systems/storag...snapvalidator/
"
The challenge: the risk of data corruption is inherent in data transfers

Organizations of any size that rely heavily on the integrity of Oracle data need to safeguard against data corruption. Because database servers and storage devices reside at opposite ends of the I/O path, corruption can occur as each data block transfer passes through a series of logical layers involving hardware and software from multiple vendors. Other factors, such as application anomalies and human error, present additional risk. As a result, data corruption can occur at any stage of the process, even with the protection inherent in the most robust storage systems. The impact of these corruptions can cause considerable disruption to business continuity, which can be time consuming and costly to resolve.
The solution: end-to-end data validation

IBM System Storage N series with SnapValidator. software is designed to provide a high level of protection for Oracle data, helping you to detect potential data corruption before it occurs. By adding intelligence and database awareness to modular storage systems-across iSCSI SAN, FC SAN and NAS protocols-the software can help extend the advantages of checksum functionality to a greater variety of organizations."

Of course it's not trully end-to-end and it's only for writes, but at least IBM
recognizes that data integrity is a problem despite using enterprise RAID arrays.
Then something similar from EMC
http://www.emc.com/products/software/checksum.jsp

or Oracle itself
http://www.oracle.com/technology/dep...ocs/hardf.html
Other main vendors also recognizes data corruption as a problem and all
know RAID isn't complete answer. So they develop half-baked solutions as above.
Of course it's better than nothing.

Then comes ZFS and completely changes the game. They (Sun) did something which
is really ahead of competition and is innovative. And whether you like it or
not, and whether in your mind enterprise arrays are reliable or not, data corrpution
happens and ZFS greatly protects from it. Even more - ZFS does excellent its job
both on enterprise storage and on cheap industry disks. Which is great as for many
environments you can actually build reliable solution wiht orders of magnitude lower
costs.

Now I understand why IBM doesn't like it :)

You've really gone off the wall here, Robert. You have proven you are
blowing your "facts" out your ass. You have absolutely no idea about
which you speak - your outlandish claims about what ZFS can do (adjust
amp gain, etc.) is proof of that. And your claims about software data
transfer being more efficient than hardware is hilarious.

And your previous post which talked about all the neat things ZFS could
do which have absolutely nothing to do with reliability.

Robert, your credibility here is now zero. You've made way too many
claims that anyone with even a modicum of hardware knowledge could
refute - hell, even a tech school student could do it.

So nothing else you say has any credibility, either. Go back into your
hole. You're adding absolutely nothing to this conversation.
--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 14 '06 #93

Robert Milkowski

Jerry Stuckle <js*******@attglobal.netwrote:

Robert Milkowski wrote:

It's here you don't understand. ZFS protect me from bad driver, bad FC switch,
bad FC/SCSI/ESCON/... adapter corrupting data. As one SCSI adapter unfortunately
did few weeks ago. So it's not only the array.
And while I admit that I haven't seen (yet) ZFS detecting data corruption
on Symmetrix boxes, but I did on another arrays, it could be due to fact
I put ZFS on Symmetrix boxes not that long ago and comparing to other
arrays it's not that much storage under ZFS on Symmetrix here. So statisticaly
it can be that I'm just more lucky. And of course I expect Symmetrix to be
more reliable than JBOD or medium array.

Immaterial.

A bad driver won't let the system run - at least not for long. Same
with a bad FC switch, etc. And how long did your system run with a bad
SCSI adapter?

Actualy it was writing data with some errors for hours, then system panicked.
Then again it was writing data for another 7-9h.

Now of course system won't reboot just because bad switch - it's not the first
time I had problem with fc switch (long time ago, granted). With at least dual links
from each host to differen switch (differen fabric) it's not a big problem.

And yes, as I've stated before - like anything else, you get what you
paid for. Get a good quality RAID and you won't get data corruption issues.

Is IBM high-end array a good quality one for example?

And what does any of this have to do with the discussion at hand - which
is data reliability? You seen to have a penchance for changing the
subject when you can't refute the facts.

Reliability comes from end-to-end data integrity, which HW RAID itself can't
provide so your data are less protected.

Your RAID doesn't protect you from anything between itself and your host.
Now vendors recognize it for years that's why IBM, EMC, Oracle, Sun, Hitachi, etc.
all of them provide some hacks for specific applications like Oracle. Of course
none of those solution are nearly as complete as ZFS.
I know, you know better than all those vendors. You know better than people
who actually lost their data both on cheap and 500k+ (you seem to like this number)
arrays. It's just that you for some reasons can't accept simple facts.

Reliability comes from keeping meta data blocks on different LUNs + configured
protection. While it's not strictly RAID issue the point is ZFS has file system
integrated with Volume Manager, so it can do it. So even if you do just
striping on ZFS on RAID LUNS, and you overwrite entire LUN your file system still
will be consistent - only lost data (not metadata) are lost.

Reliability comes from never overwriting actual data on a medium, so you don't
have to deal with incomplete writes, etc.

Reliability comes from integrating file system with volume manager, so regardless
of RAID type you used, you are always consistent on disk (both data and meta data).
Something which traditional file systems, log structured, can't guarantee. And that's
why even with journaling sometimes you end up with fsck after all.

Reliability comes from knowing where exactly on each disk your data is, so if you
do not have RAID full of data ZFS will resilver disk in case of emergency MUCH
faster by resilvering only actual data and not all blocks on disk. Also as it understand
data on disk it starts resilver from / so from the beginning even if resilver isn't
completed yet you get some protection. On classic array you can't do it as they
can't understand data on their disks.
Now there are other thing in ZFS which greatly increase reliability, manegability, and
other things. But as you can't agree with basic and simple facts it doesn't really
make sense to go any further. It probably doesn't even make sense to talk to you anymore.
--
Robert Milkowski
rm**************@wp-sa.pl
http://milek.blogspot.com

Nov 14 '06 #94

Robert Milkowski

Jerry Stuckle <js*******@attglobal.netwrote:

Robert Milkowski wrote:
Also when you're using your high-end array without ZFS basicaly
you get less reliability when you use the same array with ZFS.

Proof? Statistics?

Proof - I lost data on arrays, both big and large. ZFS already helped me
few times. That's all proof I need. Now if you need some basics why this is
possible please try again to read what all people here where writing, this
time with open mind and understanding, then go to Open Solaris site for
more details, then if you are still more curious and capable of reading code,
then do it. Or you can just try it.

However I suspect you will just keep trolling and stay in a dream of yours.

ZFS cannot adjust amp gains. It cannot change the bias. It cannot
tweak the slew rates. And a lot more. These are all very low level
operations available only to the disk controller. And they are much of
what makes a difference between a high-end drive and a throw-away drive
(platter coating being another major cost difference).

To be honest I have no idea if high end arrays do these things.
But all I know is that if I use such an array and put ZFS on top of it,
then additional to protection you describe I get much more. At the and
I get better data protection. And that's exactly what people are already
doing.

Now, please try to ajust your bias, as evidently you've got undetected
(at least by yourself) malfunction :))) no offence.

>who has worked with high performance, critical systems knows there is a
*huge* difference between doing it in hardware and software.

Really? Actually depending on workload specifics and hardware specifics
I can see HW being faster than software, and the opposite.

In some cases clever combination of both gives best results.

Wrong! Data transfer in hardware is ALWAYS faster than in software.

Holly.....!#>!>!
Now you want persuade me that even if my application works faster with
software RAID it's actually slower, just because you think so.
You really have a problem with grasping reality around you.

So how often do YOU get bad cables or connectors?

Also, you're telling me you can go into your system while it's running
and just unplug any cable you want and it will keep running? Gee,
you've accomplished something computer manufacturers have dreamed about
for decades!

Yep, in our HA solutions you can go and unplug one external cable you want and
the system will keep going - doesn't metter if it's network cable, power cable,
FC cable, ...

You know maybe when you're studying, and it was long time ago I guess, people
were dreaming about this, but it's really kind of standard in enterprise for years
if even for much more time. You really don't know anything about HA (but that was obvious
earlier).

>And data in the system and system software can be corrupted. Once the
data is in the RAID device, it cannot.

Really? Unfortunately for your claims it happens.
And you know, even your beloved IBM's array lost some data here.
The array even warned us about it :) It wasn't Shark, but also
not low-end in IBMs arrays. And it did more than once.

Not with good quality RAIDS. And obviously your claim of having IBM's
array is as full of manure as your earlier claims in this thread.

Ok, enough.
I guess from time to time I need to play a little bit with trolls, but this is
enough.

As someone else pointed - sometimes you really just can't help some people.

EOT

--
Robert Milkowski
rm************@wp-sa.pl
http://milek.blogspot.com

Nov 14 '06 #95

Bill Todd

Jerry Stuckle wrote:

....

software implementations are poor replacements for a truly

fault-tolerant system.

You really ought to stop saying things like that: it just makes you
look ignorant.

While special-purpose embedded systems may be implementable wholly in
firmware, general-purpose systems are not - and hence *cannot* be more
reliable than their system software (and hardware) is. Either that
software and the hardware it runs on are sufficiently reliable, or
they're not - and if they're not, then the most reliable storage system
in the world can't help the situation, because the data *processing*
cannot be relied upon.

So stop babbling about software reliability: if it's adequate to
process the data, it's adequate to perform the lower levels of data
management as well. In many cases that combination may be *more*
reliable than adding totally separate hardware and firmware designed by
a completely different organization: using the existing system hardware
that *already* must be reliable (rather than requiring that a *second*
piece of hardware be equally reliable too) by definition reduces total
exposure to hardware faults or flaws (and writing reliable system
software is in no way more bug-prone than writing equivalent but
separate RAID firmware).

And the high end RAID devices do not require

special software - they look like any other disk device attached to the
system.

Which is why they cannot include the kinds of end-to-end checks that a
software implementation inside the OS can: the standard interface
doesn't support it.

>
As for bundling write acceleration in NVRAM - again, meaningless because
good RAID devices aren't loaded as a "special system device".

Perhaps you misunderstood: I listed transparent write-acceleration by
using NVRAM in hardware RAID as a hardware RAID *advantage* (it just has
nothing to do with *reliability*, which has been the issue under
discussion here).

>
Prestoserve was one of the first lower-end RAID products made.

I don't believe that Prestoserve had much to do with RAID: it was
simply battery-backed (i.e., non-volatile) RAM that could be used as a
disk (or in front of disks) to make writes persistent at RAM speeds.

However,

there were a huge number of them before that. But you wouldn't find
them on a PC. They were primarily medium and large system devices.

I never heard of Prestoserve being available on a PC (though don't know
for certain that it wasn't): I encountered it on mid-range DEC and Sun
systems.

Prestoserve took some of the ideas and moved much of the hardware
handling into software. Unfortunately, when they did it, they lost the
ability to handle problems at a low-level (i.e. read head biasing,
etc.). It did make the arrays a lot cheaper, but at a price.

I think you're confused: Prestoserve had nothing to do with any kind of
'hardware handling', just with write acceleration via use of NVRAM.

>
And in the RAID devices, system address space was never a problem -
because the data was transferred to RAID cache immediately. This did
not come out of the system pool; the controllers have their own cache.

Which was what I said: back when amounts of system memory were limited
(by addressability if nothing else) this was a real asset that hardware
RAID could offer, whereas today it's much less significant (since a
64-bit system can now address and effectively use as much RAM as can be
connected to it).

>
I remember 64MB caches in the controllers way back in the mid 80's. It's
in the GB, now.

Indeed - a quick look at some current IBM arrays show support up to at
least 1 GB (and large EMC and HDS arrays offer even more). On the other
hand, system RAM in large IBM p-series systems can reach 2 TB these
days, so (as I noted) the amount that a RAID controller can add to that
is far less (relatively) significant than it once was.

>But time marches on. Most serious operating systems now support
(either natively or via extremely reputable decade-old,
thoroughly-tested third-party system software products from people
like Veritas) software RAID, and as much cache memory as you can
afford (no more address-space limitations there) - plus (with products
like ZFS) are at least starting to address synchronous small-update
throughput (though when synchronous small-update *latency* is critical
there's still no match for NVRAM).

Sure, you can get software RAID. But it's not as reliable as a good
hardware RAID.

That is simply incorrect: stop talking garbage.

>
> You don't take $89 100GB disk drives off the shelf,

>>tack them onto an EIDE controller and add some software to the system.

Actually, you can do almost *precisely* that, as long as the software
is handles the situation appropriately - and that's part of what ZFS
is offering (and what you so obviously completely fail to be able to
grasp).

In the cheap RAID devices, sure. But not in the good ones. You're
talking cheap. I'm talking quality.

No: I'm talking relatively inexpensive, but with *no* sacrifice in
quality. You just clearly don't understand how that can be done, but
that's your own limitation, not any actual limitation on the technology.

>
>No disk or firmware is completely foolproof. Not one. No matter how
expensive and well-designed. So the question isn't whether the disks
and firmware are unreliable, but just the degree and manner in which
they are.

I never said they were 100% foolproof. Rather, I said they are amongst
the most tested software made. Probably the only software tested more
thoroughly is the microcode on CPU's. And they are as reliable as
humanly possible.

So is the system software in several OSs - zOS and VMS, for example (I
suspect IBM's i-series as well, but I'm not as familiar with that).
Those systems can literally run for on the order of a decade without a
reboot, as long as the hardware doesn't die underneath them.

And people trust their data to the system software already, so (as I
already noted) there's no significant additional exposure if that
software handles some of the RAID duties as well (and *less* total
*hardware* exposure, since no additional hardware has been introduced
into the equation).

>
Of course, the same thing goes for ZFS and any file system. They're not
completely foolproof, either, are they?

No, but neither need they be any *less* foolproof: it just doesn't
matter *where* you execute the RAID operations, just *how well* they're
implemented.

>
>There is, to be sure, no way that you can make a pair of inexpensive
SATA drives just as reliable as a pair of Cheetahs, all other things
being equal. But it is *eminently* possible, using appropriate
software (or firmware), to make *three or four* inexpensive SATA
drives *more* reliable than a pair of Cheetahs that cost far more -
and to obtain better performance in many areas in the bargain.

And there is no way to make a pair of Cheetahs as reliable as drives
made strictly for high end RAID devices. Some of these drives still
sell for $30-60/GB (or more).

It is possible that I'm just not familiar with the 'high-end RAID
devices' that you're talking about - so let's focus on that.

EMC took a major chunk of the mainframe storage market away from IBM in
the '90s using commodity SCSI disks from (mostly) Seagate, not any kind
of 'special' drives (well, EMC got Seagate to make a few firmware
revisions, but nothing of major significance - I've always suspected
mostly to keep people from by-passing EMC and getting the drives at
standard retail prices). At that point, IBM was still making its own
proprietary drives at *far* higher prices per GB, and EMC cleaned up (by
emulating the traditional IBM drive technology using the commodity SCSI
Seagate drives and building in reliability through redundancy and
intelligent firmware to compensate for the lower per-drive reliability -
exactly the same kind of thing that I've been describing to let today's
SATA drives substitute effectively for the currently-popular higher-end
drives in enterprise use).

Since then, every major (and what *I* would consider 'high-end') array
manufacturer has followed that path: IBM and Hitachi use commodity
FC/SCSI drives in their high-end arrays too (and even offer SATA drives
as an option for less-demanding environments). These are the kinds of
arrays used in the highest-end systems that, e.g., IBM and HP run their
largest-system TPC-C benchmark submissions on: I won't assert that even
higher-end arrays using non-standard disks don't exist at all, but I
sure don't *see* them being used *anywhere*.

So exactly what drives are you talking about that cost far more than the
best that Seagate has to offer, and offer far more features in terms of
external control over internal drive operations (beyond the standard
'SCSI mode page' tweaks)? Where can we see descriptions of the
super-high-end arrays (costing "$100-500/GB" in your words) that use
such drives (and preferably descriptions of *how* they use them)?

Demonstrating that you have at least that much of a clue what you're
talking about would not only help convince people that you might
actually be worth listening to, but would also actually teach those of
us whose idea of 'high-end arrays' stops with HDS and Symmetrix
something we don't know. Otherwise, we'll just continue to assume that
you're at best talking about '80s proprietary technology that's
completely irrelevant today (if it ever existed as you describe it even
back then).

That would not, however, change the fact that equal reliability can be
achieved at significantly lower cost by using higher numbers of low-cost
(though reputable) drives with intelligent software to hook them
together into a reliable whole. In fact, the more expensive those
alleged non-standard super-high-end drives (and arrays) are, the easier
that is to do.

....

>And you don't attach them through Brand X SATA controllers, either:
ideally, you attach them directly (since you no longer need any
intermediate RAID hardware), using the same quality electronics you
have on the rest of your system board (so the SATA connection won't
constitute a weak link). And by virtue of being considerably simpler
hardware/firmware than a RAID implementation, that controller may well
be *more* reliable.

There is no way this is more reliable than a good RAID system. If you
had ever used one, you wouldn't even try to make that claim.

One could equally observe that if you had ever used a good operating
system, you wouldn't even try to make *your* claim. You clearly don't
understand software capabilities at all.

....

>Whether you're aware of it or not, modern SATA drives (and even
not-too-old ATA drives) do *all* the things that you just described in
your last one-and-a-half paragraphs.

And let's see those drives do things like dynamically adjust the
electronics - such as amp gain, bias, slew rate... They can't do it.

Your continuing blind spot is in not being able to understand that they
don't have to: whatever marginal improvement in per-drive reliability
such special-purpose advantages may achieve (again, assuming that they
achieve any at all: as I noted, commodity SATA drives *do* support most
of what you described, and I have no reason to believe that you have any
real clue what actual differences exist), those advantages can be
outweighed simply by using more lower-cost drives (such that two or even
three can fail for every high-cost drive failure without jeopardizing data).

....

>Modern disks (both FC/SCSI and ATA/SATA) do that themselves, without
waiting for instructions from a higher level. They report any failure
up so that the higher level (again, doesn't matter whether it's
firmware or software) can correct the data if a good copy can be found
elsewhere. If its internal retry succeeds, the disk doesn't report an
error, but does log it internally such that any interested
higher-level firmware or software can see whether such successful
retries are starting to become alarmingly frequent and act accordingly.

Yes, they report total failure on a read. But they can't go back and
try to reread the sector with different parms to the read amps, for
instance.

As I already said, I have no reason to believe that you know what you're
talking about there. Both FC/SCSI and ATA/SATA drives make *exhaustive*
attempts to read data before giving up: they make multiple passes,
jigger the heads first off to one side of the track and then off to the
other to try to improve access, and God knows what else - to the point
where they can keep working for on the order of a minute trying to read
a bad sector before finally giving up (and I suspect that part of what
keeps them working that long includes at least some of the kinds of
electrical tweaks that you describe).

And a good RAID controller will make decisions based in part

on what parameters it takes to read the data.

>>>
Also, with two or more controllers, the controllers talk to each
other directly, generally over a dedicated bus. They keep each other
informed of their status and constantly run diagnostics on themselves
and each other when the system is idle.

Which is only necessary because they're doing things like capturing
updates in NVRAM (updates that must survive controller failure and
thus need to be mirrored in NVRAM at the other controller): if you
eliminate that level of function, you lose any need for that level of
complexity (not to mention eliminating a complete layer of complex
hardware with its own potential to fail).

This has nothing to do with updates in NVRAM.

Yes, it does. In fact, that's about the *only* reason they really
*need* to talk with each other (and they don't even need to do that
unless they're configured as a fail-over pair, which itself is not an
actual 'need' when data is mirrored such that a single controller can
suffice).

This has everything to do

with processing the data, constant self-checks, etc. This is critical
in high-reliabilty systems.

No, it's not. You clearly just don't understand the various ways in
which high reliability can be achieved.

....

> These tests include reading and writing

>>test cylinders on the disks to verify proper operation.

The background disk scrubbing which both hardware and software RAID
approaches should be doing covers that (and if there's really *no*
writing going on in the system for long periods of time, the software
can exercise that as well once in a while).

No, it doesn't. For instance, these tests include things like writing
with a lower-level signal than normal and trying to read it back. It
helps catch potential problems in the heads and electronics. The same
is true for writing with stronger than normal currents - and trying to
read them back. Also checking adjacent tracks for "bit bleed". And a
lot of other things.

These are things again no software implementation can do.

Anything firmware can do, software can do - but anything beyond standard
'mode page' controls would require use of the same special-purpose disk
interface that you allege the RAID firmware uses.

Again, though, there are more ways to skin the reliability cat than
continually torturing the disk through such a special interface - the
opposite extreme being just to do none of those special checks, let the
disk die (in whole or in part) if it decides to, and use sufficient
redundancy that that doesn't matter.

>

>>>
Additionally, in the more expensive RAID devices, checksums are
typically at least 32 bits long (your off-the-shelf drive typically
uses a 16 bit checksum), and the checksum is built in hardware - much
more expensive, but much faster than doing it in firmware. Checksum
comparisons are done in hardware, also.

Your hand-waving just got a bit fast to follow there.

1. Disks certainly use internal per-sector error-correction codes
when transferring data to and from their platters. They are hundreds
(perhaps by now *many* hundreds) of bits long.

Actually, not.

Actually, yes: you really should lose your habit of making assertions
about things that you don't know a damn thing about.

Sectors are still 512 bytes. And the checksums (or ECC,

if they use them) are still only 16 or 32 bits.

No, they are not.

And even if they use

ECC, 32 bits can only can only correct up to 3 bad bits out of the 512
bytes.

Which is why they use hundreds of bits.

None use "many hundreds of bits".

Yes, they do.

It would waste too much disk

space.

No, it doesn't - though it and other overhead do use enough space that
the industry is slowly moving toward adopting 4 KB disk sectors to
reduce the relative impact.

Seagate's largest SATA drive generates a maximum internal bit rate of
1030 Mb/sec but a maximum net data transfer rate of only 78 MB/sec,
suggesting that less than 2/3 of each track is occupied by user data -
the rest being split between inter-record gaps and overhead (in part
ECC). IBM's largest SATA drive states that it uses a 52-byte (416-bit)
per-sector ECC internally (i.e., about 10% of the data payload size); it
claims to be able to recover 5 random burst errors and a single 330-bit
continuous burst error.

Possibly you were confused by the use of 4-byte ECC values in the 'read
long' and 'write long' commands: those values are emulated from the
information in the longer physical ECC.

>
>>
2. Disks use cyclic redundancy checks on the data that they accept
from and distribute to the outside world (old IDE disks did not, but
ATA disks do and SATA disks do as well - IIRC the width is 32 bits).

See above. And even the orignal IDE drives used a 16 bit checksum.

I'm not sure what you mean by 'see above', unless you confused about the
difference between the (long) ECC used to correct data coming off the
platter and the (32-bit - I just checked the SATA 2.5 spec) CRC used to
guard data sent between the disk and the host.

>
>3. I'd certainly expect any RAID hardware to use those CRCs to
communicate with both disks and host systems: that hardly qualifies
as anything unusual. If you were talking about some *other* kind of
checksum, it would have to have been internal to the RAID, since the
disks wouldn't know anything about it (a host using special driver
software potentially could, but it would add nothing of obvious value
to the CRC mechanisms that the host already uses to communicate
directly with disks, so I'd just expect the RAID box to emulate a disk
for such communication).

CRC's are not transferred to the host system, either in RAID or non-RAID
drives.

Yes, they are: read the SATA spec (see the section about 'frames').

Yes, some drives have that capability for diagnostic purposes.

But as a standard practice, transferring 512 bytes is 512 bytes of data
- no more, no less.

>4. Thus data going from system memory to disk platter and back goes
(in each direction) through several interfaces and physical connectors
and multiple per-hop checks, and the probability of some undetected
failure, while very small for any given interface, connector, or hop,
is not quite as small for the sum of all of them (as well as there
being some errors, such as misdirected or lost writes, that none of
those checks can catch). What ZFS provides (that by definition
hardware RAID cannot, since it must emulate a standard block-level
interface to the host) is an end-to-end checksum that verifies data
from the time it is created in main memory to the time it has been
fetched back into main memory from disk. IBM, NetApp, and EMC use
somewhat analogous supplementary checksums to protect data: in the
i-series case I believe that they are created and checked in main
memory at the driver level and are thus comparably strong, while in
NetApp's and EMC's cases they are created and checked in the main
memory of the file server or hardware box but then must get to and
from client main memory across additional interfaces, connectors, and
hops which have their own individual checks and are thus not
comparably end-to-end in nature - though if the NetApp data is
accessed through a file-level protocol that includes an end-to-end
checksum that is created and checked in client and server main memory
rather than, e.g., in some NIC hardware accelerator it could be
*almost* comparable in strength.

Yes, ZFS can correct for errors like bad connectors and cables. And I
guess you need it if you use cheap connectors or cables.

You need it, period: the only question is how often.

But even if

they do fail - it's not going to be a one-time occurrance. Chances are
your system will crash within a few hundred ms.

I dont' know about NetApp, but IBM doesn't work this way at all. The
channel itself is parity checked by hardware on both ends. Any parity
check brings the system to an immediate halt.

Exactly what part of the fact that the end-to-end ZFS mechanism is meant
to catch errors that are *not* caught elsewhere is still managing to
escape you? And that IBM uses similar mechanisms itself in its i-series
systems (as to other major vendors like EMC and NetApp) for the same reason?

>

>>>
Plus, with verified writes, the firmware has to go back and reread
the data the next time the sector comes around and compare it with
the contents of the buffer. Again, this is often done in hardware on
the high end RAID systems.

And can just as well be done in system software (indeed, this is often
a software option in high-end systems).

Sure, it *can* be done with software, at a price.

A much lower price than is required to write the same code as firmware
and then enshrine it in additional physical hardware.

>

>>>
And, most of these RAID devices use custom chip sets - not something
off the shelf.

That in itself is a red flag: they are far more complex and also get
far less thoroughly exercised out in the field than more standard
components - regardless of how diligently they're tested.

Gotten a cell phone lately? Chances are the chips in your phone are
custom-made. Each manufacturer creates its own. Or an X-BOX, Nintendo,
PlayStation, etc.? Most of those have customer chips. And the same is
true for microwaves, TV sets and more.

The big difference is that Nokia can make 10M custom chilps for its
phones; for a high-end RAID device, 100K is a big run.

Exactly the point I was making: they get far less exercise out in the
field to flush out the last remaining bugs. I'd be more confident in a
carefully-crafted new software RAID implementation than in an equally
carefully-crafted new hardware-plus-firmware implementation, because the
former has considerably less 'new' in it (not to mention being easier to
trouble-shoot and fix in place if something *does* go wrong).

....

You seem to think software is the way to go. Just tell me one thing.
When was the last time you had to have your computer fixed because of a
hardware problem? And how many times have you had to reboot due to a
software problem?

Are you seriously suggesting that Intel and Microsoft have comparable
implementation discipline? Not to mention the relative complexity of an
operating system plus full third-party driver and application spectrum
vs. the far more standardized relationships that typical PC hardware
pieces enjoy.

We're talking about reliability *performing the same function*, not
something more like comparing the reliability of an automobile engine
with that of the vehicle as a whole.

>
And you say software is as reliable?

For a given degree of complexity, and an equally-carefully-crafted
implementation, software that can leverage the reliability of existing
hardware that already has to be depended upon for other processing is
inherently more reliable - because code is code whether in software or
firmware, but the software-only approach has less hardware to go wrong.

....

>I seriously doubt that anyone who's been talking with you (or at least
trying to) about hardware RAID solutions has been talking about any
that you'd find at CompUSA. EMC's Symmetrix, for example, was the
gold standard of enterprise-level hardware RAID for most of the '90s -
only relatively recently did IBM claw back substantial market share in
that area (along with HDS).

Actually, Symmetrix grew big in the small and medium systems, but IBM
never lost the lead in the top end RAID solutions. But they also were
(and still are) quite a bit more expensive than EMC's.

You really need to point to specific examples of the kinds of 'higher
end' RAIDs that you keep talking about (something we can look at and
evaluate on line, rather than asking us to take your unsupported word
for it). *Then* we'll actually have concrete competing approaches to
discuss.

....

the software cannot

>>
>>detect when a signal is getting marginal (it's either "good" or
"bad", adjust the r/w head parameters, and similar things.

And neither can hardware RAID: those things happen strictly
internally at the disk (for that matter, by definition *anything* that
the disk externalizes can be handled by software as well as by RAID
hardware).

And here you show you know nothing about what you talk. RAID drives are
specially built to work with their controllers. And RAID controllers
are made to be able to do these things. This is very low level stuff -
not things which are avialable outside the drive/controller.

Effectively, the RAID controller and the disk controller become one
unit. Separate, but one.

Provide examples we can look at if you want anyone to believe you.

>
> Yes, it can

>>checksum the data coming back and read from the mirror drive if
necessary.

Yup.

Now, that *used* to be at least something of a performance issue -
being able to offload that into firmware was measurably useful. But
today's processor and memory bandwidth makes it eminently feasible -
even in cases where it's not effectively free (if you have to move the
data, or have to compress/decompress or encrypt/decrypt it, you can
generate the checksum as it's passing through and pay virtually no
additional cost at all).

Sorry, Bill, this statement is really off the wall.

Not at all: in fact, you just had someone tell you about very
specifically comparing ZFS performance with more conventional approaches
and finding it *better*.

>
Then why do all the high end disk controllers use DMA to transfer data?

Because a) it's only been in the past few years that CPU and memory
bandwidth has become 'too cheap to meter' (so controllers are still
using the same approaches they used to and b) there's no point in
*wasting* CPU bandwidth for no reason (DMA isn't magic, though: it
doesn't save any *memory* bandwidth).

When you're already moving the data, computing the checksum is free. If
you're not, it's still cheap enough to be worth the cost for the benefit
it confers (and there's often some way to achieve at least a bit of
synergy - e.g., then deciding to move the data after all because it
makes things easier and you've already got it in the processor cache to
checksum it).

Because it's faster and takes fewer CPU cycles than doing it software,
that's why. And computing checkums for 512 bytes takes a significantly
longer time that actually transferring the data to/from memory via
software.

It doesn't take *any* longer: even with pipelined and prefetched
caching, today's processors can compute checksums faster than the data
can be moved.

>
Also, instead of allocating 512 byte buffers, the OS would have to
allocate 514 or 516 byte buffers. This removes a lot of the optimizaton
possible when the system is using buffers during operations.

1. If you were talking about something like the IBM i-series approach,
that would be an example of the kind of synergy that I just mentioned:
while doing the checksum, you could also move the data to consolidate it
at minimal additional cost.

2. But the ZFS approach keeps the checksums separate from the data, and
its sectors are packed normally (just payload).

>
Additionally, differerent disk drives internally use differrent checksums.

Plus there is no way to tell the disk what to write for a checksum. This
is hard-coded into the disk controller.

You're very confused: ZFS's checksums have nothing whatsoever to do
with disk checksums.

- bill

Nov 14 '06 #96

Kees Nuyt

On Mon, 13 Nov 2006 21:48:07 -0500, Jerry Stuckle
<js*******@attglobal.netwrote:

>Bill Todd wrote:
>Dear me - I just bade you a fond farewell, and here you've at last come
up with something at least vaguely technical (still somewhat mistaken,
but at least technical). So I'll respond to it in kind:

Jerry Stuckle wrote:
>>>

>And there is no way to make a pair of Cheetahs as reliable as drives
made strictly for high end RAID devices. Some of these drives still
sell for $30-60/GB (or more).

They often are Cheetahs or similar, usually with custom disk
controllers on them instead of commodity controllers. Remember
the original meaning of "RAID"?
The high price has to cover the cost of the custom disk
controllers, QA, packaging, fast transport, part tracking,
failure statistics and technical support.
That explains the difference between the cost of the bare
"inexpensive disk" and TCO.

>No, I didn't say ANY drive was "crap". They're good drives, when used
for what they are designed. But drives made for RAID arrays are in a
class by themselves. And they can do things that standard drives can't
(like dynamically adjust amplifiers and slew rates when reading and
writing data).

Depends of the drive controller, not the drive.
At the current densities even commodity drives will need
adaptive amplification etc.

>CRC's are not transferred to the host system, either in RAID or non-RAID
drives. Yes, some drives have that capability for diagnostic purposes.
But as a standard practice, transferring 512 bytes is 512 bytes of
data - no more, no less.

CKD is still used in mainframes. Not exactly a checksum, nor at
sector level (rather at allocation unit level), but redundant
info anyway, to enhance reliability.

>I dont' know about NetApp, but IBM doesn't work this way at all. The
channel itself is parity checked by hardware on both ends. Any parity
check brings the system to an immediate halt.

Of course not. The I/O is invalidated, discarded and retried
over another path. You can pull a channel plug anytime, without
interrupting anything at the application level. Just a warning
on the console.

>>And, most of these RAID devices use custom chip sets - not something
off the shelf.

Or generic DSPs, FPGAs and RISC processors programmed for this
specific application.

>Actually, Symmetrix grew big in the small and medium systems, but IBM
never lost the lead in the top end RAID solutions. But they also were
(and still are) quite a bit more expensive than EMC's.

I guess you missed a serious price fighting round in the passed
decennium. IBM offered ESS for dump prices for quite a while
to gain market share.

>>detect when a signal is getting marginal (it's either "good" or "bad",
adjust the r/w head parameters, and similar things.

There is no reason to not use those capabilities in a ZFS
environment.

>Then why do all the high end disk controllers use DMA to transfer data?

Apples and pears. We're not talking high end PC's here.
In high end systems the disk controller doesn't have access to
system memory at all. It is just part of the storage system.
The storage system is connected to the computer system by some
sort of channel, which connects to a channel adapter of some
kind, which may have DMA access. In mainframes there's still an
IOP in between, there the IOP's use DMA.

Just my EUR 0,02
--
( Kees
)
c[_] A problem shared is a problem halved, so
is your problem really yours or just half
of someone else's? (#348)

Nov 14 '06 #97

Jerry Stuckle

Robert Milkowski wrote:

Jerry Stuckle <js*******@attglobal.netwrote:

>>Robert Milkowski wrote:

>>>It's here you don't understand. ZFS protect me from bad driver, bad FC switch,
bad FC/SCSI/ESCON/... adapter corrupting data. As one SCSI adapter unfortunately
did few weeks ago. So it's not only the array.
And while I admit that I haven't seen (yet) ZFS detecting data corruption
on Symmetrix boxes, but I did on another arrays, it could be due to fact
I put ZFS on Symmetrix boxes not that long ago and comparing to other
arrays it's not that much storage under ZFS on Symmetrix here. So statisticaly
it can be that I'm just more lucky. And of course I expect Symmetrix to be
more reliable than JBOD or medium array.

Immaterial.

A bad driver won't let the system run - at least not for long. Same
with a bad FC switch, etc. And how long did your system run with a bad
SCSI adapter?

Actualy it was writing data with some errors for hours, then system panicked.
Then again it was writing data for another 7-9h.

Now of course system won't reboot just because bad switch - it's not the first
time I had problem with fc switch (long time ago, granted). With at least dual links
from each host to differen switch (differen fabric) it's not a big problem.

OK, that's possible, I guess. But a high end RAID device is running
diagnostics on the adapters when it's idle. And if it has two or more
controllers (which most of them do), they are also checking each other.
So your problem would have been detected in milliseconds.

>
>>And yes, as I've stated before - like anything else, you get what you
paid for. Get a good quality RAID and you won't get data corruption issues.

Is IBM high-end array a good quality one for example?

That's one of them.

>

>>And what does any of this have to do with the discussion at hand - which
is data reliability? You seen to have a penchance for changing the
subject when you can't refute the facts.

Reliability comes from end-to-end data integrity, which HW RAID itself can't
provide so your data are less protected.

It can provide integrity right to the connector.

Your RAID doesn't protect you from anything between itself and your host.
Now vendors recognize it for years that's why IBM, EMC, Oracle, Sun, Hitachi, etc.
all of them provide some hacks for specific applications like Oracle. Of course
none of those solution are nearly as complete as ZFS.
I know, you know better than all those vendors. You know better than people
who actually lost their data both on cheap and 500k+ (you seem to like this number)
arrays. It's just that you for some reasons can't accept simple facts.

That's not its job. It's job is to deliver accurate data to the bus.
If you want further integrity checking, it's quite easy to do in
hardware, also - i.e. parity checks, ECC, etc. on the bus. That's why
IBM mainframes have parity checking on their channels.

Reliability comes from keeping meta data blocks on different LUNs + configured
protection. While it's not strictly RAID issue the point is ZFS has file system
integrated with Volume Manager, so it can do it. So even if you do just
striping on ZFS on RAID LUNS, and you overwrite entire LUN your file system still
will be consistent - only lost data (not metadata) are lost.

Gee, why not state the obvious? RAIDs do that quite well.

Reliability comes from never overwriting actual data on a medium, so you don't
have to deal with incomplete writes, etc.

That's where you're wrong. You're ALWAYS overwriting data on a medium.
Otherwise your disk would quickly fill. High end RAIDs ensure they
have sufficient backup power such at even in the event of a complete
power failure they can complete the current write, for instance. And
they detect the power failure.

Some are even designed to have sufficient backup power to flush anything
in the buffers to disk before they power down.

Reliability comes from integrating file system with volume manager, so regardless
of RAID type you used, you are always consistent on disk (both data and meta data).
Something which traditional file systems, log structured, can't guarantee. And that's
why even with journaling sometimes you end up with fsck after all.

Not at all. Volume manager has nothing to do with ensuring data
integrity on a RAID system.

Reliability comes from knowing where exactly on each disk your data is, so if you
do not have RAID full of data ZFS will resilver disk in case of emergency MUCH
faster by resilvering only actual data and not all blocks on disk. Also as it understand
data on disk it starts resilver from / so from the beginning even if resilver isn't
completed yet you get some protection. On classic array you can't do it as they
can't understand data on their disks.

Reliability comes from not caring where your data is physically on the
disk. Even the lower end disk drives have spare cylinders they can
allocate transparently in the case a sector or track goes bad. The
sector address the file system provides to the disk may or may not be
the physical sector where the data is written. It may have been mapped
to an entirely different area of the disk.

>
Now there are other thing in ZFS which greatly increase reliability, manegability, and
other things. But as you can't agree with basic and simple facts it doesn't really
make sense to go any further. It probably doesn't even make sense to talk to you anymore.

I never said there were not some good things about ZFS. And it makes
sense as a cheap replacement for an expensive RAID.

But the fact are - there are things a RAID can do that ZFS cannot do.
And there are things that ZFS can do which RAID does not do, because it
is beyond the job of a disk subsystem.

And I agree. You've shown such ignorance of the technology involved in
RAID systems that it doesn't make any sense to talk to you, either.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 14 '06 #98

Jerry Stuckle

Robert Milkowski wrote:

Jerry Stuckle <js*******@attglobal.netwrote:

>>Robert Milkowski wrote:

>>>Also when you're using your high-end array without ZFS basicaly
you get less reliability when you use the same array with ZFS.

Proof? Statistics?

Proof - I lost data on arrays, both big and large. ZFS already helped me
few times. That's all proof I need. Now if you need some basics why this is
possible please try again to read what all people here where writing, this
time with open mind and understanding, then go to Open Solaris site for
more details, then if you are still more curious and capable of reading code,
then do it. Or you can just try it.

However I suspect you will just keep trolling and stay in a dream of yours.

Yes, you will when you use cheap arrays. The ones I worked with never
lost data, even though disks crashed, controllers went bad and other
things happened. A good RAID array will detect the problems and recover
from them. At the same time it will notify the system of the problem so
corrective action can be taken.

>

>>ZFS cannot adjust amp gains. It cannot change the bias. It cannot
tweak the slew rates. And a lot more. These are all very low level
operations available only to the disk controller. And they are much of
what makes a difference between a high-end drive and a throw-away drive
(platter coating being another major cost difference).

To be honest I have no idea if high end arrays do these things.
But all I know is that if I use such an array and put ZFS on top of it,
then additional to protection you describe I get much more. At the and
I get better data protection. And that's exactly what people are already
doing.

Now, please try to ajust your bias, as evidently you've got undetected
(at least by yourself) malfunction :))) no offence.

Of course they do. That's part of what makes them high end arrays.
This type of circuitry is much more reliable - and much more expensive
to implement. It's part of why they cost so much.

>

>>>>who has worked with high performance, critical systems knows there is a
*huge* difference between doing it in hardware and software.
Really? Actually depending on workload specifics and hardware specifics
I can see HW being faster than software, and the opposite.

In some cases clever combination of both gives best results.

Wrong! Data transfer in hardware is ALWAYS faster than in software.

Holly.....!#>!>!
Now you want persuade me that even if my application works faster with
software RAID it's actually slower, just because you think so.
You really have a problem with grasping reality around you.

And with a good hardware RAID device, it would work even faster. But
you've never tried it one, so you have no comparison, do you?

>

>>So how often do YOU get bad cables or connectors?

Also, you're telling me you can go into your system while it's running
and just unplug any cable you want and it will keep running? Gee,
you've accomplished something computer manufacturers have dreamed about
for decades!

Yep, in our HA solutions you can go and unplug one external cable you want and
the system will keep going - doesn't metter if it's network cable, power cable,
FC cable, ...

You know maybe when you're studying, and it was long time ago I guess, people
were dreaming about this, but it's really kind of standard in enterprise for years
if even for much more time. You really don't know anything about HA (but that was obvious
earlier).

Sure it's been a standard - since the 80's. And I dare say I know more
about them than you do. You've already proven that fact with your
statements.

>

>>>>And data in the system and system software can be corrupted. Once the
data is in the RAID device, it cannot.
Really? Unfortunately for your claims it happens.
And you know, even your beloved IBM's array lost some data here.
The array even warned us about it :) It wasn't Shark, but also
not low-end in IBMs arrays. And it did more than once.

Not with good quality RAIDS. And obviously your claim of having IBM's
array is as full of manure as your earlier claims in this thread.

Ok, enough.
I guess from time to time I need to play a little bit with trolls, but this is
enough.

As someone else pointed - sometimes you really just can't help some people.

EOT

Yea, you really can't help some people. You are so caught up in how
great ZFS is that you can't see reality, even when the facts are
presented to you.

But that's OK. You're (and your customers or employer) are the ones who
will suffer. And there will be people like me who don't have your
preconceived notions who will come along to pick up the pieces.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 14 '06 #99

toby

Jerry Stuckle wrote:

...you get what you
paid for. Get a good quality RAID and you won't get data corruption issues.

ITYM "...and you're somewhat less likely to get data corruption..."

--T

Nov 14 '06 #100