MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever) - Page 3

alf

Hi,

is it possible that due to OS crash or mysql itself crash or some e.g.
SCSI failure to lose all the data stored in the table (let's say million
of 1KB rows). In other words what is the worst case scenario for MyISAM
backend?
Also is it possible to not to lose data but get them corrupted?
Thx, Andy

Nov 9 '06

Subscribe Post Reply

110

10420

Jerry Stuckle

Bill Todd wrote:

Jerry Stuckle wrote:

...

software implementations are poor replacements for a truly

>fault-tolerant system.

You really ought to stop saying things like that: it just makes you
look ignorant.

No, troll. You should learn to face facts. Hardware can do things that
software can't.

While special-purpose embedded systems may be implementable wholly in
firmware, general-purpose systems are not - and hence *cannot* be more
reliable than their system software (and hardware) is. Either that
software and the hardware it runs on are sufficiently reliable, or
they're not - and if they're not, then the most reliable storage system
in the world can't help the situation, because the data *processing*
cannot be relied upon.

Hardware systems can be more reliable than the sum of their parts.
Duplication, backup and diagnostics are all a part of the equation which
makes hardware RAID devices more reliable than a single disk drive.
This has been proven time and time again.

So stop babbling about software reliability: if it's adequate to
process the data, it's adequate to perform the lower levels of data
management as well. In many cases that combination may be *more*
reliable than adding totally separate hardware and firmware designed by
a completely different organization: using the existing system hardware
that *already* must be reliable (rather than requiring that a *second*
piece of hardware be equally reliable too) by definition reduces total
exposure to hardware faults or flaws (and writing reliable system
software is in no way more bug-prone than writing equivalent but
separate RAID firmware).

Incorrect. The software processing the data may be totally reliable.
But other software in the system can corrupt things.

Windows 3.x was a perfect example. By itself, it was quite reliable.
But applications running under it could corrupt the system. And the
programs processing the data may have been perfect - but another program
could get in there and corrupt the data.

With the advent of the 80286 and later chipsets with virtual memory,
rings, protected memory and the rest of the nice things, an application
has a much smaller possibility of corrupting the system. But under
certain conditions it can still happen. And even with all that, a
poorly written driver can really screw things up.

These things cannot happen in a RAID device, where all the firmware is
completely isolated from the rest of the system.

And adding a second piece of hardware INCREASES reliability, when
properly configured. That's why RAID devices exist, after all. Sure,
you have a slightly higher probability of ONE piece failing, but an
almost infinitesimal chance of BOTH pieces failing at the same time.

And the high end RAID devices do not require

>special software - they look like any other disk device attached to
the system.

Which is why they cannot include the kinds of end-to-end checks that a
software implementation inside the OS can: the standard interface
doesn't support it.

Not on the Intel systems, they don't, I agree. But mainframes, for
instance, are designed such that their I/O channels are parity checked
in the hardware. A parity problem on the bus brings the system to an
immediate screeching halt. This guarantees the data sent by the
controller arrives in the system correctly.

>>
As for bundling write acceleration in NVRAM - again, meaningless
because good RAID devices aren't loaded as a "special system device".

Perhaps you misunderstood: I listed transparent write-acceleration by
using NVRAM in hardware RAID as a hardware RAID *advantage* (it just has
nothing to do with *reliability*, which has been the issue under
discussion here).

OK. But asynchronous writes are buffered anyway (even in most non-RAID
controllers today), and if an immediate read find the data still in the
buffer, most controllers will just return it from the buffer. And
synchronous writes should never return until they are physically on the
device (something that some file systems ignore).

It should never be cached in NVRAM - unless the NVRAM itself is the
storage device (which is getting more and more possible).

>>
Prestoserve was one of the first lower-end RAID products made.

I don't believe that Prestoserve had much to do with RAID: it was
simply battery-backed (i.e., non-volatile) RAM that could be used as a
disk (or in front of disks) to make writes persistent at RAM speeds.

That's interesting. I'm thinking of a different system, then - it's
been a few years. I was thinking they actually had a RAID device, also.

However,

>there were a huge number of them before that. But you wouldn't find
them on a PC. They were primarily medium and large system devices.

I never heard of Prestoserve being available on a PC (though don't know
for certain that it wasn't): I encountered it on mid-range DEC and Sun
systems.

As I said - you wouldn't find them on a PC.

>Prestoserve took some of the ideas and moved much of the hardware
handling into software. Unfortunately, when they did it, they lost
the ability to handle problems at a low-level (i.e. read head biasing,
etc.). It did make the arrays a lot cheaper, but at a price.

I think you're confused: Prestoserve had nothing to do with any kind of
'hardware handling', just with write acceleration via use of NVRAM.

OK, again - a different device than what I was thinking of. But yes,
NVRAM devices do speed up disk access tremendously.

>>
And in the RAID devices, system address space was never a problem -
because the data was transferred to RAID cache immediately. This did
not come out of the system pool; the controllers have their own cache.

Which was what I said: back when amounts of system memory were limited
(by addressability if nothing else) this was a real asset that hardware
RAID could offer, whereas today it's much less significant (since a
64-bit system can now address and effectively use as much RAM as can be
connected to it).

That's true today. However, It still remains that the data can be
corrupted while in system memory. Once transferred to the controller,
it cannot be corrupted.

>>
I remember 64MB caches in the controllers way back in the mid 80's.
It's in the GB, now.

Indeed - a quick look at some current IBM arrays show support up to at
least 1 GB (and large EMC and HDS arrays offer even more). On the other
hand, system RAM in large IBM p-series systems can reach 2 TB these
days, so (as I noted) the amount that a RAID controller can add to that
is far less (relatively) significant than it once was.

I'm surprised it's only 1 GB. I would have figured at least 100GB. The
64MB was from back in the 80's, when disks held a little over 600MB (and
the platters were the size of the tires on your car).

>>But time marches on. Most serious operating systems now support
(either natively or via extremely reputable decade-old,
thoroughly-tested third-party system software products from people
like Veritas) software RAID, and as much cache memory as you can
afford (no more address-space limitations there) - plus (with
products like ZFS) are at least starting to address synchronous
small-update throughput (though when synchronous small-update
*latency* is critical there's still no match for NVRAM).

Sure, you can get software RAID. But it's not as reliable as a good
hardware RAID.

That is simply incorrect: stop talking garbage.

And you stop talking about something of which you know nothing about.

>>
>> You don't take $89 100GB disk drives off the shelf,

tack them onto an EIDE controller and add some software to the system.

Actually, you can do almost *precisely* that, as long as the software
is handles the situation appropriately - and that's part of what ZFS
is offering (and what you so obviously completely fail to be able to
grasp).

In the cheap RAID devices, sure. But not in the good ones. You're
talking cheap. I'm talking quality.

No: I'm talking relatively inexpensive, but with *no* sacrifice in
quality. You just clearly don't understand how that can be done, but
that's your own limitation, not any actual limitation on the technology.

No, YOU don't understand that to make a RAID device "relatively
inexpensive" you must cut something out. That includes both hardware
and firmware. And that makes them less reliable.

Sure, they are reliable enough for many situations. But they are not as
reliable as the high end devices.

>>
>>No disk or firmware is completely foolproof. Not one. No matter how
expensive and well-designed. So the question isn't whether the disks
and firmware are unreliable, but just the degree and manner in which
they are.

I never said they were 100% foolproof. Rather, I said they are
amongst the most tested software made. Probably the only software
tested more thoroughly is the microcode on CPU's. And they are as
reliable as humanly possible.

So is the system software in several OSs - zOS and VMS, for example (I
suspect IBM's i-series as well, but I'm not as familiar with that).
Those systems can literally run for on the order of a decade without a
reboot, as long as the hardware doesn't die underneath them.

If that were the case, then companies wouldn't have to spend so much
money on support structures, issuing fixes and the like. Sure, this
software is tested. But not nearly as thoroughly as firmware.

But that's only to be expected. Firmware has a very dedicated job, with
limited functionality. For instance, in the case of RAID devices, it
has to interpret a limited number of commands from the system and
translate those to an even smaller number of commands to the disk
electronics.

System software, OTOH, must accept a lot more commands from application
programs, handle more devices of different types, and so on. As a
result, it is much more complicated than any RAID firmware, and much
more prone to problems.

And people trust their data to the system software already, so (as I
already noted) there's no significant additional exposure if that
software handles some of the RAID duties as well (and *less* total
*hardware* exposure, since no additional hardware has been introduced
into the equation).

There is a huge additional exposure, but you refuse to believe that. If
there were no additional exposure, they why are people buying hardware
RAID devices left and right? After all - according to you, there is
plenty of memory, plenty of CPU cycles, and the software is perfect.

But they are. That's because your premises are wrong.

>>
Of course, the same thing goes for ZFS and any file system. They're
not completely foolproof, either, are they?

No, but neither need they be any *less* foolproof: it just doesn't
matter *where* you execute the RAID operations, just *how well* they're
implemented.

It makes a world of difference where you execute the RAID operations.
But you refuse to understand that.

>>
>>There is, to be sure, no way that you can make a pair of inexpensive
SATA drives just as reliable as a pair of Cheetahs, all other things
being equal. But it is *eminently* possible, using appropriate
software (or firmware), to make *three or four* inexpensive SATA
drives *more* reliable than a pair of Cheetahs that cost far more -
and to obtain better performance in many areas in the bargain.

And there is no way to make a pair of Cheetahs as reliable as drives
made strictly for high end RAID devices. Some of these drives still
sell for $30-60/GB (or more).

It is possible that I'm just not familiar with the 'high-end RAID
devices' that you're talking about - so let's focus on that.

From what you're saying, you're not even familiar with the low-end RAID
devices. All you know about is the marketing hype you've read about
ZFS. And you take that as gospel, and everyone else is wrong.

Let me clue you in, Bill. There are millions of people out there who
know better.

EMC took a major chunk of the mainframe storage market away from IBM in
the '90s using commodity SCSI disks from (mostly) Seagate, not any kind
of 'special' drives (well, EMC got Seagate to make a few firmware
revisions, but nothing of major significance - I've always suspected
mostly to keep people from by-passing EMC and getting the drives at
standard retail prices). At that point, IBM was still making its own
proprietary drives at *far* higher prices per GB, and EMC cleaned up (by
emulating the traditional IBM drive technology using the commodity SCSI
Seagate drives and building in reliability through redundancy and
intelligent firmware to compensate for the lower per-drive reliability -
exactly the same kind of thing that I've been describing to let today's
SATA drives substitute effectively for the currently-popular higher-end
drives in enterprise use).

Sure. But these weren't RAID devices, either. A completely different
market. Those who needed high reliability bought hardware RAID devices
from IBM or others. And they still do.

Since then, every major (and what *I* would consider 'high-end') array
manufacturer has followed that path: IBM and Hitachi use commodity
FC/SCSI drives in their high-end arrays too (and even offer SATA drives
as an option for less-demanding environments). These are the kinds of
arrays used in the highest-end systems that, e.g., IBM and HP run their
largest-system TPC-C benchmark submissions on: I won't assert that even
higher-end arrays using non-standard disks don't exist at all, but I
sure don't *see* them being used *anywhere*.

Sure, they use the drives themselves. But the high end RAID devices
have different electronics, different controllers, and in some cases,
even different coatings on the disk surfaces.

So exactly what drives are you talking about that cost far more than the
best that Seagate has to offer, and offer far more features in terms of
external control over internal drive operations (beyond the standard
'SCSI mode page' tweaks)? Where can we see descriptions of the
super-high-end arrays (costing "$100-500/GB" in your words) that use
such drives (and preferably descriptions of *how* they use them)?

The high end RAID devices only available to OEM's at a much higher
price. Check your IBM sales rep for one.

Demonstrating that you have at least that much of a clue what you're
talking about would not only help convince people that you might
actually be worth listening to, but would also actually teach those of
us whose idea of 'high-end arrays' stops with HDS and Symmetrix
something we don't know. Otherwise, we'll just continue to assume that
you're at best talking about '80s proprietary technology that's
completely irrelevant today (if it ever existed as you describe it even
back then).

I have demonstrated that. However, you have demonstrated you are either
too stupid or too close-minded to understand basic facts. Which is it,
troll?

That would not, however, change the fact that equal reliability can be
achieved at significantly lower cost by using higher numbers of low-cost
(though reputable) drives with intelligent software to hook them
together into a reliable whole. In fact, the more expensive those
alleged non-standard super-high-end drives (and arrays) are, the easier
that is to do.

And that's where you're wrong. And those who really understand the high
end RAID devices disagree with you.

...

>>And you don't attach them through Brand X SATA controllers, either:
ideally, you attach them directly (since you no longer need any
intermediate RAID hardware), using the same quality electronics you
have on the rest of your system board (so the SATA connection won't
constitute a weak link). And by virtue of being considerably simpler
hardware/firmware than a RAID implementation, that controller may
well be *more* reliable.

There is no way this is more reliable than a good RAID system. If you
had ever used one, you wouldn't even try to make that claim.

One could equally observe that if you had ever used a good operating
system, you wouldn't even try to make *your* claim. You clearly don't
understand software capabilities at all.

I have used good operating systems. And I used to work with the
internals of operating systems when I worked for IBM as a Software
Engineer. I dare say I know a lot more about system software than you
do, especially from the internals end. Your statements above about how
reliable they are is proof of that.

...

>>Whether you're aware of it or not, modern SATA drives (and even
not-too-old ATA drives) do *all* the things that you just described
in your last one-and-a-half paragraphs.

And let's see those drives do things like dynamically adjust the
electronics - such as amp gain, bias, slew rate... They can't do it.

Your continuing blind spot is in not being able to understand that they
don't have to: whatever marginal improvement in per-drive reliability
such special-purpose advantages may achieve (again, assuming that they
achieve any at all: as I noted, commodity SATA drives *do* support most
of what you described, and I have no reason to believe that you have any
real clue what actual differences exist), those advantages can be
outweighed simply by using more lower-cost drives (such that two or even
three can fail for every high-cost drive failure without jeopardizing
data).

Oh, I know the differences. However, it's obvious you don't.
Fortunately, those who need to understand the differences do - and they
buy hardware RAID.

...

>>Modern disks (both FC/SCSI and ATA/SATA) do that themselves, without
waiting for instructions from a higher level. They report any
failure up so that the higher level (again, doesn't matter whether
it's firmware or software) can correct the data if a good copy can be
found elsewhere. If its internal retry succeeds, the disk doesn't
report an error, but does log it internally such that any interested
higher-level firmware or software can see whether such successful
retries are starting to become alarmingly frequent and act accordingly.

Yes, they report total failure on a read. But they can't go back and
try to reread the sector with different parms to the read amps, for
instance.

As I already said, I have no reason to believe that you know what you're
talking about there. Both FC/SCSI and ATA/SATA drives make *exhaustive*
attempts to read data before giving up: they make multiple passes,
jigger the heads first off to one side of the track and then off to the
other to try to improve access, and God knows what else - to the point
where they can keep working for on the order of a minute trying to read
a bad sector before finally giving up (and I suspect that part of what
keeps them working that long includes at least some of the kinds of
electrical tweaks that you describe).

Of course you have no reason to believe it. You have no electronics
background at all. You have no idea how it works.

Multiple passes are a completely different thing. Any drive can try to
reread the sector. But advanced drives can vary the read parameters to
compensate for marginal signals.

To simplify things for you - it's like a camera being able to adjust the
F-STOP and shutter speed to account for different lighting conditions.
A box camera has a single shutter speed and a fixed lens opening. It
can take excellent pictures of nearly stationary objects under specific
lighting conditions. But get very high or very low light, and the
result is either overexposure or underexposure. A flash helps in low
light conditions, but that's about all you can do.

However a good SLR camera can adjust both the shutter speed and lens
opening. It can take excellent pictures under a wide variety of
lighting conditions. It can even virtually freeze rapidly moving
objects without underexposure.

But you aren't going to find a good SLR for the same price as a box camera.

In the same way, you won't find a top of the line hardware RAID for the
same price as a cheap one. The top of the line one can do more things
to ensure data integrity.

And a good RAID controller will make decisions based in part

>on what parameters it takes to read the data.

>>>>
Also, with two or more controllers, the controllers talk to each
other directly, generally over a dedicated bus. They keep each
other informed of their status and constantly run diagnostics on
themselves and each other when the system is idle.

Which is only necessary because they're doing things like capturing
updates in NVRAM (updates that must survive controller failure and
thus need to be mirrored in NVRAM at the other controller): if you
eliminate that level of function, you lose any need for that level of
complexity (not to mention eliminating a complete layer of complex
hardware with its own potential to fail).

This has nothing to do with updates in NVRAM.

Yes, it does. In fact, that's about the *only* reason they really
*need* to talk with each other (and they don't even need to do that
unless they're configured as a fail-over pair, which itself is not an
actual 'need' when data is mirrored such that a single controller can
suffice).

And in a top of the line system a single controller *never* suffices.
What happens if that controller dies? Good RAID devices always have at
least two controllers to cover that possibility, just as they have two
disks for mirroring.

And they still write data to disk. NVRAM still cannot pack the density
of a hard disk.

Plus, controllers still talk to each other all the time. They run
diagnostics on each other, for instance. They also constantly track
each other's operations, to ensure a failure in one is accurately
reflected back to the system. After all, a failing component cannot be
trusted to detect and report its failure to the system. That can fail,
also.

This has everything to do

>with processing the data, constant self-checks, etc. This is critical
in high-reliabilty systems.

No, it's not. You clearly just don't understand the various ways in
which high reliability can be achieved.

...

>> These tests include reading and writing

test cylinders on the disks to verify proper operation.

The background disk scrubbing which both hardware and software RAID
approaches should be doing covers that (and if there's really *no*
writing going on in the system for long periods of time, the software
can exercise that as well once in a while).

No, it doesn't. For instance, these tests include things like writing
with a lower-level signal than normal and trying to read it back. It
helps catch potential problems in the heads and electronics. The same
is true for writing with stronger than normal currents - and trying to
read them back. Also checking adjacent tracks for "bit bleed". And a
lot of other things.

These are things again no software implementation can do.

Anything firmware can do, software can do - but anything beyond standard
'mode page' controls would require use of the same special-purpose disk
interface that you allege the RAID firmware uses.

Again, you don't understand what you're talking about. Not if the
commands are not present at the interface - which they aren't.

Such processing would add a tremendous amount of overhead to the system.
The system would have to handle potentially hundreds of variations of
parameters - every drive manufacturer as slightly different parameters,
and many drives vary even within a manufacturer. It depends on the spin
rate, density and magnetic coating used on the device. Any filesytem
which tried to manage these parameters would have to know a lot of
details about every possible disk on the market. And when a new one
came out, those parameters would be required, also.

The inexpensive disk drives are made to present a simple interface to
the system. They understand a few commands, such as initialize,
self-test, read, write and seek. Not a lot more. It's neither
practical nor necessary in most systems to have any more. And to add
these capabilities would drastically increase the price of the drives
themselves - not to mention the cost of developing the software to
handle the commands.

OTOH, high end RAID controllers are made to work closely with one
specific device (or a limited number of devices). They don't need to
worry about hundreds of different parameters. Only the set of
parameters they are made to work with.

Again, though, there are more ways to skin the reliability cat than
continually torturing the disk through such a special interface - the
opposite extreme being just to do none of those special checks, let the
disk die (in whole or in part) if it decides to, and use sufficient
redundancy that that doesn't matter.

Sure, you can do that. And the low end RAID devices do just that. But
high end devices are more reliable just *because* they do that. And
that's why they are in such demand for truly critical data.

>>
>>>>
Additionally, in the more expensive RAID devices, checksums are
typically at least 32 bits long (your off-the-shelf drive typically
uses a 16 bit checksum), and the checksum is built in hardware -
much more expensive, but much faster than doing it in firmware.
Checksum comparisons are done in hardware, also.

Your hand-waving just got a bit fast to follow there.

1. Disks certainly use internal per-sector error-correction codes
when transferring data to and from their platters. They are hundreds
(perhaps by now *many* hundreds) of bits long.

Actually, not.

Actually, yes: you really should lose your habit of making assertions
about things that you don't know a damn thing about.

Sectors are still 512 bytes. And the checksums (or ECC,

>if they use them) are still only 16 or 32 bits.

No, they are not.

You had better go back and check your facts again. The system can block
data in any size it wants. But the hardware still uses 512 byte blocks.

And even if they use

>ECC, 32 bits can only can only correct up to 3 bad bits out of the 512
bytes.

Which is why they use hundreds of bits.

> None use "many hundreds of bits".

Yes, they do.

It would waste too much disk

>space.

No, it doesn't - though it and other overhead do use enough space that
the industry is slowly moving toward adopting 4 KB disk sectors to
reduce the relative impact.

Seagate's largest SATA drive generates a maximum internal bit rate of
1030 Mb/sec but a maximum net data transfer rate of only 78 MB/sec,
suggesting that less than 2/3 of each track is occupied by user data -
the rest being split between inter-record gaps and overhead (in part
ECC). IBM's largest SATA drive states that it uses a 52-byte (416-bit)
per-sector ECC internally (i.e., about 10% of the data payload size); it
claims to be able to recover 5 random burst errors and a single 330-bit
continuous burst error.

Sure, but there are a lot of things between the internal bit rate and
the data transfer rate. Internal bit rate is the speed at which data is
read off the disk. But this is not continuous. There are seek times
(both head and sector) and inter-record gaps which slow things down, for
instance. Plus you're talking megaBITS/s for the internal rate, and
megaBYTES per second for the external transfer rate. 78MB/s translates
to about 624 Mb/s. Not bad, I do admit.

And their ECC is beyond what the normal disk drive does. Most still use
a 512 byte sector with a 16 or 32 bit checksum (a few use ECC). Those
which do claim a 4K sector generally emulate it in firmware.

Possibly you were confused by the use of 4-byte ECC values in the 'read
long' and 'write long' commands: those values are emulated from the
information in the longer physical ECC.

Nope, not at all.

>>
>>>
2. Disks use cyclic redundancy checks on the data that they accept
from and distribute to the outside world (old IDE disks did not, but
ATA disks do and SATA disks do as well - IIRC the width is 32 bits).

See above. And even the orignal IDE drives used a 16 bit checksum.

I'm not sure what you mean by 'see above', unless you confused about the
difference between the (long) ECC used to correct data coming off the
platter and the (32-bit - I just checked the SATA 2.5 spec) CRC used to
guard data sent between the disk and the host.

No, I'm talking about the 16 or 32bit checksum still used by the
majority of the low-end drives.

>>
>>3. I'd certainly expect any RAID hardware to use those CRCs to
communicate with both disks and host systems: that hardly qualifies
as anything unusual. If you were talking about some *other* kind of
checksum, it would have to have been internal to the RAID, since the
disks wouldn't know anything about it (a host using special driver
software potentially could, but it would add nothing of obvious value
to the CRC mechanisms that the host already uses to communicate
directly with disks, so I'd just expect the RAID box to emulate a
disk for such communication).

CRC's are not transferred to the host system, either in RAID or
non-RAID drives.

Yes, they are: read the SATA spec (see the section about 'frames').

Yes, some drives have that capability for diagnostic purposes.

OK, now you're talking specific drives. Yes SATA can transfer a
checksum to the system. But it's not normally done, and is specifically
for diagnostic purposes. They are a part of the test to ensure the
checksum is being correctly computed - but not transferred as part of
normal operation.

> But as a standard practice, transferring 512 bytes is 512 bytes of
data - no more, no less.

>>4. Thus data going from system memory to disk platter and back goes
(in each direction) through several interfaces and physical
connectors and multiple per-hop checks, and the probability of some
undetected failure, while very small for any given interface,
connector, or hop, is not quite as small for the sum of all of them
(as well as there being some errors, such as misdirected or lost
writes, that none of those checks can catch). What ZFS provides
(that by definition hardware RAID cannot, since it must emulate a
standard block-level interface to the host) is an end-to-end checksum
that verifies data from the time it is created in main memory to the
time it has been fetched back into main memory from disk. IBM,
NetApp, and EMC use somewhat analogous supplementary checksums to
protect data: in the i-series case I believe that they are created
and checked in main memory at the driver level and are thus
comparably strong, while in NetApp's and EMC's cases they are created
and checked in the main memory of the file server or hardware box but
then must get to and from client main memory across additional
interfaces, connectors, and hops which have their own individual
checks and are thus not comparably end-to-end in nature - though if
the NetApp data is accessed through a file-level protocol that
includes an end-to-end checksum that is created and checked in client
and server main memory rather than, e.g., in some NIC hardware
accelerator it could be *almost* comparable in strength.

Yes, ZFS can correct for errors like bad connectors and cables. And I
guess you need it if you use cheap connectors or cables.

You need it, period: the only question is how often.

Not if you have good hardware, you don't. That will detect bad data *at
the system*, like IBM's mainframe I/O channels do. But with cheap
hardware, yes you need it.

But even if

>they do fail - it's not going to be a one-time occurrance. Chances
are your system will crash within a few hundred ms.

I dont' know about NetApp, but IBM doesn't work this way at all. The
channel itself is parity checked by hardware on both ends. Any parity
check brings the system to an immediate halt.

Exactly what part of the fact that the end-to-end ZFS mechanism is meant
to catch errors that are *not* caught elsewhere is still managing to
escape you? And that IBM uses similar mechanisms itself in its i-series
systems (as to other major vendors like EMC and NetApp) for the same
reason?

>>
>>>>
Plus, with verified writes, the firmware has to go back and reread
the data the next time the sector comes around and compare it with
the contents of the buffer. Again, this is often done in hardware
on the high end RAID systems.

And can just as well be done in system software (indeed, this is
often a software option in high-end systems).

Sure, it *can* be done with software, at a price.

A much lower price than is required to write the same code as firmware
and then enshrine it in additional physical hardware.

I never argued that you can't do *some* of it in software. But you
can't do *all* of it in software.

And of course it's cheaper to do it in software. But that doesn't make
it *more* reliable. It doesn't even make it *as* reliable.

>>
>>>>
And, most of these RAID devices use custom chip sets - not something
off the shelf.

That in itself is a red flag: they are far more complex and also get
far less thoroughly exercised out in the field than more standard
components - regardless of how diligently they're tested.

Gotten a cell phone lately? Chances are the chips in your phone are
custom-made. Each manufacturer creates its own. Or an X-BOX,
Nintendo, PlayStation, etc.? Most of those have customer chips. And
the same is true for microwaves, TV sets and more.

The big difference is that Nokia can make 10M custom chilps for its
phones; for a high-end RAID device, 100K is a big run.

Exactly the point I was making: they get far less exercise out in the
field to flush out the last remaining bugs. I'd be more confident in a
carefully-crafted new software RAID implementation than in an equally
carefully-crafted new hardware-plus-firmware implementation, because the
former has considerably less 'new' in it (not to mention being easier to
trouble-shoot and fix in place if something *does* go wrong).

One correction. They get *more* exercise in the plant and therefore
*need* far less exercise in the field to flush out remaining bugs.

And I'm glad you're more confident in a new software RAID implementation
than a new hardware-plus-firmware implementation. Fortunately, people
who need total reliability for critical data disagree with you.

...

>You seem to think software is the way to go. Just tell me one thing.
When was the last time you had to have your computer fixed because of
a hardware problem? And how many times have you had to reboot due to
a software problem?

Are you seriously suggesting that Intel and Microsoft have comparable
implementation discipline? Not to mention the relative complexity of an
operating system plus full third-party driver and application spectrum
vs. the far more standardized relationships that typical PC hardware
pieces enjoy.

First of all, I didn't say anything about Microsoft or Intel. You could
be talking *any* software or hardware manufacturer. The same thing goes
for Linux, OS/X or any other software. And the same thing goes for
Western Digital, HP, or any hardware manufacturer.

But you sidestep the question because I'm right.

We're talking about reliability *performing the same function*, not
something more like comparing the reliability of an automobile engine
with that of the vehicle as a whole.

We're talking about software vs. hardware reliability. It's a fair
comparison.

>>
And you say software is as reliable?

For a given degree of complexity, and an equally-carefully-crafted
implementation, software that can leverage the reliability of existing
hardware that already has to be depended upon for other processing is
inherently more reliable - because code is code whether in software or
firmware, but the software-only approach has less hardware to go wrong.

The fact still remains that hardware is much more reliable than
software. Even your processor runs of firmware (microcode). And your
disk drives have firmware. As do your printers, video adapter, Ethernet
port and even your serial port (unless you have a winmodem). And how
often do these fail? Or how about your cell phone, microwave or even
your TV set? These all have firmware, also. Even the U.S. power grid
is run by firmware.

I know of one bug in the Intel chips, for instance, in the early
Pentiums. A few had a very obscure floating point bug which under
certain conditions gave an incorrect result. This was a firmware bug
which was promptly corrected. There may have been others, but I haven't
heard of them.

The fact remains - firmware, because of the limited job it has to do and
limited interfaces can be (and is) tested much more thoroughly than any
general software implementation.

...

>>I seriously doubt that anyone who's been talking with you (or at
least trying to) about hardware RAID solutions has been talking about
any that you'd find at CompUSA. EMC's Symmetrix, for example, was
the gold standard of enterprise-level hardware RAID for most of the
'90s - only relatively recently did IBM claw back substantial market
share in that area (along with HDS).

Actually, Symmetrix grew big in the small and medium systems, but IBM
never lost the lead in the top end RAID solutions. But they also were
(and still are) quite a bit more expensive than EMC's.

You really need to point to specific examples of the kinds of 'higher
end' RAIDs that you keep talking about (something we can look at and
evaluate on line, rather than asking us to take your unsupported word
for it). *Then* we'll actually have concrete competing approaches to
discuss.

Talk to your IBM sales rep, for one. These devices are not available
"on line". How many mainframes do you see listed online? Or other
high-end systems?

Believe it or not, there are a lot of things which aren't available
online - because they are not general consumer products. And companies
who are looking for those products are not looking online.

...

the software cannot

>>>
detect when a signal is getting marginal (it's either "good" or
"bad", adjust the r/w head parameters, and similar things.

And neither can hardware RAID: those things happen strictly
internally at the disk (for that matter, by definition *anything*
that the disk externalizes can be handled by software as well as by
RAID hardware).

And here you show you know nothing about what you talk. RAID drives
are specially built to work with their controllers. And RAID
controllers are made to be able to do these things. This is very low
level stuff - not things which are avialable outside the
drive/controller.

Effectively, the RAID controller and the disk controller become one
unit. Separate, but one.

Provide examples we can look at if you want anyone to believe you.

Again, talk to your IBM sales rep.

>>
>> Yes, it can

checksum the data coming back and read from the mirror drive if
necessary.

Yup.

Now, that *used* to be at least something of a performance issue -
being able to offload that into firmware was measurably useful. But
today's processor and memory bandwidth makes it eminently feasible -
even in cases where it's not effectively free (if you have to move
the data, or have to compress/decompress or encrypt/decrypt it, you
can generate the checksum as it's passing through and pay virtually
no additional cost at all).

Sorry, Bill, this statement is really off the wall.

Not at all: in fact, you just had someone tell you about very
specifically comparing ZFS performance with more conventional approaches
and finding it *better*.

And as I noted, there he was comparing apples and oranges, because he
wasn't comparing to a high end RAID array.

>>
Then why do all the high end disk controllers use DMA to transfer data?

Because a) it's only been in the past few years that CPU and memory
bandwidth has become 'too cheap to meter' (so controllers are still
using the same approaches they used to and b) there's no point in
*wasting* CPU bandwidth for no reason (DMA isn't magic, though: it
doesn't save any *memory* bandwidth).

Nope, it's because it's faster and creates a lower load on the
processor. People still pay extra to get that capability. There must
be a reason why.

When you're already moving the data, computing the checksum is free. If
you're not, it's still cheap enough to be worth the cost for the benefit
it confers (and there's often some way to achieve at least a bit of
synergy - e.g., then deciding to move the data after all because it
makes things easier and you've already got it in the processor cache to
checksum it).

Actually, not. It takes cycles to compute the checksum, even if you're
doing it in hardware. It just takes fewer cycles (but more electronics)
to do it in hardware.

And as I noted before, the checksum is not transferred under normal
conditions.

> Because it's faster and takes fewer CPU cycles than doing it
software, that's why. And computing checkums for 512 bytes takes a
significantly longer time that actually transferring the data to/from
memory via software.

It doesn't take *any* longer: even with pipelined and prefetched
caching, today's processors can compute checksums faster than the data
can be moved.

Sure it takes longer. 4K of data can be transferred via hardware in as
little as 1K clock cycles (assuming a 32 bit bus). The same data
transfer in software takes around 10 times that long (I'd have to check
the current number of cycles each instruction in the loop takes to make
sure).

And there is no way the software can compute a 4K checksum in 1K cycles.
It can't even do it in 10K cycles.

But you've probably never written any assembler, so you don't know the
machine instructions involved or even the fact that a single instruction
can (and almost all do) take multiple clock cycles.

>>
Also, instead of allocating 512 byte buffers, the OS would have to
allocate 514 or 516 byte buffers. This removes a lot of the
optimizaton possible when the system is using buffers during operations.

1. If you were talking about something like the IBM i-series approach,
that would be an example of the kind of synergy that I just mentioned:
while doing the checksum, you could also move the data to consolidate it
at minimal additional cost.

2. But the ZFS approach keeps the checksums separate from the data, and
its sectors are packed normally (just payload).

OK, so ZFS has its own checksum for its data. But this is not the same
as the disk checksum. And having to move the data yet again just slows
the system down even more.

>>
Additionally, differerent disk drives internally use differrent
checksums.

Plus there is no way to tell the disk what to write for a checksum.
This is hard-coded into the disk controller.

You're very confused: ZFS's checksums have nothing whatsoever to do
with disk checksums.

- bill

But you've repeated several times how the disk drive returns its
checksum and ZFS checks it. So now you admit that isn't true.

But with a good RAID controller and hardware at the system, you can be
assured the data is received on the bus correctly. Unfortunately, PC's
don't have a way of even checking parity on the bus.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 14 '06 #101

Robert Milkowski

Jerry Stuckle <js*******@attglobal.netwrote:

Bill Todd wrote:
You really need to point to specific examples of the kinds of 'higher
end' RAIDs that you keep talking about (something we can look at and
evaluate on line, rather than asking us to take your unsupported word
for it). *Then* we'll actually have concrete competing approaches to
discuss.

Talk to your IBM sales rep, for one. These devices are not available
"on line". How many mainframes do you see listed online? Or other
high-end systems?

Believe it or not, there are a lot of things which aren't available
online - because they are not general consumer products. And companies
who are looking for those products are not looking online.

Hehehehehehhe, this is really funny.

So only those super secret IBM arrays that no one can read about can
do some magical things Jerry was talking about - they never fail, never corrupt
data, etc. Of course you can't read about them.

You know what - I guess you came from some other plane of reality.

Here, in this universe, IBM hasn't developed such wonderful devices, at least not
yet. And unfortunately no one else did. Here, even mighty IBM has implemented for
specific applications like Oracle some form of end-to-end integrity, 'coz even
their arrays can corrupt data (or something else between can).

Now go back to your top-secret universe and prise those top secret technologies.
You're truly beautyful mind.
ps. I guess we should leave him now and let him go

--
Robert Milkowski
rm************@wp-sa.pl
http://milek.blogspot.com

Nov 14 '06 #102

Bill Todd

Jerry Stuckle wrote:

Robert Milkowski wrote:

....

>Reliability comes from end-to-end data integrity, which HW RAID itself
can't
provide so your data are less protected.

It can provide integrity right to the connector.

Which is not the same thing at all, and not good enough to keep
occasional data corruption out even when using the highest-grade hardware.

The point of doing end-to-end checks in main memory is not *only* to get
that last smidgeon of reliability, of course: it's also about getting
reliability comparable to the very *best* hardware solutions while using
relatively inexpensive hardware.

>
>Your RAID doesn't protect you from anything between itself and your host.
Now vendors recognize it for years that's why IBM, EMC, Oracle, Sun,
Hitachi, etc.
all of them provide some hacks for specific applications like Oracle.
Of course
none of those solution are nearly as complete as ZFS.
I know, you know better than all those vendors. You know better than
people
who actually lost their data both on cheap and 500k+ (you seem to like
this number)
arrays. It's just that you for some reasons can't accept simple facts.

That's not its job.

So what? This discussion is not about individual component reliability,
but about overall subsystem reliability (otherwise, it would not include
the file system layer at all).

It's job is to deliver accurate data to the bus. If

you want further integrity checking, it's quite easy to do in hardware,
also - i.e. parity checks, ECC, etc. on the bus. That's why IBM
mainframes have parity checking on their channels.

So, of course, do commodity systems and the disks on them: it's been a
*long* time (going at least back to the days of old-style IDE) since
communication between host and disk was unprotected.

And the fact that the sum of the individual checks that you describe is
insufficient to guarantee the level of data integrity that IBM would
like to have is why IBM supplements those checks with the kind of
end-to-end checks that they build into their i-series boxes.

....

>Reliability comes from never overwriting actual data on a medium, so
you don't
have to deal with incomplete writes, etc.

That's where you're wrong. You're ALWAYS overwriting data on a medium.
Otherwise your disk would quickly fill.

No, Jerry: that's where *you're* wrong, and where it becomes
crystal-clear that (despite your assertions to the contrary) you don't
know shit about ZFS - just as it appears that you don't know shit about
so many other aspects of this subject that you've been expostulating
about so incompetently for so long.

ZFS does not overwrite data on disk: it writes updates to space on the
disk which is currently unused, and then frees up the space that the old
copy of the data occupied (if there was an old copy) to make it
available for more updates. There's only a momentary increase in space
use equal to the size of the update: as soon as it completes, the old
space gets freed up.

....

Volume manager has nothing to do with ensuring data

integrity on a RAID system.

In a software implementation, the volume manager *is* the RAID system.

>
>Reliability comes from knowing where exactly on each disk your data
is, so if you
do not have RAID full of data ZFS will resilver disk in case of
emergency MUCH
faster by resilvering only actual data and not all blocks on disk.
Also as it understand
data on disk it starts resilver from / so from the beginning even if
resilver isn't
completed yet you get some protection. On classic array you can't do
it as they
can't understand data on their disks.

Reliability comes from not caring where your data is physically on the
disk.

Not in this instance: once again, you don't know enough about how ZFS
works to be able to discuss it intelligently.

The aspect of reliability that Bob was referring to above is the ability
of ZFS to use existing free space in the system to restore the desired
level of redundancy after a disk fails, without requiring dedicated idle
hot-spare disks - and to restore that redundancy more quickly by copying
only the actual data that existed rather than then every sector that had
been on the failed disk (including those that were not occupied by live
data).

....

there are things a RAID can do that ZFS cannot do.

ZFS *includes* RAID (unless you think that you're better-qualified to
define what RAID is than the people who invented it before the term even
existed and the later people who formally defined the term).

And for that matter the only things that you've been able to come up
with that even your own private little definition of RAID can do that
ZFS can't are these alleged special disk-head-level checks that your
alleged super-high-end arrays allegedly cause them to perform.

I now see that you've now replied to my morning post with a great deal
of additional drivel which simply isn't worth responding to - because,
despite my challenging you several times to point to *a single real
example* of this mythical super-high-end hardware that you keep babbling
about you just couldn't seem to come up with one that we could look at
to evaluate whether you were completely full of shit or might have at
least some small basis for your hallucinations (though you do seem to
have admitted that these mythical arrays use conventional disks after
all, despite your previous clear statements to the contrary).

When I realized that, boring and incompetent though you might be in
technical areas, you presented the opportunity to perform a moderately
interesting experiment in abnormal psychology, I decided to continue
talking to you to see whether referring to specific industry standards
(that define the communication path between host and commodity disk to
be handled very differently than you have claimed) and specific
manufacturer specifications (that define things like ECC lengths in
direct contradiction to the 'facts' that you've kept pulling out of your
ass) would make a dent in the fabric of your private little fantasy
world. Since it's now clear to me that you're both nutty as a fruitcake
and completely and utterly ineducable, that experiment is now at an end,
and so is our conversation.

- bill

Nov 14 '06 #103

Jerry Stuckle

Robert Milkowski wrote:

Jerry Stuckle <js*******@attglobal.netwrote:

>>Bill Todd wrote:

>>>You really need to point to specific examples of the kinds of 'higher
end' RAIDs that you keep talking about (something we can look at and
evaluate on line, rather than asking us to take your unsupported word
for it). *Then* we'll actually have concrete competing approaches to
discuss.

Talk to your IBM sales rep, for one. These devices are not available
"on line". How many mainframes do you see listed online? Or other
high-end systems?

Believe it or not, there are a lot of things which aren't available
online - because they are not general consumer products. And companies
who are looking for those products are not looking online.

Hehehehehehhe, this is really funny.

So only those super secret IBM arrays that no one can read about can
do some magical things Jerry was talking about - they never fail, never corrupt
data, etc. Of course you can't read about them.

Not at all super secret. Just not available online.

You know what - I guess you came from some other plane of reality.

Here, in this universe, IBM hasn't developed such wonderful devices, at least not
yet. And unfortunately no one else did. Here, even mighty IBM has implemented for
specific applications like Oracle some form of end-to-end integrity, 'coz even
their arrays can corrupt data (or something else between can).

Now go back to your top-secret universe and prise those top secret technologies.
You're truly beautyful mind.
ps. I guess we should leave him now and let him go

Try asking your IBM salesman, troll.

How many mainframe printers do you see on the internet? Tape drives?
Disk arrays? Or are you telling me IBM doesn't sell those, either?
They're sold through their Marketing division.

Crawl back into your hole with the other trolls.
--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 14 '06 #104

Robert Milkowski

Jerry Stuckle <js*******@attglobal.netwrote:

>
How many mainframe printers do you see on the internet? Tape drives?
Disk arrays? Or are you telling me IBM doesn't sell those, either?
They're sold through their Marketing division.

Here you can find online info about IBMs mainframes, enterprise storage, enterprise
tape drives and libraries.

http://www-03.ibm.com/systems/z/
http://www-03.ibm.com/servers/storage/disk/
http://www-03.ibm.com/servers/storage/tape/
However no single word about your mythical array.
I told you, go back to your universe.

--
Robert Milkowski
rm**********@wp-sa.pl
http://milek.blogspot.com

Nov 14 '06 #105

Jerry Stuckle

Bill Todd wrote:

Jerry Stuckle wrote:

>Robert Milkowski wrote:

...

>>Reliability comes from end-to-end data integrity, which HW RAID
itself can't
provide so your data are less protected.

It can provide integrity right to the connector.

Which is not the same thing at all, and not good enough to keep
occasional data corruption out even when using the highest-grade hardware.

The point of doing end-to-end checks in main memory is not *only* to get
that last smidgeon of reliability, of course: it's also about getting
reliability comparable to the very *best* hardware solutions while using
relatively inexpensive hardware.

Really, Bill, you can't remember details from one post to the next.
This has already been covered multiple times.

One last thing before I do to you what I do to all trolls.

If everything you were to claim were true, there would be no hardware
raid devices. There would be no market for them because your precious
ZFS would negate any need for them.

But it's a good think those in charge of critical systems know better.
And they disagree with you - 100%. That's why there is such a market,
why manufacturers build them, and why customers purchase them.

But obviously great troll Bill Todd knows more than all of these
manufacturers. He knows better than all of these customers. In fact,
he's such an expert on them that he doesn't need any facts. He can make
up his own.

And BTW - I told you how to get information on them. See your IBM Rep.
He can fill you in on all the details. Because not everything is on
the internet. And only someone with their head completely up their ass
would think there is.

So long, troll.

<plonk>

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 15 '06 #106

Jerry Stuckle

Robert Milkowski wrote:

Jerry Stuckle <js*******@attglobal.netwrote:

>>How many mainframe printers do you see on the internet? Tape drives?
Disk arrays? Or are you telling me IBM doesn't sell those, either?
They're sold through their Marketing division.

Here you can find online info about IBMs mainframes, enterprise storage, enterprise
tape drives and libraries.

http://www-03.ibm.com/systems/z/
http://www-03.ibm.com/servers/storage/disk/
http://www-03.ibm.com/servers/storage/tape/
However no single word about your mythical array.
I told you, go back to your universe.

Yep, you found some of their products. But do you really think these
are all of their products? Not a chance.

As I said. Contact your IBM Rep.

But I'm going to do you like I did the other troll, Bill.

So long, troll.

<Plonk>

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 15 '06 #107

alf

Jerry Stuckle wrote:

>>There's a lot more to it. But the final result is these devices have
a lot more hardware and software, a lot more internal communications,
and a lot more firmware. And it costs a lot of money to design and
manufacture these devices. That's why you won't find them at your
local computer store.

s cheaper to have multiple mirrors.

>
There are a few very high end who use 3 drives and compare everything (2
out of 3 win). But these are very, very rare, and only used for the
absolutely most critical data (i.e. space missions, where the can't be
repaired/replaced easily).

can you actual name them and provide links to specific hardware
manufacture web sites?

--
alf

Dec 9 '06 #108

alf

Jerry Stuckle wrote:

>
Real RAID arrays are not cheap. $100-500/GB is not out of the question.
And you won't find them at COMP-USA or other retailers.

Does not RAID stand for 'redundant array of inexpensive disks' :-)?

--
alfz1

Dec 9 '06 #109

toby

alf wrote:

Jerry Stuckle wrote:

Real RAID arrays are not cheap. $100-500/GB is not out of the question.
And you won't find them at COMP-USA or other retailers.

Does not RAID stand for 'redundant array of inexpensive disks' :-)?

It stands for "false sense of security".

>
--
alfz1

Dec 9 '06 #110

alf

toby wrote:

alf wrote:

>>Jerry Stuckle wrote:

>>>Real RAID arrays are not cheap. $100-500/GB is not out of the question.
And you won't find them at COMP-USA or other retailers.

Does not RAID stand for 'redundant array of inexpensive disks' :-)?

It stands for "false sense of security".

agreed, plus the politically correct RAID stands for "redundant array of
independent disks" :-)

Dec 9 '06 #111