MyISAM engine: worst case scenario in case of crash (mysql, O/S,hardware, whatever)

alf

Hi,

is it possible that due to OS crash or mysql itself crash or some e.g.
SCSI failure to lose all the data stored in the table (let's say million
of 1KB rows). In other words what is the worst case scenario for MyISAM
backend?
Also is it possible to not to lose data but get them corrupted?
Thx, Andy

Nov 9 '06 #1

Subscribe Post Reply

110

10419

Gordon Burditt

>is it possible that due to OS crash or mysql itself crash or some e.g.

>SCSI failure to lose all the data stored in the table (let's say million
of 1KB rows).

It is always possible that your computer system will catch fire and
lose all data EVEN IF IT'S POWERED OFF. And the same nuclear attack
might take up all your backups, too. And you and all your employees.
Or the whole thing could just be stolen.

Managing to smash just one sector, the sector containing the data
file inode, or worse, the sector containing the data file, index
file, AND table definition inodes, could pretty well kill a table.
I have had the experience of a hard disk controller that sometimes
flipped some bits in the sectors before writing them. It took weeks
to discover this.

>In other words what is the worst case scenario for MyISAM
backend?

Probably, total loss of data and hardware.

>Also is it possible to not to lose data but get them corrupted?

I call that 'lost'. But yes, it is possible to end up with a bunch
of data that's bad and you don't realize it until things have gotten
much worse.

Nov 9 '06 #2

alf

Gordon Burditt wrote:

>>is it possible that due to OS crash or mysql itself crash or some e.g.
SCSI failure to lose all the data stored in the table (let's say million
of 1KB rows).

Managing to smash just one sector, the sector containing the data
file inode, or worse, the sector containing the data file, index
file, AND table definition inodes, could pretty well kill a table.
I have had the experience of a hard disk controller that sometimes
flipped some bits in the sectors before writing them. It took weeks
to discover this.

>>In other words what is the worst case scenario for MyISAM
backend?

Probably, total loss of data and hardware.

well, let's narrow it down to the mysql bug causing it to crash. Or
better to the all situations where trx's capabilities of InnoDB can
easily take care of a recovery (to the last committed trx).

I wonder if there is a possibility due to internal structure of MyISAM
backend to lose entire table where even recovery tools give up.

Would using ext3 help?
Thx in advance, Andy

Nov 9 '06 #3

Jerry Stuckle

alf wrote:

Gordon Burditt wrote:

>>is it possible that due to OS crash or mysql itself crash or some e.g.
SCSI failure to lose all the data stored in the table (let's say million
of 1KB rows).

Managing to smash just one sector, the sector containing the data
file inode, or worse, the sector containing the data file, index
file, AND table definition inodes, could pretty well kill a table.
I have had the experience of a hard disk controller that sometimes
flipped some bits in the sectors before writing them. It took weeks
to discover this.

>>In other words what is the worst case scenario for MyISAM
backend?

Probably, total loss of data and hardware.

well, let's narrow it down to the mysql bug causing it to crash. Or
better to the all situations where trx's capabilities of InnoDB can
easily take care of a recovery (to the last committed trx).

I wonder if there is a possibility due to internal structure of MyISAM
backend to lose entire table where even recovery tools give up.

Would using ext3 help?
Thx in advance, Andy

As Gordon said - anything's possible.

I don't see why ext3 would help. It knows nothing about the internal
format of the tables, and that's what is most likely to get screwed up
in a database crash. I would think it would be almost impossible to
recover to a consistent point in the database unless you have a very
detailed knowledge of the internal format of the files. And even then
it might be impossible if your system is very busy.

The best strategy is to keep regular backups of the database.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 9 '06 #4

toby

Gordon Burditt wrote:

is it possible that due to OS crash or mysql itself crash or some e.g.
SCSI failure to lose all the data stored in the table (let's say million
of 1KB rows).

It is always possible that your computer system will catch fire and
lose all data EVEN IF IT'S POWERED OFF. And the same nuclear attack
might take up all your backups, too. And you and all your employees.
Or the whole thing could just be stolen.

Managing to smash just one sector, the sector containing the data
file inode, or worse, the sector containing the data file, index
file, AND table definition inodes, could pretty well kill a table.
I have had the experience of a hard disk controller that sometimes
flipped some bits in the sectors before writing them. It took weeks
to discover this.

I spent weeks on a similar problem too - turned out to be bad RAM. The
only filesystem that I know of which can handle such hardware failures
is Sun's ZFS:
http://blogs.sun.com/bonwick/entry/zfs_end_to_end_data

>
In other words what is the worst case scenario for MyISAM
backend?

Probably, total loss of data and hardware.

Also is it possible to not to lose data but get them corrupted?

I call that 'lost'. But yes, it is possible to end up with a bunch
of data that's bad and you don't realize it until things have gotten
much worse.

Nov 9 '06 #5

alf

Jerry Stuckle wrote:

I don't see why ext3 would help.

only to not to get the file system corrupted.

It knows nothing about the internal
format of the tables, and that's what is most likely to get screwed up
in a database crash. I would think it would be almost impossible to
recover to a consistent point in the database unless you have a very
detailed knowledge of the internal format of the files.

Well, mysql recovery procedures does have that knowledge. There are
different levels of disaster. My assumption is that the file system
survives.

>
The best strategy is to keep regular backups of the database.

in my case it is a bit different. There are millions of rows which get
inserted, live for a few minutes or hours and then they get deleted. the
backup is not even feasible. While I can afford some (1-5%) data loss
due to crash, I still must not lose entire table. Wonder if mysql
recovery procedures can ensure that.

--
alf

Nov 9 '06 #6

Jerry Stuckle

alf wrote:

Jerry Stuckle wrote:

>I don't see why ext3 would help.

only to not to get the file system corrupted.

That doesn't mean the tables themselves can't be corrupted. For instance
if MySQL crashes in the middle of large write operation. Nothing the
file system can do to prevent that from happening. And you would have
to know exactly where to stop the file system restore to recover the
data - which would require a good knowledge of MySQL table structure.

>
>It knows nothing about the internal format of the tables, and that's
what is most likely to get screwed up in a database crash. I would
think it would be almost impossible to recover to a consistent point
in the database unless you have a very detailed knowledge of the
internal format of the files.

Well, mysql recovery procedures does have that knowledge. There are
different levels of disaster. My assumption is that the file system
survives.

Yes, it does. That's its job, after all. But if the tables themselves
are corrupted, nothing the file system will do will help that. And if
MySQL can't recover the data because of this, which file system you use
doesn't make any difference.

>
>>
The best strategy is to keep regular backups of the database.

in my case it is a bit different. There are millions of rows which get
inserted, live for a few minutes or hours and then they get deleted. the
backup is not even feasible. While I can afford some (1-5%) data loss
due to crash, I still must not lose entire table. Wonder if mysql
recovery procedures can ensure that.

Backups are ALWAYS feasible. And critical if you want to keep your data
safe. There is no replacement.

You can get some help by using INNODB tables and enabling the binary
log. That will allow MySQL to recover from the last good backup by
rolling the logs forward. There should be little or no loss of data.

But you still need the backups. There's no way to feasibly roll forward
a year's worth of data, for instance.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 9 '06 #7

alf

Jerry Stuckle wrote:

That doesn't mean the tables themselves can't be corrupted. For instance
if MySQL crashes in the middle of large write operation. Nothing the
file system can do to prevent that from happening. And you would have
to know exactly where to stop the file system restore to recover the
data - which would require a good knowledge of MySQL table structure.

I understand that.

Yes, it does. That's its job, after all. But if the tables themselves
are corrupted, nothing the file system will do will help that. And if
MySQL can't recover the data because of this, which file system you use
doesn't make any difference.

Not sure I agree. ext3 enables a quick recovery because there is a
trxlog of the file system itself. In ext2 you can lose files. So there
is a small step froward.

>
Backups are ALWAYS feasible. And critical if you want to keep your data
safe. There is no replacement.

In my case backups get outdated every minute or so. There is a lot of
data coming into DB and leaving it. Also losing the data from last
minute or so is not as critical (as opposed to banking systems).
Critical is losing like 5%. I know the system is just different.

You can get some help by using INNODB tables and enabling the binary
log. That will allow MySQL to recover from the last good backup by
rolling the logs forward. There should be little or no loss of data.

For some other reasons INNODB is not an option. My job is to find out
if crashing the mysql or the actual hardware the mysql is running on can
lead that significant amount of data (more then 5%) is lost. From what
I understand from here it is.

Thx a lot, A.

Nov 9 '06 #8

Jerry Stuckle

alf wrote:

Jerry Stuckle wrote:

>That doesn't mean the tables themselves can't be corrupted. For
instance if MySQL crashes in the middle of large write operation.
Nothing the file system can do to prevent that from happening. And
you would have to know exactly where to stop the file system restore
to recover the data - which would require a good knowledge of MySQL
table structure.

I understand that.

>Yes, it does. That's its job, after all. But if the tables
themselves are corrupted, nothing the file system will do will help
that. And if MySQL can't recover the data because of this, which file
system you use doesn't make any difference.

Not sure I agree. ext3 enables a quick recovery because there is a
trxlog of the file system itself. In ext2 you can lose files. So there
is a small step froward.

So? If the file itself is corrupted, all it will do is recover a
corrupted file. What's the gain there?

>
>>
Backups are ALWAYS feasible. And critical if you want to keep your
data safe. There is no replacement.

In my case backups get outdated every minute or so. There is a lot of
data coming into DB and leaving it. Also losing the data from last
minute or so is not as critical (as opposed to banking systems).
Critical is losing like 5%. I know the system is just different.

Without backups or logs/journals, I don't think ANY RDB can provide the
recovery you want.

>
>You can get some help by using INNODB tables and enabling the binary
log. That will allow MySQL to recover from the last good backup by
rolling the logs forward. There should be little or no loss of data.

For some other reasons INNODB is not an option. My job is to find out if
crashing the mysql or the actual hardware the mysql is running on can
lead that significant amount of data (more then 5%) is lost. From what
I understand from here it is.

Thx a lot, A.

You have a problem. The file system will be able to recover a file, but
it won't be able to fix a corrupted file. And without backups and
logs/journals, neither MySQL nor any other RDB will be able to guarantee
even 1% recovery - much less 95%.

Let's say MySQL starts to completely rewrite a 100Mb table. 10 bytes
into it, MySQL crashes. Your file system will see a 10 byte file and
recover that much. The other 99.99999MB will be lost. And without a
backup and binary logs, MySQL will not be able to recover.

Sure, you might be able to roll forward the file system journal. But
you'll have to know *exactly* where to stop or your database will be
inconsistent. And even if you do figure out *exactly* where to stop,
the database may still not be consistent.

You have the wrong answer to your problem. The RDB must do the
logging/journaling. For MySQL that means INNODB. MSSQL, Oracle, DB2,
etc. all have their versions of logging/journaling, also. And they
still require a backup to start.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 10 '06 #9

Axel Schwenke

Jerry Stuckle <js*******@attglobal.netwrote:

alf wrote:
>>
Not sure I agree. ext3 enables a quick recovery because there is a
trxlog of the file system itself. In ext2 you can lose files. So there
is a small step froward.

So? If the file itself is corrupted, all it will do is recover a
corrupted file. What's the gain there?

The gain is, that you have a chance to recover at all. With no files,
there is *no* way to recover.

However, thats not a real problem. MySQL never touches the datafile
itself once it is created. Only exception: REPAIR TABLE. This will
recreate the datafile (as new file with extension .TMD) and then
rename files.

DELETE just marks a record as deleted (1 bit). INSERT writes a new
record at the end of the datafile (or into a hole, if one exists).
UPDATE is done either in place or as INSERT + DELETE.

Most file operations on MyISAM tables are easier, faster and less
risky, if the table uses fixed length records. Then there is no need to
collapse adjacent unused records into one, UPDATE can be done in place,
there will be no fragmentation and such.

The MyISAM engine is quite simple. Data and index are held in separate
files. Data is structured in records. Whenever a record is modified,
it's written to disk immediately (however the operation system might
cache this). MyISAM never touches records without need. So if mysqld
goes down while in normal operation, only those records can be damaged
that were in use by active UPDATE, DELETE or INSERT operations.

There are two exceptions: REPAIR TABLE and OPTIMIZE TABLE. Both
recreate the datafile with new name and then switch by renaming.
There is still no chance to lose *both* files.

Indexes are different, though. Indexes are organized in pages and
heavily cached. You can even instruct mysqld to never flush modified
index pages to disk (except at shutdown or cache restructuring).
However indexes can be rebuilt from scratch, without losing data.
The only thing lost is the time needed for recovery.
HTH, XL
--
Axel Schwenke, Senior Software Developer, MySQL AB

Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
MySQL User Forums: http://forums.mysql.com/

Nov 10 '06 #10

Jerry Stuckle

Axel Schwenke wrote:

Jerry Stuckle <js*******@attglobal.netwrote:

>>alf wrote:

>>>Not sure I agree. ext3 enables a quick recovery because there is a
trxlog of the file system itself. In ext2 you can lose files. So there
is a small step froward.

So? If the file itself is corrupted, all it will do is recover a
corrupted file. What's the gain there?

The gain is, that you have a chance to recover at all. With no files,
there is *no* way to recover.

What you don't get it that it's not the presence or absence of the files
- it's the CONTENTS of the files that matters. There is very little
chance you will lose the files completely in the case of a crash. There
is a much bigger (although admittedly still small) that the files will
be corrupted. And a huge chance if you have more than one table your
database will be inconsistent.

However, thats not a real problem. MySQL never touches the datafile
itself once it is created. Only exception: REPAIR TABLE. This will
recreate the datafile (as new file with extension .TMD) and then
rename files.

Excuse me? MySQL ALWAYS touches the data file. That's where the
information is stored! And it is constantly rewriting the files to disk.

DELETE just marks a record as deleted (1 bit). INSERT writes a new
record at the end of the datafile (or into a hole, if one exists).
UPDATE is done either in place or as INSERT + DELETE.

Yes, I know exactly how MySQL works. Yep, and it has to rewrite a
portion of the file to do all of this.

Most file operations on MyISAM tables are easier, faster and less
risky, if the table uses fixed length records. Then there is no need to
collapse adjacent unused records into one, UPDATE can be done in place,
there will be no fragmentation and such.

No, no fragmentation. But what happens if the row spans a disk and the
system crashes between writes, for instance? Depending on exactly where
the block was split, you could completely screw up that row, bug be very
difficult to detect. Sure, it's only one row. But data corruption like
this can be much worse than just losing a row. The latter is easier to
determine.

The MyISAM engine is quite simple. Data and index are held in separate
files. Data is structured in records. Whenever a record is modified,
it's written to disk immediately (however the operation system might
cache this). MyISAM never touches records without need. So if mysqld
goes down while in normal operation, only those records can be damaged
that were in use by active UPDATE, DELETE or INSERT operations.

But the caching is all too important. It's not unusual to have hundreds
of MB of disk cache in a busy system. That's a lot of data which can be
lost.

There are two exceptions: REPAIR TABLE and OPTIMIZE TABLE. Both
recreate the datafile with new name and then switch by renaming.
There is still no chance to lose *both* files.

True - but these are so seldom used it's almost not worth talking about.
And even then it's a good idea to backup the database before repairing
or optimizing it.

Indexes are different, though. Indexes are organized in pages and
heavily cached. You can even instruct mysqld to never flush modified
index pages to disk (except at shutdown or cache restructuring).
However indexes can be rebuilt from scratch, without losing data.
The only thing lost is the time needed for recovery.

True. But that's not a big concern, is it?

>
HTH, XL
--
Axel Schwenke, Senior Software Developer, MySQL AB

Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
MySQL User Forums: http://forums.mysql.com/

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 10 '06 #11

Axel Schwenke

Jerry Stuckle <js*******@attglobal.netwrote:

Axel Schwenke wrote:
>Jerry Stuckle <js*******@attglobal.netwrote:

>>>So? If the file itself is corrupted, all it will do is recover a
corrupted file. What's the gain there?

The gain is, that you have a chance to recover at all. With no files,
there is *no* way to recover.

What you don't get it that it's not the presence or absence of the files
- it's the CONTENTS of the files that matters.

Agreed. But Alf worried he could lose whole tables aka files.

There is very little
chance you will lose the files completely in the case of a crash. There
is a much bigger (although admittedly still small) that the files will
be corrupted. And a huge chance if you have more than one table your
database will be inconsistent.

>However, thats not a real problem. MySQL never touches the datafile
itself once it is created. Only exception: REPAIR TABLE. This will
recreate the datafile (as new file with extension .TMD) and then
rename files.

Excuse me? MySQL ALWAYS touches the data file.

Sorry, I didn't express myself clear here: MyISAM never touches the
metadata for a data file. The file itself is created with CREATE TABLE.
Later on there is data appended to the file or some block inside the
file is modified. But the file itself stays there and there is
virtually no chance to lose it. So indeed there is no gain from using
a filesystem with metadata journaling (in fact most "journaling"
filesystems use the journal only for metadata).

And it is constantly rewriting the files to disk.

....

Yes, I know exactly how MySQL works. Yep, and it has to rewrite a
portion of the file to do all of this.

What do you call "rewrite"?

Of cource MySQL writes modified data. MySQL never reads an otherwise
unmodified record and rewrites it somewhere else.

>Most file operations on MyISAM tables are easier, faster and less
risky, if the table uses fixed length records. Then there is no need to
collapse adjacent unused records into one, UPDATE can be done in place,
there will be no fragmentation and such.

... what happens if the row spans a disk and the
system crashes between writes, for instance? Depending on exactly where
the block was split, you could completely screw up that row, bug be very
difficult to detect. Sure, it's only one row. But data corruption like
this can be much worse than just losing a row. The latter is easier to
determine.

Agreed. But then again I don't know how *exactly* MyISAM does those
nonatomic writes. One could imagine that the record is first written
with a "this record is invalid" flag set. As soon as the complete
record was written successfully, this flag is cleared in an atomic
write. I know Monty is very fond of atomic operations.

But still there is no difference to what I said: If mysqld crashes,
there is a good chance that all records that mysqld was writing to
are damaged. Either incomplete or lost or such.

However, there is only very little chance to lose data that was not
written to at the time of the crash.

Dynamic vs. fixed format: Dynamic row format is susceptible to the
following problem: imagine there is a hole between two records that
will be filled by INSERT. The new record contains information about
its used and unused length. While writing the record, mysqld crashes
and garbles the length information. Now this record could look longer
than the original hole and shadow one or more of the following
(otherwise untouched) records. This would be hard to spot. Similar
problems exist with merging holes.

Fixed length records don't have this problem and are therefore more
robust.

>The MyISAM engine is quite simple. Data and index are held in separate
files. Data is structured in records. Whenever a record is modified,
it's written to disk immediately (however the operation system might
cache this). MyISAM never touches records without need. So if mysqld
goes down while in normal operation, only those records can be damaged
that were in use by active UPDATE, DELETE or INSERT operations.

But the caching is all too important. It's not unusual to have hundreds
of MB of disk cache in a busy system. That's a lot of data which can be
lost.

Sure. But this problem was out of scope. We didn't talk about what
happens if the whole machine goes down, only what happens if mysqld
crashes.

Having the whole system crashing is also hard for "real" database
engines. I remember several passages in the InnoDB manual about
certain operating systems ignoring O_DIRECT for the tx log. Also
there may be "hidden" caches in disk controllers and in the disks.
XL
--
Axel Schwenke, Senior Software Developer, MySQL AB

Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
MySQL User Forums: http://forums.mysql.com/

Nov 10 '06 #12

toby

Axel Schwenke wrote:

Jerry Stuckle <js*******@attglobal.netwrote:
Axel Schwenke wrote:
Jerry Stuckle <js*******@attglobal.netwrote:
...
The MyISAM engine is quite simple. Data and index are held in separate
files. Data is structured in records. Whenever a record is modified,
it's written to disk immediately (however the operation system might
cache this). MyISAM never touches records without need. So if mysqld
goes down while in normal operation, only those records can be damaged
that were in use by active UPDATE, DELETE or INSERT operations.
But the caching is all too important. It's not unusual to have hundreds
of MB of disk cache in a busy system. That's a lot of data which can be
lost.

Sure. But this problem was out of scope. We didn't talk about what
happens if the whole machine goes down, only what happens if mysqld
crashes.

Having the whole system crashing is also hard for "real" database
engines. I remember several passages in the InnoDB manual about
certain operating systems ignoring O_DIRECT for the tx log. Also
there may be "hidden" caches in disk controllers and in the disks.

Indeed. Some references here:
http://groups.google.com/group/comp....17a85b71816f98

>

XL
--
Axel Schwenke, Senior Software Developer, MySQL AB

Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
MySQL User Forums: http://forums.mysql.com/

Nov 10 '06 #13

Jerry Stuckle

Hi, Alex,

Comments below.

Axel Schwenke wrote:

Jerry Stuckle <js*******@attglobal.netwrote:

>>Axel Schwenke wrote:

>>>Jerry Stuckle <js*******@attglobal.netwrote:
So? If the file itself is corrupted, all it will do is recover a
corrupted file. What's the gain there?

The gain is, that you have a chance to recover at all. With no files,
there is *no* way to recover.

What you don't get it that it's not the presence or absence of the files
- it's the CONTENTS of the files that matters.

Agreed. But Alf worried he could lose whole tables aka files.

>>There is very little
chance you will lose the files completely in the case of a crash. There
is a much bigger (although admittedly still small) that the files will
be corrupted. And a huge chance if you have more than one table your
database will be inconsistent.

>>>However, thats not a real problem. MySQL never touches the datafile
itself once it is created. Only exception: REPAIR TABLE. This will
recreate the datafile (as new file with extension .TMD) and then
rename files.

Excuse me? MySQL ALWAYS touches the data file.

Sorry, I didn't express myself clear here: MyISAM never touches the
metadata for a data file. The file itself is created with CREATE TABLE.
Later on there is data appended to the file or some block inside the
file is modified. But the file itself stays there and there is
virtually no chance to lose it. So indeed there is no gain from using
a filesystem with metadata journaling (in fact most "journaling"
filesystems use the journal only for metadata).

>>And it is constantly rewriting the files to disk.

...

>>Yes, I know exactly how MySQL works. Yep, and it has to rewrite a
portion of the file to do all of this.

What do you call "rewrite"?

Of cource MySQL writes modified data. MySQL never reads an otherwise
unmodified record and rewrites it somewhere else.

Just what you are calling it. It reads in a block of data and writes it
back out to disk.

Even in variable length rows where the new row is longer than the old
one and MySQL appends it to the end of the file, MySQL has to go back
and rewrite the original row to mark it as invalid.

>

>>>Most file operations on MyISAM tables are easier, faster and less
risky, if the table uses fixed length records. Then there is no need to
collapse adjacent unused records into one, UPDATE can be done in place,
there will be no fragmentation and such.

... what happens if the row spans a disk and the
system crashes between writes, for instance? Depending on exactly where
the block was split, you could completely screw up that row, bug be very
difficult to detect. Sure, it's only one row. But data corruption like
this can be much worse than just losing a row. The latter is easier to
determine.

Agreed. But then again I don't know how *exactly* MyISAM does those
nonatomic writes. One could imagine that the record is first written
with a "this record is invalid" flag set. As soon as the complete
record was written successfully, this flag is cleared in an atomic
write. I know Monty is very fond of atomic operations.

Part of it is MyISAM. But part of it is the OS, also. For instance,
what happens if the row spans two physical blocks of data which are not
contiguous? In that case the OS has to write the first block, seek to
the next one and write that one.

There isn't anything Monty can do about that, unfortunately.

But still there is no difference to what I said: If mysqld crashes,
there is a good chance that all records that mysqld was writing to
are damaged. Either incomplete or lost or such.

That is true.

However, there is only very little chance to lose data that was not
written to at the time of the crash.

Actually, you would lose all data which wasn't written to the disk.

Dynamic vs. fixed format: Dynamic row format is susceptible to the
following problem: imagine there is a hole between two records that
will be filled by INSERT. The new record contains information about
its used and unused length. While writing the record, mysqld crashes
and garbles the length information. Now this record could look longer
than the original hole and shadow one or more of the following
(otherwise untouched) records. This would be hard to spot. Similar
problems exist with merging holes.

Yep, a serious problem.

Fixed length records don't have this problem and are therefore more
robust.

I agree there. But there can be other problems as I noted before. And
a single corrupted row may be worse than a completely crashed dataset
because it's so difficult to find that row. For instance - let's say we
have a bank account number which is a string and spans two blocks.
Someone makes a $10M deposit to your account. In the middle MySQL
crashes. The account number is now incorrect - the first 1/2 has been
written to one block but the 2nd 1/2 never made it out. So it credited
the deposit to my account.

Wait a sec - I LIKE that idea! :-)

>

>>>The MyISAM engine is quite simple. Data and index are held in separate
files. Data is structured in records. Whenever a record is modified,
it's written to disk immediately (however the operation system might
cache this). MyISAM never touches records without need. So if mysqld
goes down while in normal operation, only those records can be damaged
that were in use by active UPDATE, DELETE or INSERT operations.

But the caching is all too important. It's not unusual to have hundreds
of MB of disk cache in a busy system. That's a lot of data which can be
lost.

Sure. But this problem was out of scope. We didn't talk about what
happens if the whole machine goes down, only what happens if mysqld
crashes.

Having the whole system crashing is also hard for "real" database
engines. I remember several passages in the InnoDB manual about
certain operating systems ignoring O_DIRECT for the tx log. Also
there may be "hidden" caches in disk controllers and in the disks.

Agreed it's a problem. Most databases handle this with a log/journal
which writes directly to the file system and doesn't return until the
record is written. Once that is done, the real data is written
asynchronously to the tables.

In that way a crash loses at most the last record written (in the case
of an incomplete journal entry). But it still needs a consistent point
(i.e. a backup) to roll forward the log from.

But, as you pointed out, not all OS's support this. They should,
however, for critical data.

And BTW - some even have an option to have their own file system which
is not dependent on the OS at all. They are just provided with a space
on the disk (i.e. a partition) and handle their own I/O completely.
This, obviously, is the most secure because the RDB can handle corrupted
files - they know both the external and internal format for the data.
It's also the most efficient. But it's the hardest to implement.

>
XL
--
Axel Schwenke, Senior Software Developer, MySQL AB

Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
MySQL User Forums: http://forums.mysql.com/

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 10 '06 #14

toby

Jerry Stuckle wrote:

Hi, Alex,

Comments below.

Axel Schwenke wrote:
Jerry Stuckle <js*******@attglobal.netwrote:

>Axel Schwenke wrote:

Jerry Stuckle <js*******@attglobal.netwrote:
So? If the file itself is corrupted, all it will do is recover a
corrupted file. What's the gain there?

The gain is, that you have a chance to recover at all. With no files,
there is *no* way to recover.

What you don't get it that it's not the presence or absence of the files
- it's the CONTENTS of the files that matters.

Agreed. But Alf worried he could lose whole tables aka files.

>There is very little
chance you will lose the files completely in the case of a crash. There
is a much bigger (although admittedly still small) that the files will
be corrupted. And a huge chance if you have more than one table your
database will be inconsistent.
However, thats not a real problem. MySQL never touches the datafile
itself once it is created. Only exception: REPAIR TABLE. This will
recreate the datafile (as new file with extension .TMD) and then
rename files.

Excuse me? MySQL ALWAYS touches the data file.

Sorry, I didn't express myself clear here: MyISAM never touches the
metadata for a data file. The file itself is created with CREATE TABLE.
Later on there is data appended to the file or some block inside the
file is modified. But the file itself stays there and there is
virtually no chance to lose it. So indeed there is no gain from using
a filesystem with metadata journaling (in fact most "journaling"
filesystems use the journal only for metadata).

>And it is constantly rewriting the files to disk.
...

>Yes, I know exactly how MySQL works. Yep, and it has to rewrite a
portion of the file to do all of this.

What do you call "rewrite"?

Of cource MySQL writes modified data. MySQL never reads an otherwise
unmodified record and rewrites it somewhere else.

Just what you are calling it. It reads in a block of data and writes it
back out to disk.

Note the words "otherwise unmodified" - i.e. not affected by current
operation.

>
Even in variable length rows where the new row is longer than the old
one and MySQL appends it to the end of the file, MySQL has to go back
and rewrite the original row to mark it as invalid.

>>Most file operations on MyISAM tables are easier, faster and less
risky, if the table uses fixed length records. Then there is no need to
collapse adjacent unused records into one, UPDATE can be done in place,
there will be no fragmentation and such.

... what happens if the row spans a disk and the
system crashes between writes, for instance? ...

Agreed. But then again I don't know how *exactly* MyISAM does those
nonatomic writes. ...

Part of it is MyISAM. But part of it is the OS, also. For instance,
what happens if the row spans two physical blocks of data which are not
contiguous? In that case the OS has to write the first block, seek to
the next one and write that one.

There isn't anything Monty can do about that, unfortunately.

MyISAM doesn't claim to be transactional.

However, there is only very little chance to lose data that was not
written to at the time of the crash.

Actually, you would lose all data which wasn't written to the disk.

Axel means, data *already* written which is not being changed, i.e.
other records.

>
Dynamic vs. fixed format: Dynamic row format is susceptible to the
following problem: ...
Having the whole system crashing is also hard for "real" database
engines. I remember several passages in the InnoDB manual about
certain operating systems ignoring O_DIRECT for the tx log. Also
there may be "hidden" caches in disk controllers and in the disks.
Agreed it's a problem. Most databases handle this with a log/journal
which writes directly to the file system and doesn't return until the
record is written. Once that is done, the real data is written
asynchronously to the tables.

Yes, but how is this relevant to MyISAM?

...

XL
--
Axel Schwenke, Senior Software Developer, MySQL AB

Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
MySQL User Forums: http://forums.mysql.com/

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 10 '06 #15

Jerry Stuckle

toby wrote:

Jerry Stuckle wrote:

>>Hi, Alex,

Comments below.

Axel Schwenke wrote:

>>>Jerry Stuckle <js*******@attglobal.netwrote:
Axel Schwenke wrote:
>Jerry Stuckle <js*******@attglobal.netwrote:
>
>
>
>>So? If the file itself is corrupted, all it will do is recover a
>>corrupted file. What's the gain there?
>
>The gain is, that you have a chance to recover at all. With no files,
>there is *no* way to recover.

What you don't get it that it's not the presence or absence of the files
- it's the CONTENTS of the files that matters.
Agreed. But Alf worried he could lose whole tables aka files.

There is very little
chance you will lose the files completely in the case of a crash. There
is a much bigger (although admittedly still small) that the files will
be corrupted. And a huge chance if you have more than one table your
database will be inconsistent.

>However, thats not a real problem. MySQL never touches the datafile
>itself once it is created. Only exception: REPAIR TABLE. This will
>recreate the datafile (as new file with extension .TMD) and then
>rename files.

Excuse me? MySQL ALWAYS touches the data file.
Sorry, I didn't express myself clear here: MyISAM never touches the
metadata for a data file. The file itself is created with CREATE TABLE.
Later on there is data appended to the file or some block inside the
file is modified. But the file itself stays there and there is
virtually no chance to lose it. So indeed there is no gain from using
a filesystem with metadata journaling (in fact most "journaling"
filesystems use the journal only for metadata).

And it is constantly rewriting the files to disk.

...
Yes, I know exactly how MySQL works. Yep, and it has to rewrite a
portion of the file to do all of this.
What do you call "rewrite"?

Of cource MySQL writes modified data. MySQL never reads an otherwise
unmodified record and rewrites it somewhere else.

Just what you are calling it. It reads in a block of data and writes it
back out to disk.

Note the words "otherwise unmodified" - i.e. not affected by current
operation.

Depends on your definition of "otherwise unmodified". That sounds like
something different than "unmodified", doesn't it? "Otherwise
unmodified" indicates *something* has changed.

Now - if you just say "MySQL never reads an unmodified record and
rewrites it somewhere else", I will agree.

>
>>Even in variable length rows where the new row is longer than the old
one and MySQL appends it to the end of the file, MySQL has to go back
and rewrite the original row to mark it as invalid.

>>>>>Most file operations on MyISAM tables are easier, faster and less
>risky, if the table uses fixed length records. Then there is no need to
>collapse adjacent unused records into one, UPDATE can be done in place,
>there will be no fragmentation and such.

... what happens if the row spans a disk and the
system crashes between writes, for instance? ...
Agreed. But then again I don't know how *exactly* MyISAM does those
nonatomic writes. ...

Part of it is MyISAM. But part of it is the OS, also. For instance,
what happens if the row spans two physical blocks of data which are not
contiguous? In that case the OS has to write the first block, seek to
the next one and write that one.

There isn't anything Monty can do about that, unfortunately.

MyISAM doesn't claim to be transactional.

Nope, and I never said it did. But this has nothing to do with
transactions. It has to do with a single row - or even a single column
in one row - being corrupted.

Transactional has to do with multiple operations (generally including
modification of the data) in which all or none must complete. That's
not the case here.

>

>>>However, there is only very little chance to lose data that was not
written to at the time of the crash.

Actually, you would lose all data which wasn't written to the disk.

Axel means, data *already* written which is not being changed, i.e.
other records.

Could be. But that's not what he said. He said "not written to...".

Now - if he means data which was not overwritten (or in the progress of
being overwritten), then I will agree.

>

>>>Dynamic vs. fixed format: Dynamic row format is susceptible to the
following problem: ...
Having the whole system crashing is also hard for "real" database
engines. I remember several passages in the InnoDB manual about
certain operating systems ignoring O_DIRECT for the tx log. Also
there may be "hidden" caches in disk controllers and in the disks.

Agreed it's a problem. Most databases handle this with a log/journal
which writes directly to the file system and doesn't return until the
record is written. Once that is done, the real data is written
asynchronously to the tables.

Yes, but how is this relevant to MyISAM?

It goes back to the crux of the original poster's problem. He wants to
use an access method which is not crash-safe and is trying to ensure the
integrity of his data - or at least a major portion of it.

>
>>...

>>>XL
--
Axel Schwenke, Senior Software Developer, MySQL AB

Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
MySQL User Forums: http://forums.mysql.com/

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 10 '06 #16

Gordon Burditt

Agreed. But Alf worried he could lose whole tables aka files.

>

There is very little
chance you will lose the files completely in the case of a crash. There

Ok, I assume here we are talking about a mysqld crash, NOT an OS crash,
a power failure, or a hardware crash, or a hardware malfunction such
as a disk controller that writes on the wrong sectors or writes random
crap to the correct sectors.

WHY did mysqld crash? One plausible scenario is that it has gone
completely bonkers, e.g. because of a buffer-overflow virus attack
or coding error. Scribbled-on code can do anything. It's even more
likely to do something bad if the buffer-overflow was intentional.

So, you have to assume that mysqld can do anything a rogue user-level
process running with the same privileges will do: such as deleting
all the tables, or interpreting SELECT * FROM ... as DELETE FROM
.... Bye, bye, data. Any time you write data, there is a chance
of writing crap instead (buggy daemon code, buggy OS, buggy hardware,
etc.). Any time you write data, there is a chance of its being
written in the wrong place.

The worst case is considerably less ugly if you assume that mysqld
crashes because someone did a kill -9 on the daemon (it suddenly
stops with correct behavior up to the stopping point) and it is
otherwise bug-free.

The worst case is still very bad but the average case is a lot less
ugly if you assume a "clean" interruption of power: writes to the
hard disk just stop at an arbitrary point. (I have one system where
a particular disk partition usually acquires an unreadable sector
if the system crashes due to power interruption, even though 99% of
the time it's sitting there not accessing the disk, read or write).

>>is a much bigger (although admittedly still small) that the files will
be corrupted. And a huge chance if you have more than one table your
database will be inconsistent.
However, thats not a real problem. MySQL never touches the datafile
itself once it is created. Only exception: REPAIR TABLE. This will
recreate the datafile (as new file with extension .TMD) and then
rename files.

I believe this is incorrect. OPTIMIZE TABLE and ALTER TABLE (under
some circumstances, such as actually changing the schema) will also
do this. But these aren't used very often.

Now consider what happens when you attempt doing this WITH INSUFFICIENT
DISK SPACE for temporarily having two copies. I believe I have
managed to lose a table this way, although it was a scratch table
and not particularly important anyway. And this scenario has usually
"failed cleanly", although it usually leaves the partition out of
disk space so nothing much else works.

As far as I know there are very few places where MySQL chops a file and
then attempts to re-write it, and these are places where it's re-creating
the file from scratch, with the data already stored in another file
(REPAIR TABLE, OPTIMIZE TABLE, ALTER TABLE, DROP TABLE/CREATE TABLE).
It won't do that for things like mass UPDATE. It may leave some more
unused space in the data file which may be usable later when data is
INSERTed.

>>Excuse me? MySQL ALWAYS touches the data file.
Sorry, I didn't express myself clear here: MyISAM never touches the
metadata for a data file. The file itself is created with CREATE TABLE.

Writing on a file changes the change-time metadata for the file.
Writing on a file to extend it likely changes the list of blocks
used by a file (if it is extended by enough to add more blocks).

Later on there is data appended to the file or some block inside the
file is modified. But the file itself stays there and there is
virtually no chance to lose it. So indeed there is no gain from using
a filesystem with metadata journaling (in fact most "journaling"
filesystems use the journal only for metadata).
And it is constantly rewriting the files to disk.

...

Yes, I know exactly how MySQL works. Yep, and it has to rewrite a
portion of the file to do all of this.
What do you call "rewrite"?

Of cource MySQL writes modified data. MySQL never reads an otherwise
unmodified record and rewrites it somewhere else.

I don't think this is true for operations that copy rows of tables.
But that won't corrupt the source table.

>

Just what you are calling it. It reads in a block of data and writes it
back out to disk.

Note the words "otherwise unmodified" - i.e. not affected by current
operation.

>>
Even in variable length rows where the new row is longer than the old
one and MySQL appends it to the end of the file, MySQL has to go back
and rewrite the original row to mark it as invalid.

>
Most file operations on MyISAM tables are easier, faster and less
risky, if the table uses fixed length records. Then there is no need to
collapse adjacent unused records into one, UPDATE can be done in place,
there will be no fragmentation and such.

... what happens if the row spans a disk and the
system crashes between writes, for instance? ...
Agreed. But then again I don't know how *exactly* MyISAM does those
nonatomic writes. ...

Part of it is MyISAM. But part of it is the OS, also. For instance,
what happens if the row spans two physical blocks of data which are not
contiguous? In that case the OS has to write the first block, seek to
the next one and write that one.

There isn't anything Monty can do about that, unfortunately.

MyISAM doesn't claim to be transactional.

However, there is only very little chance to lose data that was not
written to at the time of the crash.

Actually, you would lose all data which wasn't written to the disk.

Axel means, data *already* written which is not being changed, i.e.
other records.

>>
Dynamic vs. fixed format: Dynamic row format is susceptible to the
following problem: ...
Having the whole system crashing is also hard for "real" database
engines. I remember several passages in the InnoDB manual about
certain operating systems ignoring O_DIRECT for the tx log. Also
there may be "hidden" caches in disk controllers and in the disks.

Agreed it's a problem. Most databases handle this with a log/journal
which writes directly to the file system and doesn't return until the
record is written. Once that is done, the real data is written
asynchronously to the tables.

Yes, but how is this relevant to MyISAM?

Nov 10 '06 #17

toby

Jerry Stuckle wrote:

toby wrote:
Jerry Stuckle wrote:

>Hi, Alex,

Comments below.

Axel Schwenke wrote:

Jerry Stuckle <js*******@attglobal.netwrote:
Axel Schwenke wrote:
... MyISAM never touches the
metadata for a data file. The file itself is created with CREATE TABLE.
Later on there is data appended to the file or some block inside the
file is modified. But the file itself stays there and there is
virtually no chance to lose it. So indeed there is no gain from using
a filesystem with metadata journaling (in fact most "journaling"
filesystems use the journal only for metadata).

And it is constantly rewriting the files to disk.

...
Yes, I know exactly how MySQL works. Yep, and it has to rewrite a
portion of the file to do all of this.
What do you call "rewrite"?

Of cource MySQL writes modified data. MySQL never reads an otherwise
unmodified record and rewrites it somewhere else.
Just what you are calling it. It reads in a block of data and writes it
back out to disk.

Note the words "otherwise unmodified" - i.e. not affected by current
operation.

Depends on your definition of "otherwise unmodified". That sounds like
something different than "unmodified", doesn't it? "Otherwise
unmodified" indicates *something* has changed.

Now - if you just say "MySQL never reads an unmodified record and
rewrites it somewhere else", I will agree.

I think that's exactly what Axel meant, yes.

>

>Even in variable length rows where the new row is longer than the old
one and MySQL appends it to the end of the file, MySQL has to go back
and rewrite the original row to mark it as invalid.
Most file operations on MyISAM tables are easier, faster and less
risky, if the table uses fixed length records. Then there is no need to
collapse adjacent unused records into one, UPDATE can be done in place,
there will be no fragmentation and such.

... what happens if the row spans a disk and the
system crashes between writes, for instance? ...
Agreed. But then again I don't know how *exactly* MyISAM does those
nonatomic writes. ...
Part of it is MyISAM. But part of it is the OS, also. For instance,
what happens if the row spans two physical blocks of data which are not
contiguous? In that case the OS has to write the first block, seek to
the next one and write that one.

There isn't anything Monty can do about that, unfortunately.

MyISAM doesn't claim to be transactional.

Nope, and I never said it did. But this has nothing to do with
transactions. It has to do with a single row - or even a single column
in one row - being corrupted.

Transactional has to do with multiple operations (generally including
modification of the data) in which all or none must complete. That's
not the case here.

The problem you describe is solved by transactional engines.

>

>>However, there is only very little chance to lose data that was not
written to at the time of the crash.
Actually, you would lose all data which wasn't written to the disk.

Axel means, data *already* written which is not being changed, i.e.
other records.

Could be. But that's not what he said. He said "not written to...".

Now - if he means data which was not overwritten (or in the progress of
being overwritten), then I will agree.

Again, I think that's what he meant.

>

>>Dynamic vs. fixed format: Dynamic row format is susceptible to the
following problem: ...
Having the whole system crashing is also hard for "real" database
engines. I remember several passages in the InnoDB manual about
certain operating systems ignoring O_DIRECT for the tx log. Also
there may be "hidden" caches in disk controllers and in the disks.
Agreed it's a problem. Most databases handle this with a log/journal
which writes directly to the file system and doesn't return until the
record is written. Once that is done, the real data is written
asynchronously to the tables.

Yes, but how is this relevant to MyISAM?

It goes back to the crux of the original poster's problem. He wants to
use an access method which is not crash-safe and is trying to ensure the
integrity of his data - or at least a major portion of it.

I guess you/Axel have covered some of the points where this just isn't
possible. OP really ought to consider a different engine, no?

>

>...

XL
--
Axel Schwenke, Senior Software Developer, MySQL AB

Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
MySQL User Forums: http://forums.mysql.com/

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 10 '06 #18

Jerry Stuckle

toby wrote:

Jerry Stuckle wrote:

>>toby wrote:

>>>Jerry Stuckle wrote:
Hi, Alex,

Comments below.

Axel Schwenke wrote:
>Jerry Stuckle <js*******@attglobal.netwrote:
>
>
>
>>Axel Schwenke wrote:
>>... MyISAM never touches the
>
>metadata for a data file. The file itself is created with CREATE TABLE.
>Later on there is data appended to the file or some block inside the
>file is modified. But the file itself stays there and there is
>virtually no chance to lose it. So indeed there is no gain from using
>a filesystem with metadata journaling (in fact most "journaling"
>filesystems use the journal only for metadata).
>
>
>
>
>>And it is constantly rewriting the files to disk.
>
>...
>
>
>
>>Yes, I know exactly how MySQL works. Yep, and it has to rewrite a
>>portion of the file to do all of this.
>
>
>What do you call "rewrite"?
>
>Of cource MySQL writes modified data. MySQL never reads an otherwise
>unmodified record and rewrites it somewhere else.
>

Just what you are calling it. It reads in a block of data and writes it
back out to disk.
Note the words "otherwise unmodified" - i.e. not affected by current
operation.

Depends on your definition of "otherwise unmodified". That sounds like
something different than "unmodified", doesn't it? "Otherwise
unmodified" indicates *something* has changed.

Now - if you just say "MySQL never reads an unmodified record and
rewrites it somewhere else", I will agree.

I think that's exactly what Axel meant, yes.

>>>>Even in variable length rows where the new row is longer than the old
one and MySQL appends it to the end of the file, MySQL has to go back
and rewrite the original row to mark it as invalid.

>>>Most file operations on MyISAM tables are easier, faster and less
>>>risky, if the table uses fixed length records. Then there is no need to
>>>collapse adjacent unused records into one, UPDATE can be done in place,
>>>there will be no fragmentation and such.
>>
>>... what happens if the row spans a disk and the
>>system crashes between writes, for instance? ...
>
>
>Agreed. But then again I don't know how *exactly* MyISAM does those
>nonatomic writes. ...
>

Part of it is MyISAM. But part of it is the OS, also. For instance,
what happens if the row spans two physical blocks of data which are not
contiguous? In that case the OS has to write the first block, seek to
the next one and write that one.

There isn't anything Monty can do about that, unfortunately.

MyISAM doesn't claim to be transactional.

Nope, and I never said it did. But this has nothing to do with
transactions. It has to do with a single row - or even a single column
in one row - being corrupted.

Transactional has to do with multiple operations (generally including
modification of the data) in which all or none must complete. That's
not the case here.

The problem you describe is solved by transactional engines.

Yes, it is solved by by "transactional engines". But you don't
necessarily need to explicitly use transactions for it. For instance,
INNODB can protect against that, even if you are using autocommit
(effectively otherwise negating transactional operations).

>

>>>>>However, there is only very little chance to lose data that was not
>written to at the time of the crash.
>

Actually, you would lose all data which wasn't written to the disk.
Axel means, data *already* written which is not being changed, i.e.
other records.

Could be. But that's not what he said. He said "not written to...".

Now - if he means data which was not overwritten (or in the progress of
being overwritten), then I will agree.

Again, I think that's what he meant.

It could be. I can only go by what he said. And sometimes English is
not the best language, especially when discussing technical topics.

>

>>>>>Dynamic vs. fixed format: Dynamic row format is susceptible to the
>following problem: ...
>Having the whole system crashing is also hard for "real" database
>engines. I remember several passages in the InnoDB manual about
>certain operating systems ignoring O_DIRECT for the tx log. Also
>there may be "hidden" caches in disk controllers and in the disks.
>

Agreed it's a problem. Most databases handle this with a log/journal
which writes directly to the file system and doesn't return until the
record is written. Once that is done, the real data is written
asynchronously to the tables.
Yes, but how is this relevant to MyISAM?

It goes back to the crux of the original poster's problem. He wants to
use an access method which is not crash-safe and is trying to ensure the
integrity of his data - or at least a major portion of it.

I guess you/Axel have covered some of the points where this just isn't
possible. OP really ought to consider a different engine, no?

I agree completely.

Of course, with the additional integrity comes additional overhead.
TANSTAAFL.

>

>>>>...

>XL
>--
>Axel Schwenke, Senior Software Developer, MySQL AB
>
>Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
>MySQL User Forums: http://forums.mysql.com/

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 10 '06 #19

toby

Jerry Stuckle wrote:

toby wrote:
Jerry Stuckle wrote:

>toby wrote:

Jerry Stuckle wrote:
Hi, Alex,

Comments below.

Axel Schwenke wrote:
Jerry Stuckle <js*******@attglobal.netwrote:

>Axel Schwenke wrote:
>... MyISAM never touches the

metadata for a data file. The file itself is created with CREATE TABLE.
Later on there is data appended to the file or some block inside the
file is modified. But the file itself stays there and there is
virtually no chance to lose it. So indeed there is no gain from using
a filesystem with metadata journaling (in fact most "journaling"
filesystems use the journal only for metadata).

>And it is constantly rewriting the files to disk.

...

>Yes, I know exactly how MySQL works. Yep, and it has to rewrite a
>portion of the file to do all of this.
What do you call "rewrite"?

Of cource MySQL writes modified data. MySQL never reads an otherwise
unmodified record and rewrites it somewhere else.
Just what you are calling it. It reads in a block of data and writes it
back out to disk.
Note the words "otherwise unmodified" - i.e. not affected by current
operation.
Depends on your definition of "otherwise unmodified". That sounds like
something different than "unmodified", doesn't it? "Otherwise
unmodified" indicates *something* has changed.

Now - if you just say "MySQL never reads an unmodified record and
rewrites it somewhere else", I will agree.

I think that's exactly what Axel meant, yes.

>>>Even in variable length rows where the new row is longer than the old
one and MySQL appends it to the end of the file, MySQL has to go back
and rewrite the original row to mark it as invalid.

>>Most file operations on MyISAM tables are easier, faster and less
>>risky, if the table uses fixed length records. Then there is no need to
>>collapse adjacent unused records into one, UPDATE can be done in place,
>>there will be no fragmentation and such.
>
>... what happens if the row spans a disk and the
>system crashes between writes, for instance? ...
Agreed. But then again I don't know how *exactly* MyISAM does those
nonatomic writes. ...
Part of it is MyISAM. But part of it is the OS, also. For instance,
what happens if the row spans two physical blocks of data which are not
contiguous? In that case the OS has to write the first block, seek to
the next one and write that one.

There isn't anything Monty can do about that, unfortunately.

MyISAM doesn't claim to be transactional.
Nope, and I never said it did. But this has nothing to do with
transactions. It has to do with a single row - or even a single column
in one row - being corrupted.

Transactional has to do with multiple operations (generally including
modification of the data) in which all or none must complete. That's
not the case here.

The problem you describe is solved by transactional engines.

Yes, it is solved by by "transactional engines". But you don't
necessarily need to explicitly use transactions for it. For instance,
INNODB can protect against that, even if you are using autocommit
(effectively otherwise negating transactional operations).

An Autocommited statement is no different from any other transaction,
so it benefits from the same machinery, yes.

>

>>>>However, there is only very little chance to lose data that was not
written to at the time of the crash.
Actually, you would lose all data which wasn't written to the disk.
Axel means, data *already* written which is not being changed, i.e.
other records.
Could be. But that's not what he said. He said "not written to...".

Now - if he means data which was not overwritten (or in the progress of
being overwritten), then I will agree.

Again, I think that's what he meant.

It could be. I can only go by what he said. And sometimes English is
not the best language, especially when discussing technical topics.

You apparently had more trouble deciphering his intended meaning than I
did.

>

>>>>Dynamic vs. fixed format: Dynamic row format is susceptible to the
following problem: ...
Having the whole system crashing is also hard for "real" database
engines. I remember several passages in the InnoDB manual about
certain operating systems ignoring O_DIRECT for the tx log. Also
there may be "hidden" caches in disk controllers and in the disks.
Agreed it's a problem. Most databases handle this with a log/journal
which writes directly to the file system and doesn't return until the
record is written. Once that is done, the real data is written
asynchronously to the tables.
Yes, but how is this relevant to MyISAM?
It goes back to the crux of the original poster's problem. He wants to
use an access method which is not crash-safe and is trying to ensure the
integrity of his data - or at least a major portion of it.

I guess you/Axel have covered some of the points where this just isn't
possible. OP really ought to consider a different engine, no?

I agree completely.

Of course, with the additional integrity comes additional overhead.
TANSTAAFL.

Well, each of the engines has a different sweet spot (BDB, Solid, PBXT,
Falcon) and we don't even know if the OP has a performance problem. I
think he only mentioned an integrity problem?

>

>>>...

XL
--
Axel Schwenke, Senior Software Developer, MySQL AB

Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
MySQL User Forums: http://forums.mysql.com/

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 10 '06 #20

Axel Schwenke

Guys, could you please try to cut your quotes to a minimum?
Thanks!

Jerry Stuckle <js*******@attglobal.netwrote:

toby wrote:
>Jerry Stuckle wrote:
>>>>>>
>>Of cource MySQL writes modified data. MySQL never reads an otherwise
>>unmodified record and rewrites it somewhere else.
>
>Just what you are calling it. It reads in a block of data and writes it
>back out to disk.

Note the words "otherwise unmodified" - i.e. not affected by current
operation.

Depends on your definition of "otherwise unmodified". That sounds like
something different than "unmodified", doesn't it? "Otherwise
unmodified" indicates *something* has changed.

Now - if you just say "MySQL never reads an unmodified record and
rewrites it somewhere else", I will agree.

I think that's exactly what Axel meant, yes.

I can confirm that's just what I meant.

>>>>MyISAM doesn't claim to be transactional.

Nope, and I never said it did. But this has nothing to do with
transactions. It has to do with a single row - or even a single column
in one row - being corrupted.

Transactional has to do with multiple operations (generally including
modification of the data) in which all or none must complete. That's
not the case here.

The problem you describe is solved by transactional engines.

Yes, it is solved by by "transactional engines". But you don't
necessarily need to explicitly use transactions for it. For instance,
INNODB can protect against that, even if you are using autocommit
(effectively otherwise negating transactional operations).

The is no "non-transactional" operation mode if you use InnoDB. If
AUTO_COMMIT=yes, each DML statement is one implicit transaction.
And of course each modification of the InnoDB table space is tracked
by the InnoDB TX log.

>>>>>>However, there is only very little chance to lose data that was not
>>written to at the time of the crash.
>
>Actually, you would lose all data which wasn't written to the disk.

Axel means, data *already* written which is not being changed, i.e.
other records.

Could be. But that's not what he said. He said "not written to...".

Now - if he means data which was not overwritten (or in the progress of
being overwritten), then I will agree.

Again, I think that's what he meant.

Confirmed again.

/me starts considering that Jerry does not understand what /me means

It could be. I can only go by what he said. And sometimes English is
not the best language, especially when discussing technical topics.

Lets switch to German then :-)

>>>>Yes, but how is this relevant to MyISAM?

It goes back to the crux of the original poster's problem. He wants to
use an access method which is not crash-safe and is trying to ensure the
integrity of his data - or at least a major portion of it.

I guess you/Axel have covered some of the points where this just isn't
possible. OP really ought to consider a different engine, no?

No.

Alf said he could afford to lose some data. Not 100% of course, but up
to 5% (he said so in <2I******************************@comcast.com>).

So - maybe - MyISAM could be "good enough" for his needs.
XL
--
Axel Schwenke, Senior Software Developer, MySQL AB

Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
MySQL User Forums: http://forums.mysql.com/

Nov 10 '06 #21

Jerry Stuckle

Axel Schwenke wrote:

Guys, could you please try to cut your quotes to a minimum?
Thanks!

Jerry Stuckle <js*******@attglobal.netwrote:

No.

Alf said he could afford to lose some data. Not 100% of course, but up
to 5% (he said so in <2I******************************@comcast.com>).

So - maybe - MyISAM could be "good enough" for his needs.

And he also said he could NOT afford to lose all the data. Not in the
case of a MySQL crash, an OS crash, a hardware problem, whatever.

You can't guarantee that with MYISAM.

So *maybe* he can get by. But I wouldn't bet my job on it. And I
wouldn't recommend it to one of my customers in the same situation. If
I did, I would be negligent in my duties as a consultant.

XL
--
Axel Schwenke, Senior Software Developer, MySQL AB

Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
MySQL User Forums: http://forums.mysql.com/

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 11 '06 #22

toby

Jerry Stuckle wrote:

Axel Schwenke wrote:
Guys, could you please try to cut your quotes to a minimum?
Thanks!

Jerry Stuckle <js*******@attglobal.netwrote:

No.

Alf said he could afford to lose some data. Not 100% of course, but up
to 5% (he said so in <2I******************************@comcast.com>).

So - maybe - MyISAM could be "good enough" for his needs.

And he also said he could NOT afford to lose all the data. Not in the
case of a MySQL crash, an OS crash, a hardware problem, whatever.

You can't guarantee that with MYISAM.

Faced with OS or hardware problem, you can't guarantee it with any
engine.

>
So *maybe* he can get by. But I wouldn't bet my job on it. And I
wouldn't recommend it to one of my customers in the same situation. If
I did, I would be negligent in my duties as a consultant.

XL
--
Axel Schwenke, Senior Software Developer, MySQL AB

Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
MySQL User Forums: http://forums.mysql.com/

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 11 '06 #23

Jerry Stuckle

toby wrote:

Jerry Stuckle wrote:

>>Axel Schwenke wrote:

>>>Guys, could you please try to cut your quotes to a minimum?
Thanks!

Jerry Stuckle <js*******@attglobal.netwrote:

No.

Alf said he could afford to lose some data. Not 100% of course, but up
to 5% (he said so in <2I******************************@comcast.com>).

So - maybe - MyISAM could be "good enough" for his needs.

And he also said he could NOT afford to lose all the data. Not in the
case of a MySQL crash, an OS crash, a hardware problem, whatever.

You can't guarantee that with MYISAM.

Faced with OS or hardware problem, you can't guarantee it with any
engine.

Actually, for an OS problem, you can. Use an RDB which journals and
take regular backups. Rolling forward from the last valid backup will
restore all committed transactions.

Hardware failure is a little more difficult. At the least you need to
have your database and journal on two different disks with two different
adapters. Better is to also mirror the database and journal with
something like RAID-1 or RAID-10.

It can be guaranteed. Critical databases all use these techniques.

>
>>So *maybe* he can get by. But I wouldn't bet my job on it. And I
wouldn't recommend it to one of my customers in the same situation. If
I did, I would be negligent in my duties as a consultant.

>>>XL
--
Axel Schwenke, Senior Software Developer, MySQL AB

Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
MySQL User Forums: http://forums.mysql.com/

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 11 '06 #24

toby

Jerry Stuckle wrote:

toby wrote:
Jerry Stuckle wrote:
>And he also said he could NOT afford to lose all the data. Not in the
case of a MySQL crash, an OS crash, a hardware problem, whatever.

You can't guarantee that with MYISAM.

Faced with OS or hardware problem, you can't guarantee it with any
engine.

Actually, for an OS problem, you can. Use an RDB which journals and
take regular backups. Rolling forward from the last valid backup will
restore all committed transactions.

Nope, OS/hardware issue could mean "no journal".

>
Hardware failure is a little more difficult. At the least you need to
have your database and journal on two different disks with two different
adapters. Better is to also mirror the database and journal with
something like RAID-1 or RAID-10.

Neither of which can protect against certain hardware failures -
everyone has a story about the bulletproof RAID setup which was
scribbled over by a bad controller, or bad cable, or bad power. ZFS
buys a lot more safety (end to end verification).

>
It can be guaranteed. Critical databases all use these techniques.

I don't trust that word "guaranteed". You need backups in any case. :)

>

>So *maybe* he can get by. But I wouldn't bet my job on it. And I
wouldn't recommend it to one of my customers in the same situation. If
I did, I would be negligent in my duties as a consultant.
XL
--
Axel Schwenke, Senior Software Developer, MySQL AB

Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
MySQL User Forums: http://forums.mysql.com/
--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 11 '06 #25

Jerry Stuckle

toby wrote:

Jerry Stuckle wrote:

>>toby wrote:

>>>Jerry Stuckle wrote:

And he also said he could NOT afford to lose all the data. Not in the
case of a MySQL crash, an OS crash, a hardware problem, whatever.

You can't guarantee that with MYISAM.
Faced with OS or hardware problem, you can't guarantee it with any
engine.

Actually, for an OS problem, you can. Use an RDB which journals and
take regular backups. Rolling forward from the last valid backup will
restore all committed transactions.

Nope, OS/hardware issue could mean "no journal".

Journals are written synchronously, before data is written to the
database. Also, they are preallocated - so there is no change in the
allocation units on the disk. Even an OS problem can't break that. The
worst which can happen is the last transaction isn't completely written
to disk (i.e. the database crashed in the middle of the write).

And RAID-1 and RAID-10 are fault tolerant mirrored systems. You would
have to lose multiple disks and adapters at exactly the same time to
loose the journal.

>
>>Hardware failure is a little more difficult. At the least you need to
have your database and journal on two different disks with two different
adapters. Better is to also mirror the database and journal with
something like RAID-1 or RAID-10.

Neither of which can protect against certain hardware failures -
everyone has a story about the bulletproof RAID setup which was
scribbled over by a bad controller, or bad cable, or bad power. ZFS
buys a lot more safety (end to end verification).

I don't know of anyone who has "a story" about these systems where data
was lost on RAID-1 or RAID-10.

These systems duplicate everything. They have multiple controllers.
Separate cables. Even separate power supplies in the most critical
cases. Even a power failure just powers down the device (and take the
system down).

Also, ZFS doesn't protect against a bad disk, for instance. All it does
is guarantee the data was written properly. A failing controller can
easily overwrite the data at some later time. RAID-1 and RAID-10 could
still have that happen, but what are the chances of two separate
controllers having exactly the same failure at the same time?

I have in the past been involved in some very critical databases. They
all use various RAID devices. And the most critical use RAID-1 or RAID-10.

>
>>It can be guaranteed. Critical databases all use these techniques.

I don't trust that word "guaranteed". You need backups in any case. :)

>>>>So *maybe* he can get by. But I wouldn't bet my job on it. And I
wouldn't recommend it to one of my customers in the same situation. If
I did, I would be negligent in my duties as a consultant.

>XL
>--
>Axel Schwenke, Senior Software Developer, MySQL AB
>
>Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
>MySQL User Forums: http://forums.mysql.com/

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 11 '06 #26

toby

Jerry Stuckle wrote:

toby wrote:
Jerry Stuckle wrote:

>toby wrote:

Jerry Stuckle wrote:

And he also said he could NOT afford to lose all the data. Not in the
case of a MySQL crash, an OS crash, a hardware problem, whatever.

You can't guarantee that with MYISAM.
Faced with OS or hardware problem, you can't guarantee it with any
engine.
Actually, for an OS problem, you can. Use an RDB which journals and
take regular backups. Rolling forward from the last valid backup will
restore all committed transactions.

Nope, OS/hardware issue could mean "no journal".

Journals are written synchronously, before data is written to the
database. Also, they are preallocated - so there is no change in the
allocation units on the disk. Even an OS problem can't break that. The
worst which can happen is the last transaction isn't completely written
to disk (i.e. the database crashed in the middle of the write).

And RAID-1 and RAID-10 are fault tolerant mirrored systems. You would
have to lose multiple disks and adapters at exactly the same time to
loose the journal.

As long as there is a single point of failure (software or firmware bug
for instance)...

>

>Hardware failure is a little more difficult. At the least you need to
have your database and journal on two different disks with two different
adapters. Better is to also mirror the database and journal with
something like RAID-1 or RAID-10.

Neither of which can protect against certain hardware failures -
everyone has a story about the bulletproof RAID setup which was
scribbled over by a bad controller, or bad cable, or bad power. ZFS
buys a lot more safety (end to end verification).

I don't know of anyone who has "a story" about these systems where data
was lost on RAID-1 or RAID-10.

It hasn't happened to me either, but it has happened to many others.

>
These systems duplicate everything. They have multiple controllers.
Separate cables. Even separate power supplies in the most critical
cases. Even a power failure just powers down the device (and take the
system down).

Also, ZFS doesn't protect against a bad disk, for instance. All it does
is guarantee the data was written properly.

It does considerably better than RAID-1 here, in several ways - by
verifying writes; verifying reads; by healing immediately a data error
is found; and by (optionally) making scrubbing passes to reduce the
possibility of undetected loss (this also works for conventional RAID
of course, subject to error detection limitations).

A failing controller can
easily overwrite the data at some later time. RAID-1 and RAID-10 could
still have that happen, but what are the chances of two separate
controllers having exactly the same failure at the same time?

The difference is that ZFS will see the problem (checksum) and
automatically salvage the data from the good side, while RAID-1 will
not discover the damage (only reads from one side of the mirror).
Obviously checksumming is the critical difference; RAID-1 is entirely
dependent on the drive correctly signalling errors (correctable or
not); it cannot independently verify data integrity and remains
vulnerable to latent data loss.

>
I have in the past been involved in some very critical databases. They
all use various RAID devices. And the most critical use RAID-1 or RAID-10.

We can do even better these days.

Related links of interest:
http://blogs.sun.com/bonwick/
http://blogs.sun.com/relling/entry/zfs_from_a_ras_point
https://www.gelato.unsw.edu.au/archi...er/003008.html
http://www.lockss.org/locksswiki/fil...urosys2006.pdf [A Fresh
Look at the Reliability of Long-term Digital Storage, 2006]
http://www.ecsl.cs.sunysb.edu/tr/rpe19.pdf [Challenges of Long-Term
Digital Archiving: A Survey, 2006]
http://www.cs.wisc.edu/~vijayan/vijayan-thesis.pdf [IRON File Systems,
2006]
http://www.tcs.hut.fi/~hhk/phd/phd_Hannu_H_Kari.pdf [Latent Sector
Faults and Reliability of Disk Arrays, 1997]

>It can be guaranteed. Critical databases all use these techniques.

I don't trust that word "guaranteed". You need backups in any case. :)

>>>So *maybe* he can get by. But I wouldn't bet my job on it. And I
wouldn't recommend it to one of my customers in the same situation. If
I did, I would be negligent in my duties as a consultant.

XL
--
Axel Schwenke, Senior Software Developer, MySQL AB

Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
MySQL User Forums: http://forums.mysql.com/

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 11 '06 #27

toby

toby wrote:

Jerry Stuckle wrote:
...
A failing controller can
easily overwrite the data at some later time. RAID-1 and RAID-10 could
still have that happen, but what are the chances of two separate
controllers having exactly the same failure at the same time?

The difference is that ZFS will see the problem (checksum) and
automatically salvage the data from the good side, while RAID-1 will
not discover the damage

I should have added - you don't need *two* failures. You only need *one
silent error* to cause data loss with RAID-1. ZFS is proof against
silent errors, although of course it's still susceptible to multiple
failures (such as both mirrors suffering a whole disk failure without
repair).

Nov 11 '06 #28

Jerry Stuckle

toby wrote:

Jerry Stuckle wrote:

>>
Journals are written synchronously, before data is written to the
database. Also, they are preallocated - so there is no change in the
allocation units on the disk. Even an OS problem can't break that. The
worst which can happen is the last transaction isn't completely written
to disk (i.e. the database crashed in the middle of the write).

And RAID-1 and RAID-10 are fault tolerant mirrored systems. You would
have to lose multiple disks and adapters at exactly the same time to
loose the journal.

As long as there is a single point of failure (software or firmware bug
for instance)...

They will also handle hardware failures. I have never heard of any loss
of data due to hardware failures on RAID-1 or RAID-10. Can you point to
even one instance?

>

>>>>Hardware failure is a little more difficult. At the least you need to
have your database and journal on two different disks with two different
adapters. Better is to also mirror the database and journal with
something like RAID-1 or RAID-10.
Neither of which can protect against certain hardware failures -
everyone has a story about the bulletproof RAID setup which was
scribbled over by a bad controller, or bad cable, or bad power. ZFS
buys a lot more safety (end to end verification).

I don't know of anyone who has "a story" about these systems where data
was lost on RAID-1 or RAID-10.

It hasn't happened to me either, but it has happened to many others.

Specifics? Using RAID-1 or RAID-10?

>
>>These systems duplicate everything. They have multiple controllers.
Separate cables. Even separate power supplies in the most critical
cases. Even a power failure just powers down the device (and take the
system down).

Also, ZFS doesn't protect against a bad disk, for instance. All it does
is guarantee the data was written properly.

It does considerably better than RAID-1 here, in several ways - by
verifying writes; verifying reads; by healing immediately a data error
is found; and by (optionally) making scrubbing passes to reduce the
possibility of undetected loss (this also works for conventional RAID
of course, subject to error detection limitations).

And how does it recover from a disk crash? Or what happens if the data
goes bad after being written and read back?

Additionally, it depends on the software correctly detecting and
signaling a data error.

>
>>A failing controller can
easily overwrite the data at some later time. RAID-1 and RAID-10 could
still have that happen, but what are the chances of two separate
controllers having exactly the same failure at the same time?

The difference is that ZFS will see the problem (checksum) and
automatically salvage the data from the good side, while RAID-1 will
not discover the damage (only reads from one side of the mirror).
Obviously checksumming is the critical difference; RAID-1 is entirely
dependent on the drive correctly signalling errors (correctable or
not); it cannot independently verify data integrity and remains
vulnerable to latent data loss.

If it's a single sector. But if the entire disk crashes - i.e. an
electronics failure?

But all data is mirrored. And part of the drive's job is to signal
errors. One which doesn't do that correctly isn't much good, is it/

>
>>I have in the past been involved in some very critical databases. They
all use various RAID devices. And the most critical use RAID-1 or RAID-10.

We can do even better these days.

Related links of interest:
http://blogs.sun.com/bonwick/
http://blogs.sun.com/relling/entry/zfs_from_a_ras_point
https://www.gelato.unsw.edu.au/archi...er/003008.html
http://www.lockss.org/locksswiki/fil...urosys2006.pdf [A Fresh
Look at the Reliability of Long-term Digital Storage, 2006]
http://www.ecsl.cs.sunysb.edu/tr/rpe19.pdf [Challenges of Long-Term
Digital Archiving: A Survey, 2006]
http://www.cs.wisc.edu/~vijayan/vijayan-thesis.pdf [IRON File Systems,
2006]
http://www.tcs.hut.fi/~hhk/phd/phd_Hannu_H_Kari.pdf [Latent Sector
Faults and Reliability of Disk Arrays, 1997]

So? I don't see anything in any of these articles which affects this
discussion. We're not talking about long term digital storage, for
instance.

I'm just curious. How many critical database systems have you actually
been involved with? I've lost count. When I worked for IBM, we had
banks, insurance companies, etc., all with critical databases as
customers. Probably the largest I ever worked with was a major U.S.
airline reservation system.

These systems are critical to their business. The airline database
averaged tens of thousands of operations per second. This is a critical
system. Can you imagine what would happen if they lost even 2 minutes
of reservations? Especially with today's electronic ticketing systems?
And *never* have I seen (or heard of) a loss of data other than what
was being currently processed.

BTW - NONE of them use zfs - because these are mainframe systems, not
Linux. But they all use the mainframe versions of RAID-1 or RAID-10.

In any case - this is way off topic for this newsgroup. The original
question was "Can I prevent the loss of a significant portion of my data
in the case of a MySQL, OS or hardware failure, when using MyISAM?".

The answer is no.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 11 '06 #29

Jerry Stuckle

toby wrote:

toby wrote:

>>Jerry Stuckle wrote:

>>>...
A failing controller can
easily overwrite the data at some later time. RAID-1 and RAID-10 could
still have that happen, but what are the chances of two separate
controllers having exactly the same failure at the same time?

The difference is that ZFS will see the problem (checksum) and
automatically salvage the data from the good side, while RAID-1 will
not discover the damage

I should have added - you don't need *two* failures. You only need *one
silent error* to cause data loss with RAID-1. ZFS is proof against
silent errors, although of course it's still susceptible to multiple
failures (such as both mirrors suffering a whole disk failure without
repair).

ZFS is not proof against silent errors - they can still occur. It is
possible for it to miss an error, also. Plus it is not proof against
data decaying after it is written to disk. And, as you note, it doesn't
handle a disk crash.

But when properly implemented, RAID-1 and RAID-10 will detect and
correct even more errors than ZFS will.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 11 '06 #30

toby

Jerry Stuckle wrote:

toby wrote:
Jerry Stuckle wrote:

>
Journals are written synchronously, ...

And RAID-1 and RAID-10 are fault tolerant mirrored systems. You would
have to lose multiple disks and adapters at exactly the same time to
loose the journal.

As long as there is a single point of failure (software or firmware bug
for instance)...

They will also handle hardware failures. I have never heard of any loss
of data due to hardware failures on RAID-1 or RAID-10. Can you point to
even one instance?

There are several examples of such hardware failures in the links
cited, but I'll crosspost this to comp.arch.storage - I'll eat my hat
if no-one there has seen a RAID data loss.

>

>...
I don't know of anyone who has "a story" about these systems where data
was lost on RAID-1 or RAID-10.

It hasn't happened to me either, but it has happened to many others.

Specifics? Using RAID-1 or RAID-10?

>These systems duplicate everything. They have multiple controllers.
Separate cables. Even separate power supplies in the most critical
cases. Even a power failure just powers down the device (and take the
system down).

Also, ZFS doesn't protect against a bad disk, for instance. All it does
is guarantee the data was written properly.

It does considerably better than RAID-1 here, in several ways - by
verifying writes; verifying reads; by healing immediately a data error
is found; and by (optionally) making scrubbing passes to reduce the
possibility of undetected loss (this also works for conventional RAID
of course, subject to error detection limitations).

And how does it recover from a disk crash? Or what happens if the data
goes bad after being written and read back?

You use the redundancy to repair it. RAID-1 does not do this.

>
Additionally, it depends on the software correctly detecting and
signaling a data error.

Which RAID-1 cannot do at all.

>

>A failing controller can
easily overwrite the data at some later time. RAID-1 and RAID-10 could
still have that happen, but what are the chances of two separate
controllers having exactly the same failure at the same time?

The difference is that ZFS will see the problem (checksum) and
automatically salvage the data from the good side, while RAID-1 will
not discover the damage (only reads from one side of the mirror).
Obviously checksumming is the critical difference; RAID-1 is entirely
dependent on the drive correctly signalling errors (correctable or
not); it cannot independently verify data integrity and remains
vulnerable to latent data loss.

If it's a single sector. But if the entire disk crashes - i.e. an
electronics failure?

That's right, it cannot bring a dead disk back to life...

>
But all data is mirrored. And part of the drive's job is to signal
errors. One which doesn't do that correctly isn't much good, is it/

You're right that RAID-1 is built on the assumption that drives
perfectly report errors. ZFS isn't.

As Richard Elling writes, "We don't have to rely on a parity protected
SCSI bus, or a bug-free disk firmware (I've got the scars) to ensure
that what is on persistent storage is what we get in memory. ... by
distrusting everything in the storage data path we will build in the
reliability and redundancy into the file system."

>

>I have in the past been involved in some very critical databases. They
all use various RAID devices. And the most critical use RAID-1 or RAID-10.

We can do even better these days.

Related links of interest:
http://blogs.sun.com/bonwick/
http://blogs.sun.com/relling/entry/zfs_from_a_ras_point
https://www.gelato.unsw.edu.au/archi...er/003008.html
http://www.lockss.org/locksswiki/fil...urosys2006.pdf [A Fresh
Look at the Reliability of Long-term Digital Storage, 2006]
http://www.ecsl.cs.sunysb.edu/tr/rpe19.pdf [Challenges of Long-Term
Digital Archiving: A Survey, 2006]
http://www.cs.wisc.edu/~vijayan/vijayan-thesis.pdf [IRON File Systems,
2006]
http://www.tcs.hut.fi/~hhk/phd/phd_Hannu_H_Kari.pdf [Latent Sector
Faults and Reliability of Disk Arrays, 1997]

So? I don't see anything in any of these articles which affects this
discussion. We're not talking about long term digital storage, for
instance.

I think that's quite relevant to many "business critical" database
systems. Databases are even evolving in response to changing
*regulatory* requirements: MySQL's ARCHIVE engine, for instance...

I'm just curious. How many critical database systems have you actually
been involved with? I've lost count. ...
These systems are critical to their business. ...

None of this is relevant to what I'm trying to convey, which is simply:
What ZFS does beyond RAID.

Why are you taking the position that they are equivalent? There are
innumerable failure modes that RAID(-1) cannot handle, which ZFS does.

>
BTW - NONE of them use zfs - because these are mainframe systems, not
Linux. But they all use the mainframe versions of RAID-1 or RAID-10.

I still claim - along with Sun - that you can, using more modern
software, improve on the integrity and availability guarantees of
RAID-1. This applies equally to the small systems I specify (say, a
small mirrored disk server storing POS account data) as to their
humongous storage arrays.

>
In any case - this is way off topic for this newsgroup. The original
question was "Can I prevent the loss of a significant portion of my data
in the case of a MySQL, OS or hardware failure, when using MyISAM?".

The answer is no.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 11 '06 #31

Bill Todd

Jerry Stuckle wrote:

....

ZFS is not proof against silent errors - they can still occur.

Of course they can, but they will be caught by the background
verification scrubbing before much time passes (i.e., within a time
window that radically reduces the likelihood that another disk will fail
before the error is caught and corrected), unlike the case with
conventional RAID (where they aren't caught at all, and rise up to bite
you - with non-negligible probability these days - if the good copy then
dies).

And ZFS *is* proof against silent errors in the sense that data thus
mangled will not be returned to an application (i.e., it will be caught
when read if the background integrity validation has not yet reached it)
- again, unlike the case with conventional mirroring, where there's a
good chance that it will be returned to the application as good.

It is

possible for it to miss an error, also.

It is also possible for all the air molecules in your room to decide -
randomly - to congregate in the corner, and for you to be asphyxiated.
Most people needn't worry about probabilities of these magnitudes.

Plus it is not proof against

data decaying after it is written to disk.

No - but, again, it will catch it before long, even in cases where
conventional disk scrubbing would not.

And, as you note, it doesn't

handle a disk crash.

It handles it with resilience comparable to RAID-1, but is more flexible
in that it can then use distributed free space to restore the previous
level of redundancy (whereas RAID-1/RAID-10 cannot unless the number of
configured hot spare disks equals the number of failed disks).

>
But when properly implemented, RAID-1 and RAID-10 will detect and
correct even more errors than ZFS will.

Please name even one.

- bill

Nov 11 '06 #32

Gordon Burditt

>>>Actually, for an OS problem, you can. Use an RDB which journals and

>>>take regular backups. Rolling forward from the last valid backup will
restore all committed transactions.

Nope, OS/hardware issue could mean "no journal".

OS issue could mean "no disk writes", PERIOD.

>Journals are written synchronously, before data is written to the
database. Also, they are preallocated - so there is no change in the
allocation units on the disk. Even an OS problem can't break that.

Yes, it can. An OS can decide not to write data at all. (consider
an anti-virus program that monitors disk writes hooked into the
OS). Or, at any time, it can erase all the data. (Consider
accidentally zeroing out a sector containing inodes, including a
file someone else was using and the journal and some table .MYD
files. Oh, yes, remember READING a file modifies the inode (accessed
time)). Or it could write any data over the last sector read (which
might be part of the mysqld executable).

>The
worst which can happen is the last transaction isn't completely written
to disk (i.e. the database crashed in the middle of the write).

When you think worst-case OS failure, think VIRUS. When you think
worst-case hardware failure, think EXPLOSION. Or FIRE. Or disk
head crash. When you think worst-case power-failure situation,
think burned-out circuits. Or erased track.

>And RAID-1 and RAID-10 are fault tolerant mirrored systems. You would
have to lose multiple disks and adapters at exactly the same time to
loose the journal.

Only if the OS writes it in the first place, and the RAID controller
isn't broken, and the OS doesn't scribble over it later.

>>>Hardware failure is a little more difficult. At the least you need to
have your database and journal on two different disks with two different
adapters.

On different planets.

>>>Better is to also mirror the database and journal with
something like RAID-1 or RAID-10.

Neither of which can protect against certain hardware failures -
everyone has a story about the bulletproof RAID setup which was
scribbled over by a bad controller, or bad cable, or bad power. ZFS
buys a lot more safety (end to end verification).

I don't know of anyone who has "a story" about these systems where data
was lost on RAID-1 or RAID-10.

It's certainly possible to rapidly lose data when you type in a
UPDATE or DELETE query and accidentally type a semicolon instead
of ENTER just before you were going to type WHERE. RAID (or MySQL's
replication setup) does just what it's supposed to do and updates
all the copies with the bad data.

I'm not trying to discourage use of RAID. It can save your butt in
lots of situations. But it doesn't work miracles.

>These systems duplicate everything. They have multiple controllers.
Separate cables. Even separate power supplies in the most critical
cases. Even a power failure just powers down the device (and take the
system down).

>Also, ZFS doesn't protect against a bad disk, for instance. All it does
is guarantee the data was written properly. A failing controller can
easily overwrite the data at some later time. RAID-1 and RAID-10 could
still have that happen, but what are the chances of two separate
controllers having exactly the same failure at the same time?

I have in the past been involved in some very critical databases. They
all use various RAID devices. And the most critical use RAID-1 or RAID-10.
>>
>>>It can be guaranteed. Critical databases all use these techniques.

I don't trust that word "guaranteed". You need backups in any case. :)

Nov 11 '06 #33

toby

Jerry Stuckle wrote:

toby wrote:
toby wrote:

>Jerry Stuckle wrote:

...
A failing controller can
easily overwrite the data at some later time. RAID-1 and RAID-10 could
still have that happen, but what are the chances of two separate
controllers having exactly the same failure at the same time?

The difference is that ZFS will see the problem (checksum) and
automatically salvage the data from the good side, while RAID-1 will
not discover the damage

I should have added - you don't need *two* failures. You only need *one
silent error* to cause data loss with RAID-1. ZFS is proof against
silent errors, although of course it's still susceptible to multiple
failures (such as both mirrors suffering a whole disk failure without
repair).

ZFS is not proof against silent errors - they can still occur. It is
possible for it to miss an error, also. Plus it is not proof against
data decaying after it is written to disk.

Actually both capabilities are among its strongest features.

Clearly you haven't read or understood any of the publicly available
information about it, so I'm not going to pursue this any further
beyond relating an analogy:
You will likely live longer if you look both ways before crossing
the road, rather than walking straight across without looking because
"cars will stop".

...
But when properly implemented, RAID-1 and RAID-10 will detect and
correct even more errors than ZFS will.

I'll let those with more patience refute this.

>
--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 11 '06 #34

Rich Teer

On Sat, 11 Nov 2006, toby wrote:

Jerry Stuckle wrote:
...
But when properly implemented, RAID-1 and RAID-10 will detect and
correct even more errors than ZFS will.

I'll let those with more patience refute this.

Jerry, what are you smoking? Do you actually know what ZFS is, and
if so what if, in the context of your assertion I quoted above, ZFS
is used to implement RAID 1 and RAID 10 (which, incidentally, it is
VERY frequently used to do)?

I agree with Toby: you need to read a bit more about ZFS. If you're
a storage nut (meant in a non-disparaging way!), I think you'll like
what you read.

--
Rich Teer, SCSA, SCNA, SCSECA, OpenSolaris CAB member

President,
Rite Online Inc.

Voice: +1 (250) 979-1638
URL: http://www.rite-group.com/rich

Nov 11 '06 #35

Jerry Stuckle

Bill Todd wrote:

Jerry Stuckle wrote:

...

>ZFS is not proof against silent errors - they can still occur.

Of course they can, but they will be caught by the background
verification scrubbing before much time passes (i.e., within a time
window that radically reduces the likelihood that another disk will fail
before the error is caught and corrected), unlike the case with
conventional RAID (where they aren't caught at all, and rise up to bite
you - with non-negligible probability these days - if the good copy then
dies).

And ZFS *is* proof against silent errors in the sense that data thus
mangled will not be returned to an application (i.e., it will be caught
when read if the background integrity validation has not yet reached it)
- again, unlike the case with conventional mirroring, where there's a
good chance that it will be returned to the application as good.

The same is true with RAID-1 and RAID-10. An error on the disk will be
detected and returned by the hardware to the OS.

It is

>possible for it to miss an error, also.

It is also possible for all the air molecules in your room to decide -
randomly - to congregate in the corner, and for you to be asphyxiated.
Most people needn't worry about probabilities of these magnitudes.

About the same chances of ZFS missing an error as RAID-1 or RAID-10.
The big difference being ZFS if done in software, which requires CPU
cycles and other resources. It's also open to corruption. RAID-1 and
RAID-10 are implemented in hardware/firmware which cannot be corrupted
(Read only memory) and require no CPU cycles.

Plus it is not proof against

>data decaying after it is written to disk.

No - but, again, it will catch it before long, even in cases where
conventional disk scrubbing would not.

So do RAID-1 and RAID-10.

And, as you note, it doesn't

>handle a disk crash.

It handles it with resilience comparable to RAID-1, but is more flexible
in that it can then use distributed free space to restore the previous
level of redundancy (whereas RAID-1/RAID-10 cannot unless the number of
configured hot spare disks equals the number of failed disks).

And for a critical system you have that redundancy and more.

>>
But when properly implemented, RAID-1 and RAID-10 will detect and
correct even more errors than ZFS will.

A complete disk crash, for instance. Even Toby admitted ZFS cannot
recover from a disk crash.

ZFS is good. But it's a cheap software implementation of an expensive
hardware recovery system. And there is no way software can do it as
well as hardware does.

That's why all critical systems use hardware systems such as RAID-1 and
RAID-10.

>
Please name even one.

- bill

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 12 '06 #36

Jerry Stuckle

toby wrote:

Jerry Stuckle wrote:

>>toby wrote:

>>>toby wrote:
Jerry Stuckle wrote:
>...
>A failing controller can
>easily overwrite the data at some later time. RAID-1 and RAID-10 could
>still have that happen, but what are the chances of two separate
>controllers having exactly the same failure at the same time?

The difference is that ZFS will see the problem (checksum) and
automatically salvage the data from the good side, while RAID-1 will
not discover the damage
I should have added - you don't need *two* failures. You only need *one
silent error* to cause data loss with RAID-1. ZFS is proof against
silent errors, although of course it's still susceptible to multiple
failures (such as both mirrors suffering a whole disk failure without
repair).

ZFS is not proof against silent errors - they can still occur. It is
possible for it to miss an error, also. Plus it is not proof against
data decaying after it is written to disk.

Actually both capabilities are among its strongest features.

Clearly you haven't read or understood any of the publicly available
information about it, so I'm not going to pursue this any further
beyond relating an analogy:
You will likely live longer if you look both ways before crossing
the road, rather than walking straight across without looking because
"cars will stop".

Actually, I understand quite a bit about ZFS. However, unlike you, I
also understand its shortcomings. That's because I started working on
fault-tolerant drive systems starting in 1977 as a hardware CE for IBM,
working on large mainframes. I've watched it grow over the years. And
as a EE major, I also understand the hardware and it's strengths and
weaknesses - in detail.

And as a CS major (dual majors) and programmer since 1867, including
working on system software for IBM in the 1980's I have a thorough
understanding of the software end.

And it's obvious from your statements you have no real understanding or
either, other than sales literature.

>
>>...
But when properly implemented, RAID-1 and RAID-10 will detect and
correct even more errors than ZFS will.

I'll let those with more patience refute this.

And more knowledge of the real facts?

BTW - I took out all those extra newsgroups you added. If I wanted to
discuss things there I would have added them myself.
But I'm also not going to discuss this any more with you. I'd really
rather have discussions with someone who really knows the internals - of
both systems.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 12 '06 #37

Jerry Stuckle

Rich Teer wrote:

On Sat, 11 Nov 2006, toby wrote:

>>Jerry Stuckle wrote:

>>>...
But when properly implemented, RAID-1 and RAID-10 will detect and
correct even more errors than ZFS will.

I'll let those with more patience refute this.

Jerry, what are you smoking? Do you actually know what ZFS is, and
if so what if, in the context of your assertion I quoted above, ZFS
is used to implement RAID 1 and RAID 10 (which, incidentally, it is
VERY frequently used to do)?

I agree with Toby: you need to read a bit more about ZFS. If you're
a storage nut (meant in a non-disparaging way!), I think you'll like
what you read.

I'm not smoking anything.

REAL RAID-1 and RAID-10 are implemented in hardware/firmware, not a
software system such as ZFS.

Of course, there are some systems out there which CLAIM to be RAID-1 or
RAID-10, but implement them in software such as ZFS. What they are are
really RAID-1/RAID-10 compliant.

And BTW - I've taken out the extra newsgroups. They have nothing to do
with this discussion.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 12 '06 #38

toby

Jerry Stuckle wrote:

toby wrote:
Jerry Stuckle wrote:

>toby wrote:

toby wrote:
Jerry Stuckle wrote:
...
A failing controller can
easily overwrite the data at some later time. RAID-1 and RAID-10 could
still have that happen, but what are the chances of two separate
controllers having exactly the same failure at the same time?

The difference is that ZFS will see the problem (checksum) and
automatically salvage the data from the good side, while RAID-1 will
not discover the damage
I should have added - you don't need *two* failures. You only need *one
silent error* to cause data loss with RAID-1. ZFS is proof against
silent errors, although of course it's still susceptible to multiple
failures (such as both mirrors suffering a whole disk failure without
repair).
ZFS is not proof against silent errors - they can still occur. It is
possible for it to miss an error, also. Plus it is not proof against
data decaying after it is written to disk.

Actually both capabilities are among its strongest features.

Clearly you haven't read or understood any of the publicly available
information about it, so I'm not going to pursue this any further
beyond relating an analogy:
You will likely live longer if you look both ways before crossing
the road, rather than walking straight across without looking because
"cars will stop".

Actually, I understand quite a bit about ZFS. However, unlike you, I
also understand its shortcomings. That's because I started working on
fault-tolerant drive systems starting in 1977 as a hardware CE for IBM,
working on large mainframes. I've watched it grow over the years. And
as a EE major, I also understand the hardware and it's strengths and
weaknesses - in detail.

And as a CS major (dual majors) and programmer since 1867, including
working on system software for IBM in the 1980's I have a thorough
understanding of the software end.

And it's obvious from your statements you have no real understanding or
either, other than sales literature.

This isn't about a battle of the egos. I was challenging what seemed to
be factual misunderstandings of ZFS relative to RAID. Perhaps we're
talking at cross purposes; you had trouble getting Axel's point also...

>

>...
But when properly implemented, RAID-1 and RAID-10 will detect and
correct even more errors than ZFS will.

I'll let those with more patience refute this.

And more knowledge of the real facts?

BTW - I took out all those extra newsgroups you added. If I wanted to
discuss things there I would have added them myself.
But I'm also not going to discuss this any more with you. I'd really
rather have discussions with someone who really knows the internals - of
both systems.

You'll find them in the newsgroups you snipped, not here. I'm sorry
things degenerated to this point, but I stand by my corrections of your
strange views on ZFS' capabilities.

>
--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 12 '06 #39

Jerry Stuckle

toby wrote:

Jerry Stuckle wrote:

>>toby wrote:

>>>Jerry Stuckle wrote:
Journals are written synchronously, ...

And RAID-1 and RAID-10 are fault tolerant mirrored systems. You would
have to lose multiple disks and adapters at exactly the same time to
loose the journal.
As long as there is a single point of failure (software or firmware bug
for instance)...

They will also handle hardware failures. I have never heard of any loss
of data due to hardware failures on RAID-1 or RAID-10. Can you point to
even one instance?

There are several examples of such hardware failures in the links
cited, but I'll crosspost this to comp.arch.storage - I'll eat my hat
if no-one there has seen a RAID data loss.

I've seen those links. I have yet to see where there was any loss of
data proven. Some conjectures in blogs, for instance. But I want to
see documented facts.

And I've removed the cross-post. If I want a discussion in
comp.arch.storage, I will post in it.

>

>>>>...
I don't know of anyone who has "a story" about these systems where data
was lost on RAID-1 or RAID-10.
It hasn't happened to me either, but it has happened to many others.

Specifics? Using RAID-1 or RAID-10?

>>>>These systems duplicate everything. They have multiple controllers.
Separate cables. Even separate power supplies in the most critical
cases. Even a power failure just powers down the device (and take the
system down).

Also, ZFS doesn't protect against a bad disk, for instance. All it does
is guarantee the data was written properly.
It does considerably better than RAID-1 here, in several ways - by
verifying writes; verifying reads; by healing immediately a data error
is found; and by (optionally) making scrubbing passes to reduce the
possibility of undetected loss (this also works for conventional RAID
of course, subject to error detection limitations).

And how does it recover from a disk crash? Or what happens if the data
goes bad after being written and read back?

You use the redundancy to repair it. RAID-1 does not do this.

No, RAID-1 has complete mirrors of the data. And if it detects an error
on the primary disk it can correct the error from the mirror, automatically.

>
>>Additionally, it depends on the software correctly detecting and
signaling a data error.

Which RAID-1 cannot do at all.

Actually, RAID-1 does do it. In case you aren't aware, all sectors on
the disks are checksummed. If there is a failure, the hardware will
detect it, long before it even gets to the software. The hardware can
even retry the operation, or it can go straight to the mirror.

>

>>>>A failing controller can
easily overwrite the data at some later time. RAID-1 and RAID-10 could
still have that happen, but what are the chances of two separate
controllers having exactly the same failure at the same time?
The difference is that ZFS will see the problem (checksum) and
automatically salvage the data from the good side, while RAID-1 will
not discover the damage (only reads from one side of the mirror).
Obviously checksumming is the critical difference; RAID-1 is entirely
dependent on the drive correctly signalling errors (correctable or
not); it cannot independently verify data integrity and remains
vulnerable to latent data loss.

If it's a single sector. But if the entire disk crashes - i.e. an
electronics failure?

That's right, it cannot bring a dead disk back to life...

Nope, but the mirror still contains the data.

>
>>But all data is mirrored. And part of the drive's job is to signal
errors. One which doesn't do that correctly isn't much good, is it/

You're right that RAID-1 is built on the assumption that drives
perfectly report errors. ZFS isn't.

Do you really understand how drives work? I mean the actual electronics
of it? Could you read a schematic, scope a failing drive down to the
bad component? Do you have that level of knowledge?

If not, please don't make statements you have no real understanding of.
I can do that, and more. And I have done it.

As Richard Elling writes, "We don't have to rely on a parity protected
SCSI bus, or a bug-free disk firmware (I've got the scars) to ensure
that what is on persistent storage is what we get in memory. ... by
distrusting everything in the storage data path we will build in the
reliability and redundancy into the file system."

So, you read a few statements and argue your point without any real
technical knowledge of what goes on behind the scenes?

Can you tell me the chances of having an undetected problem on a parity
protected SCSI bus? Or even a non-parity protected one? And can you
give me the details of the most common causes of those? I thought not.

And bug-free disk firmware? Disk firmware is a LOT more bug free than
any OS software I've ever seen, including Linux. That's because it has
to do a limited amount of operations with a limited interface.

Unlike a file system which has to handle many additional operations on
different disk types and configurations.

And BTW - how many disk firmware bugs have you heard about recently? I
don't say they can't occur. But the reliable disk manufacturers check,
double-check and triple-check their code before it goes out. Then they
test it again.

>

>>>>I have in the past been involved in some very critical databases. They
all use various RAID devices. And the most critical use RAID-1 or RAID-10.
We can do even better these days.

Related links of interest:
http://blogs.sun.com/bonwick/
http://blogs.sun.com/relling/entry/zfs_from_a_ras_point
https://www.gelato.unsw.edu.au/archi...er/003008.html
http://www.lockss.org/locksswiki/fil...urosys2006.pdf [A Fresh
Look at the Reliability of Long-term Digital Storage, 2006]
http://www.ecsl.cs.sunysb.edu/tr/rpe19.pdf [Challenges of Long-Term
Digital Archiving: A Survey, 2006]
http://www.cs.wisc.edu/~vijayan/vijayan-thesis.pdf [IRON File Systems,
2006]
http://www.tcs.hut.fi/~hhk/phd/phd_Hannu_H_Kari.pdf [Latent Sector
Faults and Reliability of Disk Arrays, 1997]

So? I don't see anything in any of these articles which affects this
discussion. We're not talking about long term digital storage, for
instance.

I think that's quite relevant to many "business critical" database
systems. Databases are even evolving in response to changing
*regulatory* requirements: MySQL's ARCHIVE engine, for instance...

What does MySQL's ARCHIVE engine have to do with "regulatory
requirements"? In case you haven't noticed, MySQL is NOT a US company
(although they do have a U.S. subsidiary).

>
>>I'm just curious. How many critical database systems have you actually
been involved with? I've lost count. ...
These systems are critical to their business. ...

None of this is relevant to what I'm trying to convey, which is simply:
What ZFS does beyond RAID.

Why are you taking the position that they are equivalent? There are
innumerable failure modes that RAID(-1) cannot handle, which ZFS does.

I'm not taking the position they are equivalent. I'm taking the
position that ZFS is an inferior substitute for a true RAID-1 or RAID-10
implementation.

>
>>BTW - NONE of them use zfs - because these are mainframe systems, not
Linux. But they all use the mainframe versions of RAID-1 or RAID-10.

I still claim - along with Sun - that you can, using more modern
software, improve on the integrity and availability guarantees of
RAID-1. This applies equally to the small systems I specify (say, a
small mirrored disk server storing POS account data) as to their
humongous storage arrays.

OK, you can maintain it. But a properly configured and operating RAID-1
or RAID-10 array needs no such assistance.

>
>>In any case - this is way off topic for this newsgroup. The original
question was "Can I prevent the loss of a significant portion of my data
in the case of a MySQL, OS or hardware failure, when using MyISAM?".

The answer is no.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 12 '06 #40

Jerry Stuckle

Gordon Burditt wrote:

>>>>Actually, for an OS problem, you can. Use an RDB which journals and
take regular backups. Rolling forward from the last valid backup will
restore all committed transactions.
Nope, OS/hardware issue could mean "no journal".

OS issue could mean "no disk writes", PERIOD.

True. But synchronous writes to the journal will have errors, if the OS
is doing its job.

>
>>Journals are written synchronously, before data is written to the
database. Also, they are preallocated - so there is no change in the
allocation units on the disk. Even an OS problem can't break that.

Yes, it can. An OS can decide not to write data at all. (consider
an anti-virus program that monitors disk writes hooked into the
OS). Or, at any time, it can erase all the data. (Consider
accidentally zeroing out a sector containing inodes, including a
file someone else was using and the journal and some table .MYD
files. Oh, yes, remember READING a file modifies the inode (accessed
time)). Or it could write any data over the last sector read (which
might be part of the mysqld executable).

True. The OS has to perform the operations demanded of it. But in that
case nothing will help - not ZFS, not RAID, nothing.

However, at the same time, an OS which does that won't be running for
long, so it's really a moot point.

And BTW - when you're talking inodes, etc., you're discussing
Unix-specific implementations of one file system (actually a couple more
than that - but they are all basically related). There are other file
systems out there.

>
>>The
worst which can happen is the last transaction isn't completely written
to disk (i.e. the database crashed in the middle of the write).

When you think worst-case OS failure, think VIRUS. When you think
worst-case hardware failure, think EXPLOSION. Or FIRE. Or disk
head crash. When you think worst-case power-failure situation,
think burned-out circuits. Or erased track.

A critical system will have virus protection. If you aren't taking even
minimal steps to protect your system, you deserve everything you get.

>

>>And RAID-1 and RAID-10 are fault tolerant mirrored systems. You would
have to lose multiple disks and adapters at exactly the same time to
loose the journal.

Only if the OS writes it in the first place, and the RAID controller
isn't broken, and the OS doesn't scribble over it later.

Yep. Take precautions to protect your system.

>

>>>>Hardware failure is a little more difficult. At the least you need to
have your database and journal on two different disks with two different
adapters.

On different planets.

Right. Get real here.

>

>>>>Better is to also mirror the database and journal with
something like RAID-1 or RAID-10.
Neither of which can protect against certain hardware failures -
everyone has a story about the bulletproof RAID setup which was
scribbled over by a bad controller, or bad cable, or bad power. ZFS
buys a lot more safety (end to end verification).

I don't know of anyone who has "a story" about these systems where data
was lost on RAID-1 or RAID-10.

It's certainly possible to rapidly lose data when you type in a
UPDATE or DELETE query and accidentally type a semicolon instead
of ENTER just before you were going to type WHERE. RAID (or MySQL's
replication setup) does just what it's supposed to do and updates
all the copies with the bad data.

I'm not trying to discourage use of RAID. It can save your butt in
lots of situations. But it doesn't work miracles.

This has nothing to do with the integrity of the database. Of course
it's possible to do something stupid on any system.

Why not just:

rm -r /

>
>>These systems duplicate everything. They have multiple controllers.
Separate cables. Even separate power supplies in the most critical
cases. Even a power failure just powers down the device (and take the
system down).

>>Also, ZFS doesn't protect against a bad disk, for instance. All it does
is guarantee the data was written properly. A failing controller can
easily overwrite the data at some later time. RAID-1 and RAID-10 could
still have that happen, but what are the chances of two separate
controllers having exactly the same failure at the same time?

I have in the past been involved in some very critical databases. They
all use various RAID devices. And the most critical use RAID-1 or RAID-10.

>>>>It can be guaranteed. Critical databases all use these techniques.
I don't trust that word "guaranteed". You need backups in any case. :)

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 12 '06 #41

toby

Jerry Stuckle wrote:

toby wrote:
Jerry Stuckle wrote:

>toby wrote:

Jerry Stuckle wrote:
Journals are written synchronously, ...

And RAID-1 and RAID-10 are fault tolerant mirrored systems. You would
have to lose multiple disks and adapters at exactly the same time to
loose the journal.
As long as there is a single point of failure (software or firmware bug
for instance)...
They will also handle hardware failures. I have never heard of any loss
of data due to hardware failures on RAID-1 or RAID-10. Can you point to
even one instance?

There are several examples of such hardware failures in the links
cited, but I'll crosspost this to comp.arch.storage - I'll eat my hat
if no-one there has seen a RAID data loss.

I've seen those links. I have yet to see where there was any loss of
data proven. Some conjectures in blogs, for instance. But I want to
see documented facts.

Jerry, I'm having trouble believing that you can't come up with a data
loss scenario for conventional RAID-1.

>
And I've removed the cross-post. If I want a discussion in
comp.arch.storage, I will post in it.

>>>...
I don't know of anyone who has "a story" about these systems where data
was lost on RAID-1 or RAID-10.
It hasn't happened to me either, but it has happened to many others.
Specifics? Using RAID-1 or RAID-10?
These systems duplicate everything. They have multiple controllers.
Separate cables. Even separate power supplies in the most critical
cases. Even a power failure just powers down the device (and take the
system down).

Also, ZFS doesn't protect against a bad disk, for instance. All it does
is guarantee the data was written properly.
It does considerably better than RAID-1 here, in several ways - by
verifying writes; verifying reads; by healing immediately a data error
is found; and by (optionally) making scrubbing passes to reduce the
possibility of undetected loss (this also works for conventional RAID
of course, subject to error detection limitations).
And how does it recover from a disk crash? Or what happens if the data
goes bad after being written and read back?

You use the redundancy to repair it. RAID-1 does not do this.

No, RAID-1 has complete mirrors of the data. And if it detects an error
on the primary disk it can correct the error from the mirror, automatically.

In fact, it does not. It reads from only one side of the mirror. Yes,
*if the drive reports an error* it can fix from the other side. ZFS
does not depend on the drive (or any subsystem) reliably reporting
errors. (I'm not inventing this, I'm only describing.)

>

>Additionally, it depends on the software correctly detecting and
signaling a data error.

Which RAID-1 cannot do at all.

Actually, RAID-1 does do it. In case you aren't aware, all sectors on
the disks are checksummed.

Are you referring to disk internals? If so, it's not relevant to a
comparison between RAID-1 and ZFS, since the mechanism applies in both
cases. ZFS applies a further level of checksumming as you know.

If there is a failure, the hardware will
detect it, long before it even gets to the software. The hardware can
even retry the operation, or it can go straight to the mirror.

>>>A failing controller can
easily overwrite the data at some later time. RAID-1 and RAID-10 could
still have that happen, but what are the chances of two separate
controllers having exactly the same failure at the same time?
The difference is that ZFS will see the problem (checksum) and
automatically salvage the data from the good side, while RAID-1 will
not discover the damage (only reads from one side of the mirror).
Obviously checksumming is the critical difference; RAID-1 is entirely
dependent on the drive correctly signalling errors (correctable or
not); it cannot independently verify data integrity and remains
vulnerable to latent data loss.
If it's a single sector. But if the entire disk crashes - i.e. an
electronics failure?

That's right, it cannot bring a dead disk back to life...

Nope, but the mirror still contains the data.

>But all data is mirrored. And part of the drive's job is to signal
errors. One which doesn't do that correctly isn't much good, is it/

You're right that RAID-1 is built on the assumption that drives
perfectly report errors. ZFS isn't.

Do you really understand how drives work? I mean the actual electronics
of it? Could you read a schematic, scope a failing drive down to the
bad component? Do you have that level of knowledge?

If not, please don't make statements you have no real understanding of.
I can do that, and more. And I have done it.

Is that actually relevant here?

My statement was, ZFS does not assume drives, controllers, drivers or
any level of the stack faithfully reports errors. I'm not inventing
that. Its design principle is, as Richard writes, distrust of the
entire I/O stack (a.k.a. Bonwick's "end-to-end"). You may not like to
hear the words from me (since you've decided I'm not worth listening
to), but there it is.

>
As Richard Elling writes, "We don't have to rely on a parity protected
SCSI bus, or a bug-free disk firmware (I've got the scars) to ensure
that what is on persistent storage is what we get in memory. ... by
distrusting everything in the storage data path we will build in the
reliability and redundancy into the file system."

So, you read a few statements and argue your point without any real
technical knowledge of what goes on behind the scenes?

Can you tell me the chances of having an undetected problem on a parity
protected SCSI bus? Or even a non-parity protected one? And can you
give me the details of the most common causes of those? I thought not.

OK. Seems you're pretty angry about something...

>
And bug-free disk firmware? Disk firmware is a LOT more bug free than
any OS software I've ever seen, including Linux. That's because it has
to do a limited amount of operations with a limited interface.

Unlike a file system which has to handle many additional operations on
different disk types and configurations.

And BTW - how many disk firmware bugs have you heard about recently? I
don't say they can't occur. But the reliable disk manufacturers check,
double-check and triple-check their code before it goes out. Then they
test it again.

>>>I have in the past been involved in some very critical databases. They
all use various RAID devices. And the most critical use RAID-1 or RAID-10.
We can do even better these days.

Related links of interest:
http://blogs.sun.com/bonwick/
http://blogs.sun.com/relling/entry/zfs_from_a_ras_point
https://www.gelato.unsw.edu.au/archi...er/003008.html
http://www.lockss.org/locksswiki/fil...urosys2006.pdf [A Fresh
Look at the Reliability of Long-term Digital Storage, 2006]
http://www.ecsl.cs.sunysb.edu/tr/rpe19.pdf [Challenges of Long-Term
Digital Archiving: A Survey, 2006]
http://www.cs.wisc.edu/~vijayan/vijayan-thesis.pdf [IRON File Systems,
2006]
http://www.tcs.hut.fi/~hhk/phd/phd_Hannu_H_Kari.pdf [Latent Sector
Faults and Reliability of Disk Arrays, 1997]
So? I don't see anything in any of these articles which affects this
discussion. We're not talking about long term digital storage, for
instance.

I think that's quite relevant to many "business critical" database
systems. Databases are even evolving in response to changing
*regulatory* requirements: MySQL's ARCHIVE engine, for instance...

What does MySQL's ARCHIVE engine have to do with "regulatory
requirements"? In case you haven't noticed, MySQL is NOT a US company
(although they do have a U.S. subsidiary).

It was a subtle point. Don't sweat it.

>

>I'm just curious. How many critical database systems have you actually
been involved with? I've lost count. ...
These systems are critical to their business. ...

None of this is relevant to what I'm trying to convey, which is simply:
What ZFS does beyond RAID.

Why are you taking the position that they are equivalent? There are
innumerable failure modes that RAID(-1) cannot handle, which ZFS does.

I'm not taking the position they are equivalent. I'm taking the
position that ZFS is an inferior substitute for a true RAID-1 or RAID-10
implementation.

I don't believe that is the case. We'll have to agree to disagree.

>

>BTW - NONE of them use zfs - because these are mainframe systems, not
Linux. But they all use the mainframe versions of RAID-1 or RAID-10.

I still claim - along with Sun - that you can, using more modern
software, improve on the integrity and availability guarantees of
RAID-1. This applies equally to the small systems I specify (say, a
small mirrored disk server storing POS account data) as to their
humongous storage arrays.

OK, you can maintain it. But a properly configured and operating RAID-1
or RAID-10 array needs no such assistance.

But there are numerous failure modes they can't handle. Any unreported
data error on disk, for instance.

Btw, if you want information from "more qualified sources" than myself
on ZFS, you should continue to post in comp.unix.solaris. My resume
isn't as long as yours, as we have established several times, and you
clearly have decided I have nothing useful to contribute. Oh well.

>

>In any case - this is way off topic for this newsgroup. The original
question was "Can I prevent the loss of a significant portion of my data
in the case of a MySQL, OS or hardware failure, when using MyISAM?".

The answer is no.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 12 '06 #42

toby

Jerry Stuckle wrote:

Rich Teer wrote:
On Sat, 11 Nov 2006, toby wrote:

>Jerry Stuckle wrote:

...
But when properly implemented, RAID-1 and RAID-10 will detect and
correct even more errors than ZFS will.

I'll let those with more patience refute this.

Jerry, what are you smoking? Do you actually know what ZFS is, and
if so what if, in the context of your assertion I quoted above, ZFS
is used to implement RAID 1 and RAID 10 (which, incidentally, it is
VERY frequently used to do)?

I agree with Toby: you need to read a bit more about ZFS. If you're
a storage nut (meant in a non-disparaging way!), I think you'll like
what you read.

I'm not smoking anything.

REAL RAID-1 and RAID-10 are implemented in hardware/firmware, not a
software system such as ZFS.

Which is actually a weak point, because you then have to trust the
controller, cables, and so on that interface the "reliable" storage.
Sure, you can have two controllers, and so on, but your application
still has no assurance that the data is good. ZFS is designed to
provide that assurance. The fact that it is part of the operating
system and not a hardware-isolated module makes this possible. Don't
take my word for it, read Bonwick, he's much smarter than I am (which
is why I use his system):
http://blogs.sun.com/bonwick/entry/zfs_end_to_end_data

Btw, I have restored the crosspost to comp.unix.solaris, because ZFS is
a Solaris 10 filesystem.

>
Of course, there are some systems out there which CLAIM to be RAID-1 or
RAID-10, but implement them in software such as ZFS. What they are are
really RAID-1/RAID-10 compliant.

And BTW - I've taken out the extra newsgroups. They have nothing to do
with this discussion.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 12 '06 #43

Jerry Stuckle

toby wrote:

Jerry Stuckle wrote:

>>Rich Teer wrote:

>>>On Sat, 11 Nov 2006, toby wrote:

Jerry Stuckle wrote:
>...
>But when properly implemented, RAID-1 and RAID-10 will detect and
>correct even more errors than ZFS will.

I'll let those with more patience refute this.
Jerry, what are you smoking? Do you actually know what ZFS is, and
if so what if, in the context of your assertion I quoted above, ZFS
is used to implement RAID 1 and RAID 10 (which, incidentally, it is
VERY frequently used to do)?

I agree with Toby: you need to read a bit more about ZFS. If you're
a storage nut (meant in a non-disparaging way!), I think you'll like
what you read.

I'm not smoking anything.

REAL RAID-1 and RAID-10 are implemented in hardware/firmware, not a
software system such as ZFS.

Which is actually a weak point, because you then have to trust the
controller, cables, and so on that interface the "reliable" storage.
Sure, you can have two controllers, and so on, but your application
still has no assurance that the data is good. ZFS is designed to
provide that assurance. The fact that it is part of the operating
system and not a hardware-isolated module makes this possible. Don't
take my word for it, read Bonwick, he's much smarter than I am (which
is why I use his system):
http://blogs.sun.com/bonwick/entry/zfs_end_to_end_data

Believe me - I trust the hardware a LOT farther than the software!

And yes, I've read bonwick. He's a great proponent of zfs. However, I
don't think he has any idea how the hardware works. At least I haven't
seen any indication of it.

Btw, I have restored the crosspost to comp.unix.solaris, because ZFS is
a Solaris 10 filesystem.

And I have removed it again.

But that's OK. I'm not going to respond to you any further. It's
obvious you've bought a bill of goods hook, line and sinker. And you
aren't willing to listen to anything else.

Bye.

>
>>Of course, there are some systems out there which CLAIM to be RAID-1 or
RAID-10, but implement them in software such as ZFS. What they are are
really RAID-1/RAID-10 compliant.

And BTW - I've taken out the extra newsgroups. They have nothing to do
with this discussion.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 12 '06 #44

Jerry Stuckle

toby wrote:

Jerry Stuckle wrote:

>>toby wrote:

>>>Jerry Stuckle wrote:
toby wrote:
>toby wrote:
>
>
>
>>Jerry Stuckle wrote:
>>
>>
>>
>>>...
>>>A failing controller can
>>>easily overwrite the data at some later time. RAID-1 and RAID-10 could
>>>still have that happen, but what are the chances of two separate
>>>controllers having exactly the same failure at the same time?
>>
>>The difference is that ZFS will see the problem (checksum) and
>>automatically salvage the data from the good side, while RAID-1 will
>>not discover the damage
>
>
>I should have added - you don't need *two* failures. You only need *one
>silent error* to cause data loss with RAID-1. ZFS is proof against
>silent errors, although of course it's still susceptible to multiple
>failures (such as both mirrors suffering a whole disk failure without
>repair).
>

ZFS is not proof against silent errors - they can still occur. It is
possible for it to miss an error, also. Plus it is not proof against
data decaying after it is written to disk.
Actually both capabilities are among its strongest features.

Clearly you haven't read or understood any of the publicly available
information about it, so I'm not going to pursue this any further
beyond relating an analogy:
You will likely live longer if you look both ways before crossing
the road, rather than walking straight across without looking because
"cars will stop".

Actually, I understand quite a bit about ZFS. However, unlike you, I
also understand its shortcomings. That's because I started working on
fault-tolerant drive systems starting in 1977 as a hardware CE for IBM,
working on large mainframes. I've watched it grow over the years. And
as a EE major, I also understand the hardware and it's strengths and
weaknesses - in detail.

And as a CS major (dual majors) and programmer since 1867, including
working on system software for IBM in the 1980's I have a thorough
understanding of the software end.

And it's obvious from your statements you have no real understanding or
either, other than sales literature.

This isn't about a battle of the egos. I was challenging what seemed to
be factual misunderstandings of ZFS relative to RAID. Perhaps we're
talking at cross purposes; you had trouble getting Axel's point also...

It's not about a battle of egos with me, either. It's about correcting
some misconceptions of a close-minded individual who has no real idea of
the technical issues involved.

I suspect I understand both ZFS and RAID-1 and RAID-10 a whole lot more
than you do - because I have a thorough understanding of the underlying
hardware and its operation, as well as the programming involved.

>

>>>>...
But when properly implemented, RAID-1 and RAID-10 will detect and
correct even more errors than ZFS will.
I'll let those with more patience refute this.

And more knowledge of the real facts?

BTW - I took out all those extra newsgroups you added. If I wanted to
discuss things there I would have added them myself.
But I'm also not going to discuss this any more with you. I'd really
rather have discussions with someone who really knows the internals - of
both systems.

You'll find them in the newsgroups you snipped, not here. I'm sorry
things degenerated to this point, but I stand by my corrections of your
strange views on ZFS' capabilities.

And it's snipped again because I really don't give a damn what bill of
goods you've bought. And I'm finished with this conversation.

Bye.

>
>>--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 12 '06 #45

toby

toby wrote:

Jerry Stuckle wrote:
...
Actually, I understand quite a bit about ZFS. However, unlike you, I
also understand its shortcomings.

This group and I would very much like to hear about those shortcomings,
if you would elucidate.

That's because I started working on
fault-tolerant drive systems starting in 1977 as a hardware CE for IBM,
working on large mainframes. I've watched it grow over the years. And
as a EE major, I also understand the hardware and it's strengths and
weaknesses - in detail.

And as a CS major (dual majors) and programmer since 1867, including
working on system software for IBM in the 1980's I have a thorough
understanding of the software end.

And it's obvious from your statements you have no real understanding or
either, other than sales literature.

Nov 12 '06 #46

toby

Jerry Stuckle wrote:

toby wrote:
Jerry Stuckle wrote:

>Rich Teer wrote:

On Sat, 11 Nov 2006, toby wrote:

Jerry Stuckle wrote:
...
But when properly implemented, RAID-1 and RAID-10 will detect and
correct even more errors than ZFS will.

I'll let those with more patience refute this.
Jerry, what are you smoking? Do you actually know what ZFS is, and
if so what if, in the context of your assertion I quoted above, ZFS
is used to implement RAID 1 and RAID 10 (which, incidentally, it is
VERY frequently used to do)?

I agree with Toby: you need to read a bit more about ZFS. If you're
a storage nut (meant in a non-disparaging way!), I think you'll like
what you read.
I'm not smoking anything.

REAL RAID-1 and RAID-10 are implemented in hardware/firmware, not a
software system such as ZFS.

Which is actually a weak point, because you then have to trust the
controller, cables, and so on that interface the "reliable" storage.
Sure, you can have two controllers, and so on, but your application
still has no assurance that the data is good. ZFS is designed to
provide that assurance. The fact that it is part of the operating
system and not a hardware-isolated module makes this possible. Don't
take my word for it, read Bonwick, he's much smarter than I am (which
is why I use his system):
http://blogs.sun.com/bonwick/entry/zfs_end_to_end_data

Believe me - I trust the hardware a LOT farther than the software!

And yes, I've read bonwick. He's a great proponent of zfs. However, I
don't think he has any idea how the hardware works. At least I haven't
seen any indication of it.

Btw, I have restored the crosspost to comp.unix.solaris, because ZFS is
a Solaris 10 filesystem.

And I have removed it again.

But that's OK. I'm not going to respond to you any further. It's
obvious you've bought a bill of goods hook, line and sinker. And you
aren't willing to listen to anything else.

Au contraire. I have asked a question in the relevant group, about what
you have identified as ZFS' shortcomings, and I would genuinely like to
hear the answer.

--Toby

>
Bye.

>Of course, there are some systems out there which CLAIM to be RAID-1 or
RAID-10, but implement them in software such as ZFS. What they are are
really RAID-1/RAID-10 compliant.

And BTW - I've taken out the extra newsgroups. They have nothing to do
with this discussion.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 12 '06 #47

Jerry Stuckle

toby wrote:

Jerry Stuckle wrote:

>>toby wrote:

>>>Jerry Stuckle wrote:
toby wrote:
>Jerry Stuckle wrote:
>
>
>
>>Journals are written synchronously, ...
>>
>>And RAID-1 and RAID-10 are fault tolerant mirrored systems. You would
>>have to lose multiple disks and adapters at exactly the same time to
>>loose the journal.
>
>
>As long as there is a single point of failure (software or firmware bug
>for instance)...
>

They will also handle hardware failures. I have never heard of any loss
of data due to hardware failures on RAID-1 or RAID-10. Can you point to
even one instance?
There are several examples of such hardware failures in the links
cited, but I'll crosspost this to comp.arch.storage - I'll eat my hat
if no-one there has seen a RAID data loss.

I've seen those links. I have yet to see where there was any loss of
data proven. Some conjectures in blogs, for instance. But I want to
see documented facts.

Jerry, I'm having trouble believing that you can't come up with a data
loss scenario for conventional RAID-1.

What do you mean ME coming up with a data loss scenario? YOU come up
with a single point of failure which loses data on a real RAID-1. A
point of failure which isn't reflected back to the OS, of course.

>
>>And I've removed the cross-post. If I want a discussion in
comp.arch.storage, I will post in it.

>>>>>>...
>>I don't know of anyone who has "a story" about these systems where data
>>was lost on RAID-1 or RAID-10.
>
>
>It hasn't happened to me either, but it has happened to many others.
>

Specifics? Using RAID-1 or RAID-10?

>>These systems duplicate everything. They have multiple controllers.
>>Separate cables. Even separate power supplies in the most critical
>>cases. Even a power failure just powers down the device (and take the
>>system down).
>>
>>Also, ZFS doesn't protect against a bad disk, for instance. All it does
>>is guarantee the data was written properly.
>
>
>It does considerably better than RAID-1 here, in several ways - by
>verifying writes; verifying reads; by healing immediately a data error
>is found; and by (optionally) making scrubbing passes to reduce the
>possibility of undetected loss (this also works for conventional RAID
>of course, subject to error detection limitations).
>

And how does it recover from a disk crash? Or what happens if the data
goes bad after being written and read back?
You use the redundancy to repair it. RAID-1 does not do this.

No, RAID-1 has complete mirrors of the data. And if it detects an error
on the primary disk it can correct the error from the mirror, automatically.

In fact, it does not. It reads from only one side of the mirror. Yes,
*if the drive reports an error* it can fix from the other side. ZFS
does not depend on the drive (or any subsystem) reliably reporting
errors. (I'm not inventing this, I'm only describing.)

Maybe not the implementations you're familiar with. True fault tolerant
ones will detect failure on one side and automatically corrects by
fetching them from the other side.

And tell me exactly under what conditions the drive will report an error
but ZFS would not have. Specifics details, please.

>

>>>>Additionally, it depends on the software correctly detecting and
signaling a data error.
Which RAID-1 cannot do at all.

Actually, RAID-1 does do it. In case you aren't aware, all sectors on
the disks are checksummed.

Are you referring to disk internals? If so, it's not relevant to a
comparison between RAID-1 and ZFS, since the mechanism applies in both
cases. ZFS applies a further level of checksumming as you know.

Which makes ZFS's checksum unnecessary and irrelevant - unless you're
using cheap drives, that is.

>
>>If there is a failure, the hardware will
detect it, long before it even gets to the software. The hardware can
even retry the operation, or it can go straight to the mirror.

>>>>>>A failing controller can
>>easily overwrite the data at some later time. RAID-1 and RAID-10 could
>>still have that happen, but what are the chances of two separate
>>controllers having exactly the same failure at the same time?
>
>
>The difference is that ZFS will see the problem (checksum) and
>automatically salvage the data from the good side, while RAID-1 will
>not discover the damage (only reads from one side of the mirror).
>Obviously checksumming is the critical difference; RAID-1 is entirely
>dependent on the drive correctly signalling errors (correctable or
>not); it cannot independently verify data integrity and remains
>vulnerable to latent data loss.
>

If it's a single sector. But if the entire disk crashes - i.e. an
electronics failure?
That's right, it cannot bring a dead disk back to life...

Nope, but the mirror still contains the data.

>>>>But all data is mirrored. And part of the drive's job is to signal
errors. One which doesn't do that correctly isn't much good, is it/
You're right that RAID-1 is built on the assumption that drives
perfectly report errors. ZFS isn't.

Do you really understand how drives work? I mean the actual electronics
of it? Could you read a schematic, scope a failing drive down to the
bad component? Do you have that level of knowledge?

If not, please don't make statements you have no real understanding of.
I can do that, and more. And I have done it.

Is that actually relevant here?

You're making technical claims. Provide the technical support to back
up your claims.

My statement was, ZFS does not assume drives, controllers, drivers or
any level of the stack faithfully reports errors. I'm not inventing
that. Its design principle is, as Richard writes, distrust of the
entire I/O stack (a.k.a. Bonwick's "end-to-end"). You may not like to
hear the words from me (since you've decided I'm not worth listening
to), but there it is.

Sure you should distrust the I/O stack. It can be overwritten so many
ways by the software.

Unlike the controller - where it can't be overwritten.

I'll talk to you. But please don't insult me by repeating technical
claims when you don't understand the background behind them. As I said
before - I've read the links you provided, and quite frankly don't agree
with a number of their claims.

>

>>>As Richard Elling writes, "We don't have to rely on a parity protected
SCSI bus, or a bug-free disk firmware (I've got the scars) to ensure
that what is on persistent storage is what we get in memory. ... by
distrusting everything in the storage data path we will build in the
reliability and redundancy into the file system."

So, you read a few statements and argue your point without any real
technical knowledge of what goes on behind the scenes?

Can you tell me the chances of having an undetected problem on a parity
protected SCSI bus? Or even a non-parity protected one? And can you
give me the details of the most common causes of those? I thought not.

OK. Seems you're pretty angry about something...

Not angry at all. Just trying to find out if you understand what you're
talking about.

>
>>And bug-free disk firmware? Disk firmware is a LOT more bug free than
any OS software I've ever seen, including Linux. That's because it has
to do a limited amount of operations with a limited interface.

Unlike a file system which has to handle many additional operations on
different disk types and configurations.

And BTW - how many disk firmware bugs have you heard about recently? I
don't say they can't occur. But the reliable disk manufacturers check,
double-check and triple-check their code before it goes out. Then they
test it again.

>>>>>>I have in the past been involved in some very critical databases. They
>>all use various RAID devices. And the most critical use RAID-1 or RAID-10.
>
>
>We can do even better these days.
>
>Related links of interest:
>http://blogs.sun.com/bonwick/
>http://blogs.sun.com/relling/entry/zfs_from_a_ras_point
>https://www.gelato.unsw.edu.au/archi...er/003008.html
>http://www.lockss.org/locksswiki/fil...urosys2006.pdf [A Fresh
>Look at the Reliability of Long-term Digital Storage, 2006]
>http://www.ecsl.cs.sunysb.edu/tr/rpe19.pdf [Challenges of Long-Term
>Digital Archiving: A Survey, 2006]
>http://www.cs.wisc.edu/~vijayan/vijayan-thesis.pdf [IRON File Systems,
>2006]
>http://www.tcs.hut.fi/~hhk/phd/phd_Hannu_H_Kari.pdf [Latent Sector
>Faults and Reliability of Disk Arrays, 1997]
>

So? I don't see anything in any of these articles which affects this
discussion. We're not talking about long term digital storage, for
instance.
I think that's quite relevant to many "business critical" database
systems. Databases are even evolving in response to changing
*regulatory* requirements: MySQL's ARCHIVE engine, for instance...

What does MySQL's ARCHIVE engine have to do with "regulatory
requirements"? In case you haven't noticed, MySQL is NOT a US company
(although they do have a U.S. subsidiary).

It was a subtle point. Don't sweat it.

Then why even bring it up? Because it's irrelevant?

>

>>>>I'm just curious. How many critical database systems have you actually
been involved with? I've lost count. ...
These systems are critical to their business. ...
None of this is relevant to what I'm trying to convey, which is simply:
What ZFS does beyond RAID.

Why are you taking the position that they are equivalent? There are
innumerable failure modes that RAID(-1) cannot handle, which ZFS does.

I'm not taking the position they are equivalent. I'm taking the
position that ZFS is an inferior substitute for a true RAID-1 or RAID-10
implementation.

I don't believe that is the case. We'll have to agree to disagree.

The difference is I don't just accept what someone claims. Rather, I
analyze and determine just how accurate the statements are.

>

>>>>BTW - NONE of them use zfs - because these are mainframe systems, not
Linux. But they all use the mainframe versions of RAID-1 or RAID-10.
I still claim - along with Sun - that you can, using more modern
software, improve on the integrity and availability guarantees of
RAID-1. This applies equally to the small systems I specify (say, a
small mirrored disk server storing POS account data) as to their
humongous storage arrays.

OK, you can maintain it. But a properly configured and operating RAID-1
or RAID-10 array needs no such assistance.

But there are numerous failure modes they can't handle. Any unreported
data error on disk, for instance.

And exactly how can you get an unreported data error from a disk?

Btw, if you want information from "more qualified sources" than myself
on ZFS, you should continue to post in comp.unix.solaris. My resume
isn't as long as yours, as we have established several times, and you
clearly have decided I have nothing useful to contribute. Oh well.

Not really. You butted into this conversation and discussed zfs -
which, BTW, is a UNIX-only file system. And in case you haven't figured
out, UNIX is NOT the only OS out there. Even MySQL recognizes that.

I'm just refuting your wild claims. But you're not interested in
discussing hard facts - you make claims about "unreported data errors",
for instance, but have no idea how they can happen, how often they
happen or the odds of them happening.

All you have is a sales pitch you've bought.

Thanks, I have better things to do with my time. Bye.

>

>>>>In any case - this is way off topic for this newsgroup. The original
question was "Can I prevent the loss of a significant portion of my data
in the case of a MySQL, OS or hardware failure, when using MyISAM?".

The answer is no.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 12 '06 #48

toby

Jerry Stuckle wrote:

toby wrote:
Jerry Stuckle wrote:

>toby wrote:

Jerry Stuckle wrote:
toby wrote:
Jerry Stuckle wrote:

>Journals are written synchronously, ...
>
>And RAID-1 and RAID-10 are fault tolerant mirrored systems. You would
>have to lose multiple disks and adapters at exactly the same time to
>loose the journal.
As long as there is a single point of failure (software or firmware bug
for instance)...
They will also handle hardware failures. I have never heard of any loss
of data due to hardware failures on RAID-1 or RAID-10. Can you point to
even one instance?
There are several examples of such hardware failures in the links
cited, but I'll crosspost this to comp.arch.storage - I'll eat my hat
if no-one there has seen a RAID data loss.
I've seen those links. I have yet to see where there was any loss of
data proven. Some conjectures in blogs, for instance. But I want to
see documented facts.

Jerry, I'm having trouble believing that you can't come up with a data
loss scenario for conventional RAID-1.

What do you mean ME coming up with a data loss scenario? YOU come up
with a single point of failure which loses data on a real RAID-1. A
point of failure which isn't reflected back to the OS, of course.

>And I've removed the cross-post. If I want a discussion in
comp.arch.storage, I will post in it.
>...
>I don't know of anyone who has "a story" about these systems where data
>was lost on RAID-1 or RAID-10.
It hasn't happened to me either, but it has happened to many others.
Specifics? Using RAID-1 or RAID-10?

>These systems duplicate everything. They have multiple controllers.
>Separate cables. Even separate power supplies in the most critical
>cases. Even a power failure just powers down the device (and take the
>system down).
>
>Also, ZFS doesn't protect against a bad disk, for instance. All it does
>is guarantee the data was written properly.
It does considerably better than RAID-1 here, in several ways - by
verifying writes; verifying reads; by healing immediately a data error
is found; and by (optionally) making scrubbing passes to reduce the
possibility of undetected loss (this also works for conventional RAID
of course, subject to error detection limitations).
And how does it recover from a disk crash? Or what happens if the data
goes bad after being written and read back?
You use the redundancy to repair it. RAID-1 does not do this.
No, RAID-1 has complete mirrors of the data. And if it detects an error
on the primary disk it can correct the error from the mirror, automatically.

In fact, it does not. It reads from only one side of the mirror. Yes,
*if the drive reports an error* it can fix from the other side. ZFS
does not depend on the drive (or any subsystem) reliably reporting
errors. (I'm not inventing this, I'm only describing.)

Maybe not the implementations you're familiar with. True fault tolerant
ones will detect failure on one side and automatically corrects by
fetching them from the other side.

And tell me exactly under what conditions the drive will report an error
but ZFS would not have. Specifics details, please.

>>>Additionally, it depends on the software correctly detecting and
signaling a data error.
Which RAID-1 cannot do at all.
Actually, RAID-1 does do it. In case you aren't aware, all sectors on
the disks are checksummed.

Are you referring to disk internals? If so, it's not relevant to a
comparison between RAID-1 and ZFS, since the mechanism applies in both
cases. ZFS applies a further level of checksumming as you know.

Which makes ZFS's checksum unnecessary and irrelevant - unless you're
using cheap drives, that is.

>If there is a failure, the hardware will
detect it, long before it even gets to the software. The hardware can
even retry the operation, or it can go straight to the mirror.
>A failing controller can
>easily overwrite the data at some later time. RAID-1 and RAID-10 could
>still have that happen, but what are the chances of two separate
>controllers having exactly the same failure at the same time?
The difference is that ZFS will see the problem (checksum) and
automatically salvage the data from the good side, while RAID-1 will
not discover the damage (only reads from one side of the mirror).
Obviously checksumming is the critical difference; RAID-1 is entirely
dependent on the drive correctly signalling errors (correctable or
not); it cannot independently verify data integrity and remains
vulnerable to latent data loss.
If it's a single sector. But if the entire disk crashes - i.e. an
electronics failure?
That's right, it cannot bring a dead disk back to life...
Nope, but the mirror still contains the data.
But all data is mirrored. And part of the drive's job is to signal
errors. One which doesn't do that correctly isn't much good, is it/
You're right that RAID-1 is built on the assumption that drives
perfectly report errors. ZFS isn't.
Do you really understand how drives work? I mean the actual electronics
of it? Could you read a schematic, scope a failing drive down to the
bad component? Do you have that level of knowledge?

If not, please don't make statements you have no real understanding of.
I can do that, and more. And I have done it.

Is that actually relevant here?

You're making technical claims. Provide the technical support to back
up your claims.

My statement was, ZFS does not assume drives, controllers, drivers or
any level of the stack faithfully reports errors. I'm not inventing
that. Its design principle is, as Richard writes, distrust of the
entire I/O stack (a.k.a. Bonwick's "end-to-end"). You may not like to
hear the words from me (since you've decided I'm not worth listening
to), but there it is.

Sure you should distrust the I/O stack. It can be overwritten so many
ways by the software.

Unlike the controller - where it can't be overwritten.

I'll talk to you. But please don't insult me by repeating technical
claims when you don't understand the background behind them. As I said
before - I've read the links you provided, and quite frankly don't agree
with a number of their claims.

Sure, let's talk. Would you humour me with a reply in comp.unix.solaris
with details on the ZFS shortcomings you were talking about - a genuine
request, because some of us *have* invested in that technology.
Tomorrow I'll think over the points you question above.

>

>>As Richard Elling writes, "We don't have to rely on a parity protected
SCSI bus, or a bug-free disk firmware (I've got the scars) to ensure
that what is on persistent storage is what we get in memory. ... by
distrusting everything in the storage data path we will build in the
reliability and redundancy into the file system."
So, you read a few statements and argue your point without any real
technical knowledge of what goes on behind the scenes?

Can you tell me the chances of having an undetected problem on a parity
protected SCSI bus? Or even a non-parity protected one? And can you
give me the details of the most common causes of those? I thought not.

OK. Seems you're pretty angry about something...

Not angry at all. Just trying to find out if you understand what you're
talking about.

You've decided I don't. But let's press on while it remains civil.

>

>And bug-free disk firmware? Disk firmware is a LOT more bug free than
any OS software I've ever seen, including Linux. That's because it has
to do a limited amount of operations with a limited interface.

Unlike a file system which has to handle many additional operations on
different disk types and configurations.

And BTW - how many disk firmware bugs have you heard about recently? I
don't say they can't occur. But the reliable disk manufacturers check,
double-check and triple-check their code before it goes out. Then they
test it again.
>I have in the past been involved in some very critical databases. They
>all use various RAID devices. And the most critical use RAID-1 or RAID-10.
We can do even better these days.

Related links of interest:
http://blogs.sun.com/bonwick/
http://blogs.sun.com/relling/entry/zfs_from_a_ras_point
https://www.gelato.unsw.edu.au/archi...er/003008.html
http://www.lockss.org/locksswiki/fil...urosys2006.pdf [A Fresh
Look at the Reliability of Long-term Digital Storage, 2006]
http://www.ecsl.cs.sunysb.edu/tr/rpe19.pdf [Challenges of Long-Term
Digital Archiving: A Survey, 2006]
http://www.cs.wisc.edu/~vijayan/vijayan-thesis.pdf [IRON File Systems,
2006]
http://www.tcs.hut.fi/~hhk/phd/phd_Hannu_H_Kari.pdf [Latent Sector
Faults and Reliability of Disk Arrays, 1997]
So? I don't see anything in any of these articles which affects this
discussion. We're not talking about long term digital storage, for
instance.
I think that's quite relevant to many "business critical" database
systems. Databases are even evolving in response to changing
*regulatory* requirements: MySQL's ARCHIVE engine, for instance...
What does MySQL's ARCHIVE engine have to do with "regulatory
requirements"? In case you haven't noticed, MySQL is NOT a US company
(although they do have a U.S. subsidiary).

It was a subtle point. Don't sweat it.

Then why even bring it up? Because it's irrelevant?

I don't think "long term data storage" is irrelevant to databases and
data integrity. It *is* irrelevant to the OP's question, of course :)

>

>>>I'm just curious. How many critical database systems have you actually
been involved with? I've lost count. ...
These systems are critical to their business. ...
None of this is relevant to what I'm trying to convey, which is simply:
What ZFS does beyond RAID.

Why are you taking the position that they are equivalent? There are
innumerable failure modes that RAID(-1) cannot handle, which ZFS does.
I'm not taking the position they are equivalent. I'm taking the
position that ZFS is an inferior substitute for a true RAID-1 or RAID-10
implementation.

I don't believe that is the case. We'll have to agree to disagree.

The difference is I don't just accept what someone claims. Rather, I
analyze and determine just how accurate the statements are.

Please don't assume I have done none of my own thinking.

>

>>>BTW - NONE of them use zfs - because these are mainframe systems, not
Linux. But they all use the mainframe versions of RAID-1 or RAID-10.
I still claim - along with Sun - that you can, using more modern
software, improve on the integrity and availability guarantees of
RAID-1. This applies equally to the small systems I specify (say, a
small mirrored disk server storing POS account data) as to their
humongous storage arrays.
OK, you can maintain it. But a properly configured and operating RAID-1
or RAID-10 array needs no such assistance.

But there are numerous failure modes they can't handle. Any unreported
data error on disk, for instance.

And exactly how can you get an unreported data error from a disk?

If the error occurs in cable, controller, RAM, and so on. I have seen
this myself.

>
Btw, if you want information from "more qualified sources" than myself
on ZFS, you should continue to post in comp.unix.solaris. My resume
isn't as long as yours, as we have established several times, and you
clearly have decided I have nothing useful to contribute. Oh well.

Not really. You butted into this conversation and discussed zfs -
which, BTW, is a UNIX-only file system. And in case you haven't figured
out, UNIX is NOT the only OS out there. Even MySQL recognizes that.

Yes, I was talking about specific capabilities of ZFS. The fact it's
UNIX-specific isn't really important to those principles.

>
I'm just refuting your wild claims. But you're not interested in
discussing hard facts - you make claims about "unreported data errors",
for instance, but have no idea how they can happen, how often they
happen or the odds of them happening.

I'm not sure I've made any wild claims other than "I think ZFS can
guarantee more than conventional RAID", due to concepts which underpin
ZFS' design. That's not very "wild". There is some inductive thinking
involved, not sheer speculation. If you calmed down, we could talk it
over. You've said you distrust Bonwick -- so I'd like to hear why he's
wrong (in the appropriate forum).

>
All you have is a sales pitch you've bought.

Thanks, I have better things to do with my time. Bye.

>>>In any case - this is way off topic for this newsgroup. The original
question was "Can I prevent the loss of a significant portion of my data
in the case of a MySQL, OS or hardware failure, when using MyISAM?".

The answer is no.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Nov 12 '06 #49

Bill Todd

Jerry Stuckle wrote:

Bill Todd wrote:
>Jerry Stuckle wrote:

...

>>ZFS is not proof against silent errors - they can still occur.

Of course they can, but they will be caught by the background
verification scrubbing before much time passes (i.e., within a time
window that radically reduces the likelihood that another disk will
fail before the error is caught and corrected), unlike the case with
conventional RAID (where they aren't caught at all, and rise up to
bite you - with non-negligible probability these days - if the good
copy then dies).

And ZFS *is* proof against silent errors in the sense that data thus
mangled will not be returned to an application (i.e., it will be
caught when read if the background integrity validation has not yet
reached it) - again, unlike the case with conventional mirroring,
where there's a good chance that it will be returned to the
application as good.

The same is true with RAID-1 and RAID-10. An error on the disk will be
detected and returned by the hardware to the OS.

I'd think that someone as uninformed as you are would have thought twice
about appending an ad for his services to his Usenet babble. But formal
studies have shown that the least competent individuals seem to be the
most confident of their opinions (because they just don't know enough to
understand how clueless they really are).

Do you even know what a silent error is? It's an error that the disk
does not notice, and hence cannot report.

Duh.

In some of your other recent drivel you've seemed to suggest that this
simply does not happen. Well, perhaps not in your own extremely limited
experience, but you really shouldn't generalize from that.

A friend of mine at DEC investigated this about a decade ago and found
that the (high-end) disk subsystems of some (high-end) Alpha platforms
were encountering undetected errors on average every few TB (i.e., what
they read back was, very rarely, not quite what they had written in,
with no indication of error). That may be better today (that's more
like the uncorrectable error rate now), but it still happens. The
causes are well known to people reasonably familiar with the technology:
the biggies are writes that report successful completion but in fact
do nothing, writes that go to the wrong target sector(s) (whether or not
they report success), and errors that the sector checksums just don't
catch (those used to be about three orders of magnitude rarer than
uncorrectable errors, but that was before the rush toward higher density
and longer checksums to catch the significantly-increased raw error
rates - disk manufacturers no longer report the undetected error rate,
but I suspect that it's considerably closer to the uncorrectable error
rate now). There are also a few special cases - e.g., the disk that
completes a sector update while power is failing, not knowing that the
transfer from memory got clamped part-way through and returned zeros
rather than whatever it was supposed to (so as far as the disk knows
they're valid).

IBM, Unisys, NetApp, and EMC (probably not an exhaustive list, but
they're the ones that spring immediately to mind) all use non-standard
disk sector sizes in some of their systems to hold additional validation
information (maintained by software or firmware well above the disk
level) aimed at catching some (but in most cases not all) of these
unreported errors.

Silent errors are certainly rare, but they happen. ZFS catches them.
RAID does not. End of story.

....

The

big difference being ZFS if done in software, which requires CPU cycles
and other resources.

Since when was this discussion about use of resources rather than
integrity (not that ZFS's use of resources for implementing its own
RAID-1/RAID-10 facilities is significant anyway)?

It's also open to corruption.

No more than the data that some other file system gives to a hardware
RAID implementation would be: it all comes from the same place (main
memory).

However, because ZFS subsequently checks what it wrote against a
*separate* checksum, if it *was* corrupted below the request-submission
level ZFS is very likely to find out, whereas a conventional RAID
implementation (and the higher layers built on top of it) won't: they
just write what (they think) they're told to, with no additional check.

RAID-1 and RAID-10

are implemented in hardware/firmware which cannot be corrupted (Read
only memory) and require no CPU cycles.

If your operating system and file system have been corrupted, you've got
problems regardless of how faithfully your disk hardware transfers this
corruption to its platters: this alleged deficiency compared with a
hardware implementation is just not an issue.

You've also suggested elsewhere that a hardware implementation is less
likely to contain bugs, which at least in this particular instance is
nonsense: ZFS's RAID-1/10 implementation benefits from the rest of its
design such that it's likely *far* simpler than any high-performance
hardware implementation (with its controller-level cache management and
deferred write-back behavior) is, and hence if anything likely *less* buggy.

>
> Plus it is not proof against

>>data decaying after it is written to disk.

No - but, again, it will catch it before long, even in cases where
conventional disk scrubbing would not.

So do RAID-1 and RAID-10.

No, they typically do not: they may scrub to ensure that sectors can be
read successfully (and without checksum errors), but they do not compare
one copy with the other (and even if they did, if they found that the
copies differed they'd have no idea which one was the right one - but
ZFS knows).

>
> And, as you note, it doesn't

>>handle a disk crash.

It handles it with resilience comparable to RAID-1, but is more
flexible in that it can then use distributed free space to restore the
previous level of redundancy (whereas RAID-1/RAID-10 cannot unless the
number of configured hot spare disks equals the number of failed disks).

And for a critical system you have that redundancy and more.

So, at best, RAID-1/10 matches ZFS in this specific regard (though of
course it can't leverage the additional bandwidth and IOPS of its spare
space, unlike ZFS). Whoopee.

>

>>>
But when properly implemented, RAID-1 and RAID-10 will detect and
correct even more errors than ZFS will.

A complete disk crash, for instance. Even Toby admitted ZFS cannot
recover from a disk crash.

ZFS is good. But it's a cheap software implementation of an expensive
hardware recovery system. And there is no way software can do it as
well as hardware does.

You at least got that right: ZFS does it considerably better, not
merely 'as well'. And does so at significantly lower cost (so you got
that part right too).

The one advantage that a good hardware RAID-1/10 implementation has over
ZFS relates to performance, primarily small-synchronous-write latency:
while ZFS can group small writes to achieve competitive throughput (in
fact, superior throughput in some cases), it can't safely report
synchronous write completion until the data is on the disk platters,
whereas a good RAID controller will contain mirrored NVRAM that can
guarantee persistence in microseconds rather than milliseconds (and then
destage the writes to the platters lazily).

Now, ZFS does have an 'intent log' for small writes, and does have the
capability of placing this log on (mirrored) NVRAM to achieve equivalent
small-synchronous-write latency - but that's a hardware option, not part
and parcel of ZFS itself.

....

>Please name even one.

Why am I not surprised that you dodged that challenge?

Now, as far as credentials go, some people (who aren't sufficiently
familiar with this subject to know just how incompetent you really are
to discuss it) might find yours impressive (at least you appear to think
they might, since you made some effort to trot them out). I must admit
that I can't match your claim to have been programming since "1867", but
I have been designing and writing system software since 1976 (starting
with 11 years at DEC), and had a stint at EMC designing high-end storage
firmware in the early '90s. I specialize in designing and implementing
high-performance, high-availability distributed file, object, and
database systems, and have personally created significant portions of
several such; in this pursuit, I've kept current on the state of the art
both in academia and in the commercial arena.

And I say you're full of shit. Christ, you've never even heard of
people losing mirrored data at all - not from latent errors only
discovered at rebuild time, not from correlated failures of mirror pairs
from the same batch (or even not from the same batch - with a large
enough RAID-10 array there's a modest probability that some pair won't
recover from a simple power outage, and - though this may be news to you
- even high-end UPSs are *not* infallible)...

Sheesh.

- bill

Nov 12 '06 #50