[sorry being OT, but I do have a weak spot for faults]
CBFalconer wrote:
so**********@gmail.com wrote:
>I was going through peter van der linden's book expert C programming
there i found the following section on bus error
On bus error, just take the train. :)
Parity systems only detect _some_ memory errors, and do not
The point, is that parity bit detect *all* single bit errors, in fact
parity detect all odd bit errors... right?
correct. Good modern systems use ECC, which corrects single bit
errors immediately, and detects at least double bit errors, and
probably most bigger errors.
Good systems, or let say expensive systems, do far more than that.
Recovering *only* from single bit errors, is low-end server stuff. :)
IBM S/390 mainframes protect against memory failures by design, no chip
affect more than a single bit in the ECC code. Hence, IBM mainframes can
protect against memory chip failures, even in the case that every bit in
a chip fails. That is, the ECC logic can fix hardware fault of a memory
chip, without loss of data.
HP Nonstop (former Tandem Computers) mainframes did protection via
lockstep, is case of HW faults, redundant HW took over the processing.
The principle here, as I understand it, was that computations was
performed in parallel and checked, whenever there was a disagreement,
the module was marked faulty, and processing moved elsewhere from the
last check-point state.
Likewise, various high-end servers, provide extended protection e.g.
chipkill (IBM), chipspare (HP), Extended ECC (Sun/Fijutsi) etc..
These days, Intel Itanium2 CPUs comes with a number of RAS features,
including lock-step technology for both sockets and core. I have tried
reading Intels CPU manuals, without finding much information about these
interesting features.
Both AMD64 and IA-64 has 4 protection rings, while the most popular OS
kernels utilize only two (user and kernel space). Utilizing all the 4
rings, will make future kernels more robust and secure.
ECC protects against such things as cosmic rays. Systematic errors
are rare, and detectable through such things as memtest86. But
random events, fixed by ECC, can cause total destruction, and the
evils may happen much later when all backups are fouled.
HW faults will bring most low-end systems down, correcting single-bit
failures was perhaps sufficient in the past, we do move forward. :)
--
Tor <bw****@wvtqvm.vw | tr i-za-h a-z>