UNIT 3
Fault Tolerance Techniques
Introduction,
Coding technique
failure causes,
Software fault tolerance
fault types, fault detection,
Networkfault tolerance:
fault
and
error
containment,
redundancy,
data diversity,
reversal checks,
malicious
failures,
or
Byzantine
Roll No: 15
Fault Tolerance
Definition
Fault tolerancerefers to a system's ability to deal
with malfunctions.
Fault-tolerant systems - ideally systems capable of
executing their tasks correctly regardless of either
hardware failures or software errors
Real Time and Fault Tolerance
Failure Causes
There are three causes of failure:
Errors in the specification or design,
Defects in the components,
Environmental effects.
Real Time and Fault Tolerance
Fault Types
Categorized into
Faults are classified according to their behavior
1. Temporal behavior
2. Output behavior.
A fault is said to be active when it is physically
capable of generating errors and to be benign when
it
is
not.
Real Time and Fault Tolerance
Temporal Behavior
Transient faults:
These occur once and then disappear
Intermittent faults:
Intermittent faults are characterized by a fault
occurring, then vanishing again, then reoccurring, then
vanishing.
Permanent faults:
This type of failure is persistent: it continues to exist
until the faulty component is repaired or replaced.
Real Time and Fault Tolerance
Fault Detection
Definition
There are two ways to determine that a processor
is malfunctioning:
[Link]
[Link].
. Online detection goes on in parallel with normal
system operation.
. One way of doing this is to check for any behavior
that is Inconsistent with correct operation
Real Time and Fault Tolerance
A monitor (called a watchdog processor) is associated
with each processor, looking for signs that the
processor is faulty.
The watchdog processor watches the data and
address lines, as shown in Figure.
A second approach is to have multiple processors,
which are supposed to put out the same result, and
compare the results.
A discrepancy indicates the existence of a fault.
Real Time and Fault Tolerance
Online Detection using Watchdog
Processor
Real Time and Fault Tolerance
The following actions are indicative of a faulty
processor.
Branching to an invalid destination.
Fetching an opcode from a location containing data.
Writing into a portion of memory to which the
process has no write access.
Fetching an illegal opcode.
Inactive for more than a prescribed period.
Real Time and Fault Tolerance
Offline detection consists of running diagnostic tests.
Not runnable
When a processor is running such a test, it obviously
cannot be executing the applications software.
Diagnostic test can be scheduled just like ordinary
tasks.
The greater the failure rate, the greater must be the
frequency with which these tests are run.
Real Time and Fault Tolerance
Fault and Error containment
Includes
When a fault or error occurs in one part of the system,
it can, if unchecked, spread through the system like an
infectious disease.
A fault in one part of the system might.
for example, cause large voltage swings in another; a
fault-free processor can put out erroneous results as a
result of using erroneous input from a faulty unit.
Faults and errors must therefore be prevented from
spreading through the system. This is called
Real Time and Fault Tolerance
The system is divided into
Fault Containment Zones (FCZ):
An FCZ is a subset of the system that operates
correctly despite arbitrary logical or electrical faults
outside the subset.
That is, the failure of some part of the computer
outside an FCZ cannot cause any element inside
that FCZ to fail.
Error Containment Zones (ECZ)
The function of an ECZ is to prevent errors from
propagating across zone boundaries. This is
Real Time and Fault Tolerance
typically achieved by voting redundant outputs.
Redundancy
Four Types:
Hardware redundancy: The system is provided with far
more hardware if all the components are perfectly
reliable
Software redundancy: The system is provided with
different software versions of tasks, so that when one
version of a task fails under certain inputs, another
version can be used.
Time redundancy: The task schedule has some slack in
it, so that some tasks can be rerun if necessary and
still meet critical deadlines.
Information redundancy: The data are coded in such a
way that a certain number of bit errors can be sombody@[Link]
Hardware Redundancy
Two types: static (or masking) and dynamic
redundancy
Static: redundant components are used inside a
system to hide the effects of faults; e.g. Triple Modular
Redundancy
TMR 3 identical subcomponents and majority voting
circuits; the outputs are compared and if one differs
from the other two that output is masked out
Dynamic: redundancy supplied inside a component
which indicates that the output is in error; provides an
error detection facility; recovery must be provided by
another component
E.g. communications checksums and memory parity
Real Time and Fault Tolerance
bits
N-Modular Redundancy
N-modular redundancy (NMR) is a scheme for
forward error recovery.
It works by using N processors instead of
one, and voting on their output. N is usually
odd.
Figure illustrates this scheme for N = 3.
One of two approaches is possible.
In design (a), there arc N voters and the
entire cluster produces N outputs. In design
Real Time and Fault Tolerance
(b), there is just one voter.
N-Modular Redundancy
Real Time and Fault Tolerance
Software Redundancy
System is provided with different software version
of task
Written independently
programmers
by
different
team
of
If one version of task fail under certain input
another version
can be used
Real Time and Fault Tolerance
Software Redundancy
N-Version Programming
Recovery Block Approach
Real Time and Fault Tolerance
N-Version Programming
The N-version software concept attempts to parallel the
traditional hardware fault tolerance concept of N-way
redundant hardware.
In an N-version software system, each module is made
with up toNdifferent implementations. Each variant
accomplishes the same task, but hopefully in a different
way.
Each version then submits its answer to voter or
decider which determines the correct answer, and
Real Time and Fault Tolerance
This system can hopefully overcome the design faults
present in most software by relying upon the design
diversity concept.
An important distinction in N-version software is the
fact that the system could include multiple types of
hardware using multiple versions of software.
The goal is to increase the diversity in order to avoid
common mode failures.
Using N-version software, it is encouraged that each
different version be implemented in as diverse a
manner as possible, including different tool sets,
different programming languages, and possibly
different environments
Real Time and Fault Tolerance
Recovery Block Approach
The recovery block operates with an adjudicator which
confirms the results of various implementations of the
same algorithm.
In a system with recovery blocks, the system view is
broken down into fault recoverable blocks.
The entire system is constructed of these fault tolerant
blocks.
Each
block
contains
at
least
primary,
secondary, and exceptional case code along with an
adjudicator
Real Time and Fault Tolerance
The adjudicator is the component which determines
the correctness of the various blocks to try.
Upon first entering a unit, the adjudicator first executes
the primary alternate.
If the adjudicator determines that the primary block
failed, it then tries toroll backthe state of the system
and tries the secondary alternate.
If the adjudicator does not accept the results of any of
the alternates, it then invokes the exception handler,
which then indicates the fact that the software could
not perform the requested operation.
Real Time and Fault Tolerance
Software Redundancy Structures
Real Time and Fault Tolerance
Time Redundancy
Achieves fault tolerance by performing an operation
several times.
Timeouts and retransmissions in reliable point-topoint and group communication are examples of
time redundancy.
This form of redundancy is useful in the presence of
transient or intermittent faults. It is of no use with
permanent faults.
Real Time and Fault Tolerance
Time Redundancy
1. Recovery Points
2. Backward Error Recovery
Real Time and Fault Tolerance
Information Redundancy
The basic idea of information redundancy is to provide
more information than is strictly necessary and to use
that extra information to check for errors.
We use coding all the time ourselves, while correcting
for typographical errors.
For example, if we encounter the word startegic, we
will most likely unconsciously correct it to strategic.
This was possible because (a) there is no such word as
startegic, and (b) strategic is the closest word that
we can think of to strategic.
Real Time and Fault Tolerance
The conditions (a) and (b) are at the basis of all coding
theory.
All computer words arc strings of Os and 1s Coding
ensures that not all strings of Os and Is are legal (i.e., are
valid).
When assessing a coding scheme, we want to know how
many extra bits it adds to the words, and how many bit
errors it can detect or correct.
We are interested in how much work it takes to encode
Real Time and Fault Tolerance
Information Redundancy structures
Repetition Codes
Parity coding
Checksum codes
Cyclic Redundancy check
Real Time and Fault Tolerance
Data diversity
Data diversity is an approach that can be used in
association with any of the redundancy techniques
considered above.
Sometimes, hardware or software may fail for certain
inputs, but not for other inputs that are very close to
them.
So, instead of applying the same input data to the
redundant processors, we apply slightly different input
data to them.
Thus we have in some cases another line of defense
against failure.
Real Time of
and Fault
Tolerance
This approach will only work if the sensitivity
the
Data diversity
Real Time and Fault Tolerance
Reversal Checks
Introduction
If there is a simple relationship between the inputs and
outputs of a system, it may then be possible to
calculate the inputs given the outputs.
This can then be compared with the actual inputs as a
check.
For example, consider a task that finds the square root
of a number.
To see if the process is correct, we can square the
output and check it against the original input. Or let the
task consist of writing a block onto disk.
The reverse operation consists of reading this block
from the disk after writing and comparing it to the input
to make sure that the two are the same.. Real Time and Fault Tolerance
MALICIOUS OR BYZANTINE FAILURES
Introduction
Whenever a failure can cause a unit to behave
arbitrarily, malicious or Byzantine failure is said to
happen.
For correct operation, it is often the case that copies of
the same data as seen by various processors must be
consistent (i.e., the same).
When communication is limited to two-party messages,
the faulty units must be fewer than a third of the total
number of units if consistency is to be guaranteed.
Real Time and Fault Tolerance
Integrated failure handling
Introduction
When an error is detected, the system must
respond swiftly to deal with it.
In the short term, the error might be masked by
voting
In the long term, the system will have to locate the
failure that gave rise to the error and decide what
to do with the failed unit.
Three options are usually available:
1.
retry
2.
disconnect
3.
replace.
Real Time and Fault Tolerance
Networkfault tolerance:
Includes
Reliable communication protocols
Agreement protocols
Database commitprotocols -Application:
sombody@[Link]
Agreement in faulty systems
Introduction
Two Army Problem:
We'll first examine the case of good processors but
faulty communication lines.
This is known as thetwo army problem
Byzantine agreement:
The source processor broadcasts its initial value to
all other processes.
Agreement: All nonfaulty processors agree on the
same value.
Validity: If the source processor is nonfaulty, the
common agreed upon value by all nonfaulty
processors should be the initial value of the source
Real Time and Fault Tolerance
Check pointing & Recovery
Includes
Checkpoint-Recovery is a common technique for
imbuing a program or system with fault tolerant
qualities, and grew from the ideas used in systems
which employ transaction processing
It allows systems to recover after some fault
interrupts the system, and causes the task to fail,
or be aborted in some way.
While many systems employ the technique to
minimize lost processing time, it can be used more
broadly to tolerate and recover from faults in a
critical application or task.
Real Time and Fault Tolerance
Continue..
Real Time and Fault Tolerance
Micro check pointing
A
single
checkpoint
buffer
is
maintained
per
multithreaded ARMOR
process.
The
element
state
is
checkpointed
after
each
operation.
Checkpoints are committed to stable storage after
processing a message.
The is no need to do process-wide checkpoints of
stacks, heap,
The existing locking policy of element data prevents
the need to suspend all threads.
Real Time and Fault Tolerance
IRIX check pointing
Facility for saving running processes and, at some
other time, restarting the saved processes from the
point already reached, without starting all over again.
A checkpoint image is saved in a set of disk files and
can comprise
A set of processes
All processes in the process group (a set of
processes that constitute a logical job)
All processes in a process session (a set of
processes started from the same physical or logical
terminal)
Real Time and Fault Tolerance
THANK YOU