FAULT TOLERANCE
System ability to continue operating uninterrupted despite the failure of one or more of its
components.
How an OS Responds to and allows malfunctions and failures.
It guarantees no break in service.
Recovers from failure completely and transparently.
FAULT TOLERANCE
Every achievement in fault tolerance leads to a drawback somewhere else.
The system will be slower, take more disk space, utilize more machines and also
increase other costs.
There for fault tolerance is always a trade-off between cost and the degree of fault
tolerance.
FAILUREVS ERROR
System differs from expected behavior.
Failure might involve the system being unreachable or producing incorrect output.
Error is incorrectness of system that may lead to a failure.
Error do not must create failures but can be detect in the system before they produce
failure.
FAULT TOLERANCE
Fault tolerance usually running through several phases.
Error Detection: error has to be detect in order to avoid failure.
Damage Confinement: it must prevent that the error spreads through other components
Error recovery: error must be removed, otherwise system would run into failure
PROCESSOR FAULT
Occur when the processor behaves in unexpected manner. It may be classified into three
kinds.
1. Fail Stop: totally failed and will never respond, neighboring processors can detect the
failed processor
2. Slowdown: processor might run in degraded form or might totally fail
3. Byzantine: processor can fail, run in degraded fashion for some time or execute at normal
speed but tries to fail the computation
NETWORK FAULTS
When processors are prevented from communicating with each other. Link faults can cause
new kinds of problems like
One-way Links: one processor can send messages but other is not able to receive message.
Network partition: network of portion is completely isolated with other
ATTRIBUTES OF FAULT TOLERANT SYSTEM
Fault tolerance system is depended system which requires following attributes
1. Availability: when system is in a ready state and ready to deliver tis functions. Highly
available systems work at a given instant in time.
2. Reliability: ability of computer to run continuously without failure, it is defined as time
interval instead of instant time. Reliable system works constantly without interruption.
3. Safety: fails to carry out its corresponding processes correctly and operations are incorrect
but no major disastrous happened and also doesn't affect other system to be faulty
4. Maintainability: if failures can be notices and fixed easily.
Types of failure:
CLASSIFICATION OF FAILURE
Transient:
Intermittent:
Permanent:
FAULT TOLERANCE MECHANISM IN DISTRIBUTED SYSTEM
Replication based fault tolerance technique.
Process level redundancy technique.
Fusion based redundancy technique.
REPLICATION BASED FAULT TOLERANCE TECHNIQUE
Replicate the data on other machine. It will not cause the whole system to stop.
Replicate the data on different server.
Problems of replication
Consistency: major problem of replication is consistency because of updating by any client.
Consistency of data is ensured by some model such as sequential, causal memory consistency
model
Degree of replica: large number of replications are needed in order to achieve high fault
tolerance.
PROCESS LEVEL REDUNDANCY TECHNIQUES
Faults that disappears without anything been done is called transient faults. This type of faults
is hard to identify.
Handling transient fault, software based fault tolerance technique are used.
PLR Compares processes to ensure correct execution.
Check point and roll back are popular technique in which the current state of system is done.
FUSION BASED TECHNIQUE
Replication: downside is multiple backups that increases cost.
This problem is solved by fusion based technique because it requires fewer backup
Backup machines are fused to a given set of system (NP-Problem).
Fusion based technique has very high overhead during recovery process and it's acceptable in
low probability of fault in a system.