0% found this document useful (0 votes)
22 views12 pages

Fault Tolerance in Distributed Systems

This document discusses failures, faults, and fault tolerance in systems. It defines key terms like failure, error, fault, and explains that while perfect software is impossible, fault tolerance aims to increase dependability by allowing systems to function correctly despite internal faults. Faults are classified by duration (transient or permanent) or cause (design faults or operational faults). The general process of fault tolerance includes error detection, error recovery, and fault treatment. Error detection identifies invalid states, while recovery restores the system to a valid state either by rolling back or moving forward. Fault treatment repairs or replaces the failed component.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views12 pages

Fault Tolerance in Distributed Systems

This document discusses failures, faults, and fault tolerance in systems. It defines key terms like failure, error, fault, and explains that while perfect software is impossible, fault tolerance aims to increase dependability by allowing systems to function correctly despite internal faults. Faults are classified by duration (transient or permanent) or cause (design faults or operational faults). The general process of fault tolerance includes error detection, error recovery, and fault treatment. Error detection identifies invalid states, while recovery restores the system to a valid state either by rolling back or moving forward. Fault treatment repairs or replaces the failed component.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

Failures and Fault Tolerance

Classification of failures
Security
Fundamentals of Fault tolerance
It is simply not possible to devise absolutely
foolproof, 100% reliable software.
The best we can do is to reduce the
probability of failure to an "acceptable" level.
Fault tolerance is the ability of a system to
perform its function correctly even in the
presence of internal faults. The purpose of
fault tolerance is to increase the dependability
of a system.

A failure occurs when an actual running system
deviates from this specified behavior. The cause
of a failure is called an error.
An error represents an invalid system state, one
that is not allowed by the system behavior
specification. The error itself is the result of a
defect in the system or fault, which fault is the
root cause of a failure.
A fault may not necessarily result in an error, but
the same fault may result in multiple errors

Fault Classification
Based on duration, faults can be classified as transient or
permanent.
A different way to classify faults is by their underlying
cause.
Design faults are the result of design failures
Operational faults, on the other hand, are faults that occur during
the lifetime of the system and are invariably due to physical
causes

General Fault Tolerant Procedure
Series of distinct activities that are typically
(although not necessarily) performed in
sequence.
Error detection is the process of identifying that
the system is in an invalid state - damage
confinement; In other words, we first treat the
symptoms and then go after the underlying cause
The most common techniques for error detection
are: Replication checks, Timing checks, Run-time
constraints checking, Diagnostic checks



Error Recovery
The system needs to be restored to a valid
state(Two general approaches exists]
In backward error recovery, the system is
restored to a previous known valid state. This
often requires check pointing the system state
and, once an error is detected, rolling back the
system state to the last check pointed state.
forward error recovery is more appropriate. This
involves driving the system from the erroneous
state to a new valid state.

Fault Treatment
Repair [Link]
[Link], WARM
and HOT standby components

Common questions

Powered by AI

Error detection contributes to fault tolerance by identifying invalid states of the system as soon as they occur, allowing for timely interventions before the errors escalate into failures. It acts as the first step in addressing the symptoms of faults, which is crucial for damage confinement. Common techniques for error detection include replication checks, which compare redundant processes or data for discrepancies; timing checks, which monitor for timing anomalies in processes; runtime constraints checking, to ensure operations do not exceed predetermined limits; and diagnostic checks, which actively scan for signs of failure. These methods enable early detection and management of errors, thereby maintaining system dependability .

Fault treatment involves identifying, isolating, and correcting the underlying fault to prevent recurrence, often through repairs or component replacements, whereas error recovery is focused on restoring system correctness after an error occurs. Fault treatment is crucial for preventing the same fault from causing future errors, thereby addressing the root of the problem. Error recovery is necessary for maintaining system operation and preventing errors from escalating into failures. Together, they ensure long-term system reliability and immediate operational continuity by handling both the symptoms and causes of disturbances within the system .

Backward error recovery involves restoring the system to a previously known valid state by checkpointing the system state and rolling back when an error is detected. This approach is suitable for scenarios where reverting to a prior state is feasible and data loss can be minimized, such as database systems where transactions can be undone to maintain consistency. Forward error recovery, on the other hand, involves driving the system from an erroneous to a new valid state without reverting to past states. It is appropriate in scenarios where it is either impossible or impractical to reverse states, such as in real-time systems where returning to a previous state might not be feasible due to time constraints or data streams .

A fault is the root cause that, if not addressed, can lead to an error, which is an invalid or incorrect system state. A failure occurs when this error leads to the system deviating from its specified behavior and thus unable to perform its intended functions. This distinction is important because it helps in pinpointing the root cause of system issues; by understanding the progression from fault to error to failure, system architects can design robust fault-tolerant measures that target each aspect appropriately. Addressing faults can prevent errors and potential failures, hence maintaining system dependability and performance .

COLD standby components are only activated when a failure occurs and typically require longer recovery times as these components need initial setup and data synchronization. WARM standby components are partially active, meaning they are periodically updated but not fully functionally operational until needed; they offer moderate recovery times as less initialization is needed compared to cold standby. HOT standby components are fully operational and synchronized in real-time with the primary system, providing the shortest recovery times as they can take over with minimal delay in case of a primary system failure. Each type of standby component impacts the speed and effectiveness of system recovery in different ways, allowing systems to tailor their fault tolerance strategy based on criticality and resource availability .

The primary objective of implementing fault tolerance in software systems is to ensure the system's ability to perform its functions correctly even in the presence of faults, thereby increasing its dependability. Fault tolerance is achieved by reducing the probability of system failure to an "acceptable" level, even though it is impossible to create a completely foolproof system. By maintaining correct functionality despite internal faults, dependability is enhanced, as the system can continue to operate appropriately without major disruptions. This involves error detection, error recovery, and fault treatment, which help prevent small issues from escalating into larger failures .

System checkpointing aids backward error recovery by periodically saving the system state, allowing it to roll back to a known good state when an error is detected. This maintains system integrity and reduces recovery time by only needing to restore from the last checkpoint. However, risks include potential data loss if checkpoints are infrequent, and performance overhead due to resource usage and time spent in saving states. If not managed properly, checkpointing itself can introduce new errors or inconsistencies, especially in systems with high transaction volumes or in real-time environments where state integrity is critical .

Timing checks play a role in enhancing security by ensuring operations occur within expected timeframes, which helps prevent unauthorized delays or accelerations that might indicate tampering or faults. By monitoring whether processes occur within their predetermined timelines, the system can detect anomalies early and prevent potential breaches or failures. Similarly, runtime constraints ensure operations remain within specific limits, preventing overruns or under-runs that can compromise system integrity or security. These checks help in identifying errors that might occur due to environmental changes or malicious actions, thus protecting the system's reliability and security fundamentals .

Design faults, resulting from errors in the system's design phase, pose significant challenges because they are deeply embedded within the system architecture and often require substantial redesign efforts to address. Overcoming these challenges involves thorough validation and verification processes during the design phase, such as extensive testing, code reviews, and formal methods to ensure design correctness. Redundancy can also be used to mitigate the impact of design faults; by using diverse design techniques and implementing multiple independent designs, the system can tolerate certain design errors. Additionally, adaptive systems and self-healing algorithms can dynamically adjust operations in the presence of identified design faults, providing a level of resilience against failures .

Faults can be classified based on their duration into transient or permanent. Transient faults are temporary and often resolve on their own or with minor intervention, whereas permanent faults persist and require corrective action to fix. Based on their underlying cause, faults are categorized into design faults, arising from design flaws, and operational faults, which occur due to physical causes during the system's lifetime. The classification has significant implications for error recovery strategies: transient faults might only need forward error recovery where the system is driven to a new valid state, while permanent or design faults could require backward error recovery, where the system is rolled back to a previous valid state, often through check-pointing mechanisms .

You might also like