0% found this document useful (0 votes)
221 views4 pages

Fault-Tolerant Computing Overview

Fault-tolerant computing aims to build systems that can continue operating despite faults. It incorporates redundancy through additional hardware, software, data, or design diversity. Common approaches to hardware fault tolerance include fault masking using triple modular redundancy and dynamic recovery using spare components. Software faults can be tolerated using acceptance tests and redundant code blocks. Information redundancy adds check bits for error detection and correction. Fault-tolerant systems are validated through modeling and fault simulations. They find applications in safety-critical systems like medical devices and airplanes to avoid failures.

Uploaded by

Mayowa Sunusi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
221 views4 pages

Fault-Tolerant Computing Overview

Fault-tolerant computing aims to build systems that can continue operating despite faults. It incorporates redundancy through additional hardware, software, data, or design diversity. Common approaches to hardware fault tolerance include fault masking using triple modular redundancy and dynamic recovery using spare components. Software faults can be tolerated using acceptance tests and redundant code blocks. Information redundancy adds check bits for error detection and correction. Fault-tolerant systems are validated through modeling and fault simulations. They find applications in safety-critical systems like medical devices and airplanes to avoid failures.

Uploaded by

Mayowa Sunusi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Fault-Tolerant Computing

INTRODUCTION
Fault-tolerant computing is the art and science of building computing systems that continue to
operate satisfactorily in the presence of faults. Fault tolerance is the ability of a system to continue
performing its intended function in spite of faults.

A fault-tolerant system may be able to tolerate one or more fault-types including:

i) Transient, intermittent or permanent hardware faults


ii) software and hardware design errors
iii) operator errors, or
iv) Externally induced upsets or physical damage.

An extensive methodology has been developed in this field over the past thirty years, and a number
of fault-tolerant machines have been developed, mostly dealing with random hardware faults, while
a smaller number deal with software, design and operator faults to varying degrees. A large amount
of supporting research has been reported.

Fault tolerance is associated with reliability, with successful operation, and with the absence of
breakdowns. A fault-tolerant system should be able to handle faults in individual hardware or
software components, power failures or other kinds of unexpected disasters and still meet its
specification.

Fault tolerance is needed because it is practically impossible to build a perfect system. The
fundamental problem is that, as the complexity of a system increases, its reliability drastically
deteriorates, unless compensatory measures are taken.

Although designers do their best to have all the hardware defects and software bugs cleaned out of
a system before it goes on the market, history shows that such a goal is not attainable. It is
inevitable that some unexpected environmental factor is not taken into account, or some potential
user mistakes are not foreseen. Thus, even in the unlikely case that a system is designed and
implemented perfectly, faults are likely to be caused by situations out of the control of the
designers.

FAULT TOLERANCE AND REDUNDANCY


There are various approaches to achieve fault-tolerance. Common to all these approaches is a
certain amount of redundancy. Redundancy is the provision of functional capabilities that would be
unnecessary in a fault free environment. This can be a replicated hardware component, an
additional check bit attached to a string of digital data, or a few lines of program code verifying the
correctness of the program’s results.

1
The idea of incorporating redundancy in order to improve reliability of a system was pioneered by
John von Neumann in early 1950s in his work “Probabilistic logic and the synthesis of reliable
organisms from unreliable components”.

Two kinds of redundancies are possible: space redundancy and time redundancy.

Space redundancy: provides additional components, functions, or data items that are unnecessary
for a fault-free operation. Space redundancy is further classified into hardware, software and
information redundancy, depending on the type of redundant resources added to the system.

In time redundancy: the computation or data transmission is repeated and the result is compared to
a stored copy of the previous result.

Hardware Fault-Tolerance
The majority of fault-tolerant designs have been directed toward building computers that
automatically recover from random faults occurring in hardware components. Hardware
redundancy is provided by incorporating extra hardware into the design to either detect or override
the effects of a failed component. The techniques employed to do this generally involve partitioning
a computing system into modules that act as fault containment regions. For example, instead of
having a single processor, we can use two or three processors, each performing the same function.
Each module is backed up with protective redundancy so that, if the module fails, others can assume
its function. Special mechanisms are added to detect errors and implement recovery.

Two general approaches to hardware fault recovery have been used: fault masking and dynamic
recovery.

Fault masking
Fault masking is a structural redundancy technique that completely masks faults within a set of
redundant modules. A number of identical modules execute the same functions, and their outputs
are voted to remove errors created by a faulty module. For example, instead of having a single
processor, we can use two or three processors, each performing the same function. By having two
processors, we can detect the failure of a single processor; by having three, we can use the majority
output to override the wrong output of a single faulty processor.

Triple modular redundancy (TMR) is a commonly used form of fault masking in which the circuitry is
triplicated and voted. The voting circuitry can also triplicate so that individual voter failures can also
be corrected by the voting process. A TMR system fails whenever two modules in a redundant triplet
create errors so that the vote is no longer valid.

Hybrid redundancy is an extension of TMR in which the triplicate modules are backed up with
additional spares, which are used to replace faulty modules - allowing more faults to be tolerated.

Voted systems require more than three times as much hardware as non-redundant systems, but
they have the advantage that computations can continue without interruption when a fault occurs,
allowing existing operating systems to be used.

2
Dynamic recovery
In the case of dynamic recovery, spare components are activated upon the failure of a currently
active component. Special mechanisms are required to detect faults in the modules, switch out a
faulty module, switch in a spare, and instigate those software actions (rollback, initialization, retry,
restart) necessary to restore and continue the computation.

In single computers, special hardware is required along with software to do this, while in
multicomputers, the function is often managed by the other processors.

Dynamic recovery is generally more hardware-efficient than voted systems, and it is therefore the
approach of choice in resource-constrained (e.g. low-power) systems. Its disadvantage is that
computational delays occur during fault recovery.

Software Fault-Tolerance
Efforts to attain software that can tolerate software design faults (programming errors) have made
use of static and dynamic redundancy approaches similar to those used for hardware faults.
Programs are partitioned into blocks and acceptance tests are executed after each block. If an
acceptance test fails, a redundant code block is executed.

Hardware and Software Design Fault Tolerance


To tolerate design faults of both hardware and software, an approach called design diversity
combines hardware and software fault-tolerance by implementing a fault-tolerant computer system
using different hardware and software in redundant channels.

Each channel is designed to provide the same function, and a method is provided to identify if one
channel deviates unacceptably from the others. This is a very expensive technique, but it is used in
very critical aircraft control applications.

Information Redundancy
The best-known form of information redundancy is error detection and correction coding. Here,
extra bits (called check bits) are added to the original data bits so that an error in the data bits can
be detected or even corrected. The resulting error-detecting and error-correcting codes are widely
used today in memory units and various storage devices to protect against benign failures. Note that
these error codes (like any other form of information redundancy) require extra hardware to process
the redundant data (the check bits).

Error-detecting and error-correcting codes are also used to protect data communicated over noisy
channels, which are channels that are subject to many transient failures. These channels can be
either the communication links among widely separated processors (e.g., the Internet) or among
locally connected processors that form a local network. If the code used for data communication is
capable of only detecting the faults that have occurred (but not correcting them), we can retransmit
as necessary, thus employing time redundancy.

3
VALIDATION OF FAULT-TOLERANCE
One of the most difficult tasks in the design of a fault-tolerant machine is to verify that it will meet
its reliability requirements. This requires creating a number of models. The first model is of the
error/fault environment that is expected.

Other models specify the structure and behavior of the design. It is then necessary to determine
how well the fault tolerance mechanisms work by analytic studies and fault simulations.

FOUR ASPECTS TO FAULT TOLERANCE

APPLICATIONS OF FAULT-TOLERANCE
Following the development of semiconductor technology, hardware components became
intrinsically more reliable and the need for tolerance of component defect diminished in general
purpose applications.

Nevertheless, fault tolerance remained necessary in many safety-, mission- and business-critical
applications.

1. Safety-critical applications are those where loss of life or environmental disaster must be
avoided. Examples are nuclear power plant control systems, computer-controlled radiation
therapy machines or heart pace-makers, military radar systems.

2. Mission-critical applications stress mission completion, such as in the case of an airplane or a


spacecraft.

3. Business-critical are those in which keeping a business operating is an issue. Examples are
bank and stock exchange’s automated trading system, web servers, e-commerce.

Common questions

Powered by AI

Fault masking and dynamic recovery are two approaches to handling hardware faults, differing primarily in their methods and implications. Fault masking employs structural redundancy to completely mask faults within a set of redundant modules; multiple identical modules execute the same functions, and outputs are voted to mitigate errors from faulty ones. Triple modular redundancy (TMR) is a common fault masking technique using triplicated circuitry and voters. In contrast, dynamic recovery activates spare components upon detecting a fault, requiring mechanisms for fault detection, switching out faulty modules, and initiating recovery software actions. While dynamic recovery is more hardware-efficient, it entails computational delays during fault recovery, unlike fault masking, which allows for uninterrupted computation .

Fault-tolerant computing systems improve overall reliability by continuing to operate satisfactorily in the presence of faults, through methods such as redundancy, fault masking, and dynamic recovery. Redundancy, like hardware and software redundancy, helps in handling unexpected faults by providing extra components or verifying program correctness. Techniques like fault masking shelter systems from faults by executing identical functions in redundant modules and voting on the outcomes, while dynamic recovery activates spare components to replace faulty ones. These measures ensure that systems can manage faults arising from hardware, software, or environmental errors, enhancing reliability despite increasing complexity .

Design diversity in fault-tolerant systems involves implementing different hardware and software in redundant channels, each capable of providing the same function. A method is used to detect deviations from acceptable performance among channels. It is considered expensive because it requires significant additional resources to ensure each redundant path is independently able to perform the necessary functions, often involving completely different systems that cover all possible bases of failure. Design diversity is crucial in extremely critical applications, such as aircraft control, where failure could result in catastrophic outcomes .

Software fault tolerance techniques, like their hardware counterparts, utilize redundancy to address faults, but focus on software design imperfections and heterogeneity. Static and dynamic redundancy techniques are employed, such as partitioning programs into blocks and executing acceptance tests after each block, executing redundant code upon test failure. Unlike hardware fault tolerance, which often relies on physical redundancy and containment, software fault tolerance deals with complexity and the inherent unpredictability of software behaviors, addressing design faults that can't be resolved through hardware methods, thus ensuring continuous software functionality .

Fault tolerance remains particularly necessary in scenarios involving safety-critical, mission-critical, and business-critical applications. Safety-critical applications include systems where failure could lead to loss of life or environmental disasters, such as in nuclear power plant controls and medical devices like pacemakers. Mission-critical applications are those where the completion of a mission is paramount, as in aerospace technologies. Business-critical applications involve continuous business operations in financial transactions and e-commerce platforms. These applications require fault tolerance because failure can lead to catastrophic consequences despite the increased reliability of semiconductor technology .

As system complexity increases, reliability tends to drastically decrease because of the higher likelihood of faults from various sources, including design errors, user mistakes, and unforeseen environmental factors. Fault tolerance strategies mitigate these issues by incorporating redundancy—both in space and time—in order to counteract potential failures. By implementing fault tolerance techniques such as fault masking and dynamic recovery, systems can continue operating despite individual component failures, thus maintaining reliability in complex systems. These strategies act as compensatory measures, ensuring continued operation and adherence to specifications even when faults occur, addressing the challenges posed by complexity .

The key challenges in validating a fault-tolerant machine include verifying that it meets reliability requirements despite its complexity and potential variations in real-world fault environments. Addressing these challenges requires developing models of the expected error/fault environment and the design's structure and behavior. Analytical studies and fault simulations are then used to evaluate how effectively fault tolerance mechanisms perform under these modeled conditions. This is inherently difficult due to the need for accurate models and the vast number of potential fault scenarios a system may encounter, necessitating comprehensive testing and evaluation strategies .

Redundancy is crucial in fault-tolerant systems as it provides additional functional capabilities that compensate for faults, thus maintaining system operations. The different types of redundancy used include space redundancy, which involves adding extra components, functions, or data that are unnecessary for fault-free operation, and time redundancy, which involves repeating computations and comparing results to detect discrepancies. Space redundancy can further classify into hardware, software, and information redundancy, addressing different fault types. These redundancies ensure the system can handle various faults, from hardware failures to environmental upsets, which enhances overall system reliability and robustness .

Dynamic recovery might be preferred over fault masking in certain systems primarily due to its hardware efficiency, which makes it more suitable for resource-constrained systems like those with low power availability. Although dynamic recovery involves computational delays during fault resolution, it requires less hardware compared to voted systems like fault masking, which necessitate triplication of components. This approach makes dynamic recovery advantageous in environments where minimizing hardware costs or power consumption is a priority, despite the trade-off of potential operational delays during fault recovery .

Information redundancy is utilized for error detection and correction by adding extra bits, known as check bits, to data bits. These check bits enable the detection and correction of errors in the data, which is crucial for maintaining data integrity. Error-detecting and error-correcting codes, derived from these check bits, are widely applied in memory systems and storage devices to guard against benign failures. Such redundancy is also vital in data communication over noisy channels, ensuring data integrity by retransmitting when necessary if only error detection is feasible. This redundancy preserves data accuracy, especially in critical applications like financial systems and space communication, where data integrity is paramount .

You might also like