0% found this document useful (0 votes)

221 views4 pages

Fault-Tolerant Computing Overview

Fault-tolerant computing aims to build systems that can continue operating despite faults. It incorporates redundancy through additional hardware, software, data, or design diversity. Common approaches to hardware fault tolerance include fault masking using triple modular redundancy and dynamic recovery using spare components. Software faults can be tolerated using acceptance tests and redundant code blocks. Information redundancy adds check bits for error detection and correction. Fault-tolerant systems are validated through modeling and fault simulations. They find applications in safety-critical systems like medical devices and airplanes to avoid failures.

Uploaded by

Mayowa Sunusi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

221 views4 pages

Fault-Tolerant Computing Overview

Uploaded by

Mayowa Sunusi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Fault-Tolerant Computing

INTRODUCTION
Fault-tolerant computing is the art and science of building computing systems that continue to
operate satisfactorily in the presence of faults. Fault tolerance is the ability of a system to continue
performing its intended function in spite of faults.

A fault-tolerant system may be able to tolerate one or more fault-types including:

i) Transient, intermittent or permanent hardware faults

ii) software and hardware design errors
iii) operator errors, or
iv) Externally induced upsets or physical damage.

An extensive methodology has been developed in this field over the past thirty years, and a number
of fault-tolerant machines have been developed, mostly dealing with random hardware faults, while
a smaller number deal with software, design and operator faults to varying degrees. A large amount
of supporting research has been reported.

Fault tolerance is associated with reliability, with successful operation, and with the absence of
breakdowns. A fault-tolerant system should be able to handle faults in individual hardware or
software components, power failures or other kinds of unexpected disasters and still meet its
specification.

Fault tolerance is needed because it is practically impossible to build a perfect system. The
fundamental problem is that, as the complexity of a system increases, its reliability drastically
deteriorates, unless compensatory measures are taken.

Although designers do their best to have all the hardware defects and software bugs cleaned out of
a system before it goes on the market, history shows that such a goal is not attainable. It is
inevitable that some unexpected environmental factor is not taken into account, or some potential
user mistakes are not foreseen. Thus, even in the unlikely case that a system is designed and
implemented perfectly, faults are likely to be caused by situations out of the control of the
designers.

FAULT TOLERANCE AND REDUNDANCY

There are various approaches to achieve fault-tolerance. Common to all these approaches is a
certain amount of redundancy. Redundancy is the provision of functional capabilities that would be
unnecessary in a fault free environment. This can be a replicated hardware component, an
additional check bit attached to a string of digital data, or a few lines of program code verifying the
correctness of the program’s results.

1
The idea of incorporating redundancy in order to improve reliability of a system was pioneered by
John von Neumann in early 1950s in his work “Probabilistic logic and the synthesis of reliable
organisms from unreliable components”.

Two kinds of redundancies are possible: space redundancy and time redundancy.

Space redundancy: provides additional components, functions, or data items that are unnecessary
for a fault-free operation. Space redundancy is further classified into hardware, software and
information redundancy, depending on the type of redundant resources added to the system.

In time redundancy: the computation or data transmission is repeated and the result is compared to
a stored copy of the previous result.

Hardware Fault-Tolerance
The majority of fault-tolerant designs have been directed toward building computers that
automatically recover from random faults occurring in hardware components. Hardware
redundancy is provided by incorporating extra hardware into the design to either detect or override
the effects of a failed component. The techniques employed to do this generally involve partitioning
a computing system into modules that act as fault containment regions. For example, instead of
having a single processor, we can use two or three processors, each performing the same function.
Each module is backed up with protective redundancy so that, if the module fails, others can assume
its function. Special mechanisms are added to detect errors and implement recovery.

Two general approaches to hardware fault recovery have been used: fault masking and dynamic
recovery.

Fault masking
Fault masking is a structural redundancy technique that completely masks faults within a set of
redundant modules. A number of identical modules execute the same functions, and their outputs
are voted to remove errors created by a faulty module. For example, instead of having a single
processor, we can use two or three processors, each performing the same function. By having two
processors, we can detect the failure of a single processor; by having three, we can use the majority
output to override the wrong output of a single faulty processor.

Triple modular redundancy (TMR) is a commonly used form of fault masking in which the circuitry is
triplicated and voted. The voting circuitry can also triplicate so that individual voter failures can also
be corrected by the voting process. A TMR system fails whenever two modules in a redundant triplet
create errors so that the vote is no longer valid.

Hybrid redundancy is an extension of TMR in which the triplicate modules are backed up with
additional spares, which are used to replace faulty modules - allowing more faults to be tolerated.

Voted systems require more than three times as much hardware as non-redundant systems, but
they have the advantage that computations can continue without interruption when a fault occurs,
allowing existing operating systems to be used.

2
Dynamic recovery
In the case of dynamic recovery, spare components are activated upon the failure of a currently
active component. Special mechanisms are required to detect faults in the modules, switch out a
faulty module, switch in a spare, and instigate those software actions (rollback, initialization, retry,
restart) necessary to restore and continue the computation.

In single computers, special hardware is required along with software to do this, while in
multicomputers, the function is often managed by the other processors.

Dynamic recovery is generally more hardware-efficient than voted systems, and it is therefore the
approach of choice in resource-constrained (e.g. low-power) systems. Its disadvantage is that
computational delays occur during fault recovery.

Software Fault-Tolerance
Efforts to attain software that can tolerate software design faults (programming errors) have made
use of static and dynamic redundancy approaches similar to those used for hardware faults.
Programs are partitioned into blocks and acceptance tests are executed after each block. If an
acceptance test fails, a redundant code block is executed.

Hardware and Software Design Fault Tolerance

To tolerate design faults of both hardware and software, an approach called design diversity
combines hardware and software fault-tolerance by implementing a fault-tolerant computer system
using different hardware and software in redundant channels.

Each channel is designed to provide the same function, and a method is provided to identify if one
channel deviates unacceptably from the others. This is a very expensive technique, but it is used in
very critical aircraft control applications.

Information Redundancy
The best-known form of information redundancy is error detection and correction coding. Here,
extra bits (called check bits) are added to the original data bits so that an error in the data bits can
be detected or even corrected. The resulting error-detecting and error-correcting codes are widely
used today in memory units and various storage devices to protect against benign failures. Note that
these error codes (like any other form of information redundancy) require extra hardware to process
the redundant data (the check bits).

Error-detecting and error-correcting codes are also used to protect data communicated over noisy
channels, which are channels that are subject to many transient failures. These channels can be
either the communication links among widely separated processors (e.g., the Internet) or among
locally connected processors that form a local network. If the code used for data communication is
capable of only detecting the faults that have occurred (but not correcting them), we can retransmit
as necessary, thus employing time redundancy.

3
VALIDATION OF FAULT-TOLERANCE
One of the most difficult tasks in the design of a fault-tolerant machine is to verify that it will meet
its reliability requirements. This requires creating a number of models. The first model is of the
error/fault environment that is expected.

Other models specify the structure and behavior of the design. It is then necessary to determine
how well the fault tolerance mechanisms work by analytic studies and fault simulations.

FOUR ASPECTS TO FAULT TOLERANCE

APPLICATIONS OF FAULT-TOLERANCE
Following the development of semiconductor technology, hardware components became
intrinsically more reliable and the need for tolerance of component defect diminished in general
purpose applications.

Nevertheless, fault tolerance remained necessary in many safety-, mission- and business-critical
applications.

1. Safety-critical applications are those where loss of life or environmental disaster must be
avoided. Examples are nuclear power plant control systems, computer-controlled radiation
therapy machines or heart pace-makers, military radar systems.

2. Mission-critical applications stress mission completion, such as in the case of an airplane or a

spacecraft.

3. Business-critical are those in which keeping a business operating is an issue. Examples are
bank and stock exchange’s automated trading system, web servers, e-commerce.

Common questions

Fault masking and dynamic recovery are two approaches to handling hardware faults, differing primarily in their methods and implications. Fault masking employs structural redundancy to completely mask faults within a set of redundant modules; multiple identical modules execute the same functions, and outputs are voted to mitigate errors from faulty ones. Triple modular redundancy (TMR) is a common fault masking technique using triplicated circuitry and voters. In contrast, dynamic recovery activates spare components upon detecting a fault, requiring mechanisms for fault detection, switching out faulty modules, and initiating recovery software actions. While dynamic recovery is more hardware-efficient, it entails computational delays during fault recovery, unlike fault masking, which allows for uninterrupted computation .

Fault-tolerant computing systems improve overall reliability by continuing to operate satisfactorily in the presence of faults, through methods such as redundancy, fault masking, and dynamic recovery. Redundancy, like hardware and software redundancy, helps in handling unexpected faults by providing extra components or verifying program correctness. Techniques like fault masking shelter systems from faults by executing identical functions in redundant modules and voting on the outcomes, while dynamic recovery activates spare components to replace faulty ones. These measures ensure that systems can manage faults arising from hardware, software, or environmental errors, enhancing reliability despite increasing complexity .

Design diversity in fault-tolerant systems involves implementing different hardware and software in redundant channels, each capable of providing the same function. A method is used to detect deviations from acceptable performance among channels. It is considered expensive because it requires significant additional resources to ensure each redundant path is independently able to perform the necessary functions, often involving completely different systems that cover all possible bases of failure. Design diversity is crucial in extremely critical applications, such as aircraft control, where failure could result in catastrophic outcomes .

Software fault tolerance techniques, like their hardware counterparts, utilize redundancy to address faults, but focus on software design imperfections and heterogeneity. Static and dynamic redundancy techniques are employed, such as partitioning programs into blocks and executing acceptance tests after each block, executing redundant code upon test failure. Unlike hardware fault tolerance, which often relies on physical redundancy and containment, software fault tolerance deals with complexity and the inherent unpredictability of software behaviors, addressing design faults that can't be resolved through hardware methods, thus ensuring continuous software functionality .

Fault tolerance remains particularly necessary in scenarios involving safety-critical, mission-critical, and business-critical applications. Safety-critical applications include systems where failure could lead to loss of life or environmental disasters, such as in nuclear power plant controls and medical devices like pacemakers. Mission-critical applications are those where the completion of a mission is paramount, as in aerospace technologies. Business-critical applications involve continuous business operations in financial transactions and e-commerce platforms. These applications require fault tolerance because failure can lead to catastrophic consequences despite the increased reliability of semiconductor technology .

As system complexity increases, reliability tends to drastically decrease because of the higher likelihood of faults from various sources, including design errors, user mistakes, and unforeseen environmental factors. Fault tolerance strategies mitigate these issues by incorporating redundancy—both in space and time—in order to counteract potential failures. By implementing fault tolerance techniques such as fault masking and dynamic recovery, systems can continue operating despite individual component failures, thus maintaining reliability in complex systems. These strategies act as compensatory measures, ensuring continued operation and adherence to specifications even when faults occur, addressing the challenges posed by complexity .

The key challenges in validating a fault-tolerant machine include verifying that it meets reliability requirements despite its complexity and potential variations in real-world fault environments. Addressing these challenges requires developing models of the expected error/fault environment and the design's structure and behavior. Analytical studies and fault simulations are then used to evaluate how effectively fault tolerance mechanisms perform under these modeled conditions. This is inherently difficult due to the need for accurate models and the vast number of potential fault scenarios a system may encounter, necessitating comprehensive testing and evaluation strategies .

Redundancy is crucial in fault-tolerant systems as it provides additional functional capabilities that compensate for faults, thus maintaining system operations. The different types of redundancy used include space redundancy, which involves adding extra components, functions, or data that are unnecessary for fault-free operation, and time redundancy, which involves repeating computations and comparing results to detect discrepancies. Space redundancy can further classify into hardware, software, and information redundancy, addressing different fault types. These redundancies ensure the system can handle various faults, from hardware failures to environmental upsets, which enhances overall system reliability and robustness .

Dynamic recovery might be preferred over fault masking in certain systems primarily due to its hardware efficiency, which makes it more suitable for resource-constrained systems like those with low power availability. Although dynamic recovery involves computational delays during fault resolution, it requires less hardware compared to voted systems like fault masking, which necessitate triplication of components. This approach makes dynamic recovery advantageous in environments where minimizing hardware costs or power consumption is a priority, despite the trade-off of potential operational delays during fault recovery .

Information redundancy is utilized for error detection and correction by adding extra bits, known as check bits, to data bits. These check bits enable the detection and correction of errors in the data, which is crucial for maintaining data integrity. Error-detecting and error-correcting codes, derived from these check bits, are widely applied in memory systems and storage devices to guard against benign failures. Such redundancy is also vital in data communication over noisy channels, ensuring data integrity by retransmitting when necessary if only error detection is feasible. This redundancy preserves data accuracy, especially in critical applications like financial systems and space communication, where data integrity is paramount .

Overview of OSI Security Architecture
No ratings yet
Overview of OSI Security Architecture
5 pages
Public Key Cryptosystems Overview
No ratings yet
Public Key Cryptosystems Overview
32 pages
Decode Caesar Cipher Messages
No ratings yet
Decode Caesar Cipher Messages
4 pages
Security Services in Cryptography
No ratings yet
Security Services in Cryptography
33 pages
Securing Web Applications, Services, and Servers
No ratings yet
Securing Web Applications, Services, and Servers
21 pages
Remote User Authentication Protocols
No ratings yet
Remote User Authentication Protocols
7 pages
Approaches to Information Security Implementation
100% (1)
Approaches to Information Security Implementation
4 pages
Chinese Remainder Theorem in Cryptography
71% (7)
Chinese Remainder Theorem in Cryptography
11 pages
Chinese Remainder Theorem in Cryptography
No ratings yet
Chinese Remainder Theorem in Cryptography
9 pages
Modern Block Ciphers Overview
No ratings yet
Modern Block Ciphers Overview
74 pages
Fault Tolerance in Distributed Systems
100% (1)
Fault Tolerance in Distributed Systems
21 pages
RC4 vs RC5: Key Differences Explained
No ratings yet
RC4 vs RC5: Key Differences Explained
18 pages
Wireless Device Security Challenges
No ratings yet
Wireless Device Security Challenges
11 pages
Message Authentication Requirements
No ratings yet
Message Authentication Requirements
13 pages
Overview of X.800 Security Services
No ratings yet
Overview of X.800 Security Services
5 pages
Classical Encryption Techniques Overview
No ratings yet
Classical Encryption Techniques Overview
31 pages
Unit-IV TLS
No ratings yet
Unit-IV TLS
36 pages
Web Security and SSL Overview
No ratings yet
Web Security and SSL Overview
36 pages
Merkle-Hellman Knapsack Cryptosystem
100% (1)
Merkle-Hellman Knapsack Cryptosystem
11 pages
Cryptographic Hash Function Overview
100% (1)
Cryptographic Hash Function Overview
31 pages
Cryptography & Network Security Notes
No ratings yet
Cryptography & Network Security Notes
63 pages
PGP and S/MIME in Email Security
0% (1)
PGP and S/MIME in Email Security
39 pages
Public Key Cryptography Overview
No ratings yet
Public Key Cryptography Overview
109 pages
Overview of Classical Encryption Techniques
No ratings yet
Overview of Classical Encryption Techniques
18 pages
Architectural and Fundamental Models in Distributed Systems
No ratings yet
Architectural and Fundamental Models in Distributed Systems
39 pages
Lecture Notes on Distributed Systems
No ratings yet
Lecture Notes on Distributed Systems
34 pages
Enhancing Data Security with PGP
No ratings yet
Enhancing Data Security with PGP
17 pages
Diffie-Hellman Key Exchange Explained
No ratings yet
Diffie-Hellman Key Exchange Explained
9 pages
Network Security: Attacks and Services
No ratings yet
Network Security: Attacks and Services
22 pages
SSH Transport Layer Security Overview
No ratings yet
SSH Transport Layer Security Overview
23 pages
Principles of Public Key Cryptography
No ratings yet
Principles of Public Key Cryptography
20 pages
Information Assurance Exam - Forouzan 3rd Ed.
No ratings yet
Information Assurance Exam - Forouzan 3rd Ed.
5 pages
Unit 1: Cryptography Overview
No ratings yet
Unit 1: Cryptography Overview
32 pages
Key Management in Public-Key Cryptography
No ratings yet
Key Management in Public-Key Cryptography
42 pages
Honers Htcs 401 Aktu Unit2
No ratings yet
Honers Htcs 401 Aktu Unit2
19 pages
Email Security: PGP vs S/MIME
No ratings yet
Email Security: PGP vs S/MIME
47 pages
Cryptography and Network Security Overview
No ratings yet
Cryptography and Network Security Overview
31 pages
Fault Tolerant Systems Syllabus 14-05-2019
No ratings yet
Fault Tolerant Systems Syllabus 14-05-2019
3 pages
Understanding Transport Layer Security (TLS)
No ratings yet
Understanding Transport Layer Security (TLS)
17 pages
IoT Framework and Security Models
No ratings yet
IoT Framework and Security Models
36 pages
Cns Full Material
No ratings yet
Cns Full Material
233 pages
Timestamp Protocols in DBMS
No ratings yet
Timestamp Protocols in DBMS
15 pages
Timing Attacks in Cryptanalysis
100% (1)
Timing Attacks in Cryptanalysis
22 pages
Cryptography Lab Manual for B.Tech
No ratings yet
Cryptography Lab Manual for B.Tech
36 pages
Email and Web Security Overview
No ratings yet
Email and Web Security Overview
22 pages
IP Security Overview and PGP Insights
No ratings yet
IP Security Overview and PGP Insights
97 pages
Message Authentication and Digital Signatures
No ratings yet
Message Authentication and Digital Signatures
23 pages
Overview of Blowfish Encryption Algorithm
No ratings yet
Overview of Blowfish Encryption Algorithm
62 pages
01 ChF01 Introduction
No ratings yet
01 ChF01 Introduction
13 pages
Error Detection Methods in Networks
No ratings yet
Error Detection Methods in Networks
14 pages
Understanding Security Attacks and Services
No ratings yet
Understanding Security Attacks and Services
7 pages
Understanding Fault Tolerance Concepts
No ratings yet
Understanding Fault Tolerance Concepts
52 pages
Ethical Hacking Exam Questions and Answers
No ratings yet
Ethical Hacking Exam Questions and Answers
2 pages
Fault Tolerance in System Design Techniques
No ratings yet
Fault Tolerance in System Design Techniques
2 pages
Triplication Voting in Fault-Tolerant Systems
No ratings yet
Triplication Voting in Fault-Tolerant Systems
8 pages
Chapter 9 Embedded
No ratings yet
Chapter 9 Embedded
18 pages
Fault Tolerance Techniques Overview
No ratings yet
Fault Tolerance Techniques Overview
40 pages
Redundant and Voting System
No ratings yet
Redundant and Voting System
10 pages
Fault-Tolerant Computing Explained
No ratings yet
Fault-Tolerant Computing Explained
6 pages
Fault Tolerant System Design Principles
No ratings yet
Fault Tolerant System Design Principles
7 pages
Error Detection and Correction
No ratings yet
Error Detection and Correction
25 pages
Turbo Coder Implementation in VLSI
No ratings yet
Turbo Coder Implementation in VLSI
4 pages
Ring Topology Project in System Programming
No ratings yet
Ring Topology Project in System Programming
9 pages
Vos Fundamentals
100% (1)
Vos Fundamentals
385 pages
Understanding Cryptographic Hash Functions
No ratings yet
Understanding Cryptographic Hash Functions
11 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
45 pages
3GPP TS 36.212
No ratings yet
3GPP TS 36.212
60 pages
Structural Testing and Fault Models
No ratings yet
Structural Testing and Fault Models
51 pages
Understanding Software Redundancy
No ratings yet
Understanding Software Redundancy
3 pages
Cyclic Redundancy Check Lab Report
No ratings yet
Cyclic Redundancy Check Lab Report
4 pages
Fault Tolerance in Multicore Architectures
No ratings yet
Fault Tolerance in Multicore Architectures
9 pages
Hamming Code for ASCII Error Correction
No ratings yet
Hamming Code for ASCII Error Correction
5 pages
DCS and HMI Course Notes
No ratings yet
DCS and HMI Course Notes
72 pages
Go-Back-N ARQ Protocol Overview
No ratings yet
Go-Back-N ARQ Protocol Overview
5 pages
Switchgear Design for Critical Power Systems
No ratings yet
Switchgear Design for Critical Power Systems
12 pages
Byzantine Agreement Problem Explained
No ratings yet
Byzantine Agreement Problem Explained
4 pages
Error Detection Unit 3
No ratings yet
Error Detection Unit 3
13 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
42 pages
WINSEM2020-21 CSE1004 ETH VL2020210504035 Reference Material I 22-Feb-2021 ch10 1 v1
No ratings yet
WINSEM2020-21 CSE1004 ETH VL2020210504035 Reference Material I 22-Feb-2021 ch10 1 v1
25 pages
Hamming Code: Error Detection & Correction
No ratings yet
Hamming Code: Error Detection & Correction
4 pages
Module 2 - Replication, Consistency & Fault Tolerance (30-Min Structured Notes)
No ratings yet
Module 2 - Replication, Consistency & Fault Tolerance (30-Min Structured Notes)
8 pages
MongoDB Replication Guide PDF
100% (1)
MongoDB Replication Guide PDF
106 pages
Computer Based Industrial Control
100% (3)
Computer Based Industrial Control
625 pages
Error Detection and Correction Techniques
No ratings yet
Error Detection and Correction Techniques
7 pages
Overview of TRICON Control Systems
No ratings yet
Overview of TRICON Control Systems
11 pages
Trellis Diagrams & Viterbi Algorithm
No ratings yet
Trellis Diagrams & Viterbi Algorithm
12 pages
CH 10
No ratings yet
CH 10
61 pages
HDFS Fault Tolerance & Transparency Issues
No ratings yet
HDFS Fault Tolerance & Transparency Issues
4 pages
RAID Storage Systems Overview
No ratings yet
RAID Storage Systems Overview
20 pages
Fault-Tolerant Architecture Overview
No ratings yet
Fault-Tolerant Architecture Overview
19 pages

Fault-Tolerant Computing Overview

Uploaded by

Fault-Tolerant Computing Overview

Uploaded by

Fault-Tolerant Computing

A fault-tolerant system may be able to tolerate one or more fault-types including:

i) Transient, intermittent or permanent hardware faults

FAULT TOLERANCE AND REDUNDANCY

Hardware and Software Design Fault Tolerance

FOUR ASPECTS TO FAULT TOLERANCE

2. Mission-critical applications stress mission completion, such as in the case of an airplane or a

Common questions

How do fault masking and dynamic recovery differ in their approach to handling hardware faults?

How does fault tolerance in computing systems improve their overall reliability?

What is design diversity in fault-tolerant systems, and why is it considered an expensive technique?

How do software fault tolerance techniques compare to those employed in hardware fault tolerance, and what challenges do they address?

In what scenarios is fault tolerance particularly necessary despite advances in semiconductor technology?

In what ways does the increase in system complexity affect reliability, and how can fault tolerance strategies mitigate this issue?

What are the key challenges in validating a fault-tolerant machine, and how can these challenges be addressed?

Why is redundancy considered crucial in fault-tolerant systems, and what are the different types of redundancy used?

Why might dynamic recovery be preferred over fault masking in certain systems, despite its potential drawbacks?

In what ways is information redundancy used for error detection and correction, and why is it important for data integrity?

You might also like