0% found this document useful (0 votes)

27 views40 pages

Fault Tolerance Techniques Overview

This document discusses various fault tolerance techniques including coding techniques, software fault tolerance, network fault tolerance, and redundancy. It covers fault types like transient, intermittent and permanent faults. It describes fault detection methods like online and offline detection. It also discusses fault and error containment through techniques like redundancy, data diversity, and reversal checks.

Uploaded by

Luis Anderson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views40 pages

Fault Tolerance Techniques Overview

Uploaded by

Luis Anderson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

UNIT 3

Fault Tolerance Techniques

Introduction,

Coding technique

failure causes,

Software fault tolerance

fault types, fault detection,

Networkfault tolerance:

fault

and

error

containment,
redundancy,
data diversity,
reversal checks,
malicious
failures,

Byzantine
Roll No: 15

Fault Tolerance
Definition
Fault tolerancerefers to a system's ability to deal
with malfunctions.

Fault-tolerant systems - ideally systems capable of

executing their tasks correctly regardless of either
hardware failures or software errors

Real Time and Fault Tolerance

Failure Causes
There are three causes of failure:
Errors in the specification or design,
Defects in the components,
Environmental effects.

Real Time and Fault Tolerance

Fault Types
Categorized into

Faults are classified according to their behavior

1. Temporal behavior
2. Output behavior.

A fault is said to be active when it is physically

capable of generating errors and to be benign when
it

not.

Real Time and Fault Tolerance

Temporal Behavior
Transient faults:
These occur once and then disappear

Intermittent faults:
Intermittent faults are characterized by a fault
occurring, then vanishing again, then reoccurring, then
vanishing.

Permanent faults:
This type of failure is persistent: it continues to exist
until the faulty component is repaired or replaced.

Real Time and Fault Tolerance

Fault Detection
Definition
There are two ways to determine that a processor
is malfunctioning:
[Link]
[Link].
. Online detection goes on in parallel with normal
system operation.
. One way of doing this is to check for any behavior
that is Inconsistent with correct operation

Real Time and Fault Tolerance

A monitor (called a watchdog processor) is associated

with each processor, looking for signs that the
processor is faulty.
The watchdog processor watches the data and
address lines, as shown in Figure.

A second approach is to have multiple processors,

which are supposed to put out the same result, and
compare the results.

A discrepancy indicates the existence of a fault.

Real Time and Fault Tolerance

Online Detection using Watchdog

Processor

Real Time and Fault Tolerance

The following actions are indicative of a faulty

processor.

Branching to an invalid destination.

Fetching an opcode from a location containing data.
Writing into a portion of memory to which the
process has no write access.
Fetching an illegal opcode.
Inactive for more than a prescribed period.
Real Time and Fault Tolerance

Offline detection consists of running diagnostic tests.

Not runnable
When a processor is running such a test, it obviously
cannot be executing the applications software.

Diagnostic test can be scheduled just like ordinary

tasks.

The greater the failure rate, the greater must be the

frequency with which these tests are run.
Real Time and Fault Tolerance

Fault and Error containment

Includes
When a fault or error occurs in one part of the system,
it can, if unchecked, spread through the system like an
infectious disease.
A fault in one part of the system might.

for example, cause large voltage swings in another; a

fault-free processor can put out erroneous results as a
result of using erroneous input from a faulty unit.

Faults and errors must therefore be prevented from

spreading through the system. This is called

Real Time and Fault Tolerance

The system is divided into

Fault Containment Zones (FCZ):
An FCZ is a subset of the system that operates
correctly despite arbitrary logical or electrical faults
outside the subset.
That is, the failure of some part of the computer
outside an FCZ cannot cause any element inside
that FCZ to fail.
Error Containment Zones (ECZ)
The function of an ECZ is to prevent errors from
propagating across zone boundaries. This is
Real Time and Fault Tolerance

typically achieved by voting redundant outputs.

Redundancy
Four Types:

Hardware redundancy: The system is provided with far

more hardware if all the components are perfectly
reliable
Software redundancy: The system is provided with
different software versions of tasks, so that when one
version of a task fails under certain inputs, another
version can be used.
Time redundancy: The task schedule has some slack in
it, so that some tasks can be rerun if necessary and
still meet critical deadlines.
Information redundancy: The data are coded in such a
way that a certain number of bit errors can be sombody@[Link]

Hardware Redundancy
Two types: static (or masking) and dynamic
redundancy
Static: redundant components are used inside a
system to hide the effects of faults; e.g. Triple Modular
Redundancy
TMR 3 identical subcomponents and majority voting
circuits; the outputs are compared and if one differs
from the other two that output is masked out
Dynamic: redundancy supplied inside a component
which indicates that the output is in error; provides an
error detection facility; recovery must be provided by
another component
E.g. communications checksums and memory parity
Real Time and Fault Tolerance
bits

N-Modular Redundancy
N-modular redundancy (NMR) is a scheme for
forward error recovery.

It works by using N processors instead of

one, and voting on their output. N is usually
odd.

Figure illustrates this scheme for N = 3.

One of two approaches is possible.
In design (a), there arc N voters and the
entire cluster produces N outputs. In design
Real Time and Fault Tolerance
(b), there is just one voter.

N-Modular Redundancy

Real Time and Fault Tolerance

Software Redundancy
System is provided with different software version
of task

Written independently
programmers

different

team

If one version of task fail under certain input

another version
can be used

Real Time and Fault Tolerance

Software Redundancy
N-Version Programming
Recovery Block Approach

Real Time and Fault Tolerance

N-Version Programming
The N-version software concept attempts to parallel the
traditional hardware fault tolerance concept of N-way
redundant hardware.

In an N-version software system, each module is made

with up toNdifferent implementations. Each variant
accomplishes the same task, but hopefully in a different
way.

Each version then submits its answer to voter or

decider which determines the correct answer, and

Real Time and Fault Tolerance

This system can hopefully overcome the design faults

present in most software by relying upon the design
diversity concept.
An important distinction in N-version software is the
fact that the system could include multiple types of
hardware using multiple versions of software.
The goal is to increase the diversity in order to avoid
common mode failures.
Using N-version software, it is encouraged that each
different version be implemented in as diverse a
manner as possible, including different tool sets,
different programming languages, and possibly
different environments
Real Time and Fault Tolerance

Recovery Block Approach

The recovery block operates with an adjudicator which
confirms the results of various implementations of the
same algorithm.

In a system with recovery blocks, the system view is

broken down into fault recoverable blocks.

The entire system is constructed of these fault tolerant

blocks.

Each

block

contains

least

primary,

secondary, and exceptional case code along with an

adjudicator

Real Time and Fault Tolerance

The adjudicator is the component which determines

the correctness of the various blocks to try.

Upon first entering a unit, the adjudicator first executes

the primary alternate.

If the adjudicator determines that the primary block

failed, it then tries toroll backthe state of the system
and tries the secondary alternate.
If the adjudicator does not accept the results of any of
the alternates, it then invokes the exception handler,
which then indicates the fact that the software could
not perform the requested operation.
Real Time and Fault Tolerance

Software Redundancy Structures

Real Time and Fault Tolerance

Time Redundancy
Achieves fault tolerance by performing an operation
several times.

Timeouts and retransmissions in reliable point-topoint and group communication are examples of
time redundancy.

This form of redundancy is useful in the presence of

transient or intermittent faults. It is of no use with
permanent faults.

Real Time and Fault Tolerance

Time Redundancy
1. Recovery Points

2. Backward Error Recovery

Real Time and Fault Tolerance

Information Redundancy
The basic idea of information redundancy is to provide
more information than is strictly necessary and to use
that extra information to check for errors.

We use coding all the time ourselves, while correcting

for typographical errors.

For example, if we encounter the word startegic, we

will most likely unconsciously correct it to strategic.

This was possible because (a) there is no such word as

startegic, and (b) strategic is the closest word that
we can think of to strategic.

Real Time and Fault Tolerance

The conditions (a) and (b) are at the basis of all coding
theory.

All computer words arc strings of Os and 1s Coding

ensures that not all strings of Os and Is are legal (i.e., are
valid).

When assessing a coding scheme, we want to know how

many extra bits it adds to the words, and how many bit
errors it can detect or correct.

We are interested in how much work it takes to encode

Real Time and Fault Tolerance

Information Redundancy structures

Repetition Codes
Parity coding
Checksum codes
Cyclic Redundancy check

Real Time and Fault Tolerance

Data diversity
Data diversity is an approach that can be used in
association with any of the redundancy techniques
considered above.
Sometimes, hardware or software may fail for certain
inputs, but not for other inputs that are very close to
them.
So, instead of applying the same input data to the
redundant processors, we apply slightly different input
data to them.
Thus we have in some cases another line of defense
against failure.
Real Time of
and Fault
Tolerance
This approach will only work if the sensitivity
the

Data diversity

Real Time and Fault Tolerance

Reversal Checks
Introduction
If there is a simple relationship between the inputs and
outputs of a system, it may then be possible to
calculate the inputs given the outputs.
This can then be compared with the actual inputs as a
check.

For example, consider a task that finds the square root

of a number.
To see if the process is correct, we can square the
output and check it against the original input. Or let the
task consist of writing a block onto disk.
The reverse operation consists of reading this block
from the disk after writing and comparing it to the input
to make sure that the two are the same.. Real Time and Fault Tolerance

MALICIOUS OR BYZANTINE FAILURES

Introduction
Whenever a failure can cause a unit to behave
arbitrarily, malicious or Byzantine failure is said to
happen.
For correct operation, it is often the case that copies of
the same data as seen by various processors must be
consistent (i.e., the same).
When communication is limited to two-party messages,
the faulty units must be fewer than a third of the total
number of units if consistency is to be guaranteed.

Real Time and Fault Tolerance

Integrated failure handling

Introduction

When an error is detected, the system must

respond swiftly to deal with it.
In the short term, the error might be masked by
voting
In the long term, the system will have to locate the
failure that gave rise to the error and decide what
to do with the failed unit.
Three options are usually available:
1.
retry
2.
disconnect
3.
replace.
Real Time and Fault Tolerance

Networkfault tolerance:
Includes

Reliable communication protocols

Agreement protocols

Database commitprotocols -Application:

sombody@[Link]

Agreement in faulty systems

Introduction
Two Army Problem:
We'll first examine the case of good processors but
faulty communication lines.
This is known as thetwo army problem
Byzantine agreement:
The source processor broadcasts its initial value to
all other processes.
Agreement: All nonfaulty processors agree on the
same value.
Validity: If the source processor is nonfaulty, the
common agreed upon value by all nonfaulty
processors should be the initial value of the source
Real Time and Fault Tolerance

Check pointing & Recovery

Includes

Checkpoint-Recovery is a common technique for

imbuing a program or system with fault tolerant
qualities, and grew from the ideas used in systems
which employ transaction processing
It allows systems to recover after some fault
interrupts the system, and causes the task to fail,
or be aborted in some way.
While many systems employ the technique to
minimize lost processing time, it can be used more
broadly to tolerate and recover from faults in a
critical application or task.
Real Time and Fault Tolerance

Continue..

Real Time and Fault Tolerance

Micro check pointing

single

checkpoint

buffer

maintained

per

multithreaded ARMOR
process.
The

element

state

checkpointed

after

each

operation.
Checkpoints are committed to stable storage after
processing a message.
The is no need to do process-wide checkpoints of
stacks, heap,
The existing locking policy of element data prevents
the need to suspend all threads.

Real Time and Fault Tolerance

IRIX check pointing

Facility for saving running processes and, at some

other time, restarting the saved processes from the

point already reached, without starting all over again.
A checkpoint image is saved in a set of disk files and
can comprise
A set of processes
All processes in the process group (a set of
processes that constitute a logical job)
All processes in a process session (a set of
processes started from the same physical or logical
terminal)
Real Time and Fault Tolerance

THANK YOU

Fault Avoidance and Tolerance Techniques
No ratings yet
Fault Avoidance and Tolerance Techniques
15 pages
Fault Tolerance in Computer Systems
No ratings yet
Fault Tolerance in Computer Systems
20 pages
Fault Tolerance Techniques Overview
100% (1)
Fault Tolerance Techniques Overview
104 pages
Ch08 Fault Tolerance
No ratings yet
Ch08 Fault Tolerance
53 pages
Fault-Tolerant Computing Overview
No ratings yet
Fault-Tolerant Computing Overview
4 pages
Benefits of Fault-Tolerant Computing
100% (1)
Benefits of Fault-Tolerant Computing
61 pages
Fault Tolerance and Consensus: C. Bettini - Distributed and Pervasive Systems
No ratings yet
Fault Tolerance and Consensus: C. Bettini - Distributed and Pervasive Systems
10 pages
Fault Tolerance
No ratings yet
Fault Tolerance
13 pages
Enhancing Reliability in Distributed Systems
No ratings yet
Enhancing Reliability in Distributed Systems
33 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
101 pages
Fault Detection and Fault Tolerant Control
No ratings yet
Fault Detection and Fault Tolerant Control
207 pages
Hardware vs Software Reliability Explained
No ratings yet
Hardware vs Software Reliability Explained
25 pages
Hardware Redundancy Techniques Overview
No ratings yet
Hardware Redundancy Techniques Overview
25 pages
Fault-Tolerant Design in SoC Systems
No ratings yet
Fault-Tolerant Design in SoC Systems
118 pages
Software Fault Tolerance Techniques
No ratings yet
Software Fault Tolerance Techniques
44 pages
RTOS Fault Sensitivity in Safety-Critical Systems
No ratings yet
RTOS Fault Sensitivity in Safety-Critical Systems
8 pages
Pipelined Processor Design Overview
No ratings yet
Pipelined Processor Design Overview
55 pages
Complete Truth Table for Transistor Circuit
No ratings yet
Complete Truth Table for Transistor Circuit
166 pages
Application-Aware Byzantine Fault Tolerance
No ratings yet
Application-Aware Byzantine Fault Tolerance
6 pages
Understanding Real-Time POSIX Standards
No ratings yet
Understanding Real-Time POSIX Standards
6 pages
ECE 753 Exam: Fault-Tolerant Computing
No ratings yet
ECE 753 Exam: Fault-Tolerant Computing
14 pages
Error Detection and Correction in Networking
No ratings yet
Error Detection and Correction in Networking
31 pages
Fault Tolerant System Design Overview
100% (1)
Fault Tolerant System Design Overview
44 pages
Understanding Software Reliability
No ratings yet
Understanding Software Reliability
7 pages
Understanding Software Reliability
No ratings yet
Understanding Software Reliability
24 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
34 pages
Clock Synchronization in Distributed Systems
No ratings yet
Clock Synchronization in Distributed Systems
34 pages
Understanding Software Reliability
No ratings yet
Understanding Software Reliability
126 pages
Hardware Design Review Checklist
No ratings yet
Hardware Design Review Checklist
28 pages
Software Reliability Models Overview
No ratings yet
Software Reliability Models Overview
21 pages
Fault Tolerant System Evaluation Techniques
No ratings yet
Fault Tolerant System Evaluation Techniques
60 pages
Communication Interfaces in Embedded Systems
No ratings yet
Communication Interfaces in Embedded Systems
20 pages
Fault Tolerant Systems Course Overview
No ratings yet
Fault Tolerant Systems Course Overview
222 pages
UDS Protocol Interview Insights
No ratings yet
UDS Protocol Interview Insights
7 pages
Digital Data Transmission
No ratings yet
Digital Data Transmission
45 pages
Overview of Real-Time Systems and RTOS
No ratings yet
Overview of Real-Time Systems and RTOS
8 pages
Real-Time Operating Systems Overview
100% (1)
Real-Time Operating Systems Overview
3 pages
Security and Dependability Overview
No ratings yet
Security and Dependability Overview
46 pages
Remote Debugging of Embedded Systems
No ratings yet
Remote Debugging of Embedded Systems
13 pages
Aes Document 1 Final
No ratings yet
Aes Document 1 Final
102 pages
Advances in Embedded Systems Trends
No ratings yet
Advances in Embedded Systems Trends
10 pages
Fault Tolerant Network Overview
No ratings yet
Fault Tolerant Network Overview
44 pages
Introduction to Distributed Systems
No ratings yet
Introduction to Distributed Systems
58 pages
Final Year Project Defense
No ratings yet
Final Year Project Defense
23 pages
Fault-Tolerance Basics and Measures
No ratings yet
Fault-Tolerance Basics and Measures
31 pages
Software Reliability Unit Test II
No ratings yet
Software Reliability Unit Test II
2 pages
ARM Embedded Systems Design Course
No ratings yet
ARM Embedded Systems Design Course
3 pages
Shared Memory in Embedded Systems
No ratings yet
Shared Memory in Embedded Systems
7 pages
Overview of Embedded Systems Concepts
No ratings yet
Overview of Embedded Systems Concepts
7 pages
Data Structures and Algorithms Syllabus
100% (1)
Data Structures and Algorithms Syllabus
2 pages
TR symRSS PDF
No ratings yet
TR symRSS PDF
7 pages
Unit 1
No ratings yet
Unit 1
20 pages
Advanced Computer Architecture Syllabus
100% (1)
Advanced Computer Architecture Syllabus
2 pages
Host and Target Testing in Embedded Systems
No ratings yet
Host and Target Testing in Embedded Systems
23 pages
Time Triggered Communication 1st Edition Roman Obermaisser Ebook Testbank Solutions Chapter Ready Version
100% (4)
Time Triggered Communication 1st Edition Roman Obermaisser Ebook Testbank Solutions Chapter Ready Version
84 pages
Fault Tolerance Techniques Explained
No ratings yet
Fault Tolerance Techniques Explained
53 pages
Chapter 9 Embedded
No ratings yet
Chapter 9 Embedded
18 pages
Fault Tolerance in Real-Time Systems
No ratings yet
Fault Tolerance in Real-Time Systems
19 pages
Redundant and Voting System
No ratings yet
Redundant and Voting System
10 pages
Understanding Parts Therapy Techniques
100% (11)
Understanding Parts Therapy Techniques
42 pages
Enhancing Portability and Convenience
No ratings yet
Enhancing Portability and Convenience
3 pages
Types and Classification of Leather
100% (1)
Types and Classification of Leather
9 pages
Resume of Agbebi Samuel Adeyemi
No ratings yet
Resume of Agbebi Samuel Adeyemi
2 pages
Equations for Common Polygons Using Desmos
No ratings yet
Equations for Common Polygons Using Desmos
5 pages
English Dialog Practice for Class VII
No ratings yet
English Dialog Practice for Class VII
3 pages
Swimming Coaches Report and SQL Queries
No ratings yet
Swimming Coaches Report and SQL Queries
2 pages
Understanding Tone in Literature
No ratings yet
Understanding Tone in Literature
6 pages
SMF 631 831 Web Manual
No ratings yet
SMF 631 831 Web Manual
54 pages
Hypertension Factors in Elderly Women
No ratings yet
Hypertension Factors in Elderly Women
9 pages
S3 Entrepreneurship Test Instructions
No ratings yet
S3 Entrepreneurship Test Instructions
3 pages
IndusInd Bank Credit Card Terms & Charges
No ratings yet
IndusInd Bank Credit Card Terms & Charges
7 pages
JAIIB/CAIIB Exam Study Materials & Tests
No ratings yet
JAIIB/CAIIB Exam Study Materials & Tests
36 pages
Power Semiconductor Modules Overview
No ratings yet
Power Semiconductor Modules Overview
27 pages
Unit-1: Unit-2: Unit-3: Unit-4: Unit-5: Unit-6: Unit-7: Unit-8: Unit-9: Unit-10: Unit-11: Unit-12: Unit-13: Unit-14: Unit-15
No ratings yet
Unit-1: Unit-2: Unit-3: Unit-4: Unit-5: Unit-6: Unit-7: Unit-8: Unit-9: Unit-10: Unit-11: Unit-12: Unit-13: Unit-14: Unit-15
1 page
Cambridge O Level Business Studies Case Study
No ratings yet
Cambridge O Level Business Studies Case Study
12 pages
Mandarin Chinese 1 Course Syllabus
No ratings yet
Mandarin Chinese 1 Course Syllabus
2 pages
Digital Imaging Fundamentals Overview
No ratings yet
Digital Imaging Fundamentals Overview
24 pages
Energetic and Exergetic Analysis of Rankine Cycles For Solar Power Plants With Parabolic Trough and Thermal Storage
No ratings yet
Energetic and Exergetic Analysis of Rankine Cycles For Solar Power Plants With Parabolic Trough and Thermal Storage
5 pages
Cambridge IELTS 12 Reading Test 4
No ratings yet
Cambridge IELTS 12 Reading Test 4
4 pages
FUO Postgraduate Admission 2021/2022
No ratings yet
FUO Postgraduate Admission 2021/2022
5 pages
Java Basics and JDK Overview
No ratings yet
Java Basics and JDK Overview
3 pages
Overview of Bharat Operating System
No ratings yet
Overview of Bharat Operating System
2 pages
April To July
No ratings yet
April To July
24 pages
Bok:978 3 642 24574 9 PDF
100% (1)
Bok:978 3 642 24574 9 PDF
416 pages
Overview of Communication Systems
No ratings yet
Overview of Communication Systems
34 pages
Tree Traversal Techniques Explained
No ratings yet
Tree Traversal Techniques Explained
10 pages
Antimicrobial Properties of Cashew Extract
No ratings yet
Antimicrobial Properties of Cashew Extract
3 pages
2-D Discrete Cosine Transform Overview
No ratings yet
2-D Discrete Cosine Transform Overview
19 pages
Young Adult vs. Adult Literature Language Analysis
No ratings yet
Young Adult vs. Adult Literature Language Analysis
4 pages