0% found this document useful (0 votes)

21 views5 pages

Fault Tolerance in Embedded Systems

The document summarizes key points from a lecture on reliability and fault tolerance in embedded systems. It defines reliability as the probability a system will function as intended for a given time period under specified conditions. It discusses different types of faults like transient faults from glitches versus permanent faults from defective components. It also covers approaches to improve reliability like defining a fault model, assessing fault tolerance, and techniques like error correction codes and redundancy. Fault injection testing is described as a way to accelerate testing for faults by simulating them in software or hardware.

Uploaded by

Suraj Sonu M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views5 pages

Fault Tolerance in Embedded Systems

Uploaded by

Suraj Sonu M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Programming Embedded Systems Lecture outline

● Notion + definition of reliability

Lecture 9 ● Notion of fault tolerance
Reliability and fault tolerance ● Typical classes of faults in
Monday Feb 14, 2011 embedded systems
● Techniques to improve fault tolerance
Philipp Rümmer
Uppsala University
[Link]@[Link]

1/37 2/37

Correctness vs. reliability Reliability

● Last lecture: ● Probability R(t) that a device/system

Correctness of software fulfils its intended function for a
→ Absolute notion, assessed using V&V period of time t, given precisely
● In practice: stated conditions
Every system will fail (sooner or later) ● If T is time (random variable) when first
● Totally correct of software is usually too failure occurs, then:
expensive
● Hardware faults (unavoidable)
● More relative notion: reliability Assumes knowledge
about probability
distribution
3/37 4/37

Reliability block model: Reliability block model:

parallelism (“or”) series (“and”)

1 2

● Compare with: fault tree analysis

5/37 6/37

Mean time between failures

Lusser's equation
(MTBF)
● E.g., “power supply A has MTBF of ● Relationship between reliability and
40000 hours” MTBF:
● Determined by testing N devices for T
hours each, counting the number R of
failures (special case of Weibull distribution)
● Interesting consequence:
Devices will survive their MTBF only
with likelihood
● Inverse: failure rate

7/37 8/37
Bathtubs Improving reliability of a system
● Define fault model describing what could
● Failure rate of hardware is considered go wrong
to follow a “bathtub” distribution
● E.g., bit-flips in memory, spurious
interrupts, noisy sensor data
“Infant
Failure
mortality” Wear-out ● Assess fault tolerance of system
rate
● Can a fault lead to a failure? How critical?
● Testing, fault injection
Normal life
● Measures to improve fault tolerance
(“useful life”)
● E.g., error correction codes, redundancy
Time
9/37 10/37

Faults categories Fault persistence

● Transient faults
● Software faults ● Isolated event during system execution
● Environment faults ● E.g., overheating, glitch in a power line
● E.g., wrong sensor data ● Intermittent faults
● Internal hardware faults ● Malfunction that occurs repeatedly
● E.g., bit flips, defective components (periodic or aperiodic)
● E.g., damaged sensor Could each be
caused by
software bug!
● Permanent faults
● Permanently defective component
11/37 ● E.g., stuck signal or memory cell 12/37

Fault model: memory faults

● Can affect internal RAM, ROM,
registers, external RAM, etc.
● Soft error: memory contents are
Typical faults modified (→ transient)
in embedded systems ● Hard error: memory cell is damaged,
e.g., stuck at zero
(→ permanent or intermittent)
● Many possible causes: electrostatic
discharge, power surging, vibration,
radiation (single-event upsets)
13/37 14/37

Testing memory faults Memory fault injection

● Hwifi: hardware-implemented fault
● Simple approach:
injection
Let system run for a long time, observe
failures ● Can be manual or automatic
● E.g., through heavy-ion radiation
● Problem: faults might only occur under
special circumstances ● Swifi: software-implemented f.i.
● Acceleration through fault injection
● Simulate faults by code instrumentation
● More systematic, more possibilities for
optimisation
● Tools available to automate Swifi
15/37
● Related to mutation testing 16/37
Swifi example Swifi example (2)
for (;;) {
    if (GPIO_ReadInputDataBit(GPIOC, SwitchPin)) {
      ++count;
● Instrumented code is tested
    } else if (count != 1) {
      GPIO_WriteBit(GPIOC, ON1Pin, Bit_RESET); ● If no failures occur, software is tolerant
      GPIO_WriteBit(GPIOC, ON2Pin, Bit_RESET);
      count = 1; w.r.t. the injected fault
    }
Instrumentation ● Alternative method:
    if (CONDITION) simulating a bit flip
      count ^= 1 << BIT; (e.g., triggered randomly) Compare outputs of the original and
    if (count == 10)           // 0.2 seconds the instrumented software during tests
      GPIO_WriteBit(GPIOC, ON1Pin, Bit_SET);
    else if (count == 100) {   // 0.2 + 1.8 seconds
● If no difference in outputs can be
      GPIO_WriteBit(GPIOC, ON1Pin, Bit_RESET);
      GPIO_WriteBit(GPIOC, ON2Pin, Bit_SET);
observed, fault is tolerated
    }

17/37 18/37

Protection from memory faults CPU/MCU defects

● Error-correcting memory
● Components of controller or system
can be defective
● RAM: error-correcting codes (ECC),
→ Usually permanent fault
traditionally Hamming codes
● ROM: checksums (e.g., CRC)
● Might be a byzantine fault
● Can be implemented in software or ● Component continues operating in a
hardware (e.g., some CORTEX M3 faulty manner
have error-corrected flash memory) ● “Babbling idiot”
● Memory “scrubbing:” fix errors in ● Difficult to predict or inject
memory in the background
● Check-point schemes, runtime
monitors, self-stabilising algorithms, ...19/37 20/37

Detection of defects Spurious interrupts

● Built-in self-test (BIST);
aka built-in test software (BITS) ● Defective components might
erroneously trigger interrupts
● Permanently run software tests in the
background (when system is idle) ● Can destroy real-time properties if load
becomes too high
● Built-in watchdogs ● Effects can be mitigated by redundancy
● When fault is detected, component can ● Additional checks whether interrupts
be switched off or replaced, fault can are genuine (e.g., confirmation flag
be reported set by component via DMA)
● Interrupts can be disabled if spurious
interrupts are detected
21/37 22/37

Precision of sensors Networking faults

● Most sensors exhibit some amount of ● Various possible problems
noise ● Transmission errors, dropped messages
● E.g., GPS usually off a few meters ● Defective (byzantine) component sends
● Position sensors (like in the elevator spurious messages
lab) might count wrongly ● Crucial: error-correcting protocols
● Improve precision by combining
different sources of information
● Can be tested for:
● Simulation of faulty channels
● Multiple sensors
● Fuzzing techniques
● Expected sensor values, domain knowl.
● Combined using Kalman filters 23/37 24/37
N-version programming

● Implement N completely independent

control systems
General approaches ● Independent hardware
to fault tolerance ● Independently developed software
● Independent programmers
● But same specification
● Main idea: different systems will
contain different kinds of bugs
→ Unlikely that all fail at the same time
25/37 26/37

N-version programming (2) N-version programming (3)

● Main idea: ... different kinds of bugs ● Different topologies possible

● Not so clear whether this is actually ● Voting scheme:
true: Majority wins
Studies show that even independent ● Master/slave
teams tend to make the same mistakes If master system fails, slave takes
over

● Nevertheless: this is one of the

standard techniques

27/37 28/37

Checkpoints Recovery blocks

● Record system state at particular ● Extended version of checkpoints:

points during execution Split computation into recovery blocks,
● E.g., write to file, send to other system which can be rolled back if a fault is
over network detected
● Can be used for diagnostic purposes, or
● Standard version:
after a system failure ensure <acceptance test>
by <primary alternate>
else by <alternate 2>
… …
else by <alternate n>
else error
29/37 30/37

Basic recovery schema Properties of recovery blocks

Enter recovery block:

● Can mitigate various faults
establish checkpoint of Execute one
relevant system state alternate ● Software faults in one of the alternates
Acceptance
condition holds
● Transient hardware faults
Acceptance
condition
● Transient erroneous sensor data
violated ● Timing problems:
first try expensive computation, abort
Restore checkpoint Discard checkpoint if it takes too long and use a faster
(but sub-optimal) variant

31/37 32/37
Properties of recovery blocks (2) Watchdogs
● Of course, introduces new problems
● One of the most important techniques
● Side-effects of alternates can be hard
to undo ● Watchdog is a component that
● Some computations might be monitors the system
impossible to repeat ● If system does not react any more,
● Checkpointing can be expensive watchdog restarts it
● Maybe no time for repetition ● Should be as independent as possible
from system
● But: most micro-controllers have built-
● Concept is related to in watchdogs
transactional memory (often with independent clock)
33/37 34/37

Watchdogs (2) Window watchdog example

● Typically: watchdog works like a timer

● 8-bit timer, counting downward
● Continuously counts up to some limit
● Timer has to be refreshed within a
● If limit is reached, system is restarted
certain “window”
● System has to reset the timer regularly ● Too early or too
to prevent restart late refresh
● STM32 CORTEX M3 has two built-in → System is
watchdogs restarted
● “Independent watchdog” (own clock)
● Refreshing can be
● “Window watchdog”
done using
35/37
an interrupt 36/37

Watchdogs in general

● Watchdogs have to discriminate:

● Normal system execution, from
● System that hangs, or that is running
rogue
● Difficult in general:
Byzantine system can do anything
● Interesting read:
[Link]

37/37

Design Methodologies for Embedded Systems
No ratings yet
Design Methodologies for Embedded Systems
15 pages
Fault Tolerant System Design Principles
No ratings yet
Fault Tolerant System Design Principles
7 pages
Fault Tolerance in Computer Systems
No ratings yet
Fault Tolerance in Computer Systems
20 pages
Fault Tolerance in System Reliability
No ratings yet
Fault Tolerance in System Reliability
40 pages
Chapter 9 Embedded
No ratings yet
Chapter 9 Embedded
18 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
42 pages
Overview of Embedded Control Systems
No ratings yet
Overview of Embedded Control Systems
15 pages
Fault Tolerance Techniques Overview
No ratings yet
Fault Tolerance Techniques Overview
15 pages
Testing Methodologies for Embedded Systems
No ratings yet
Testing Methodologies for Embedded Systems
47 pages
Fault Detection Techniques in Engineering
100% (1)
Fault Detection Techniques in Engineering
32 pages
Dependable and Secure Computing Concepts
No ratings yet
Dependable and Secure Computing Concepts
14 pages
Fault Tolerance in Real-Time Systems
No ratings yet
Fault Tolerance in Real-Time Systems
19 pages
Software Fault Tolerance Techniques
No ratings yet
Software Fault Tolerance Techniques
50 pages
Fault Tolerance Techniques Overview
No ratings yet
Fault Tolerance Techniques Overview
101 pages
Fault Modeling in Digital Systems
No ratings yet
Fault Modeling in Digital Systems
15 pages
Fault Tolerance Against Malicious Inputs
No ratings yet
Fault Tolerance Against Malicious Inputs
8 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
34 pages
RTOS Class 9-15 Intro To RTS
No ratings yet
RTOS Class 9-15 Intro To RTS
64 pages
Fault-Tolerant Computing Overview
No ratings yet
Fault-Tolerant Computing Overview
4 pages
Fault-Tolerant Computing Explained
No ratings yet
Fault-Tolerant Computing Explained
6 pages
2.1 Embedded Systems
No ratings yet
2.1 Embedded Systems
7 pages
Redundant and Voting System
No ratings yet
Redundant and Voting System
10 pages
Introduction to Embedded Systems
No ratings yet
Introduction to Embedded Systems
36 pages
Computer Systems Reliability Techniques
No ratings yet
Computer Systems Reliability Techniques
74 pages
Day3 - 01 - VLSI Design For Test
No ratings yet
Day3 - 01 - VLSI Design For Test
52 pages
From Traditional Fault Tolerance To Blockchain 1st Edition Wenbing Zhao Download Full Chapters
100% (4)
From Traditional Fault Tolerance To Blockchain 1st Edition Wenbing Zhao Download Full Chapters
128 pages
Embedded Systems Design Methodologies
No ratings yet
Embedded Systems Design Methodologies
28 pages
Embedded System Design Overview
No ratings yet
Embedded System Design Overview
80 pages
Building Robust Systems: Key Concepts
No ratings yet
Building Robust Systems: Key Concepts
30 pages
Synonyms and Characteristics of Embedded Systems
No ratings yet
Synonyms and Characteristics of Embedded Systems
8 pages
Hardware Design Review Checklist
No ratings yet
Hardware Design Review Checklist
28 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
6 pages
Software Safety and Halstead's Science
No ratings yet
Software Safety and Halstead's Science
78 pages
Dependability & Fault Tolerance in Distributed Systems
No ratings yet
Dependability & Fault Tolerance in Distributed Systems
5 pages
Fault Tolerance in Real-Time Systems
No ratings yet
Fault Tolerance in Real-Time Systems
5 pages
Module 5 Esrtosm - 121918
No ratings yet
Module 5 Esrtosm - 121918
17 pages
Embedded Systems Reliability & Fault Tolerance
No ratings yet
Embedded Systems Reliability & Fault Tolerance
35 pages
16nm ASIC Design and DFT Overview
No ratings yet
16nm ASIC Design and DFT Overview
24 pages
Giu 2573 68 30060 2026-03-09T13 09 13
No ratings yet
Giu 2573 68 30060 2026-03-09T13 09 13
23 pages
Benefits of Fault-Tolerant Computing
100% (1)
Benefits of Fault-Tolerant Computing
61 pages
FPGA-Based Embedded System Design
No ratings yet
FPGA-Based Embedded System Design
151 pages
Fault Classification in Fault-Tolerant Systems
No ratings yet
Fault Classification in Fault-Tolerant Systems
20 pages
Fault Tolerant Systems: Prerequisites
No ratings yet
Fault Tolerant Systems: Prerequisites
14 pages
Fault Tolerance Techniques Overview
100% (1)
Fault Tolerance Techniques Overview
104 pages
Embedded System Design Process Guide
No ratings yet
Embedded System Design Process Guide
25 pages
Fault-Tolerant Computing Explained
No ratings yet
Fault-Tolerant Computing Explained
3 pages
Software/Hardware Co-Design in Embedded Systems
No ratings yet
Software/Hardware Co-Design in Embedded Systems
23 pages
Unit I Testing
No ratings yet
Unit I Testing
75 pages
High Availability Design Patterns
No ratings yet
High Availability Design Patterns
10 pages
Underground Cable Fault Detection System
No ratings yet
Underground Cable Fault Detection System
52 pages
Combinational Circuit Testing Basics
No ratings yet
Combinational Circuit Testing Basics
74 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
48 pages
Software Fault Tolerance Techniques
No ratings yet
Software Fault Tolerance Techniques
44 pages
Software Framework for Hardware Defect Detection
No ratings yet
Software Framework for Hardware Defect Detection
17 pages
Fault Detection in Microprocessor Systems
No ratings yet
Fault Detection in Microprocessor Systems
60 pages
VLSI Testing: Concepts and Challenges
No ratings yet
VLSI Testing: Concepts and Challenges
62 pages
Distributed Database Reliability Concepts
No ratings yet
Distributed Database Reliability Concepts
25 pages
Characteristics of Real-Time Systems
No ratings yet
Characteristics of Real-Time Systems
40 pages
Hard Real-Time Systems Overview
No ratings yet
Hard Real-Time Systems Overview
63 pages
Python Basics Handbook PDF
100% (1)
Python Basics Handbook PDF
332 pages
Hospital Management System Project Report
94% (53)
Hospital Management System Project Report
87 pages
Computer Science Vocabulary Challenge
No ratings yet
Computer Science Vocabulary Challenge
2 pages
Bonafide Certificate for Vidyasaarathi Scholarship
No ratings yet
Bonafide Certificate for Vidyasaarathi Scholarship
1 page
Vtu Syllabus
No ratings yet
Vtu Syllabus
140 pages
VTU 5th Sem Computer Networks Notes
100% (1)
VTU 5th Sem Computer Networks Notes
266 pages
Understanding Algorithm Analysis and Complexity
No ratings yet
Understanding Algorithm Analysis and Complexity
164 pages
Eni Spa Instrumentation Installation Standards
No ratings yet
Eni Spa Instrumentation Installation Standards
27 pages
Plant Maintenance Management Overview
100% (1)
Plant Maintenance Management Overview
10 pages
Understanding Reliability Engineering Basics
No ratings yet
Understanding Reliability Engineering Basics
20 pages
Fleet Performance Metrics Overview
No ratings yet
Fleet Performance Metrics Overview
7 pages
69-Chiara Gilardi Foster Wheeler Italiana-Relief-E
100% (1)
69-Chiara Gilardi Foster Wheeler Italiana-Relief-E
16 pages
Maintenance Performance Indicators Summary
No ratings yet
Maintenance Performance Indicators Summary
2 pages
SIL2 Emergency Call Point Guidelines
No ratings yet
SIL2 Emergency Call Point Guidelines
22 pages
Total Productive Maintenance Overview
100% (55)
Total Productive Maintenance Overview
65 pages
Data Flow Diagrams and Prototyping Explained
No ratings yet
Data Flow Diagrams and Prototyping Explained
3 pages
Equipment Requirements in Facilities Planning
No ratings yet
Equipment Requirements in Facilities Planning
12 pages
Fault Tolerance
No ratings yet
Fault Tolerance
13 pages
Understanding Software Requirements Types
No ratings yet
Understanding Software Requirements Types
31 pages
ASB03512HB Fan MTBF Test Results
No ratings yet
ASB03512HB Fan MTBF Test Results
3 pages
Chapter 5: Reliability Report Overview
No ratings yet
Chapter 5: Reliability Report Overview
5 pages
MTTR vs MTBF: Key Differences Explained
No ratings yet
MTTR vs MTBF: Key Differences Explained
9 pages
Key Maintenance Performance Metrics
No ratings yet
Key Maintenance Performance Metrics
4 pages
Boiler Tube Failure Analysis at JPL
100% (1)
Boiler Tube Failure Analysis at JPL
8 pages
A.C. Electrical Variable Speed Drive Systems 33660533
0% (1)
A.C. Electrical Variable Speed Drive Systems 33660533
45 pages
Crane Operations Reliability Assessment
No ratings yet
Crane Operations Reliability Assessment
63 pages
Properties of Probability Density Function
No ratings yet
Properties of Probability Density Function
6 pages
Reliability Analysis of Butter-Oil Plant
No ratings yet
Reliability Analysis of Butter-Oil Plant
14 pages
RF Product Selector Guide
No ratings yet
RF Product Selector Guide
49 pages
Satellite Reliability and Redundancy
No ratings yet
Satellite Reliability and Redundancy
23 pages
Maintenance Engineering Overview
100% (1)
Maintenance Engineering Overview
107 pages
Probability Analysis of Defective Items
No ratings yet
Probability Analysis of Defective Items
16 pages
DS2490 Rev A6 Reliability Report
No ratings yet
DS2490 Rev A6 Reliability Report
3 pages
RCM Fmeca
100% (11)
RCM Fmeca
58 pages
Essential Maintenance Metrics Guide
No ratings yet
Essential Maintenance Metrics Guide
26 pages
Parameters Estimation Methods of The Weibull Distribution: A Comparative Study
No ratings yet
Parameters Estimation Methods of The Weibull Distribution: A Comparative Study
9 pages

Fault Tolerance in Embedded Systems

Uploaded by

Fault Tolerance in Embedded Systems

Uploaded by

Programming Embedded Systems Lecture outline

● Notion + definition of reliability

Correctness vs. reliability Reliability

● Last lecture: ● Probability R(t) that a device/system

Reliability block model: Reliability block model:

● Compare with: fault tree analysis

Mean time between failures

Faults categories Fault persistence

Fault model: memory faults

Testing memory faults Memory fault injection

Protection from memory faults CPU/MCU defects

Detection of defects Spurious interrupts

Precision of sensors Networking faults

● Implement N completely independent

N-version programming (2) N-version programming (3)

● Main idea: ... different kinds of bugs ● Different topologies possible

● Nevertheless: this is one of the

Checkpoints Recovery blocks

● Record system state at particular ● Extended version of checkpoints:

Basic recovery schema Properties of recovery blocks

Enter recovery block:

Watchdogs (2) Window watchdog example

● Typically: watchdog works like a timer

● Watchdogs have to discriminate:

You might also like