0% found this document useful (0 votes)
21 views5 pages

Fault Tolerance in Embedded Systems

The document summarizes key points from a lecture on reliability and fault tolerance in embedded systems. It defines reliability as the probability a system will function as intended for a given time period under specified conditions. It discusses different types of faults like transient faults from glitches versus permanent faults from defective components. It also covers approaches to improve reliability like defining a fault model, assessing fault tolerance, and techniques like error correction codes and redundancy. Fault injection testing is described as a way to accelerate testing for faults by simulating them in software or hardware.

Uploaded by

Suraj Sonu M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views5 pages

Fault Tolerance in Embedded Systems

The document summarizes key points from a lecture on reliability and fault tolerance in embedded systems. It defines reliability as the probability a system will function as intended for a given time period under specified conditions. It discusses different types of faults like transient faults from glitches versus permanent faults from defective components. It also covers approaches to improve reliability like defining a fault model, assessing fault tolerance, and techniques like error correction codes and redundancy. Fault injection testing is described as a way to accelerate testing for faults by simulating them in software or hardware.

Uploaded by

Suraj Sonu M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Programming Embedded Systems Lecture outline

● Notion + definition of reliability


Lecture 9 ● Notion of fault tolerance
Reliability and fault tolerance ● Typical classes of faults in
Monday Feb 14, 2011 embedded systems
● Techniques to improve fault tolerance
Philipp Rümmer
Uppsala University
[Link]@[Link]

1/37 2/37

Correctness vs. reliability Reliability

● Last lecture: ● Probability R(t) that a device/system


Correctness of software fulfils its intended function for a
→ Absolute notion, assessed using V&V period of time t, given precisely
● In practice: stated conditions
Every system will fail (sooner or later) ● If T is time (random variable) when first
● Totally correct of software is usually too failure occurs, then:
expensive
● Hardware faults (unavoidable)
● More relative notion: reliability Assumes knowledge
about probability
distribution
3/37 4/37

Reliability block model: Reliability block model:


parallelism (“or”) series (“and”)

1 2

● Compare with: fault tree analysis


5/37 6/37

Mean time between failures


Lusser's equation
(MTBF)
● E.g., “power supply A has MTBF of ● Relationship between reliability and
40000 hours” MTBF:
● Determined by testing N devices for T
hours each, counting the number R of
failures (special case of Weibull distribution)
● Interesting consequence:
Devices will survive their MTBF only
with likelihood
● Inverse: failure rate

7/37 8/37
Bathtubs Improving reliability of a system
● Define fault model describing what could
● Failure rate of hardware is considered go wrong
to follow a “bathtub” distribution
● E.g., bit-flips in memory, spurious
interrupts, noisy sensor data
“Infant
Failure
mortality” Wear-out ● Assess fault tolerance of system
rate
● Can a fault lead to a failure? How critical?
● Testing, fault injection
Normal life
● Measures to improve fault tolerance
(“useful life”)
● E.g., error correction codes, redundancy
Time
9/37 10/37

Faults categories Fault persistence


● Transient faults
● Software faults ● Isolated event during system execution
● Environment faults ● E.g., overheating, glitch in a power line
● E.g., wrong sensor data ● Intermittent faults
● Internal hardware faults ● Malfunction that occurs repeatedly
● E.g., bit flips, defective components (periodic or aperiodic)
● E.g., damaged sensor Could each be
caused by
software bug!
● Permanent faults
● Permanently defective component
11/37 ● E.g., stuck signal or memory cell 12/37

Fault model: memory faults


● Can affect internal RAM, ROM,
registers, external RAM, etc.
● Soft error: memory contents are
Typical faults modified (→ transient)
in embedded systems ● Hard error: memory cell is damaged,
e.g., stuck at zero
(→ permanent or intermittent)
● Many possible causes: electrostatic
discharge, power surging, vibration,
radiation (single-event upsets)
13/37 14/37

Testing memory faults Memory fault injection


● Hwifi: hardware-implemented fault
● Simple approach:
injection
Let system run for a long time, observe
failures ● Can be manual or automatic
● E.g., through heavy-ion radiation
● Problem: faults might only occur under
special circumstances ● Swifi: software-implemented f.i.
● Acceleration through fault injection
● Simulate faults by code instrumentation
● More systematic, more possibilities for
optimisation
● Tools available to automate Swifi
15/37
● Related to mutation testing 16/37
Swifi example Swifi example (2)
for (;;) {
    if (GPIO_ReadInputDataBit(GPIOC, SwitchPin)) {
      ++count;
● Instrumented code is tested
    } else if (count != ­1) {
      GPIO_WriteBit(GPIOC, ON1Pin, Bit_RESET); ● If no failures occur, software is tolerant
      GPIO_WriteBit(GPIOC, ON2Pin, Bit_RESET);
      count = ­1; w.r.t. the injected fault
    }
Instrumentation ● Alternative method:
    if (CONDITION) simulating a bit flip
      count ^= 1 << BIT; (e.g., triggered randomly) Compare outputs of the original and
    if (count == 10)           // 0.2 seconds the instrumented software during tests
      GPIO_WriteBit(GPIOC, ON1Pin, Bit_SET);
    else if (count == 100) {   // 0.2 + 1.8 seconds
● If no difference in outputs can be
      GPIO_WriteBit(GPIOC, ON1Pin, Bit_RESET);
      GPIO_WriteBit(GPIOC, ON2Pin, Bit_SET);
observed, fault is tolerated
    }

17/37 18/37

Protection from memory faults CPU/MCU defects


● Error-correcting memory
● Components of controller or system
can be defective
● RAM: error-correcting codes (ECC),
→ Usually permanent fault
traditionally Hamming codes
● ROM: checksums (e.g., CRC)
● Might be a byzantine fault
● Can be implemented in software or ● Component continues operating in a
hardware (e.g., some CORTEX M3 faulty manner
have error-corrected flash memory) ● “Babbling idiot”
● Memory “scrubbing:” fix errors in ● Difficult to predict or inject
memory in the background
● Check-point schemes, runtime
monitors, self-stabilising algorithms, ...19/37 20/37

Detection of defects Spurious interrupts


● Built-in self-test (BIST);
aka built-in test software (BITS) ● Defective components might
erroneously trigger interrupts
● Permanently run software tests in the
background (when system is idle) ● Can destroy real-time properties if load
becomes too high
● Built-in watchdogs ● Effects can be mitigated by redundancy
● When fault is detected, component can ● Additional checks whether interrupts
be switched off or replaced, fault can are genuine (e.g., confirmation flag
be reported set by component via DMA)
● Interrupts can be disabled if spurious
interrupts are detected
21/37 22/37

Precision of sensors Networking faults


● Most sensors exhibit some amount of ● Various possible problems
noise ● Transmission errors, dropped messages
● E.g., GPS usually off a few meters ● Defective (byzantine) component sends
● Position sensors (like in the elevator spurious messages
lab) might count wrongly ● Crucial: error-correcting protocols
● Improve precision by combining
different sources of information
● Can be tested for:
● Simulation of faulty channels
● Multiple sensors
● Fuzzing techniques
● Expected sensor values, domain knowl.
● Combined using Kalman filters 23/37 24/37
N-version programming

● Implement N completely independent


control systems
General approaches ● Independent hardware
to fault tolerance ● Independently developed software
● Independent programmers
● But same specification
● Main idea: different systems will
contain different kinds of bugs
→ Unlikely that all fail at the same time
25/37 26/37

N-version programming (2) N-version programming (3)

● Main idea: ... different kinds of bugs ● Different topologies possible


● Not so clear whether this is actually ● Voting scheme:
true: Majority wins
Studies show that even independent ● Master/slave
teams tend to make the same mistakes If master system fails, slave takes
over

● Nevertheless: this is one of the


standard techniques

27/37 28/37

Checkpoints Recovery blocks

● Record system state at particular ● Extended version of checkpoints:


points during execution Split computation into recovery blocks,
● E.g., write to file, send to other system which can be rolled back if a fault is
over network detected
● Can be used for diagnostic purposes, or
● Standard version:
after a system failure ensure <acceptance test>
by <primary alternate>
else by <alternate 2>
… …
else by <alternate n>
else error
29/37 30/37

Basic recovery schema Properties of recovery blocks

Enter recovery block:


● Can mitigate various faults
establish checkpoint of Execute one
relevant system state alternate ● Software faults in one of the alternates
Acceptance
condition holds
● Transient hardware faults
Acceptance
condition
● Transient erroneous sensor data
violated ● Timing problems:
first try expensive computation, abort
Restore checkpoint Discard checkpoint if it takes too long and use a faster
(but sub-optimal) variant

31/37 32/37
Properties of recovery blocks (2) Watchdogs
● Of course, introduces new problems
● One of the most important techniques
● Side-effects of alternates can be hard
to undo ● Watchdog is a component that
● Some computations might be monitors the system
impossible to repeat ● If system does not react any more,
● Checkpointing can be expensive watchdog restarts it
● Maybe no time for repetition ● Should be as independent as possible
from system
● But: most micro-controllers have built-
● Concept is related to in watchdogs
transactional memory (often with independent clock)
33/37 34/37

Watchdogs (2) Window watchdog example

● Typically: watchdog works like a timer


● 8-bit timer, counting downward
● Continuously counts up to some limit
● Timer has to be refreshed within a
● If limit is reached, system is restarted
certain “window”
● System has to reset the timer regularly ● Too early or too
to prevent restart late refresh
● STM32 CORTEX M3 has two built-in → System is
watchdogs restarted
● “Independent watchdog” (own clock)
● Refreshing can be
● “Window watchdog”
done using
35/37
an interrupt 36/37

Watchdogs in general

● Watchdogs have to discriminate:


● Normal system execution, from
● System that hangs, or that is running
rogue
● Difficult in general:
Byzantine system can do anything
● Interesting read:
[Link]

37/37

You might also like