Programming Embedded Systems Lecture outline
● Notion + definition of reliability
Lecture 9 ● Notion of fault tolerance
Reliability and fault tolerance ● Typical classes of faults in
Monday Feb 14, 2011 embedded systems
● Techniques to improve fault tolerance
Philipp Rümmer
Uppsala University
[Link]@[Link]
1/37 2/37
Correctness vs. reliability Reliability
● Last lecture: ● Probability R(t) that a device/system
Correctness of software fulfils its intended function for a
→ Absolute notion, assessed using V&V period of time t, given precisely
● In practice: stated conditions
Every system will fail (sooner or later) ● If T is time (random variable) when first
● Totally correct of software is usually too failure occurs, then:
expensive
● Hardware faults (unavoidable)
● More relative notion: reliability Assumes knowledge
about probability
distribution
3/37 4/37
Reliability block model: Reliability block model:
parallelism (“or”) series (“and”)
1 2
● Compare with: fault tree analysis
5/37 6/37
Mean time between failures
Lusser's equation
(MTBF)
● E.g., “power supply A has MTBF of ● Relationship between reliability and
40000 hours” MTBF:
● Determined by testing N devices for T
hours each, counting the number R of
failures (special case of Weibull distribution)
● Interesting consequence:
Devices will survive their MTBF only
with likelihood
● Inverse: failure rate
7/37 8/37
Bathtubs Improving reliability of a system
● Define fault model describing what could
● Failure rate of hardware is considered go wrong
to follow a “bathtub” distribution
● E.g., bit-flips in memory, spurious
interrupts, noisy sensor data
“Infant
Failure
mortality” Wear-out ● Assess fault tolerance of system
rate
● Can a fault lead to a failure? How critical?
● Testing, fault injection
Normal life
● Measures to improve fault tolerance
(“useful life”)
● E.g., error correction codes, redundancy
Time
9/37 10/37
Faults categories Fault persistence
● Transient faults
● Software faults ● Isolated event during system execution
● Environment faults ● E.g., overheating, glitch in a power line
● E.g., wrong sensor data ● Intermittent faults
● Internal hardware faults ● Malfunction that occurs repeatedly
● E.g., bit flips, defective components (periodic or aperiodic)
● E.g., damaged sensor Could each be
caused by
software bug!
● Permanent faults
● Permanently defective component
11/37 ● E.g., stuck signal or memory cell 12/37
Fault model: memory faults
● Can affect internal RAM, ROM,
registers, external RAM, etc.
● Soft error: memory contents are
Typical faults modified (→ transient)
in embedded systems ● Hard error: memory cell is damaged,
e.g., stuck at zero
(→ permanent or intermittent)
● Many possible causes: electrostatic
discharge, power surging, vibration,
radiation (single-event upsets)
13/37 14/37
Testing memory faults Memory fault injection
● Hwifi: hardware-implemented fault
● Simple approach:
injection
Let system run for a long time, observe
failures ● Can be manual or automatic
● E.g., through heavy-ion radiation
● Problem: faults might only occur under
special circumstances ● Swifi: software-implemented f.i.
● Acceleration through fault injection
● Simulate faults by code instrumentation
● More systematic, more possibilities for
optimisation
● Tools available to automate Swifi
15/37
● Related to mutation testing 16/37
Swifi example Swifi example (2)
for (;;) {
if (GPIO_ReadInputDataBit(GPIOC, SwitchPin)) {
++count;
● Instrumented code is tested
} else if (count != 1) {
GPIO_WriteBit(GPIOC, ON1Pin, Bit_RESET); ● If no failures occur, software is tolerant
GPIO_WriteBit(GPIOC, ON2Pin, Bit_RESET);
count = 1; w.r.t. the injected fault
}
Instrumentation ● Alternative method:
if (CONDITION) simulating a bit flip
count ^= 1 << BIT; (e.g., triggered randomly) Compare outputs of the original and
if (count == 10) // 0.2 seconds the instrumented software during tests
GPIO_WriteBit(GPIOC, ON1Pin, Bit_SET);
else if (count == 100) { // 0.2 + 1.8 seconds
● If no difference in outputs can be
GPIO_WriteBit(GPIOC, ON1Pin, Bit_RESET);
GPIO_WriteBit(GPIOC, ON2Pin, Bit_SET);
observed, fault is tolerated
}
17/37 18/37
Protection from memory faults CPU/MCU defects
● Error-correcting memory
● Components of controller or system
can be defective
● RAM: error-correcting codes (ECC),
→ Usually permanent fault
traditionally Hamming codes
● ROM: checksums (e.g., CRC)
● Might be a byzantine fault
● Can be implemented in software or ● Component continues operating in a
hardware (e.g., some CORTEX M3 faulty manner
have error-corrected flash memory) ● “Babbling idiot”
● Memory “scrubbing:” fix errors in ● Difficult to predict or inject
memory in the background
● Check-point schemes, runtime
monitors, self-stabilising algorithms, ...19/37 20/37
Detection of defects Spurious interrupts
● Built-in self-test (BIST);
aka built-in test software (BITS) ● Defective components might
erroneously trigger interrupts
● Permanently run software tests in the
background (when system is idle) ● Can destroy real-time properties if load
becomes too high
● Built-in watchdogs ● Effects can be mitigated by redundancy
● When fault is detected, component can ● Additional checks whether interrupts
be switched off or replaced, fault can are genuine (e.g., confirmation flag
be reported set by component via DMA)
● Interrupts can be disabled if spurious
interrupts are detected
21/37 22/37
Precision of sensors Networking faults
● Most sensors exhibit some amount of ● Various possible problems
noise ● Transmission errors, dropped messages
● E.g., GPS usually off a few meters ● Defective (byzantine) component sends
● Position sensors (like in the elevator spurious messages
lab) might count wrongly ● Crucial: error-correcting protocols
● Improve precision by combining
different sources of information
● Can be tested for:
● Simulation of faulty channels
● Multiple sensors
● Fuzzing techniques
● Expected sensor values, domain knowl.
● Combined using Kalman filters 23/37 24/37
N-version programming
● Implement N completely independent
control systems
General approaches ● Independent hardware
to fault tolerance ● Independently developed software
● Independent programmers
● But same specification
● Main idea: different systems will
contain different kinds of bugs
→ Unlikely that all fail at the same time
25/37 26/37
N-version programming (2) N-version programming (3)
● Main idea: ... different kinds of bugs ● Different topologies possible
● Not so clear whether this is actually ● Voting scheme:
true: Majority wins
Studies show that even independent ● Master/slave
teams tend to make the same mistakes If master system fails, slave takes
over
● Nevertheless: this is one of the
standard techniques
27/37 28/37
Checkpoints Recovery blocks
● Record system state at particular ● Extended version of checkpoints:
points during execution Split computation into recovery blocks,
● E.g., write to file, send to other system which can be rolled back if a fault is
over network detected
● Can be used for diagnostic purposes, or
● Standard version:
after a system failure ensure <acceptance test>
by <primary alternate>
else by <alternate 2>
… …
else by <alternate n>
else error
29/37 30/37
Basic recovery schema Properties of recovery blocks
Enter recovery block:
● Can mitigate various faults
establish checkpoint of Execute one
relevant system state alternate ● Software faults in one of the alternates
Acceptance
condition holds
● Transient hardware faults
Acceptance
condition
● Transient erroneous sensor data
violated ● Timing problems:
first try expensive computation, abort
Restore checkpoint Discard checkpoint if it takes too long and use a faster
(but sub-optimal) variant
31/37 32/37
Properties of recovery blocks (2) Watchdogs
● Of course, introduces new problems
● One of the most important techniques
● Side-effects of alternates can be hard
to undo ● Watchdog is a component that
● Some computations might be monitors the system
impossible to repeat ● If system does not react any more,
● Checkpointing can be expensive watchdog restarts it
● Maybe no time for repetition ● Should be as independent as possible
from system
● But: most micro-controllers have built-
● Concept is related to in watchdogs
transactional memory (often with independent clock)
33/37 34/37
Watchdogs (2) Window watchdog example
● Typically: watchdog works like a timer
● 8-bit timer, counting downward
● Continuously counts up to some limit
● Timer has to be refreshed within a
● If limit is reached, system is restarted
certain “window”
● System has to reset the timer regularly ● Too early or too
to prevent restart late refresh
● STM32 CORTEX M3 has two built-in → System is
watchdogs restarted
● “Independent watchdog” (own clock)
● Refreshing can be
● “Window watchdog”
done using
35/37
an interrupt 36/37
Watchdogs in general
● Watchdogs have to discriminate:
● Normal system execution, from
● System that hangs, or that is running
rogue
● Difficult in general:
Byzantine system can do anything
● Interesting read:
[Link]
37/37