Introduction to Robust Systems
Subhasish Mitra
Stanford University
Email: subh@[Link]
1
Objective of this Talk
Brainstorm
What is a robust system ?
How can we build robust systems ?
Robust systems research directions
How can EE392U be beneficial ?
Not covered: Specific solutions
2
What is a Robust Design ?
Quote from
[Link]/robust_desig/[Link]
“Not just strong”
“Flexible ! Idiot-proof ! Simple ! Efficient !”
Consistent high-level performance
Wide range of changing conditions
Client & manufacturing related
Anticipated vs. unanticipated
3
Robust Computing System
Defects, Process variation,
Degraded transistors Radiation, Noise
Robust System
Inputs “Acceptable” Outputs
Performance
Power
Design errors, Malicious attacks, Data Integrity
Software failures Human errors Availability
Security
4
Availability & Data Integrity
Availability: Probability system operational at time t
Telecom: 99.999% 5 mins./ year downtime
Data integrity: no undetected errors
$20K not interpreted as $3,616
“Beagle2 mission presumed to be lost” ([Link])
5
Safety
“Active” safety for drive-by-wire systems
Implantable medical devices
Nano-robot assisted remote surgery
“Context-aware” “pro-active” healthcare system
6
Drive-by-wire a Reality
What about reliability ?
7
Security
Major adversaries
Security thefts
Virus, hacks, spam,
Terrorists
8
Power & Performance
9
Server Reliability Goals
MTBF = Mean Time Between Failures
Taken from Bossen, IRPS 2002
10
Causes of System Failures ?
Depends on who you talk to
Application domains
PCs vs. servers
Medical devices, automotive, …
System configurations (& costs)
Hardware costs & lifetime
Single vs. clusters
Application-specific vs. general-purpose
11
Windows XP Failures
[Murphy, ACM Queue, Nov. 2004]
5% Microsoft software bugs, 12% hardware,
83% 3rd party – What you call a bug ?
Hardware failures 3rd party driver crashes
BIOS
Processor 1% Others
Memory Display
17% 20%
30% 35%
Firewall
4%
General
Modem 9%
26%
Audio 9% Anti-Virus
Disk
CD-Burning 13%
26%
10%
12
Windows XP Failures: Observations
3rd party drivers crashes – bug definition ?
Increasing hardware failures – aging hardware?
Good processor reliability enablers
Short PC lifetime
Speed & voltage guardbands during design
Price: power & performance cost
Classical scaling was sufficient
Inexpensive test & reliability screens
BUT, progressively harder in sub-65nm
13
Causes of Server Unavailability –
Data from the Past
Total Outage Cause
For 24x7 must
address both
scheduled &
Scheduled Unscheduled unscheduled
Unscheduled Outage Cause
Often operator error
Other predominates
Software as source of downtime
Hardware
Ack: Lisa Spainhower, IBM
14
Server Failures: Observations
Most software bugs “soft”
Heisenbugs – Gone after reboot / restart
Repair time is “key” here
Operator errors – major issue going forward
High hardware reliability
Enabled by hardware redundancy, BUT
Redundancy expensive
Hardware failure rates increasing
Performance & power scaling slowdown
15
Why Worry About Hardware
Reliability?
Major process variation
Worst-case design impractical
Perfect design verification + test not enough
Manufacturing process imperfect
Testing imperfect: Warranty failures
Transient errors during system operation
e.g., noise, radiation induced soft errors
“Aging”: e.g., slow transistors with time
16
Process variation: Power &
Performance Impact
max
Delay typ
min
∆ Voltage
Ack: Prof. Giovanni De Micheli
17
Bathtub Curve
Marginal parts due to Transistor degradation (e.g., PMOS
defects, e.g., gate-to- threshold voltage shift),
source shorts, small electromigration, oxide breakdown;
opens, poor vias & Mitigated by conservative design
contacts; (overdesign) to avoid failures
Mitigated by Burn-in during intended product lifetime
Failure Normal lifetime
Infant
Rate
Mortality Wearout
e.g., soft errors in
memories mitigated Period
by Error Correcting
Codes
1-20 weeks ~ 3-15 years Time
18
(Scary?) Bathtub: Future Technologies
Advanced
Advanced
technologies:
technologies: burn-
increasing wearout
in out of steam ?
failures
Advanced
technologies:
increasing transient
Failure errors
Rate
Infant mortality Normal lifetime Wearout Time
Exciting opportunities for new system design
techniques to cope with failures
19
Related EE392U Seminars
Larry Votta, Distinguished Engineer, SUN – Oct. 3
“Why Do Systems Fail ?” – Oct. 31
Lisa Spainhower, Distinguished Engineer, IBM
“Estimating the Risk of Releasing Software,” – Nov. 7
Brendan Murphy, Microsoft Research
“Reliable Design from Unreliable Components” – Nov. 14
Shekhar Borkar, Fellow, Intel
Columbia Disaster Talk – Nov. 28
Prof. Greg Kovacs, Stanford
20
How to Build Robust Systems ?
Avoidance
Conservative design
Design validation
Thorough hardware & software test
Infant mortality screen for hardware
Transient error avoidance
Proper interfaces to minimize operator errors
Correct by Construction Simply Not sufficient
Several challenges in future
21
How to Build Robust Systems?
Tolerance
Error detection during system operation
Permanent & correlated hardware failures ?
Bohrbugs vs. Heisenbugs ?
On-line monitoring & diagnostics
Self-recovery & Self-repair
Automated self-managing systems
Major Challenge: PROVE these WORK !
Classical fault-tolerance very expensive
Classical fault-tolerance inadequate
22
Fault-Tolerant Computing
23
High Availability Building Blocks
Fault Tolerance Fault Avoidance
Spare/
Degrade Concurrent Design Verification
Repair System Integration SW
Recover Failure Masking
Reliability Integration
Detect & Data Integrity
Isolate
System Design Technology
24
Related EE392U Seminars
Larry Votta, Distinguished Engineer, SUN – Oct. 3
System effects & error protection – Oct. 10
Prof. Ravi Iyer, University of Illinois at Urbana
Champaign
“Fault Tolerance in Space Environments” – Oct. 17
Dr. Philip Shirvani, nVidia
Trusted systems: Prof. Hector Garcia Molina – Oct. 24
“Why Do Systems Fail ?” – Oct. 31
Lisa Spainhower, Distinguished Engineer, IBM
“Estimating the Risk of Releasing Software,” – Nov. 7
Brendan Murphy, Microsoft Research
25
Robust Systems as Research Area
– CRA Recommendations
Trouble-free systems
PCs – “zero administration”
Large-scale systems
Millions of users
Administered by single person
Self-diagnosing, self-healing, self-evolving
Dependable and survivable systems
Secure, safe, reliable, available
26
Importance in Revolutionary
Nanotechnology
Revolutionary nanotechnologies
e.g., Molecular electronics
Well-acknowledged fact
Regular structures
Defect prone
Errors during normal operation
~ 5-10% faulty
Must be self-healing !
27
Broad Research Directions: Looking
for Interested Students
Understand failures for various applications
PCs, large servers, embedded (e.g., cars,
digital home), space systems, healthcare
Expertise required
Experimental data collection
Simulation & modeling
Circuits, architecture, systems, HCI
28
Broad Research Directions: Looking
for Interested Students
New robust system design techniques
Failure avoidance & resilience support
Expertise required
Circuit / logic design
Architecture
Compiler & operating systems
Human-Computer interaction
29
Broad Research Directions: Looking
for Interested Students
Robust system prototypes
PROVE that the system works !
How ?
Not just simulation
Build real system prototypes – How ?
Nanotechnology architectures
Built-in defect & fault-tolerance ?
Conventional methods work for very low
failure rates
30