0% found this document useful (0 votes)

27 views30 pages

Building Robust Systems: Key Concepts

This document provides an introduction to robust systems. It defines a robust system as one that can maintain consistent high-level performance over a wide range of changing conditions. The document discusses how robustness relates to availability, data integrity, safety, security, power, and performance. It examines causes of system failures from different perspectives and provides examples of server and Windows reliability goals and failure data. The document also outlines approaches for building robust systems through avoidance, tolerance, and fault tolerance techniques and lists related seminar topics for a course on robust systems.

Uploaded by

Jessa Bagsit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views30 pages

Building Robust Systems: Key Concepts

Uploaded by

Jessa Bagsit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction to Robust Systems

Subhasish Mitra

Stanford University

Email: subh@[Link]

1
Objective of this Talk
Brainstorm
What is a robust system ?
How can we build robust systems ?
Robust systems research directions
How can EE392U be beneficial ?
Not covered: Specific solutions

2
What is a Robust Design ?
Quote from
[Link]/robust_desig/[Link]
“Not just strong”
“Flexible ! Idiot-proof ! Simple ! Efficient !”
Consistent high-level performance
Wide range of changing conditions
Client & manufacturing related
Anticipated vs. unanticipated

3
Robust Computing System
Defects, Process variation,
Degraded transistors Radiation, Noise

Robust System
Inputs “Acceptable” Outputs
Performance
Power
Design errors, Malicious attacks, Data Integrity
Software failures Human errors Availability
Security

4
Availability & Data Integrity
Availability: Probability system operational at time t
Telecom: 99.999% 5 mins./ year downtime
Data integrity: no undetected errors
$20K not interpreted as $3,616
“Beagle2 mission presumed to be lost” ([Link])

5
Safety
“Active” safety for drive-by-wire systems
Implantable medical devices
Nano-robot assisted remote surgery
“Context-aware” “pro-active” healthcare system

6
Drive-by-wire a Reality
What about reliability ?

7
Security
Major adversaries
Security thefts
Virus, hacks, spam,
Terrorists

8
Power & Performance

9
Server Reliability Goals

MTBF = Mean Time Between Failures

Taken from Bossen, IRPS 2002
10
Causes of System Failures ?

Depends on who you talk to

Application domains
PCs vs. servers
Medical devices, automotive, …
System configurations (& costs)
Hardware costs & lifetime
Single vs. clusters
Application-specific vs. general-purpose

11
Windows XP Failures
[Murphy, ACM Queue, Nov. 2004]
5% Microsoft software bugs, 12% hardware,
83% 3rd party – What you call a bug ?

Hardware failures 3rd party driver crashes

BIOS
Processor 1% Others
Memory Display
17% 20%
30% 35%
Firewall
4%
General
Modem 9%
26%
Audio 9% Anti-Virus
Disk
CD-Burning 13%
26%
10%

12
Windows XP Failures: Observations
3rd party drivers crashes – bug definition ?
Increasing hardware failures – aging hardware?
Good processor reliability enablers
Short PC lifetime
Speed & voltage guardbands during design
Price: power & performance cost
Classical scaling was sufficient
Inexpensive test & reliability screens
BUT, progressively harder in sub-65nm

13
Causes of Server Unavailability –
Data from the Past
Total Outage Cause
For 24x7 must
address both
scheduled &
Scheduled Unscheduled unscheduled

Unscheduled Outage Cause

Often operator error
Other predominates
Software as source of downtime
Hardware
Ack: Lisa Spainhower, IBM
14
Server Failures: Observations
Most software bugs “soft”
Heisenbugs – Gone after reboot / restart
Repair time is “key” here
Operator errors – major issue going forward
High hardware reliability
Enabled by hardware redundancy, BUT
Redundancy expensive
Hardware failure rates increasing
Performance & power scaling slowdown

15
Why Worry About Hardware
Reliability?
Major process variation
Worst-case design impractical
Perfect design verification + test not enough
Manufacturing process imperfect
Testing imperfect: Warranty failures
Transient errors during system operation
e.g., noise, radiation induced soft errors
“Aging”: e.g., slow transistors with time

16
Process variation: Power &
Performance Impact
max
Delay typ
min

∆ Voltage
Ack: Prof. Giovanni De Micheli
17
Bathtub Curve
Marginal parts due to Transistor degradation (e.g., PMOS
defects, e.g., gate-to- threshold voltage shift),
source shorts, small electromigration, oxide breakdown;
opens, poor vias & Mitigated by conservative design
contacts; (overdesign) to avoid failures
Mitigated by Burn-in during intended product lifetime

Failure Normal lifetime

Infant
Rate
Mortality Wearout
e.g., soft errors in
memories mitigated Period
by Error Correcting
Codes

1-20 weeks ~ 3-15 years Time

18
(Scary?) Bathtub: Future Technologies
Advanced
Advanced
technologies:
technologies: burn-
increasing wearout
in out of steam ?
failures
Advanced
technologies:
increasing transient
Failure errors
Rate

Infant mortality Normal lifetime Wearout Time

Exciting opportunities for new system design

techniques to cope with failures
19
Related EE392U Seminars
Larry Votta, Distinguished Engineer, SUN – Oct. 3
“Why Do Systems Fail ?” – Oct. 31
Lisa Spainhower, Distinguished Engineer, IBM
“Estimating the Risk of Releasing Software,” – Nov. 7
Brendan Murphy, Microsoft Research
“Reliable Design from Unreliable Components” – Nov. 14
Shekhar Borkar, Fellow, Intel
Columbia Disaster Talk – Nov. 28
Prof. Greg Kovacs, Stanford

20
How to Build Robust Systems ?
Avoidance
Conservative design
Design validation
Thorough hardware & software test
Infant mortality screen for hardware
Transient error avoidance
Proper interfaces to minimize operator errors

Correct by Construction Simply Not sufficient

Several challenges in future

21
How to Build Robust Systems?
Tolerance
Error detection during system operation
Permanent & correlated hardware failures ?
Bohrbugs vs. Heisenbugs ?
On-line monitoring & diagnostics
Self-recovery & Self-repair
Automated self-managing systems
Major Challenge: PROVE these WORK !

Classical fault-tolerance very expensive

Classical fault-tolerance inadequate
22
Fault-Tolerant Computing

23
High Availability Building Blocks

Fault Tolerance Fault Avoidance

Spare/
Degrade Concurrent Design Verification
Repair System Integration SW

Recover Failure Masking

Reliability Integration

Detect & Data Integrity

Isolate
System Design Technology

24
Related EE392U Seminars
Larry Votta, Distinguished Engineer, SUN – Oct. 3
System effects & error protection – Oct. 10
Prof. Ravi Iyer, University of Illinois at Urbana
Champaign
“Fault Tolerance in Space Environments” – Oct. 17
Dr. Philip Shirvani, nVidia
Trusted systems: Prof. Hector Garcia Molina – Oct. 24
“Why Do Systems Fail ?” – Oct. 31
Lisa Spainhower, Distinguished Engineer, IBM
“Estimating the Risk of Releasing Software,” – Nov. 7
Brendan Murphy, Microsoft Research

25
Robust Systems as Research Area
– CRA Recommendations
Trouble-free systems
PCs – “zero administration”
Large-scale systems
Millions of users
Administered by single person
Self-diagnosing, self-healing, self-evolving
Dependable and survivable systems
Secure, safe, reliable, available

26
Importance in Revolutionary
Nanotechnology
Revolutionary nanotechnologies
e.g., Molecular electronics
Well-acknowledged fact
Regular structures
Defect prone
Errors during normal operation
~ 5-10% faulty
Must be self-healing !

27
Broad Research Directions: Looking
for Interested Students
Understand failures for various applications
PCs, large servers, embedded (e.g., cars,
digital home), space systems, healthcare
Expertise required
Experimental data collection
Simulation & modeling
Circuits, architecture, systems, HCI

28
Broad Research Directions: Looking
for Interested Students
New robust system design techniques
Failure avoidance & resilience support
Expertise required
Circuit / logic design
Architecture
Compiler & operating systems
Human-Computer interaction

29
Broad Research Directions: Looking
for Interested Students
Robust system prototypes
PROVE that the system works !
How ?
Not just simulation
Build real system prototypes – How ?
Nanotechnology architectures
Built-in defect & fault-tolerance ?
Conventional methods work for very low
failure rates
30

Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
34 pages
Dependable and Secure Computing Concepts
No ratings yet
Dependable and Secure Computing Concepts
14 pages
Fault Tolerance in Embedded Systems
No ratings yet
Fault Tolerance in Embedded Systems
5 pages
Fault Tolerant Systems: Prerequisites
No ratings yet
Fault Tolerant Systems: Prerequisites
14 pages
Fault-Tolerant Computing Explained
No ratings yet
Fault-Tolerant Computing Explained
6 pages
Understanding Computer System Reliability
No ratings yet
Understanding Computer System Reliability
4 pages
Understanding Murphy's Law in Design
No ratings yet
Understanding Murphy's Law in Design
24 pages
CSE 412 Lecture 4
No ratings yet
CSE 412 Lecture 4
85 pages
Understanding System Reliability and Dependability
No ratings yet
Understanding System Reliability and Dependability
56 pages
Fault Tolerance in System Reliability
No ratings yet
Fault Tolerance in System Reliability
40 pages
Hardware Design Review Checklist
No ratings yet
Hardware Design Review Checklist
28 pages
Ieee Ha Swieorick
No ratings yet
Ieee Ha Swieorick
19 pages
Understanding Fault-Tolerant Computing
No ratings yet
Understanding Fault-Tolerant Computing
24 pages
Overview of Dependability Concepts
No ratings yet
Overview of Dependability Concepts
21 pages
Distributed Database Reliability Concepts
No ratings yet
Distributed Database Reliability Concepts
25 pages
Importance of System Dependability
No ratings yet
Importance of System Dependability
39 pages
SWE-600 SW Dependable System
No ratings yet
SWE-600 SW Dependable System
48 pages
Fault-Tolerant Computer Systems Guide
No ratings yet
Fault-Tolerant Computer Systems Guide
4 pages
Software Fault Tolerance Techniques
No ratings yet
Software Fault Tolerance Techniques
50 pages
UNIT 8 - Failed Computer Software and Information System Projects
No ratings yet
UNIT 8 - Failed Computer Software and Information System Projects
14 pages
Software Fault Tolerance Techniques
No ratings yet
Software Fault Tolerance Techniques
18 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
6 pages
Dependable Computing Design Overview
No ratings yet
Dependable Computing Design Overview
31 pages
Reliable Embedded System Design Guide
No ratings yet
Reliable Embedded System Design Guide
30 pages
Key Aspects of System Dependability
No ratings yet
Key Aspects of System Dependability
17 pages
Understanding Software Engineering Essentials
No ratings yet
Understanding Software Engineering Essentials
81 pages
RTOS Class 9-15 Intro To RTS
No ratings yet
RTOS Class 9-15 Intro To RTS
64 pages
Dependability & Fault Tolerance in Distributed Systems
No ratings yet
Dependability & Fault Tolerance in Distributed Systems
5 pages
Software Engineering and Development Insights
No ratings yet
Software Engineering and Development Insights
81 pages
Design Methodologies for Embedded Systems
No ratings yet
Design Methodologies for Embedded Systems
15 pages
High Availability Design Patterns
No ratings yet
High Availability Design Patterns
10 pages
Fault Tolerance Techniques Overview
No ratings yet
Fault Tolerance Techniques Overview
15 pages
Ebookname - Com/?p 98593: Networks-Fault-Tolerance-Analysis-And-Design-1st-Edition-Martin-L-Shooman
100% (1)
Ebookname - Com/?p 98593: Networks-Fault-Tolerance-Analysis-And-Design-1st-Edition-Martin-L-Shooman
80 pages
Multilevel Dependable Computing Overview
No ratings yet
Multilevel Dependable Computing Overview
535 pages
Embedded Systems Design Methodologies
No ratings yet
Embedded Systems Design Methodologies
28 pages
Understanding Software Reliability
No ratings yet
Understanding Software Reliability
18 pages
Fault Tolerant Software Systems Overview
No ratings yet
Fault Tolerant Software Systems Overview
34 pages
Software Safety and Halstead's Science
No ratings yet
Software Safety and Halstead's Science
78 pages
Introduction to Fault Tolerant Computing
No ratings yet
Introduction to Fault Tolerant Computing
54 pages
Understanding Critical Systems Dependability
No ratings yet
Understanding Critical Systems Dependability
7 pages
Fault Avoidance and Tolerance Techniques
No ratings yet
Fault Avoidance and Tolerance Techniques
15 pages
Computer System General Requirements
No ratings yet
Computer System General Requirements
9 pages
Enhancing Reliability in Distributed Systems
No ratings yet
Enhancing Reliability in Distributed Systems
33 pages
Complexity Revisited - Learning From Failures
No ratings yet
Complexity Revisited - Learning From Failures
25 pages
Fault Tolerance in Computer Systems
No ratings yet
Fault Tolerance in Computer Systems
8 pages
Embedded Systems: Characteristics & Quality Attributes
No ratings yet
Embedded Systems: Characteristics & Quality Attributes
25 pages
Key Design Issues in Operating Systems
100% (1)
Key Design Issues in Operating Systems
10 pages
Introduction to Embedded Systems
No ratings yet
Introduction to Embedded Systems
36 pages
Introduction to Software Engineering Concepts
No ratings yet
Introduction to Software Engineering Concepts
15 pages
Understanding Software Dependability
No ratings yet
Understanding Software Dependability
21 pages
Jenkins Methodology in Systems Engineering
No ratings yet
Jenkins Methodology in Systems Engineering
8 pages
SUMSEM-2021-22 ECE6037 TH VL2021220701100 Reference Material I 04-07-2022 ECE 6037 R3
No ratings yet
SUMSEM-2021-22 ECE6037 TH VL2021220701100 Reference Material I 04-07-2022 ECE 6037 R3
27 pages
Dependable and Secure Systems Engineering 1st Edition Ali Hurson and Sahra Sedigh (Eds.) Ebook Testbank Solutions Full Unlock Version
100% (3)
Dependable and Secure Systems Engineering 1st Edition Ali Hurson and Sahra Sedigh (Eds.) Ebook Testbank Solutions Full Unlock Version
160 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
48 pages
Fault Tolerant Systems Overview
No ratings yet
Fault Tolerant Systems Overview
26 pages
Chap 1 2
No ratings yet
Chap 1 2
26 pages
Systemverilog For RTL Design Workshop: Lab Guide
No ratings yet
Systemverilog For RTL Design Workshop: Lab Guide
36 pages
SystemVerilog Assertions for Signal Integrity
No ratings yet
SystemVerilog Assertions for Signal Integrity
4 pages
UVM Interview Handbook: Key Concepts
100% (1)
UVM Interview Handbook: Key Concepts
55 pages
UVM Quick Reference Guide: Author: Putta Satish
50% (2)
UVM Quick Reference Guide: Author: Putta Satish
47 pages
UVM Overview and Coverage Verification
No ratings yet
UVM Overview and Coverage Verification
95 pages
SystemVerilogVerificationUVM1 1LabGuide
No ratings yet
SystemVerilogVerificationUVM1 1LabGuide
102 pages
A Practical Guide To Adopting The Universal Verification Methodology (UVM) by Sharon Rosenberg, Kathleen Meade
No ratings yet
A Practical Guide To Adopting The Universal Verification Methodology (UVM) by Sharon Rosenberg, Kathleen Meade
296 pages
DV Interview Questions for Top Companies
100% (2)
DV Interview Questions for Top Companies
35 pages
UVM Interview Questions and Answers
100% (2)
UVM Interview Questions and Answers
8 pages
SV Assignments
No ratings yet
SV Assignments
46 pages
Practical Error-Correction Design For Engineers 2ed 1991
No ratings yet
Practical Error-Correction Design For Engineers 2ed 1991
495 pages
Uvm Cookbook Complete Verification Academy
86% (7)
Uvm Cookbook Complete Verification Academy
599 pages
Introduction to Universal Verification Methodology
100% (5)
Introduction to Universal Verification Methodology
186 pages
Python for RTL Verification Guide
100% (1)
Python for RTL Verification Guide
245 pages
VLSI Design Verification Interview Guide
No ratings yet
VLSI Design Verification Interview Guide
34 pages
Nvidia Mixed Signal Design Interview Q&A
100% (1)
Nvidia Mixed Signal Design Interview Q&A
30 pages
AXI Burst Types and Transactions Guide
50% (2)
AXI Burst Types and Transactions Guide
4 pages
UVM Sequence Overview and Usage
No ratings yet
UVM Sequence Overview and Usage
13 pages
SystemVerilog Assertion Practice Q&A
100% (3)
SystemVerilog Assertion Practice Q&A
4 pages
UVM FIFO Design and Testbench Overview
No ratings yet
UVM FIFO Design and Testbench Overview
15 pages
SystemVerilog Constraint Examples
60% (5)
SystemVerilog Constraint Examples
46 pages
ASIC Physical Design Essentials
100% (2)
ASIC Physical Design Essentials
21 pages
System Verilog Class Fundamentals
100% (3)
System Verilog Class Fundamentals
101 pages
AHB Protocol Interview Q&A Guide
100% (1)
AHB Protocol Interview Q&A Guide
5 pages
Low Power Design Verification Techniques
No ratings yet
Low Power Design Verification Techniques
10 pages
Cadence Introduction To Class Based UVM PDF
100% (1)
Cadence Introduction To Class Based UVM PDF
95 pages
Digital Logic RTL & Verilog Interview Questions Preview
33% (6)
Digital Logic RTL & Verilog Interview Questions Preview
34 pages
SystemVerilog Interview Questions Guide
100% (2)
SystemVerilog Interview Questions Guide
9 pages
Verilog Text Book
100% (2)
Verilog Text Book
431 pages
AXI Protocol Interview Insights
83% (12)
AXI Protocol Interview Insights
5 pages
Electronic Communication Skills Guide
No ratings yet
Electronic Communication Skills Guide
5 pages
Instruction Manual: Personal Embroidery Design Software System
No ratings yet
Instruction Manual: Personal Embroidery Design Software System
353 pages
Comprehensive Windows Hotkeys Guide
No ratings yet
Comprehensive Windows Hotkeys Guide
6 pages
Ashish Garg: React & React Native Developer
No ratings yet
Ashish Garg: React & React Native Developer
1 page
Junos OS Architecture Overview
No ratings yet
Junos OS Architecture Overview
100 pages
Virtual Classroom System Overview
86% (7)
Virtual Classroom System Overview
21 pages
SIMATIC S7-PLCSIM Advanced V1.0
No ratings yet
SIMATIC S7-PLCSIM Advanced V1.0
52 pages
Demo Script: Trimble Mx7 and MX Software Workflow - Demo Script
No ratings yet
Demo Script: Trimble Mx7 and MX Software Workflow - Demo Script
31 pages
Manual vs. Computerized Accounting Explained
No ratings yet
Manual vs. Computerized Accounting Explained
8 pages
Facebook User Data Analysis
No ratings yet
Facebook User Data Analysis
8 pages
Implementing Text to Speech in C#
No ratings yet
Implementing Text to Speech in C#
4 pages
Advanced Excel Dashboard Course
No ratings yet
Advanced Excel Dashboard Course
4 pages
Ib Nas4220 B - Uk1
No ratings yet
Ib Nas4220 B - Uk1
2 pages
Is There A Website Where You Can Upload A Bunch of PDFs and A User Can Easily Navigate Them - R - Web - Design
No ratings yet
Is There A Website Where You Can Upload A Bunch of PDFs and A User Can Easily Navigate Them - R - Web - Design
1 page
Learn Docker, Kubernetes, Microservices
No ratings yet
Learn Docker, Kubernetes, Microservices
214 pages
Link Polylines in Deswik.CAD Guide
No ratings yet
Link Polylines in Deswik.CAD Guide
17 pages
TCP Server Implementation Guide
No ratings yet
TCP Server Implementation Guide
7 pages
Interview Question Bank - Hyper - V Interview Questions
No ratings yet
Interview Question Bank - Hyper - V Interview Questions
4 pages
Kenza 240 TX User's Guide
No ratings yet
Kenza 240 TX User's Guide
56 pages
Overview of System and Application Software
No ratings yet
Overview of System and Application Software
38 pages
Excel Shortcuts Guide for Beginners
No ratings yet
Excel Shortcuts Guide for Beginners
2 pages
Survey Redirect Status Codes Analysis
100% (1)
Survey Redirect Status Codes Analysis
2 pages
GitHub Profile Summary and Stats
No ratings yet
GitHub Profile Summary and Stats
24 pages
Computer Fundamentals Q&A Guide
No ratings yet
Computer Fundamentals Q&A Guide
4 pages
Azure Asset Inventory System Overview
No ratings yet
Azure Asset Inventory System Overview
12 pages
Networking Concepts for SS3 Students
No ratings yet
Networking Concepts for SS3 Students
25 pages
mGEPS Useful Downloads User Manual
No ratings yet
mGEPS Useful Downloads User Manual
14 pages
Information Systems Course Outline
No ratings yet
Information Systems Course Outline
3 pages
Millionaire Skills for Indie Hackers
No ratings yet
Millionaire Skills for Indie Hackers
6 pages
Fountain Model and Software Prototyping Insights
No ratings yet
Fountain Model and Software Prototyping Insights
5 pages

Building Robust Systems: Key Concepts

Uploaded by

Building Robust Systems: Key Concepts

Uploaded by

Introduction to Robust Systems

MTBF = Mean Time Between Failures

Depends on who you talk to

Hardware failures 3rd party driver crashes

Unscheduled Outage Cause

Failure Normal lifetime

1-20 weeks ~ 3-15 years Time

Infant mortality Normal lifetime Wearout Time

Exciting opportunities for new system design

Correct by Construction Simply Not sufficient

Classical fault-tolerance very expensive

Fault Tolerance Fault Avoidance

Recover Failure Masking

Detect & Data Integrity

You might also like