0% found this document useful (0 votes)

10 views14 pages

Recovery

Uploaded by

Shailendra Shende

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views14 pages

Recovery

Uploaded by

Shailendra Shende

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

Recovery

Computer system recovery:

Restore the system to a normal operational state

Process recovery:

Reclaim resources allocated to process,

Undo modification made to databases, and
Restart the process
Or restart process from point of failure and resume execution

Distributed system :
Provides
Enhanced Performance, through concurrent execution of many processes.
Increased Availability,

Distributed process recovery (cooperating processes):

Undo effect of interactions of failed process with other cooperating processes.
Every failed process would have to restart from an appropriate state.

Replication (hardware components, processes, data):

Main method for increasing system availability

System:
Set of hardware and software components
Designed to provide a specified service (I.e. meet a set of requirements)

System failure:
System does not meet requirements, i.e. does not perform its services as
specified

Error could lead to system failure

Erroneous System State:
State which could lead to a system failure by a sequence of valid state
transitions
Error: the part of the system state which differs from its intended value

Error is a manifestation of a fault

Fault:
Anomalous physical condition, e.g. design errors, manufacturing problems,
damage, external disturbances.
4

Classification of failures
Process failure:
Behavior: process causes system state to deviate from specification (e.g. incorrect
computation, process stop execution)
Errors causing process failure: protection violation, deadlocks, timeout, wrong user input,
etc
Recovery: Abort process or
Restart process from prior state

System failure:

Behavior: processor fails to execute

Caused by software errors or hardware faults (CPU/memory/bus// failure)
Recovery: system stopped and restarted in correct state
Assumption: fail-stop processors, i.e. system stops execution, internal state is lost

Secondary Storage Failure:

Behavior: stored data cannot be accessed
Errors causing failure: parity error, head crash, etc.
Recovery/Design strategies:
Reconstruct content from archive + log of activities
Design mirrored disk system

Communication Medium Failure:

Behavior: a site cannot communicate with another operational site
Errors/Faults: failure of switching nodes or communication links
Recovery/Design Strategies: reroute, error-resistant communication protocols
5

Backward and Forward Error Recovery

Failure recovery: restore an erroneous state to an error-free state
Approaches to failure recovery:
Forward-error recovery:
Remove errors in process/system state (if errors can be completely assessed)
Continue process/system forward execution

Backward-error recovery:
Restore process/system to previous error-free state and restart from there

Comparison: Forward vs. Backward error recovery

Backward-error recovery
(+) Simple to implement
(+) Can be used as general recovery mechanism
(-) Performance penalty
(-) No guarantee that fault does not occur again
(-) Some components cannot be recovered

Forward-error Recovery
(+) Less overhead
(-) Limited use, i.e. only when impact of faults understood
(-) Cannot be used as general mechanism for error recovery
6

Backward-Error Recovery: Basic approach

Principle: restore process/system to a known, error-free recovery point/
checkpoint.

System model:

CPU
secondary
storage

Main memory

Bring object to MM
to be accessed

stable
storage

Storage that
maintains
information in
the event of
system failure

Store logs and

recovery points

Write object back

if modified

Approaches:
(1) Operation-based approach
(2) State-based approach

(1) The Operation-based Approach

Principle:
Record all changes made to state of process (audit trail or log) such that process can
be returned to a previous state
Example: A transaction based environment where transactions update a database
It is possible to commit or undo updates on a per-transaction basis
A commit indicates that the transaction on the object was successful and changes
are permanent
(1.a) Updating-in-place
Principle: every update (write) operation to an object creates a log in stable storage that
can be used to undo and redo the operation
Log content: object name, old object state, new object state
Implementation of a recoverable update operation:
Do operation:
update object and write log record
Undo operation: log(old) -> object (undoes the action performed by a do)
Redo operation: log(new) -> object (redoes the action performed by a do)
Display operation: display log record (optional)
Problem: a do cannot be recovered if system crashes after write object but before log
record write
(1.b) The write-ahead log protocol
Principle: write log record before updating object

(2) State-based Approach

Principle: establish frequent recovery points or checkpoints saving the
entire state of process
Actions:
Checkpointing or taking a checkpoint: saving process state
Rolling back a process: restoring a process to a prior state

Note: A process should be rolled back to the most recent recovery point
to minimize the overhead and delays in the completion of the process
Shadow Pages: Special case of state-based approach
Only a part of the system state is saved to minimize recovery
When an object is modified, page containing object is first copied on stable
storage (shadow page)
If process successfully commits: shadow page discarded and modified
page is made part of the database
If process fails: shadow page used and the modified page discarded

Recovery in concurrent systems

Issue: if one of a set of cooperating processes fails and has to be rolled back to a

recovery point, all processes it communicated with since the recovery point have to
be rolled back.
Conclusion: In concurrent and/or distributed systems all cooperating processes
have to establish recovery points

Orphan messages and the domino effect

x3
m

Y
Z

x2
y1

y2
z2

Time

Case 1: failure of X after x3 : no impact on Y or Z

Case 2: failure of Y after sending msg. m
Y rolled back to y2
m orphan massage
X rolled back to x2

Case 3: failure of Z after z2

Y has to roll back to y1
X has to roll back to x1
Z has to roll back to z1

Domino Effect
10

Lost messages

X
Y

x1
y1

Failure
Time

Assume that x1 and y1 are the only recovery points for processes X and Y,
respectively
Assume Y fails after receiving message m
Y rolled back to y1, X rolled back to x1
Message m is lost

Note: there is no distinction between this case and the case where message m is
lost in communication channel and processes X and Y are in states x1 and y1,
respectively
11

Problem of livelock

Livelock: case where a single failure can cause an infinite number of rollbacks

X
Y

n1
m1

(a)

X
Y

x1
y1

Failure
Time

n2
m2

2nd roll back

(b)

(a)

(b)

Time

Process Y fails before receiving message n1 sent by X

Y rolled back to y1, no record of sending message m1, causing X to roll back to x1
When Y restarts, sends out m2 and receives n1 (delayed)

When X restarts from x1, sends out n2 and receives m2

Y has to roll back again, since there is no record of n1 being sent

This cause X to be rolled back again, since it has received m2 and there is no record of
sending m2 in Y

The above sequence can repeat indefinitely

Consistent set of checkpoints

Checkpointing in distributed systems requires that all processes

(sites) that interact with one another establish periodic checkpoints

All the sites save their local states: local checkpoints

All the local checkpoints, one from each site, collectively form a
global checkpoint

The domino effect is caused by orphan messages, which in turn are

caused by rollbacks

Strongly consistent set of checkpoints

Establish a set of local checkpoints (one for each process in the

set) such that no information flow takes place (i.e., no orphan
messages) during the interval spanned by the checkpoints

Consistent set of checkpoints

Similar to the consistent global state

Each message that is received in a checkpoint (state) should

also be recorded as sent in another checkpoint (state)
13

System Recovery and Error Management
No ratings yet
System Recovery and Error Management
38 pages
Advanced Recovery Techniques in OS
No ratings yet
Advanced Recovery Techniques in OS
74 pages
Dis Notes 4
No ratings yet
Dis Notes 4
31 pages
Recovery Techniques in Distributed Systems
No ratings yet
Recovery Techniques in Distributed Systems
119 pages
Failure Recovery in Distributed Systems
No ratings yet
Failure Recovery in Distributed Systems
24 pages
Distributed Failure Recovery Techniques
No ratings yet
Distributed Failure Recovery Techniques
30 pages
Recovery in Concurrent Systems
No ratings yet
Recovery in Concurrent Systems
9 pages
Deadlock and Recovery in Distributed Systems
No ratings yet
Deadlock and Recovery in Distributed Systems
55 pages
Distributed Computing Recovery Strategies
No ratings yet
Distributed Computing Recovery Strategies
4 pages
Log-Based Rollback Recovery Techniques
No ratings yet
Log-Based Rollback Recovery Techniques
34 pages
Checkpointing & Rollback in Distributed Systems
No ratings yet
Checkpointing & Rollback in Distributed Systems
10 pages
CS 194: Two-Phase Commit Protocol
No ratings yet
CS 194: Two-Phase Commit Protocol
15 pages
Checkpointing and Rollback Recovery in Distributed Systems
No ratings yet
Checkpointing and Rollback Recovery in Distributed Systems
11 pages
Understanding the Domino Effect in Distributed Systems
No ratings yet
Understanding the Domino Effect in Distributed Systems
21 pages
Understanding the Domino Effect in Rollback Recovery
No ratings yet
Understanding the Domino Effect in Rollback Recovery
21 pages
Checkpoiniting and Rollback
No ratings yet
Checkpoiniting and Rollback
13 pages
Fault Tolerant Checkpointing Protocols
No ratings yet
Fault Tolerant Checkpointing Protocols
35 pages
Lecture - Failure - Recovery
No ratings yet
Lecture - Failure - Recovery
49 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
Distributed Shared Memory & Recovery Techniques
No ratings yet
Distributed Shared Memory & Recovery Techniques
14 pages
Distributed Deadlocks and Recovery Strategies
100% (1)
Distributed Deadlocks and Recovery Strategies
22 pages
Checkpointing & Rollback Recovery in Systems
No ratings yet
Checkpointing & Rollback Recovery in Systems
3 pages
Checkpointing and Rollback Recovery Techniques
No ratings yet
Checkpointing and Rollback Recovery Techniques
33 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
50 pages
DC Series 2
No ratings yet
DC Series 2
21 pages
Checkpointing and Recovery Techniques
No ratings yet
Checkpointing and Recovery Techniques
4 pages
Rollback Recovery in Distributed Systems
No ratings yet
Rollback Recovery in Distributed Systems
22 pages
Checkpointing and Rollback Recovery For Distributed Systems 5cvcuy5txm
No ratings yet
Checkpointing and Rollback Recovery For Distributed Systems 5cvcuy5txm
23 pages
Recovery and Consensus in Distributed Computing
No ratings yet
Recovery and Consensus in Distributed Computing
94 pages
Recovery and Consensus in Distributed Systems
No ratings yet
Recovery and Consensus in Distributed Systems
33 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
30 pages
Understanding Fault Tolerance Concepts
No ratings yet
Understanding Fault Tolerance Concepts
52 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
30 pages
Database Failure Types and Recovery Methods
No ratings yet
Database Failure Types and Recovery Methods
5 pages
Recovery and Consensus in Distributed Systems
No ratings yet
Recovery and Consensus in Distributed Systems
32 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
39 pages
Consensus and Recovery Algorithms Explained
No ratings yet
Consensus and Recovery Algorithms Explained
3 pages
Consensus and Recovery in Distributed Systems
No ratings yet
Consensus and Recovery in Distributed Systems
3 pages
Fault Tolerance and Recovery Strategies
No ratings yet
Fault Tolerance and Recovery Strategies
10 pages
Checkpointing and Rollback Recovery in Distributed Systems
No ratings yet
Checkpointing and Rollback Recovery in Distributed Systems
36 pages
Recovery and Consensus in Distributed Systems
No ratings yet
Recovery and Consensus in Distributed Systems
32 pages
Coordinated Checkpointing in Recovery
No ratings yet
Coordinated Checkpointing in Recovery
32 pages
Rollback and Recovery in Distributed Systems
No ratings yet
Rollback and Recovery in Distributed Systems
12 pages
Checkpointing and Rollback Recovery Techniques
No ratings yet
Checkpointing and Rollback Recovery Techniques
14 pages
Synchronous vs Asynchronous Communication
No ratings yet
Synchronous vs Asynchronous Communication
19 pages
Understanding Fault Tolerance in Systems
No ratings yet
Understanding Fault Tolerance in Systems
29 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
71 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
37 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
68 pages
Issues in Failure Recovery in Systems
No ratings yet
Issues in Failure Recovery in Systems
27 pages
Consensus and Recovery in Distributed Systems
No ratings yet
Consensus and Recovery in Distributed Systems
32 pages
DC Unit 4 Book - PDF On Distributed Computing
No ratings yet
DC Unit 4 Book - PDF On Distributed Computing
33 pages
Checkpointing and Rollback in Distributed Systems
No ratings yet
Checkpointing and Rollback in Distributed Systems
26 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
4 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
19 pages
L35 CSC-503
No ratings yet
L35 CSC-503
11 pages
CS3551 Unit IV: Recovery & Consensus
No ratings yet
CS3551 Unit IV: Recovery & Consensus
34 pages
010-160 Linux Essentials Exam Dumps
No ratings yet
010-160 Linux Essentials Exam Dumps
7 pages
Erased Files Log Summary 2025
No ratings yet
Erased Files Log Summary 2025
2 pages
SolidWorks Electrical Installation Guide
No ratings yet
SolidWorks Electrical Installation Guide
9 pages
Intelligent Network Overview and Concepts
No ratings yet
Intelligent Network Overview and Concepts
10 pages
Automation Testing Interview Q&A Guide
No ratings yet
Automation Testing Interview Q&A Guide
138 pages
Unit - 1 Python
No ratings yet
Unit - 1 Python
27 pages
Data Management in Distributed Systems
No ratings yet
Data Management in Distributed Systems
51 pages
RAID Configuration Utility Guide
No ratings yet
RAID Configuration Utility Guide
2 pages
ADF Trigger Types: Schedule vs Tumbling
No ratings yet
ADF Trigger Types: Schedule vs Tumbling
4 pages
BCD to Binary Conversion Explained
No ratings yet
BCD to Binary Conversion Explained
1 page
Understanding External Entities in DFDs
No ratings yet
Understanding External Entities in DFDs
9 pages
HANA DB System Replication Setup Guide
No ratings yet
HANA DB System Replication Setup Guide
8 pages
CustomScripts Installation Guide for Plant 3D
No ratings yet
CustomScripts Installation Guide for Plant 3D
2 pages
JumpStart Installation Guide for Solaris
No ratings yet
JumpStart Installation Guide for Solaris
4 pages
Introduction to Networks Course Guide
100% (1)
Introduction to Networks Course Guide
18 pages
IBM Fusion Level 2 Quiz Results
No ratings yet
IBM Fusion Level 2 Quiz Results
12 pages
Data Structures Lab Manual 2025-2026
No ratings yet
Data Structures Lab Manual 2025-2026
29 pages
UVM Test Plan for SPI Flash Controller
No ratings yet
UVM Test Plan for SPI Flash Controller
3 pages
SQL Basics for Kids: A Fun Guide
No ratings yet
SQL Basics for Kids: A Fun Guide
38 pages
HP Compaq NX7400 Specifications
No ratings yet
HP Compaq NX7400 Specifications
58 pages
Fenwick Tree and Range Query Techniques
No ratings yet
Fenwick Tree and Range Query Techniques
24 pages
ACH File Layout and Element Guide
No ratings yet
ACH File Layout and Element Guide
12 pages
Digital IC List Alllll
No ratings yet
Digital IC List Alllll
3 pages
Student-Course Registration System
No ratings yet
Student-Course Registration System
3 pages
AS Level Computer Science: System Overview
No ratings yet
AS Level Computer Science: System Overview
3 pages
Excel 2007 Pivot Tables and Charts
No ratings yet
Excel 2007 Pivot Tables and Charts
7 pages
Oracle 19c 2-Node RAC Setup Guide
No ratings yet
Oracle 19c 2-Node RAC Setup Guide
8 pages
DM Unit - 2
No ratings yet
DM Unit - 2
14 pages
Dynamic Load Balance
No ratings yet
Dynamic Load Balance
7 pages
01-01 AAA Configuration
No ratings yet
01-01 AAA Configuration
174 pages

Recovery

Uploaded by

Recovery

Uploaded by

Recovery

Computer system recovery:

Reclaim resources allocated to process,

Distributed process recovery (cooperating processes):

Replication (hardware components, processes, data):

Error could lead to system failure

Error is a manifestation of a fault

Behavior: processor fails to execute

Secondary Storage Failure:

Communication Medium Failure:

Backward and Forward Error Recovery

Comparison: Forward vs. Backward error recovery

Backward-Error Recovery: Basic approach

Store logs and

Write object back

(1) The Operation-based Approach

(2) State-based Approach

Recovery in concurrent systems

Orphan messages and the domino effect

Case 1: failure of X after x3 : no impact on Y or Z

Case 3: failure of Z after z2

2nd roll back

Process Y fails before receiving message n1 sent by X

When X restarts from x1, sends out n2 and receives m2

Y has to roll back again, since there is no record of n1 being sent

The above sequence can repeat indefinitely

Consistent set of checkpoints

Checkpointing in distributed systems requires that all processes

All the sites save their local states: local checkpoints

The domino effect is caused by orphan messages, which in turn are

Strongly consistent set of checkpoints

Establish a set of local checkpoints (one for each process in the

Consistent set of checkpoints

Similar to the consistent global state

Each message that is received in a checkpoint (state) should

You might also like