0% found this document useful (0 votes)

5 views14 pages

Checkpointing and Rollback Recovery Techniques

The document discusses recovery and consensus mechanisms in distributed systems, focusing on checkpointing and rollback recovery techniques. It details various methods such as uncoordinated, coordinated, and communication-induced checkpointing, along with log-based rollback recovery strategies including pessimistic, optimistic, and causal logging. Additionally, it explains the Koo-Toueg coordinated checkpointing algorithm and the Juang-Venkatesan algorithm for asynchronous checkpointing and recovery.

Uploaded by

vedaraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views14 pages

Checkpointing and Rollback Recovery Techniques

Uploaded by

vedaraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

UNIT IV

RECOVERY & CONSENSUS

Checkpointing and rollback recovery: Introduction – Background and definitions – Issues in failure recovery –
Checkpoint-based recovery – Log-based rollback recovery – Coordinated checkpointing algorithm – Algorithm
for asynchronous checkpointing and recovery. Consensus and agreement algorithms: Problem definition –
Overview of results – Agreement in a failure –free system – Agreement in synchronous systems with failures.

CHECKPOINTING AND ROLLBACK RECOVERY

The saved state is called a checkpoint, and the procedure of restarting from a previously checkpointed state is
called rollback recovery.

Explain Checkpoint-based recovery in detail.

Checkpoint-based recovery

In check point based recovery, the state of each process and the communication channel is checkpointed
frequently so that when a failure occurs, the system can be restored to a globally consistent set of checkpoints.

The three types of rollback-recovery techniques are:

1. Uncoordinated checkpointing
2. Coordinated checkpointing
3. Communication-induced checkpointing

Uncoordinated Checkpointing

 Here, each process has autonomy in deciding when to take checkpoints.

 This eliminates the synchronization overhead as there is no need for coordination between processes
and it allows processes to take checkpoints when it is most convenient or efficient.

Advantages

 Lower runtime overhead during normal execution.

Limitations

 Domino effect during a recovery

 Recovery from a failure is slow.
 Each process maintains multiple checkpoints and periodically invoke a garbage collection algorithm
 Not suitable for application with frequent output commits
Coordinated checkpointing

Coordinated checkpointing requires each process to maintain only one checkpoint on the stable storage,
reducing the storage overhead and eliminating the need for garbage collection.

There are two types of coordinated checkpoints:

 Blocking Checkpointing
 Non-blocking Checkpointing

Blocking Checkpointing

 After a process takes a local checkpoint, to prevent orphan messages, it remains blocked until the entire
checkpointing activity is complete.
 The coordinator takes a checkpoint and broadcasts a request message to all processes, asking them to
take a checkpoint.
 When a process receives this message, it stops its execution, takes a tentative checkpoint.
 It sends an acknowledgment message back to the coordinator.
 After the coordinator receives acknowledgments from all processes, it broadcasts a commit message to
all processes.
 After receiving the commit message, atomically makes the tentative checkpoint permanent and then
resumes its execution .

Non-blocking Checkpointing

The processes need not stop their execution while taking checkpoints.

Example:

Message m is sent by P0 after receiving a checkpoint request from the checkpoint coordinator. Assume m
reaches P1 before the checkpoint request. This situation results in an inconsistent checkpoint. Since checkpoint
c1,x shows the receipt of message m from P0, while checkpoint c0,x does not show m being sent from P0. To solve
inconsistent checkpoint, on-blocking checkpoint coordination protocol using this snapshot algorithm of Chandy
and Lamport in which markers play the role of the checkpoint request messages.
Communication-induced checkpointing

 Communication-induced checkpointing avoids the domino effect, while allowing processes to take
some of their checkpoints independently.
 In communication-induced checkpointing, processes take two types of checkpoints.
 The checkpoints that a process takes independently are called local checkpoints.
 The process is forced to take are called forced checkpoints.

Two types of communication-induced checkpointing:

1. Model based checkpointing

 This prevents patterns of communications and checkpoints that may result in inconsistent states among
the existing checkpoints.
 This model can be maintained by taking an additional checkpoint before every message-receiving event
that is not separated from its previous message-sending event by a checkpoint.
 Another method is by taking a checkpoint immediately after every message sending event.

2. Index-based checkpointing

This assigns monotonically increasing indexes to checkpoints, such that the checkpoints having the same index
at different processes form a consistent state.

Explain Log-based rollback recovery in detail.

Log-based rollback recovery

A log-based rollback recovery makes use of deterministic and nondeterministic events in a computation.

Deterministic and non-deterministic events

 A non-deterministic event can be the receipt of a message from another process or an event internal to
the process.
 Message send event is not a non-deterministic event.

For example, in Figure, the execution of process P0 is a sequence of four deterministic intervals. The first one
starts with the creation of the process, while the remaining three start with the receipt of messages m0, m3, and
m7, respectively. Send event of message m2 is uniquely determined by the initial state of P0 and by the receipt
of message m0, and is therefore not a non-deterministic event
Pessimistic logging

 Pessimistic logging protocols assume that a failure can occur after any nondeterministic event in the
computation.
 Pessimistic protocols implement as synchronous logging.
 The processes must take periodic checkpoints to minimize the amount of work that has to be repeated
during recovery.
 When a process fails, the process is restarted from the most recent checkpoint and the logged
determinants are used to recreate the prefailure execution.

Consider the example in Figure. During failure-free operation the logs of processes P0, P1, and P2 contain
the determinants needed to replay messages m0, m4, m7, m1, m3, m6, and m2, m5, respectively. Suppose
processes P1 and P2 fail as shown, restart from checkpoints B and C, and roll forward using their
determinant logs to deliver again the same sequence of messages. This guarantees that P1 and P2 will repeat
exactly their pre-failure execution and re-send the same messages.

Optimistic logging

 In these protocols, processes log determinants asynchronously to the stable storage.

 Optimistic logging protocol assume that logging will be complete before a failure occurs.
 Pessimistic protocols need only keep the most recent checkpoint of each process, whereas optimistic
protocols may need to keep multiple checkpoints for each process.
 The overheads in optimistic logging are complicated recovery, garbage collection, and slower output
commit.
Consider the example shown in Figure. Suppose process P2 fails before the m5 is logged to the stable storage.
Process P1 then becomes an orphan process and must roll back to undo the effects of receiving the orphan
message m6. The rollback of P1 further forces P0 to roll back to undo the effects of receiving message m7.

Casual Logging

 This combines the advantages of both pessimistic and optimistic logging.

 Like optimistic logging, it does not require synchronous access to the stable storage.
 Like pessimistic logging, it allows each process to commit output independently and never creates
orphans.

Consider the example in Figure . Messages m5 and m6 are likely to be lost on the failures of P1 and P2. process
P0 will be able to “guide” the recovery of P1 and P2 since it knows the order in which P1 should replay
messages. Similarly, P0 has the order in which P2 should replay message m2. The content of these messages is
obtained from the sender log of P0.

Explain Koo–Toueg coordinated check pointing algorithm in detail. (or)

Explain Coordinated checkpointing algorithm in detail.

 Coordinated check pointing and recovery technique that takes a consistent set of check pointing and
avoids domino effect and livelock problems during the recovery.
 This algorithm includes 2 parts:
1. check pointing algorithm
2. Recovery algorithm.

Checkpointing algorithm
The following are the assumptions made in checkpointing algorithm:

 FIFO channel
 end-to-end protocols
 single process initiation
 no process failures during the execution of the algorithm

The algorithm facilitates two kinds of checkpoints:

 Permanent checkpoints
 Tentative checkpoints

The algorithm is implemented in two phases:

Phase I:

 Initiating process Pi takes a tentative checkpoint and requests all other processes to take tentative
checkpoints.
 Every process cannot send messages after taking tentative checkpoint.
 All processes will finally have the single same decision: do or discard.
 A process says no to a request if it fails to take a tentative checkpoint.
 If Pi learns that all the processes have successfully taken tentative checkpoints, Pi decides that all
tentative checkpoints should be made permanent; otherwise, Pi decides that all the tentative checkpoints
should be discarded.

Phase II:

 Pi propagates its decision to all processes.

 On receiving the message from Pi ,all process act accordingly.

Correctness of the algorithm

 Either all or none of the processes take permanent checkpoint

 No process sends message after taking permanent checkpoint.

Rollback recovery algorithm

This algorithm restore the system state to a consistent state after a failure.

Phase I:

 An initiating process Pi sends a message to all other processes to check if they are willing to restart
from their previous checkpoints.
 A process may reply no to a restart request due to any reason.
 If Pi learns that all processes are willing to restart from their previous checkpoints, Pi decides that all
processes should roll back to their previous checkpoints.
 Otherwise, Pi aborts the rollback attempt and it may attempt a recovery at a later time.

Phase II:

 Pi propagates its decision to all processes.

 On receiving the message from Pi ,all process act accordingly.

Correctness

All processes restart from an appropriate state because, if they decide to restart, they resume execution from a
consistent state.

Explain Juang-Venkatesan algorithm for asynchronous checkpointing and recovery in detail.

Assumptions

 communication channels are reliable

 delivery messages in FIFO order
 infinite buffers
 message
 transmission delay is arbitrary but finite

Two type of log storage are maintained

 Volatile log: short time to access but lost if processor crash.

 Stable log: longer time to access but remained if crashed.

Asynchronous checkpointing:

 After executing an event, a processor records a triplet (s, m, msg_sent) in its volatile storage.
s: state of the processor before the event
m: message
msgs_sent: set of messages sent by the processor during the event.
 Local checkpoint consist of set of records, first are stored in volatile log, then moved to stable log.

Recovery algorithm

Notations:

𝑅𝐶𝑉𝐷𝑖←𝑗 (𝐶𝑘𝑃𝑡𝑖): number of messages received by 𝑝𝑖 from 𝑝𝑗 , from the beginning of computation to checkpoint
𝐶𝑘𝑃𝑡𝑖

𝑆𝐸𝑁𝑇𝑖→𝑗 (𝐶𝑘𝑃𝑡𝑖): number of messages sent by 𝑝𝑖 to 𝑝 , from the beginning of computation to checkpoint 𝐶𝑘𝑃𝑡i
Idea:

 From the set of checkpoints, find a set of consistent checkpoints

 This is done based on the number of messages sent and received.
 Recovery may involve multiple iterations of roll backs by processors.
 Whenever a processor rolls back, it is necessary for all other processors to find out if any message sent
by the rolled back processor has become an orphan message.
 The orphan messages are identified by comparing the number of messages sent to and received from
neighboring processors.
 When a processor restarts after a failure, it broadcasts a ROLLBACK message that it has failed.
 Because of the broadcast of ROLLBACK messages, the recovery algorithm is initiated at all processors.

Recovery and Consensus in Distributed Computing
No ratings yet
Recovery and Consensus in Distributed Computing
94 pages
Log-Based Rollback Recovery Techniques
No ratings yet
Log-Based Rollback Recovery Techniques
34 pages
Checkpointing in Distributed Systems
No ratings yet
Checkpointing in Distributed Systems
33 pages
Checkpointing and Rollback Recovery in Distributed Systems
No ratings yet
Checkpointing and Rollback Recovery in Distributed Systems
11 pages
Checkpointing and Rollback Recovery Guide
No ratings yet
Checkpointing and Rollback Recovery Guide
5 pages
Coordinated Checkpointing in Distributed Systems
No ratings yet
Coordinated Checkpointing in Distributed Systems
33 pages
Fault Tolerant Checkpointing Protocols
No ratings yet
Fault Tolerant Checkpointing Protocols
35 pages
Understanding Rollback Propagation in Distributed Systems
No ratings yet
Understanding Rollback Propagation in Distributed Systems
5 pages
Koo-Toueg Checkpointing Algorithm Explained
No ratings yet
Koo-Toueg Checkpointing Algorithm Explained
8 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
Checkpointing Algorithms in Distributed Systems
No ratings yet
Checkpointing Algorithms in Distributed Systems
40 pages
Failure Recovery in Distributed Systems
No ratings yet
Failure Recovery in Distributed Systems
6 pages
DC Unit 4 Book - PDF On Distributed Computing
No ratings yet
DC Unit 4 Book - PDF On Distributed Computing
33 pages
Distributed Computing Recovery Strategies
No ratings yet
Distributed Computing Recovery Strategies
4 pages
Checkpointing & Rollback Recovery in Systems
No ratings yet
Checkpointing & Rollback Recovery in Systems
3 pages
Failure Recovery in Distributed Computing
No ratings yet
Failure Recovery in Distributed Computing
5 pages
Checkpointing and Rollback Recovery Techniques
No ratings yet
Checkpointing and Rollback Recovery Techniques
33 pages
Checkpointing and Rollback Recovery in Systems
No ratings yet
Checkpointing and Rollback Recovery in Systems
24 pages
Dis Notes 4
No ratings yet
Dis Notes 4
31 pages
Checkpointing and Recovery in Distributed Systems
No ratings yet
Checkpointing and Recovery in Distributed Systems
35 pages
Consensus and Recovery in Distributed Systems
No ratings yet
Consensus and Recovery in Distributed Systems
32 pages
Coordinated Checkpointing in Recovery
No ratings yet
Coordinated Checkpointing in Recovery
32 pages
Checkpointing Recovery Systems Explained
No ratings yet
Checkpointing Recovery Systems Explained
5 pages
Checkpoiniting and Rollback
No ratings yet
Checkpoiniting and Rollback
13 pages
Rollback Recovery & Consensus Algorithms
No ratings yet
Rollback Recovery & Consensus Algorithms
35 pages
System Recovery and Error Management
No ratings yet
System Recovery and Error Management
38 pages
Checkpointing and Recovery in Distributed Systems
100% (1)
Checkpointing and Recovery in Distributed Systems
26 pages
Recovery in Concurrent Systems
No ratings yet
Recovery in Concurrent Systems
9 pages
Understanding the Domino Effect in Rollback Recovery
No ratings yet
Understanding the Domino Effect in Rollback Recovery
21 pages
Understanding the Domino Effect in Distributed Systems
No ratings yet
Understanding the Domino Effect in Distributed Systems
21 pages
Koo-Toueg Checkpointing Algorithm Explained
No ratings yet
Koo-Toueg Checkpointing Algorithm Explained
4 pages
Log-Based Rollback Recovery Explained
No ratings yet
Log-Based Rollback Recovery Explained
12 pages
Rollback Recovery in Distributed Systems
No ratings yet
Rollback Recovery in Distributed Systems
22 pages
Consensus and Recovery in Distributed Systems
No ratings yet
Consensus and Recovery in Distributed Systems
3 pages
Consensus and Recovery Algorithms Explained
No ratings yet
Consensus and Recovery Algorithms Explained
3 pages
Recovery and Consensus in Distributed Systems
No ratings yet
Recovery and Consensus in Distributed Systems
32 pages
Checkpoint-Based Recovery in Distributed Systems
No ratings yet
Checkpoint-Based Recovery in Distributed Systems
10 pages
Issues in Failure Recovery in Systems
No ratings yet
Issues in Failure Recovery in Systems
27 pages
Checkpointing & Rollback in Distributed Systems
No ratings yet
Checkpointing & Rollback in Distributed Systems
10 pages
Fault Tolerance and Recovery Strategies
No ratings yet
Fault Tolerance and Recovery Strategies
10 pages
Recovery and Consensus in Distributed Systems
No ratings yet
Recovery and Consensus in Distributed Systems
33 pages
Recovery Techniques in Distributed Systems
No ratings yet
Recovery Techniques in Distributed Systems
119 pages
DC Part B Completed The Dsa Topics As A Premium Source
No ratings yet
DC Part B Completed The Dsa Topics As A Premium Source
18 pages
Checkpoint-Based Recovery Overview
No ratings yet
Checkpoint-Based Recovery Overview
5 pages
Coordinated Checkpointing in Distributed Systems
No ratings yet
Coordinated Checkpointing in Distributed Systems
3 pages
Failure Recovery in Distributed Systems
No ratings yet
Failure Recovery in Distributed Systems
24 pages
Recovery and Consensus in Distributed Systems
No ratings yet
Recovery and Consensus in Distributed Systems
32 pages
Recovery
No ratings yet
Recovery
14 pages
CS3551 Unit IV: Recovery & Consensus
No ratings yet
CS3551 Unit IV: Recovery & Consensus
34 pages
Coordinated Recovery in Distributed Systems
No ratings yet
Coordinated Recovery in Distributed Systems
5 pages
Checkpointing and Message Logging Techniques
No ratings yet
Checkpointing and Message Logging Techniques
6 pages
CS3551 Unit IV: Recovery & Consensus
No ratings yet
CS3551 Unit IV: Recovery & Consensus
34 pages
Advanced Recovery Techniques in OS
No ratings yet
Advanced Recovery Techniques in OS
74 pages
Log-Based Recovery in DBMS Explained
No ratings yet
Log-Based Recovery in DBMS Explained
8 pages
Checkpointing and Rollback Recovery For Distributed Systems 5cvcuy5txm
No ratings yet
Checkpointing and Rollback Recovery For Distributed Systems 5cvcuy5txm
23 pages
Understanding Fault Tolerance Concepts
No ratings yet
Understanding Fault Tolerance Concepts
52 pages
Checkpointing and Rollback Recovery in Distributed Systems
No ratings yet
Checkpointing and Rollback Recovery in Distributed Systems
36 pages
Unit 4 OS Notes
No ratings yet
Unit 4 OS Notes
22 pages
Object Oriented Methodologies & Testing Strategies
No ratings yet
Object Oriented Methodologies & Testing Strategies
7 pages
Distributed Mutex and Deadlock Algorithms
No ratings yet
Distributed Mutex and Deadlock Algorithms
12 pages
Understanding Domain Models in UML
No ratings yet
Understanding Domain Models in UML
24 pages
Information Retrieval Models Overview
No ratings yet
Information Retrieval Models Overview
43 pages
Overview of Information Retrieval Systems
No ratings yet
Overview of Information Retrieval Systems
26 pages
Cloud Automation in Data Centers Explained
No ratings yet
Cloud Automation in Data Centers Explained
13 pages
Container Orchestration with Kubernetes
No ratings yet
Container Orchestration with Kubernetes
20 pages
System Sequence Diagrams Explained
No ratings yet
System Sequence Diagrams Explained
20 pages
Mapping Design to Code in OOAD
100% (1)
Mapping Design to Code in OOAD
6 pages
Unified Process Phases in Software Engineering
No ratings yet
Unified Process Phases in Software Engineering
3 pages
Use Case Modeling Explained with Examples
No ratings yet
Use Case Modeling Explained with Examples
5 pages
Unified Process & Use Case Diagrams Guide
No ratings yet
Unified Process & Use Case Diagrams Guide
11 pages
Distributed vs. Parallel Computing Explained
No ratings yet
Distributed vs. Parallel Computing Explained
21 pages
GRASP and GOF Design Patterns Explained
No ratings yet
GRASP and GOF Design Patterns Explained
17 pages
Message Ordering and Snapshot Algorithms
No ratings yet
Message Ordering and Snapshot Algorithms
11 pages
This Is Holidu PDF
No ratings yet
This Is Holidu PDF
11 pages
C Program for Restaurant Billing System
No ratings yet
C Program for Restaurant Billing System
20 pages
Tentative Class Test Schedule 2022
No ratings yet
Tentative Class Test Schedule 2022
1 page
SAS Basics: PowerPoint Template Guide
No ratings yet
SAS Basics: PowerPoint Template Guide
75 pages
Understanding Near Field Communication
No ratings yet
Understanding Near Field Communication
30 pages
Understanding Number Bases in Mathematics
No ratings yet
Understanding Number Bases in Mathematics
10 pages
Predicate Logic and Quantifiers Explained
No ratings yet
Predicate Logic and Quantifiers Explained
553 pages
Esourcing
No ratings yet
Esourcing
37 pages
Fanuc Custom Macro Programming Guide
100% (1)
Fanuc Custom Macro Programming Guide
53 pages
Excel to PDF Conversion in Delphi
100% (1)
Excel to PDF Conversion in Delphi
2 pages
SOP Format for Swinburne Admissions
No ratings yet
SOP Format for Swinburne Admissions
3 pages
Alappuzha Higher Secondary School Codes
100% (1)
Alappuzha Higher Secondary School Codes
39 pages
Malware Detection Techniques Overview
No ratings yet
Malware Detection Techniques Overview
18 pages
Mobile Tour Guide Application
No ratings yet
Mobile Tour Guide Application
7 pages
Matrix Operations and Structures in C
No ratings yet
Matrix Operations and Structures in C
12 pages
Report Warning
No ratings yet
Report Warning
5 pages
Strategic Supply Chain Management Report
0% (1)
Strategic Supply Chain Management Report
6 pages
Matlab Symbolic Editor & Control Toolbox
No ratings yet
Matlab Symbolic Editor & Control Toolbox
62 pages
MapReduce Applications and Workflows Guide
No ratings yet
MapReduce Applications and Workflows Guide
29 pages
Database Management
No ratings yet
Database Management
164 pages
Database Structure for Sales and Customers
No ratings yet
Database Structure for Sales and Customers
3 pages
Ab Initio String Functions Overview
100% (3)
Ab Initio String Functions Overview
13 pages
Oracle DBA with Active Security Clearance
No ratings yet
Oracle DBA with Active Security Clearance
3 pages
Datastream Data Loader User Guide
No ratings yet
Datastream Data Loader User Guide
69 pages
Distributed Mutual Exclusion Methods
100% (1)
Distributed Mutual Exclusion Methods
2 pages
Home Automation Via Bluetooth (Using ANDROID Platform) : Team Mysterious Maniacs™
No ratings yet
Home Automation Via Bluetooth (Using ANDROID Platform) : Team Mysterious Maniacs™
16 pages
Smart Contract Bytecode and ABI
No ratings yet
Smart Contract Bytecode and ABI
2 pages
Supor Rice Cooker User Manual
50% (2)
Supor Rice Cooker User Manual
163 pages
Clementine Users Guide
No ratings yet
Clementine Users Guide
238 pages
Key OOP Concepts: Encapsulation, Abstraction, Polymorphism, Inheritance
No ratings yet
Key OOP Concepts: Encapsulation, Abstraction, Polymorphism, Inheritance
16 pages

Checkpointing and Rollback Recovery Techniques

Uploaded by

Checkpointing and Rollback Recovery Techniques

Uploaded by

UNIT IV

RECOVERY & CONSENSUS

CHECKPOINTING AND ROLLBACK RECOVERY

Explain Checkpoint-based recovery in detail.

The three types of rollback-recovery techniques are:

 Here, each process has autonomy in deciding when to take checkpoints.

 Lower runtime overhead during normal execution.

 Domino effect during a recovery

There are two types of coordinated checkpoints:

Two types of communication-induced checkpointing:

1. Model based checkpointing

Explain Log-based rollback recovery in detail.

Log-based rollback recovery

Deterministic and non-deterministic events

 In these protocols, processes log determinants asynchronously to the stable storage.

 This combines the advantages of both pessimistic and optimistic logging.

Explain Koo–Toueg coordinated check pointing algorithm in detail. (or)

Explain Coordinated checkpointing algorithm in detail.

The algorithm facilitates two kinds of checkpoints:

The algorithm is implemented in two phases:

 Pi propagates its decision to all processes.

Correctness of the algorithm

 Either all or none of the processes take permanent checkpoint

Rollback recovery algorithm

 Pi propagates its decision to all processes.

Explain Juang-Venkatesan algorithm for asynchronous checkpointing and recovery in detail.

 communication channels are reliable

Two type of log storage are maintained

 Volatile log: short time to access but lost if processor crash.

 From the set of checkpoints, find a set of consistent checkpoints

You might also like