0% found this document useful (0 votes)
5 views14 pages

Checkpointing and Rollback Recovery Techniques

The document discusses recovery and consensus mechanisms in distributed systems, focusing on checkpointing and rollback recovery techniques. It details various methods such as uncoordinated, coordinated, and communication-induced checkpointing, along with log-based rollback recovery strategies including pessimistic, optimistic, and causal logging. Additionally, it explains the Koo-Toueg coordinated checkpointing algorithm and the Juang-Venkatesan algorithm for asynchronous checkpointing and recovery.

Uploaded by

vedaraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views14 pages

Checkpointing and Rollback Recovery Techniques

The document discusses recovery and consensus mechanisms in distributed systems, focusing on checkpointing and rollback recovery techniques. It details various methods such as uncoordinated, coordinated, and communication-induced checkpointing, along with log-based rollback recovery strategies including pessimistic, optimistic, and causal logging. Additionally, it explains the Koo-Toueg coordinated checkpointing algorithm and the Juang-Venkatesan algorithm for asynchronous checkpointing and recovery.

Uploaded by

vedaraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT IV

RECOVERY & CONSENSUS

Checkpointing and rollback recovery: Introduction – Background and definitions – Issues in failure recovery –
Checkpoint-based recovery – Log-based rollback recovery – Coordinated checkpointing algorithm – Algorithm
for asynchronous checkpointing and recovery. Consensus and agreement algorithms: Problem definition –
Overview of results – Agreement in a failure –free system – Agreement in synchronous systems with failures.

CHECKPOINTING AND ROLLBACK RECOVERY

The saved state is called a checkpoint, and the procedure of restarting from a previously checkpointed state is
called rollback recovery.

Explain Checkpoint-based recovery in detail.

Checkpoint-based recovery

In check point based recovery, the state of each process and the communication channel is checkpointed
frequently so that when a failure occurs, the system can be restored to a globally consistent set of checkpoints.

The three types of rollback-recovery techniques are:

1. Uncoordinated checkpointing
2. Coordinated checkpointing
3. Communication-induced checkpointing

Uncoordinated Checkpointing

 Here, each process has autonomy in deciding when to take checkpoints.


 This eliminates the synchronization overhead as there is no need for coordination between processes
and it allows processes to take checkpoints when it is most convenient or efficient.

Advantages

 Lower runtime overhead during normal execution.

Limitations

 Domino effect during a recovery


 Recovery from a failure is slow.
 Each process maintains multiple checkpoints and periodically invoke a garbage collection algorithm
 Not suitable for application with frequent output commits
Coordinated checkpointing

Coordinated checkpointing requires each process to maintain only one checkpoint on the stable storage,
reducing the storage overhead and eliminating the need for garbage collection.

There are two types of coordinated checkpoints:

 Blocking Checkpointing
 Non-blocking Checkpointing

Blocking Checkpointing

 After a process takes a local checkpoint, to prevent orphan messages, it remains blocked until the entire
checkpointing activity is complete.
 The coordinator takes a checkpoint and broadcasts a request message to all processes, asking them to
take a checkpoint.
 When a process receives this message, it stops its execution, takes a tentative checkpoint.
 It sends an acknowledgment message back to the coordinator.
 After the coordinator receives acknowledgments from all processes, it broadcasts a commit message to
all processes.
 After receiving the commit message, atomically makes the tentative checkpoint permanent and then
resumes its execution .

Non-blocking Checkpointing

The processes need not stop their execution while taking checkpoints.

Example:

Message m is sent by P0 after receiving a checkpoint request from the checkpoint coordinator. Assume m
reaches P1 before the checkpoint request. This situation results in an inconsistent checkpoint. Since checkpoint
c1,x shows the receipt of message m from P0, while checkpoint c0,x does not show m being sent from P0. To solve
inconsistent checkpoint, on-blocking checkpoint coordination protocol using this snapshot algorithm of Chandy
and Lamport in which markers play the role of the checkpoint request messages.
Communication-induced checkpointing

 Communication-induced checkpointing avoids the domino effect, while allowing processes to take
some of their checkpoints independently.
 In communication-induced checkpointing, processes take two types of checkpoints.
 The checkpoints that a process takes independently are called local checkpoints.
 The process is forced to take are called forced checkpoints.

Two types of communication-induced checkpointing:

1. Model based checkpointing

 This prevents patterns of communications and checkpoints that may result in inconsistent states among
the existing checkpoints.
 This model can be maintained by taking an additional checkpoint before every message-receiving event
that is not separated from its previous message-sending event by a checkpoint.
 Another method is by taking a checkpoint immediately after every message sending event.

2. Index-based checkpointing

This assigns monotonically increasing indexes to checkpoints, such that the checkpoints having the same index
at different processes form a consistent state.

Explain Log-based rollback recovery in detail.

Log-based rollback recovery

A log-based rollback recovery makes use of deterministic and nondeterministic events in a computation.

Deterministic and non-deterministic events

 A non-deterministic event can be the receipt of a message from another process or an event internal to
the process.
 Message send event is not a non-deterministic event.

For example, in Figure, the execution of process P0 is a sequence of four deterministic intervals. The first one
starts with the creation of the process, while the remaining three start with the receipt of messages m0, m3, and
m7, respectively. Send event of message m2 is uniquely determined by the initial state of P0 and by the receipt
of message m0, and is therefore not a non-deterministic event
Pessimistic logging

 Pessimistic logging protocols assume that a failure can occur after any nondeterministic event in the
computation.
 Pessimistic protocols implement as synchronous logging.
 The processes must take periodic checkpoints to minimize the amount of work that has to be repeated
during recovery.
 When a process fails, the process is restarted from the most recent checkpoint and the logged
determinants are used to recreate the prefailure execution.

Consider the example in Figure. During failure-free operation the logs of processes P0, P1, and P2 contain
the determinants needed to replay messages m0, m4, m7, m1, m3, m6, and m2, m5, respectively. Suppose
processes P1 and P2 fail as shown, restart from checkpoints B and C, and roll forward using their
determinant logs to deliver again the same sequence of messages. This guarantees that P1 and P2 will repeat
exactly their pre-failure execution and re-send the same messages.

Optimistic logging

 In these protocols, processes log determinants asynchronously to the stable storage.


 Optimistic logging protocol assume that logging will be complete before a failure occurs.
 Pessimistic protocols need only keep the most recent checkpoint of each process, whereas optimistic
protocols may need to keep multiple checkpoints for each process.
 The overheads in optimistic logging are complicated recovery, garbage collection, and slower output
commit.
Consider the example shown in Figure. Suppose process P2 fails before the m5 is logged to the stable storage.
Process P1 then becomes an orphan process and must roll back to undo the effects of receiving the orphan
message m6. The rollback of P1 further forces P0 to roll back to undo the effects of receiving message m7.

Casual Logging

 This combines the advantages of both pessimistic and optimistic logging.


 Like optimistic logging, it does not require synchronous access to the stable storage.
 Like pessimistic logging, it allows each process to commit output independently and never creates
orphans.

Consider the example in Figure . Messages m5 and m6 are likely to be lost on the failures of P1 and P2. process
P0 will be able to “guide” the recovery of P1 and P2 since it knows the order in which P1 should replay
messages. Similarly, P0 has the order in which P2 should replay message m2. The content of these messages is
obtained from the sender log of P0.

Explain Koo–Toueg coordinated check pointing algorithm in detail. (or)

Explain Coordinated checkpointing algorithm in detail.

 Coordinated check pointing and recovery technique that takes a consistent set of check pointing and
avoids domino effect and livelock problems during the recovery.
 This algorithm includes 2 parts:
1. check pointing algorithm
2. Recovery algorithm.

Checkpointing algorithm
The following are the assumptions made in checkpointing algorithm:

 FIFO channel
 end-to-end protocols
 single process initiation
 no process failures during the execution of the algorithm

The algorithm facilitates two kinds of checkpoints:

 Permanent checkpoints
 Tentative checkpoints

The algorithm is implemented in two phases:

Phase I:

 Initiating process Pi takes a tentative checkpoint and requests all other processes to take tentative
checkpoints.
 Every process cannot send messages after taking tentative checkpoint.
 All processes will finally have the single same decision: do or discard.
 A process says no to a request if it fails to take a tentative checkpoint.
 If Pi learns that all the processes have successfully taken tentative checkpoints, Pi decides that all
tentative checkpoints should be made permanent; otherwise, Pi decides that all the tentative checkpoints
should be discarded.

Phase II:

 Pi propagates its decision to all processes.


 On receiving the message from Pi ,all process act accordingly.

Correctness of the algorithm

 Either all or none of the processes take permanent checkpoint


 No process sends message after taking permanent checkpoint.

Rollback recovery algorithm

This algorithm restore the system state to a consistent state after a failure.

Phase I:

 An initiating process Pi sends a message to all other processes to check if they are willing to restart
from their previous checkpoints.
 A process may reply no to a restart request due to any reason.
 If Pi learns that all processes are willing to restart from their previous checkpoints, Pi decides that all
processes should roll back to their previous checkpoints.
 Otherwise, Pi aborts the rollback attempt and it may attempt a recovery at a later time.

Phase II:

 Pi propagates its decision to all processes.


 On receiving the message from Pi ,all process act accordingly.

Correctness

All processes restart from an appropriate state because, if they decide to restart, they resume execution from a
consistent state.

Explain Juang-Venkatesan algorithm for asynchronous checkpointing and recovery in detail.

Assumptions

 communication channels are reliable


 delivery messages in FIFO order
 infinite buffers
 message
 transmission delay is arbitrary but finite

Two type of log storage are maintained

 Volatile log: short time to access but lost if processor crash.


 Stable log: longer time to access but remained if crashed.

Asynchronous checkpointing:

 After executing an event, a processor records a triplet (s, m, msg_sent) in its volatile storage.
s: state of the processor before the event
m: message
msgs_sent: set of messages sent by the processor during the event.
 Local checkpoint consist of set of records, first are stored in volatile log, then moved to stable log.

Recovery algorithm

Notations:

𝑅𝐶𝑉𝐷𝑖←𝑗 (𝐶𝑘𝑃𝑡𝑖): number of messages received by 𝑝𝑖 from 𝑝𝑗 , from the beginning of computation to checkpoint
𝐶𝑘𝑃𝑡𝑖

𝑆𝐸𝑁𝑇𝑖→𝑗 (𝐶𝑘𝑃𝑡𝑖): number of messages sent by 𝑝𝑖 to 𝑝 , from the beginning of computation to checkpoint 𝐶𝑘𝑃𝑡i
Idea:

 From the set of checkpoints, find a set of consistent checkpoints


 This is done based on the number of messages sent and received.
 Recovery may involve multiple iterations of roll backs by processors.
 Whenever a processor rolls back, it is necessary for all other processors to find out if any message sent
by the rolled back processor has become an orphan message.
 The orphan messages are identified by comparing the number of messages sent to and received from
neighboring processors.
 When a processor restarts after a failure, it broadcasts a ROLLBACK message that it has failed.
 Because of the broadcast of ROLLBACK messages, the recovery algorithm is initiated at all processors.

You might also like