0% found this document useful (0 votes)
8 views10 pages

Checkpointing & Rollback in Distributed Systems

Uploaded by

gamevortex076
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views10 pages

Checkpointing & Rollback in Distributed Systems

Uploaded by

gamevortex076
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Checkpointing and

Rollback Recovery in
Distributed Computing
By,
[Link] MUTHU,
III-CSE-’B’.
INTRODUCTION
 In distributed computing, multiple processes run on different
systems and communicate through a network.
 If one process or system fails, the whole application shouldn’t
stop.
 To handle this, we use Checkpointing and Rollback Recovery
techniques.
➡ In simple words:
Checkpointing means “saving the current state”
and Rollback Recovery means “restoring from that saved state
after a failure.”
2
What is Checkpointing?
 Checkpointing means saving the current state of a process.
 If a failure happens later, we can start again from that saved
point not from the beginning.
 It’s like an auto-save option in games or documents.
Example:
Imagine you are typing a document in Google Docs.
Even if your laptop turns off suddenly, when you open it again,
your writing will be safe till the last auto-save.
That auto-save point is called a Checkpoint.
What is Rollback Recovery?
 Rollback Recovery means that when a system fails, it goes back (rolls back) to
the last checkpoint and continues execution from there.
 So only the small work done after the last checkpoint will be lost.
Example:
In an online banking transaction, if the system crashes after debiting the amount but
before confirmation,the bank’s server rolls back to the previous checkpoint-meaning
the transaction will be canceled safely, and your money won’t be lost.
Rollback Propagation (Domino Effect):
 Sometimes, when one process rolls back, it may force other connected processes
also to roll back — to maintain consistency.
 This chain reaction is called the Domino Effect.
Example:
In a bank, money is sent from Branch A → Branch B → Branch [Link] Branch B fails
because of a network problem and rolls back,then Branch A and Branch C must also
rollback to keep all account balances correct and the system consistent. 4
Types of Checkpointing Techniques
1. Uncoordinated (Independent)
Checkpointing:
Each process takes its checkpoint independently
without communicating with others.
It’s simple to implement but can cause a domino
effect because checkpoints may not match.
Example:
In a distributed weather monitoring system, if
each sensor saves data independently,
one sensor failure can disturb overall system
consistency.

5
2. Coordinated Checkpointing:
All processes coordinate and take checkpoints
together at the same time.
This ensures a consistent global state and
avoids the domino effect.
Example:
In online ticket booking, all modules like
payment, seat booking, and confirmation take a
checkpoint together.
So, if any failure occurs, the system can restore
cleanly from that consistent point.

6
3. Communication-Induced
Checkpointing:
Here, checkpoints are taken automatically when
processes communicate based on information
attached to messages.
It reduces coordination overhead and keeps the
system consistent.
Example(Collaborative Document Editing):
In Google Docs, multiple users edit the same
[Link] are automatically saved
when users make [Link] the system crashes,
the latest edits are safe and the document stays
consistent.

7
[Link]-Based Rollback Recovery:
This method uses both checkpoints and
message/event logs.
It assumes the system’s behavior is piecewise
deterministic (PWD) — meaning the same input
produces the same output.
After a failure, the system replays the logs from
the last checkpoint to restore the exact state.
Example:
In databases, every transaction (like insert,
update, delete) is logged.
If a crash happens, the database replays those
logs to recover completed transactions safely.

8
ADVANTAGES: DISADVANTAGES:
[Link] Tolerance: [Link] Overhead:
•Helps the system recover automatically •Saving checkpoints frequently uses more CPU,
after a failure without restarting completely. memory, and storage space.
[Link] Re-computation: [Link] Coordination:
•Only the work done after the last •Synchronizing checkpoints among multiple
checkpoint is lost, saving time and effort. processes is difficult.
[Link] Consistency: [Link] Effect:
•Maintains a consistent state across all •In uncoordinated checkpointing, one rollback
distributed processes after recovery. can cause many others to rollback too.
[Link] Data Loss: [Link] Requirement:
•Because states are saved periodically, very •Large storage is needed to maintain multiple
little data is lost during a crash. checkpoints and logs.
[Link] System Reliability: [Link] Delay:
•Increases overall dependability and •Checkpointing and recovery can slow down
stability of distributed applications normal system performance.

9
10

You might also like