Checkpointing and
Rollback Recovery in
Distributed Computing
By,
[Link] MUTHU,
III-CSE-’B’.
INTRODUCTION
In distributed computing, multiple processes run on different
systems and communicate through a network.
If one process or system fails, the whole application shouldn’t
stop.
To handle this, we use Checkpointing and Rollback Recovery
techniques.
➡ In simple words:
Checkpointing means “saving the current state”
and Rollback Recovery means “restoring from that saved state
after a failure.”
2
What is Checkpointing?
Checkpointing means saving the current state of a process.
If a failure happens later, we can start again from that saved
point not from the beginning.
It’s like an auto-save option in games or documents.
Example:
Imagine you are typing a document in Google Docs.
Even if your laptop turns off suddenly, when you open it again,
your writing will be safe till the last auto-save.
That auto-save point is called a Checkpoint.
What is Rollback Recovery?
Rollback Recovery means that when a system fails, it goes back (rolls back) to
the last checkpoint and continues execution from there.
So only the small work done after the last checkpoint will be lost.
Example:
In an online banking transaction, if the system crashes after debiting the amount but
before confirmation,the bank’s server rolls back to the previous checkpoint-meaning
the transaction will be canceled safely, and your money won’t be lost.
Rollback Propagation (Domino Effect):
Sometimes, when one process rolls back, it may force other connected processes
also to roll back — to maintain consistency.
This chain reaction is called the Domino Effect.
Example:
In a bank, money is sent from Branch A → Branch B → Branch [Link] Branch B fails
because of a network problem and rolls back,then Branch A and Branch C must also
rollback to keep all account balances correct and the system consistent. 4
Types of Checkpointing Techniques
1. Uncoordinated (Independent)
Checkpointing:
Each process takes its checkpoint independently
without communicating with others.
It’s simple to implement but can cause a domino
effect because checkpoints may not match.
Example:
In a distributed weather monitoring system, if
each sensor saves data independently,
one sensor failure can disturb overall system
consistency.
5
2. Coordinated Checkpointing:
All processes coordinate and take checkpoints
together at the same time.
This ensures a consistent global state and
avoids the domino effect.
Example:
In online ticket booking, all modules like
payment, seat booking, and confirmation take a
checkpoint together.
So, if any failure occurs, the system can restore
cleanly from that consistent point.
6
3. Communication-Induced
Checkpointing:
Here, checkpoints are taken automatically when
processes communicate based on information
attached to messages.
It reduces coordination overhead and keeps the
system consistent.
Example(Collaborative Document Editing):
In Google Docs, multiple users edit the same
[Link] are automatically saved
when users make [Link] the system crashes,
the latest edits are safe and the document stays
consistent.
7
[Link]-Based Rollback Recovery:
This method uses both checkpoints and
message/event logs.
It assumes the system’s behavior is piecewise
deterministic (PWD) — meaning the same input
produces the same output.
After a failure, the system replays the logs from
the last checkpoint to restore the exact state.
Example:
In databases, every transaction (like insert,
update, delete) is logged.
If a crash happens, the database replays those
logs to recover completed transactions safely.
8
ADVANTAGES: DISADVANTAGES:
[Link] Tolerance: [Link] Overhead:
•Helps the system recover automatically •Saving checkpoints frequently uses more CPU,
after a failure without restarting completely. memory, and storage space.
[Link] Re-computation: [Link] Coordination:
•Only the work done after the last •Synchronizing checkpoints among multiple
checkpoint is lost, saving time and effort. processes is difficult.
[Link] Consistency: [Link] Effect:
•Maintains a consistent state across all •In uncoordinated checkpointing, one rollback
distributed processes after recovery. can cause many others to rollback too.
[Link] Data Loss: [Link] Requirement:
•Because states are saved periodically, very •Large storage is needed to maintain multiple
little data is lost during a crash. checkpoints and logs.
[Link] System Reliability: [Link] Delay:
•Increases overall dependability and •Checkpointing and recovery can slow down
stability of distributed applications normal system performance.
9
10