Unit 4
Tags
Below are perfect 16-mark, exam-ready answers with short key points first + full
explanation.
Clear, structured, easy to write in the answer sheet.
4. What are the key assumptions underlying
while designing agreement algorithms? Brief
them.
Short Key Points
Agreement algorithms assume a well-defined system model.
Important assumptions:
1. Timing model – synchronous / asynchronous / partially synchronous.
2. Fault model – crash, omission, Byzantine.
3. Communication model – reliable channels, FIFO / non-FIFO, message delays.
4. Process behavior – deterministic execution.
5. Known system size – number of processes (n) and maximum faulty processes (f).
6. Authentication assumptions – signatures, no message forgery, etc.
Detailed Answer
Agreement algorithms (consensus) require processes in a distributed system to agree on a
single value. To guarantee correctness, several core system assumptions must be clarified.
1. Timing Model Assumptions
Agreement protocols rely on assumptions about time:
Synchronous systems:
Unit 4 1
Known upper bounds on message delay and processing time.
Easier to design consensus (e.g., phase-based, round-based algorithms).
Asynchronous systems:
No timing guarantees.
FLP impossibility: deterministic consensus impossible with even 1 crash failure.
Partially synchronous systems:
System behaves synchronously after some unknown global stabilization time.
Used in practical BFT protocols like PBFT.
2. Fault Model Assumptions
Algorithms must know what type of failures may occur:
Crash failures: Process stops and does nothing further.
Omission failures: Fails to send/receive some messages.
Byzantine failures: Arbitrary misbehavior — sending conflicting or corrupted messages.
Bound on failures: Maximum f faulty processes must be known in advance.
E.g., Byzantine agreement requires n > 3f.
3. Communication Model Assumptions
Reliable channels: messages are never lost, duplicated, or corrupted.
FIFO channels (optional): messages between two processes arrive in sending order.
No forgery: one process cannot impersonate another.
These assumptions affect the design of total order broadcast, consensus, and Byzantine
tolerance.
4. Process Behavior Assumptions
Processes are deterministic given the same inputs.
Shared algorithm logic — all processes follow identical code.
Use of randomization (for randomized consensus) is explicitly allowed in some models.
Unit 4 2
5. Knowledge of System Size
Every process knows:
Number of processes ( n ).
Upper bound on faults ( f ).
Consensus correctness conditions rely on these (e.g., n > 2f or n > 3f ).
6. Authentication Assumptions
Presence or absence of:
Message authentication codes,
Digital signatures,
Cryptographic hash chains, etc.
These assumptions affect correctness bounds (e.g., authenticated Byzantine systems require
only n > 2f).
Conclusion
Designing agreement algorithms requires a precise understanding of system timing, fault
type, communication reliability, and process behavior. These assumptions determine
possibility, complexity, and fault tolerance bounds of consensus algorithms.
5. Construct the rollback recovery process using
a coordinated checkpointing algorithm and
illustrate how Juang–Venkatesan asynchronous
checkpointing ensures consistency
Short Key Points
Rollback recovery uses checkpoints + logs to restore process state after a crash.
Coordinated checkpointing: All processes take checkpoints in a globally consistent
manner.
Avoids domino effect and orphan messages.
Unit 4 3
Juang–Venkatesan algorithm: Asynchronous, uses dependency vectors to create
consistent global snapshots without stopping the system.
Works even with frequent failures.
A. Rollback Recovery Using Coordinated Checkpointing
Concept
Coordinated checkpointing forces all processes to take a checkpoint such that the collection
forms a consistent global state.
A global state is consistent if no message is received before it is sent (no orphan messages).
Algorithm Steps
1. Initiation:
One process acts as a coordinator and sends a “CHECKPOINT REQUEST” to all
processes.
2. Quiescence or Blocking Phase:
Each process completes current events, blocks new messages temporarily, and takes a
local checkpoint.
3. Acknowledgment:
After checkpointing, each process sends “CHECKPOINT DONE” to the coordinator.
4. Commit:
If all processes respond correctly, the coordinator broadcasts a COMMIT
CHECKPOINT message.
All processes mark their checkpoints as permanent.
5. Abort (if failure occurs):
If a process fails during checkpointing, coordinator sends ABORT and older checkpoints
remain valid.
Advantages
No domino effect.
Recovery is quick — restart from last global checkpoint.
Unit 4 4
No orphan messages.
Disadvantages
Blocking processes temporarily → latency.
High coordination overhead.
B. Juang–Venkatesan Asynchronous Checkpointing
Algorithm
Goal
To obtain global consistent checkpoints without blocking all processes.
Main Idea
Uses:
Local checkpoints,
Message dependency tracking,
Checkpoint initiation tokens,
to ensure consistency.
Algorithm Operation
1. Checkpoint Initiation
Any process can initiate a checkpoint by taking a local checkpoint and sending a
CHECKPOINT TOKEN to all others.
2. Dependency Tracking
Every message carries the sender’s dependency vector.
If a process receives a message from a sender whose checkpoint is earlier, it also
takes a forced checkpoint.
3. Propagation
This ensures that if process A's state depends on process B’s state, then B also takes
a checkpoint to maintain consistency.
4. Termination
Unit 4 5
When all processes have taken necessary checkpoints, token circulation ends.
The set of checkpoints obtained is globally consistent.
Why It Works
Forced checkpoints ensure all message causal paths are included.
Eliminates inconsistent states (no missing messages).
No blocking required — useful for high-availability systems.
Advantages over coordinated checkpointing
Non-blocking.
Better suited for long-running or real-time distributed systems.
Handles frequent failures better.
Conclusion
Coordinated checkpointing ensures consistency through blocking, whereas Juang–Venkatesan
asynchronous checkpointing uses dependency tracking to achieve non-blocking, consistent
snapshots in a distributed system.
6. Illustrate in detail the different types of
failures in distributed systems
Short Key Points
Distributed systems can experience multiple failures:
1. Crash Failure
2. Omission Failure (Send/Receive)
3. Timing Failure
4. Arbitrary/Byzantine Failure
5. Network Failure (Partitioning)
6. Response & Value Failure
Unit 4 6
7. Security Failures
Detailed Answer
Distributed systems must handle failures that arise due to hardware faults, software bugs,
timing violations, or adversarial behavior. Key failure categories:
1. Crash Failures
A process stops execution unexpectedly but was working correctly before the crash.
Most common and easiest to handle.
Example: process stops due to power outage.
Detection: Timeout-based heartbeat monitoring.
2. Omission Failures
Failures to send or receive messages.
a. Send Omission
Process fails to send a message that it is supposed to send.
b. Receive Omission
Process fails to read a message that arrived.
Impact: Causes message loss; complicates coordination protocols like mutual exclusion and
consensus.
3. Timing Failures
Process or channel violates timing assumptions.
Examples:
Process takes too long to respond.
Message delay exceeds expected time bound.
Occurs in: Real-time systems, clock drift, unpredictable network latencies.
4. Response Failures
Unit 4 7
Process produces incorrect output.
Two types:
Value failure: Wrong output value.
State transition failure: Process moves to wrong next state.
5. Arbitrary / Byzantine Failures
The most severe failure model.
Process behaves arbitrarily, maliciously, or inconsistently.
Sends conflicting messages to different processes.
Hardest to detect and tolerate.
Requires:
Expensive algorithms like BFT, OM(f), PBFT.
Need n > 3f processes to tolerate f Byzantine failures (oral messages).
6. Network Failures
a. Message corruption
Data modified in transit.
b. Message loss
Packets dropped.
c. Network partition
Network splits into disconnected subsets.
Effect:
Can cause inconsistency, split-brain problems, leader election failure.
7. Security Failures
Unauthorized access.
Data tampering.
Denial-of-service attacks.
Unit 4 8
Man-in-the-middle attacks.
Relevant in cloud and internet-based distributed systems.
Conclusion
Understanding failure types is essential for designing reliable distributed algorithms such as
consensus, checkpointing, replication, and recovery protocols. Different failures require
different tolerance techniques (timeouts for crash, BFT for Byzantine, redundancy for
network issues).
If you want, I can combine all the Part-B
answers for Unit 3 into one fully formatted
exam notes PDF.
Unit 4 9