Distributed_Computing_series_2
Performance Metrics of Mutual Exclusion Algorithms
The performance of distributed mutual exclusion algorithms is generally measured using the following
four key metrics:
1. Message Complexity
Definition: The total number of messages required per Critical Section (CS) execution by a site.
Tip to memorize: How much network traffic does it create to enter the CS?
2. Synchronization Delay (SD)
Definition: The time interval required after one site leaves the CS and before the next site enters
the CS.
Tip to memorize: The gap or transition time between two consecutive CS executions.
3. Response Time
Definition: The total time interval a request waits for its CS execution to be over, measured from
the exact moment its request messages have been sent out.
Tip to memorize: Waiting Time + CS Execution Time.
4. System Throughput
Definition: The rate at which the system executes requests for the Critical Section.
Calculation: It depends on the Synchronization Delay (SD) and the average Critical Section
execution time (E).
Equation: System throughput = 1/(SD + E)
Comparison: Token-Based vs. Non-Token-Based Approach
Basis of Token-Based Approach Non-Token-Based Approach
Comparison
Basic Principle A unique token is circulated and Two or more successive rounds of
shared among the sites in the system. messages are exchanged among the
sites.
Condition to A site can enter the Critical Section A site enters the CS when an
Enter CS (CS) only if it possesses the assertion (defined on its local
unique token. variables) becomes true.
Basis of Token-Based Approach Non-Token-Based Approach
Comparison
Ensuring Mutual Guaranteed naturally because the Guaranteed by resolving conflicts
Exclusion token is strictly unique (only one through message exchange and
exists). priority mechanisms.
Conflict Uses sequence numbers to track Uses timestamps (Lamport-style
Resolution / requests and determine priority. logical clocks) to decide the priority of
Priority conflicting requests.
Holding the A site can hold the token and A site must request permission from
Right to CS repeatedly enter the CS until it sends other sites every time it wants to enter
the token to another site. the CS.
Special States Can have an "idle token state" (a Sites are simply in "requesting",
site holding the token is executing "executing", or "idle" states. There is
outside the CS). no idle token state.
Algorithm Suzuki–Kasami’s Broadcast Lamport’s Algorithm, Ricart–Agrawala
Examples Algorithm. Algorithm.
Main Issues in Implementing DSM Software
When designing a DSM system, the primary challenge is hiding the underlying message-passing
mechanism while maintaining efficiency. The main design issues are:
1. Semantics of Concurrent Access
Determining what semantics (rules) to allow when multiple processors access shared objects at the
same time.
These semantics must be clearly specified so programmers can write their application logic
appropriately.
2. Implementing Semantics (Replication Strategy)
Deciding the best way to implement concurrent access, usually through replication.
Requires deciding the degree of replication: Partial replication (at some sites) vs. Full replication
(at all sites).
Requires deciding the type of replication: Read-replication, write-replication, or both.
3. Placement of Replicas
Selecting the optimal locations for replication (if full replication is not used) to maximize system
efficiency and performance.
4. Locating Remote Data
Determining how to find the exact location of remote data that the application needs to access
(especially when full replication is not implemented).
5. Reducing Overhead
Minimizing communication delays and the number of hidden ("under the covers") messages
required to implement the shared data semantics.
Coordinated vs. Uncoordinated Checkpointing
This comparison focuses on how distributed systems save process states (checkpoints) to recover
from failures.
Basis of Coordinated Checkpointing Uncoordinated Checkpointing
Comparison
Basic Principle All processes synchronize to take Each process takes its checkpoints
their checkpoints at the same time. independently, without coordinating with
others.
Domino Effect Avoids the domino effect. This is Prone to the domino effect (cascaded
its primary advantage. rollbacks). This is its main disadvantage.
State Creates a globally consistent Only creates locally consistent states. A
Consistency state across the system with each globally consistent state (recovery line)
set of checkpoints. must be calculated after a failure.
Overhead High performance overhead. It Very low performance overhead.
during Normal requires extra messages for Processes do not need to communicate
Operation synchronization, and processes or wait for others to take checkpoints.
may need to block and wait.
Rollback & Simple and fast. The system Complex and potentially slow. It
Recovery simply rolls back to the last saved involves searching through stored
Process globally consistent state. checkpoints to find a consistent
recovery line, which may be far in the
past.
Storage Lower. Since the domino effect is Higher. Multiple checkpoints for each
Overhead avoided, older checkpoints can be process must be stored to allow for
discarded more aggressively. finding a recovery line.
Also Known As System-wide Checkpointing Independent Checkpointing
How Quorum-Based Algorithms Differ from Other Mutual
Exclusion Algorithms
Quorum-based mutual exclusion algorithms represent a significant departure from both non-token-
based (e.g., Lamport, Ricart-Agrawala) and token-based approaches. The core difference is that a site
does not need permission from all other sites to enter the critical section.
Here are the key points of differentiation:
1. The Permission Set: Subset vs. All
Quorum-Based: A site requests permission to enter the CS from only a pre-defined subset of
sites, called a quorum.
Other Approaches (Non-Token): A site must broadcast its request and receive permission from all
other sites in the system.
2. Message Complexity
Quorum-Based: This approach significantly reduces the message complexity. Instead of
sending messages to (N-1) sites, a site only communicates with the sites in its quorum. For
example, Maekawa's algorithm achieves a message complexity of approximately 3√N, which is
much better than the O(N) complexity of Lamport's or Ricart-Agrawala's algorithms.
Other Approaches: Message complexity is high as it is directly proportional to the number of sites
(N) in the system.
3. Conflict Resolution Mechanism
Quorum-Based: Conflict resolution is handled by the sites that are common to the intersecting
quorums of the competing sites. The "Intersection Property" of quorums guarantees that any two
quorums in the system have at least one site in common. This common site acts as the arbitrator,
ensuring only one request is granted at a time.
Other Approaches: In non-token-based systems, all sites participate in conflict resolution by
comparing timestamps of incoming requests.
4. Reply and Release Protocol
Quorum-Based: A site in a quorum can send only one REPLY message at any time. It cannot
grant permission to another site until it receives a RELEASE message from the site that it last gave
permission to. This effectively "locks" a site's vote.
Other Approaches: A site can often send multiple REPLY messages immediately if it is not
currently requesting or executing the CS itself.
Comparison Table
Feature Quorum-Based Approach Other Approaches (Non-Token & Token-
Based)
Permission A subset of sites (the quorum). All other sites (in non-token) or no one (if
Group you have the token).
Message Low (e.g., O(√N)). Its primary High (e.g., O(N)).
Complexity advantage.
Conflict Handled locally by the intersecting Handled globally by all sites comparing
Resolution sites of competing quorums. timestamps (non-token).
Resource More efficient and less network High network load due to broadcast
Usage load. messages.
Fault More resilient. The system can Less resilient (non-token). The failure of
Tolerance function even if some sites outside a single site can prevent any other site
a quorum fail. from entering the CS.
Maekawa's Algorithm for Mutual Exclusion
Maekawa's algorithm was the first and is a classic example of a quorum-based mutual exclusion
algorithm. Its primary goal is to reduce the message complexity required to achieve mutual exclusion,
making it more efficient than algorithms like Ricart-Agrawala which require communication with all other
sites.
1. Core Concepts
The algorithm is built upon the idea of Request Sets (Quorums).
Request Set (Ri): For each site Si , there is a specific subset of sites Ri (including Si itself) that
it must request permission from. This Ri is the quorum for site Si .
Intersection Property: The request sets are constructed in such a way that for any two different
sites, Si and Sj , their request sets must have at least one common site.
Ri ∩ Rj ≠ ∅ (This is the non-empty intersection property).
Arbitrator Site: This common site acts as an arbitrator or a "gatekeeper" to resolve concurrent
requests from Si and Sj , ensuring only one of them gets permission at a time.
Optimal Quorum Size: To minimize the size of the quorum, Maekawa showed that the size K for
a system of N sites is approximately √N . This leads to a message complexity of O(√N) instead of
O(N) .
2. The Algorithm in Detail
Example of Maekawa's Algorithm
Let's consider a system with N=7 sites {S1, S2, S3, S4, S5, S6, S7}. We can form quorums of size
K=3 that satisfy the intersection property.
Setup: Request Sets (Quorums)
R1 = {S1, S2, S4}
R2 = {S2, S3, S5}
R3 = {S3, S4, S6}
R4 = {S4, S5, S7}
R5 = {S1, S5, S6}
R6 = {S2, S6, S7}
R7 = {S1, S3, S7}
(Notice that any two sets have exactly one site in common. For example, R1 ∩ R2 = {S2})
Scenario:
1. Site S1 wants to enter the CS.
S1 sends REQUEST(1) to its quorum members: S1, S2, and S4.
Assuming S1, S2, and S4 are all Unlocked, they each send a REPLY back to S1.
S1, S2, and S4 now set their internal state to Locked/Voted.
S1 receives 3 REPLY messages (from S1, S2, S4). It now has permission and enters the CS.
2. While S1 is in the CS, Site R3 wants to enter.
S3 sends REQUEST(3) to its quorum members: S3, S4, and S6.
S3 sends a REPLY to itself and becomes Locked.
S6 is currently Unlocked, so it sends a REPLY to S3 and becomes Locked.
S4 (the intersection of R1 and R3) receives the REQUEST(3) . However, S4 is already Locked
because it gave its vote to S1.
Therefore, S4 does not reply to S3. It adds REQUEST(3) to its queue.
Result: S3 has received only two REPLY messages (from S3 and S6). It is still waiting for the
reply from S4, so S3 is blocked and cannot enter the CS. Mutual exclusion is achieved.
3. Site S1 exits the CS.
S1 sends RELEASE(1) to its quorum members: S1, S2, and S4.
S1, S2, and S4 receive the RELEASE and set their state to Unlocked/Free.
At Site S4: As soon as it becomes Unlocked, it checks its queue. It finds the pending
REQUEST(3) from S3.
S4 immediately sends a REPLY to S3 and sets its state back to Locked/Voted (now for S3).
4. S3 Enters the CS.
S3 now receives the final required REPLY from S4.
Having received permissions from its entire quorum {S3, S4, S6}, S3 enters the CS.
3. Analysis and Conclusion
Advantage: The primary advantage is low message complexity. In this example, entering the CS
required 2*(K-1) messages (REQUEST/REPLY) and (K-1) messages (RELEASE), for a total of
3*(K-1) = 3*2 = 6 messages. A non-quorum algorithm would have required 2*(N-1) = 2*6 =
12 messages.
Disadvantage: The basic algorithm is prone to deadlock. A circular wait can form where sites are
waiting for REPLY messages from other sites that are themselves waiting. This requires more
complex solutions, such as using timestamps to prioritize requests.
Lamport's Algorithm for Mutual Exclusion
Lamport's algorithm is a foundational non-token-based, permission-based algorithm for distributed
mutual exclusion. Its core idea is to use timestamps (from Lamport's logical clocks) to create a total
ordering of all requests for the Critical Section (CS). This ensures that requests are granted on a fair,
first-come, first-served basis.
1. Core Concepts & Requirements
Timestamps: Each request is assigned a unique timestamp (ts, i) , where ts is the local
logical clock time and i is the site ID. This tuple ensures a total ordering:
If timestamps are different, the lower timestamp has higher priority.
If timestamps are the same (a rare event), the lower site ID ( i ) gets higher priority.
Request Queue: Every site Si maintains a local request_queuei . This queue stores all CS
requests that the site is aware of (including its own) and is always kept sorted in ascending order
of timestamps.
Communication Channels: The algorithm requires that communication channels between any two
sites are FIFO (First-In, First-Out). This ensures that messages from site A to site B are
processed in the order they were sent.
Message Types:
1. REQUEST : To ask for permission to enter the CS.
2. REPLY : To grant permission.
3. RELEASE : To announce that the CS has been exited.
[Link] Algorithm in Detail
Example of Lamport's Algorithm
Let's consider a system with three sites: S1, S2, and S3. All request queues are initially empty.
Scenario:
1. S1 requests the CS:
S1's clock is at 0. It creates a timestamp (1, 1).
S1 adds (1, 1) to its queue: Q1 = [(1, 1)] .
S1 broadcasts REQUEST(1, 1) to S2 and S3.
2. S3 requests the CS (slightly later):
S3's clock is at 0. It creates a timestamp (1, 3).
S3 adds (1, 3) to its queue: Q3 = [(1, 3)] .
S3 broadcasts REQUEST(1, 3) to S1 and S2.
3. Message Reception and Queue Updates:
S2 receives REQUEST(1, 1) from S1. It adds it to its queue: Q2 = [(1, 1)] . S2 sends a
REPLY to S1.
S3 receives REQUEST(1, 1) from S1. It adds it to its queue and sorts it: Q3 = [(1, 1), (1,
3)] . S3 sends a REPLY to S1.
S1 receives REQUEST(1, 3) from S3. It adds it to its queue and sorts it: Q1 = [(1, 1), (1,
3)] . S1 sends a REPLY to S3.
S2 receives REQUEST(1, 3) from S3. It adds it to its queue and sorts it: Q2 = [(1, 1), (1,
3)] . S2 sends a REPLY to S3.
4. Checking Conditions to Enter CS:
Check for S1:
L1: Has S1 heard from everyone with a later timestamp? Yes, it received REQUEST(1, 3)
from S3 and a REPLY from S2. It is now aware of all other sites' intentions. (Condition
Met).
L2: Is S1's request (1, 1) at the top of its queue Q1 = [(1, 1), (1, 3)] ? Yes.
(Condition Met).
Conclusion: S1 enters the Critical Section.
Check for S3:
L1: Has S3 heard from everyone? Yes, it received replies/requests from S1 and S2.
(Condition Met).
L2: Is S3's request (1, 3) at the top of its queue Q3 = [(1, 1), (1, 3)] ? No, (1, 1)
has a higher priority. (Condition Fails).
Conclusion: S3 must wait.
5. S1 Releases the CS:
S1 exits the CS.
S1 removes (1, 1) from its queue: Q1 = [(1, 3)] .
S1 broadcasts a RELEASE message to S2 and S3.
6. S2 and S3 Process the Release:
S2 receives the RELEASE and removes (1, 1) from its queue: Q2 = [(1, 3)] .
S3 receives the RELEASE and removes (1, 1) from its queue: Q3 = [(1, 3)] .
7. S3 Re-checks Conditions:
L1: Still met.
L2: Is S3's request (1, 3) now at the top of its queue Q3 = [(1, 3)] ? Yes. (Condition Met).
Conclusion: S3 can now enter the Critical Section.
Analysis
Message Complexity: To enter the CS once, a site sends (N-1) REQUEST s, receives (N-1)
REPLY s, and sends (N-1) RELEASE s. The total is 3(N-1) messages, which is high.
Guarantees: The algorithm guarantees mutual exclusion, is free from deadlock, and ensures
fairness (no starvation).
Here are the exam-focused notes for Deadlock Handling Strategies in a Distributed Environment:
Introduction: Why is it complicated?
In distributed systems, handling deadlocks is highly complicated because:
1. No global state: No single site has accurate knowledge of the current, up-to-date state of the
entire system.
2. Communication delays: Every inter-site communication involves finite but unpredictable delays.
The Three Deadlock Handling Strategies
There are three primary strategies to handle deadlocks. In distributed systems, the first two are
generally impractical, making the third (Detection) the preferred choice.
1. Deadlock Prevention
How it works: Restrains how processes can request resources to ensure a deadlock can never
happen. This is commonly achieved by:
Forcing a process to acquire all needed resources simultaneously before it begins
execution.
Pre-empting a process that holds a needed resource.
Verdict: Highly inefficient and impractical in distributed systems because it severely limits
concurrency and resource utilization.
2. Deadlock Avoidance
How it works: A resource is granted to a process only if the resulting global system state is proven
to be "safe" (i.e., a state where a deadlock will definitely not occur).
Verdict: Impractical in distributed systems. Determining a "safe" global state requires global
knowledge of all resources and requests, which is impossible to maintain accurately due to
message delays and the lack of a global clock.
3. Deadlock Detection (The Best Approach)
How it works: Allows the system to run without strict restrictions. It periodically examines the
status of process-resource interactions to check for the presence of a cyclic wait (a deadlock
condition).
Verdict: This is the best and most practical approach for handling deadlocks in distributed
systems.
Deep Dive: Deadlock Detection & Resolution
Handling deadlocks using the detection approach entails addressing two basic issues:
A. Detection of existing deadlocks
Detection involves maintaining a Wait-For Graph (WFG) and searching it for cycles.
Wait-For Graph (WFG): A directed graph where nodes represent processes. A directed edge from
P 1 to P 2 exists if process P 1 is blocked and waiting for P 2 to release some resource.
Condition: A system is deadlocked if and only if there exists a directed cycle or knot in the WFG.
Correctness Criteria for Detection Algorithms:
A valid deadlock detection algorithm must satisfy two conditions:
1. Progress (No undetected deadlocks): The algorithm must detect all existing deadlocks in a finite
time. It should not wait for more events to occur once all deadlock dependencies have formed.
2. Safety (No false deadlocks): The algorithm should not report deadlocks that do not actually exist.
(False deadlocks are also called phantom deadlocks). Phantom deadlocks often happen in
distributed systems because sites might obtain out-of-date or inconsistent WFG information.
B. Resolution of detected deadlocks
Once a deadlock is detected, it must be resolved.
How it works: Resolution involves breaking the existing wait-for dependencies between
processes.
Action taken: It involves aborting or rolling back one or more deadlocked processes. The
resources held by the rolled-back process are then assigned to blocked processes so they can
resume execution.
Consistent vs. Inconsistent Global State
In distributed systems, a "global state" is a collection of the individual local states of all participating
processes along with the states of the communication channels. During failure recovery and
checkpointing, ensuring the system rolls back to a consistent state is crucial.
Feature Consistent Global State Inconsistent Global State
Definition A state that could legally occur during a A state that could never occur in a
normal, failure-free execution of the correct, failure-free execution. It is an
distributed computation. impossible physical state.
Feature Consistent Global State Inconsistent Global State
Causality Rule If a process's state reflects the receipt A process's state reflects the receipt
(Send vs. of a message, the sender's state must of a message, but the sender's state
Receive) reflect the sending of that message. does not reflect having sent it.
Orphan Contains no orphan messages Contains orphan messages.
Messages (messages received but not sent).
In-Transit May contain in-transit messages May also contain in-transit
Messages (messages sent but not yet received). messages, but the defining error is
This is perfectly valid and normal. the orphan messages.
System A valid state to which the system can The system cannot resume from this
Recovery safely roll back and resume execution. state because it violates the basic
laws of cause and effect.
Detailed Example
To understand this, let's look at the classic examples of processes exchanging messages and taking
checkpoints.
1. Example of a Consistent State (Figure a in PDF):
Scenario: Process P sends a message m to P . The system state is captured after P sends m ,
1 1 2 1 1
but before P receives it.
2
Why is it consistent? This represents a normal "in-transit" message. It is completely logical for a
message to be on the network (sent) but not yet processed by the receiver. Every message that
has been received has a corresponding "send" event recorded. The system can safely resume from
here.
2. Example of an Inconsistent State (Figure b in PDF):
Scenario: Process P sends message m to P . P receives it and takes a checkpoint. Later, P
1 2 2 2 1
fails and rolls back to a checkpoint taken before it ever sent m . 2
Why is it inconsistent? If we look at the combined global state now, process P 's state says, "I
2
received message m ", but process P 's state says, "I never sent message m ".
2 1 2
The Problem: This violates causality. You cannot receive a message that was never sent. This is
called an orphan message. Such a state is impossible in a failure-free correct computation, and
the system must take further actions (like rolling back P as well) to fix this inconsistency.
2
What is Log-Based Rollback Recovery?
Log-based rollback recovery (also known as message logging) is a fault-tolerance technique in
distributed systems. It combines periodic checkpointing with the logging of nondeterministic
events (such as the receipt of a message).
If a process fails, the system restores the most recent checkpoint and then "replays" the logged
messages in their original order to perfectly reconstruct the process's state right up to the moment of
failure.
The Core Principle: Piecewise Deterministic (PWD) Assumption
This entire method relies heavily on the Piecewise Deterministic (PWD) assumption.
What it means: It assumes that a process's execution is just a sequence of deterministic state
intervals.
The Trigger: Each interval is triggered by a nondeterministic event (like receiving an external
message or an interrupt).
The Guarantee: If a system captures the initial state (via a checkpoint) and records all the
nondeterministic events (in a log), it can deterministically recreate any pre-failure state by simply
feeding those exact same events back into the process in the exact same order.
How the Mechanism Works
1. Checkpointing: Processes periodically save their state to stable storage.
2. Logging (Determinants): During normal execution, whenever a nondeterministic event occurs
(like receiving a message), all the information needed to replay that event (called a determinant) is
recorded in a log file.
3. Failure Recovery:
The failed process is rolled back to its most recent local checkpoint.
The system reads the log file and replays the saved messages/events.
The process executes deterministically, returning exactly to the state it was in before it crashed.
Three Types of Log-Based Recovery
Depending on when and how the logs are written to stable storage, this approach is categorized into
three types:
1. Pessimistic Logging: Every event is immediately saved to stable storage before it is allowed to
affect the system state. It guarantees no orphan processes and makes recovery very fast, but adds
high latency during normal execution.
2. Optimistic Logging: Events are processed immediately and logged to volatile memory first, then
written to stable storage asynchronously later. It offers great performance during normal operation
but makes the recovery process much more complex.
3. Causal Logging: Strikes a balance between the two. It logs determinants by tracking causal
dependencies piggybacked on messages, reducing the performance hit of pessimistic logging
without the extreme recovery complexity of optimistic logging.
Advantages & Disadvantages
Advantages:
Avoids the Domino Effect: Failed processes can be brought completely up to date individually.
Other processes usually do not need to roll back just because one process failed.
More Recent State: Replaying logs allows the process to recover to a state much closer to the
crash than checkpointing alone.
Disadvantages:
Performance Overhead: Constantly saving message logs to stable storage consumes disk space
and network bandwidth.
Duplicate Messages: As mentioned in the PDF issues, when a process replays its log to handle
lost messages, it might accidentally resend messages that other processes have already received,
resulting in duplicate messages.
Lamport’s Bakery Algorithm
Lamport’s Bakery Algorithm is a classic solution for the n-process mutual exclusion problem in
shared memory systems. It is called the "Bakery Algorithm" because it mimics the way customers are
served in a bakery: you take a ticket number, and the person with the lowest number is served first.
1. Core Concept & Analogy
Taking a Number: A process wanting to enter the Critical Section (CS) picks a "ticket number"
(timestamp).
The Rule: The process with the lowest ticket number enters the CS first.
Tie-Breaking (Lexicographical Order): If two processes pick the same ticket number at the same
time, the process with the lower Process ID (PID) is given priority.
This is represented as the tuple: [ticket_number, PID].
2. Data Structures Used
The algorithm uses two shared arrays:
1. choosing[1...n] (Boolean): choosing[i] is true if process P is currently in the process of
i
picking its ticket number.
2. timestamp[1...n] (Integer): timestamp[i] stores the ticket number of process P . A value of 0
i
means the process is not requesting the CS.
3. The Algorithm Steps
Phase 1: Entry Section (Taking a Ticket)
1. Set choosing[i] = true .
2. Pick a number: timestamp[i] = max(timestamp[1...n]) + 1 .
3. Set choosing[i] = false .
Phase 2: The Waiting Loop (Checking Others)
For every other process j from 1 to n:
1. Wait if process j is currently picking a number ( choosing[j] is true).
2. Wait if process j has a valid ticket AND (P has a smaller ticket OR P has the same ticket but a
j j
smaller PID).
Phase 3: Critical Section
The process executes its Critical Section.
Phase 4: Exit Section
Set timestamp[i] = 0 (Releasing the ticket).
4. Requirements Satisfied
The algorithm is proven to satisfy the three requirements of the critical section problem:
1. Mutual Exclusion: Only one process can have the lowest [timestamp, PID] tuple at any time.
2. Progress: A process is guaranteed to enter the CS because ticket numbers always increase.
3. Bounded Waiting: A process i can be overtaken by another process j at most once after i has
picked its ticket number.
5. Performance & Complexity
Space Complexity: O(n) registers (requires an array of n timestamps). This is the lower bound;
you cannot have a more space-efficient algorithm for this problem.
Time Complexity: O(n) because a process must check the status of all n − 1 other processes
before entering the CS.
Disadvantage: In the worst case, it does not guarantee a bounded delay (very high wait times).
Types of Messages in Rollback Recovery
When a process fails and the system rolls back to a previous "recovery line" (a set of checkpoints),
several messages may end up in abnormal states. Understanding these is crucial for restoring a
consistent system state.
1. In-Transit Messages
Definition: Messages that have been sent but not yet received at the time the global state was
captured.
Status: These are not problematic. In a reliable network, they will eventually be delivered. A
consistent global state must include these to ensure they are handled in any legal execution.
2. Lost Messages
Definition: Messages whose "send" is recorded (it happened before the sender's checkpoint) but
the "receive" is undone (it happened after the receiver's checkpoint).
Problem: The receiver "forgets" it ever got the message.
Handling: Usually handled by message logging, where the sender replays the message from its
log during recovery.
3. Delayed Messages
Definition: Messages whose "receive" is not recorded because the receiving process was either
down or the message arrived after the process had already rolled back.
Problem: Similar to lost messages, the system must ensure these are eventually processed
correctly once the receiver is back online.
4. Orphan Messages
Definition: Messages whose "receive" is recorded but the "send" is undone due to rollback.
Problem: This is a serious error (violates causality). It creates a state where a message was
received but never sent.
Solution: These are avoided if the system rolls back to a consistent global state.
5. Duplicate Messages
Definition: Messages that are received more than once by a process.
Cause: This usually occurs during recovery when a process replays its message log to handle
lost messages, unintentionally resending a message that was already received before the failure.
6. Delayed Orphan Messages
Definition: A combination where the "send" event is undone due to a sender rolling back, but the
message is still traveling through the network.
Problem: When these messages finally arrive at their destination, they must be discarded
because their origin no longer exists in the current system state.
Summary Table
Message Type Send Event Receive Event Status/Problem
In-Transit Recorded Not yet happened Valid / Normal
Lost Recorded Undone by rollback Needs re-sending (logging)
Orphan Undone by rollback Recorded Impossible State (Causality error)
Duplicate Sent twice Received twice Result of log replaying
Delayed Orphan Undone Arrives post-rollback Must be discarded