Introduction to Distributed Systems
•What is a Distributed System?
• A collection of autonomous computing nodes (computers/processes).
• Interconnected, they communicate via message passing (no shared memory).
• Work together to achieve a common goal.
•Key Theoretical Challenges:
• Lack of a Global Clock: No single, universally agreed-upon time.
• Lack of Shared Memory: Processes cannot directly observe each other's state.
• Independent Failures: Parts can fail while others continue.
The Challenge of Time
Physical Clocks:
• Each node has its own local physical clock.
Problems:
• Clock Drift: Clocks run at slightly different rates and diverge over time.
• Synchronization Issues: Impossible to perfectly synchronize clocks across a network due to
variable delays.
• No Universal "Now": An event's perceived time varies between nodes.
Attempted Solutions: Protocols like NTP (Network Time Protocol) try to synchronize physical clocks,
but perfect synchronization remains elusive.
Consequence: Cannot rely on physical timestamps for accurate event ordering across the entire
system.
Network Time Protocol (NTP)
Network Time Protocol (NTP)
NTP (Network Time Protocol) is a networking protocol used for clock synchronization between computer
systems over packet-switched networks. The diagram above shows its hierarchical structure:
Key Components:
1. Stratum Levels:
• Stratum 0: Most accurate time sources (Atomic clocks, GPS satellites, radio clocks)
• Stratum 1: Primary time servers directly connected to Stratum 0 sources
• Stratum 2: Secondary servers that get time from Stratum 1 servers
• Stratum 3-15: Further hierarchical levels (lower stratum = higher accuracy)
2. How It Works:
• Time information flows from top to bottom in the hierarchy
• Each level synchronizes with the level above it
• Accounts for network delays to calculate accurate time
• Uses sophisticated algorithms to compensate for clock drift
Network Time Protocol (NTP)
4. Key Features:
• Port: Uses UDP port 123
• Accuracy: Can achieve millisecond-level precision
• Redundancy: Multiple time sources prevent single points of failure
• Automatic: Continuously adjusts for network delays and clock variations
• Scalable: Hierarchical design distributes load efficiently
5. Benefits:
• Keeps all network devices synchronized
• Essential for logging, security certificates, financial transactions
• Prevents time-related conflicts in distributed systems
• Built into most modern operating systems
NTP is one of the oldest internet protocols still in use and is crucial for maintaining accurate time
across global networks.
Cristian's Algorithm
Cristian's algorithm is a clock synchronization algorithm used in distributed systems. Its main purpose is to
synchronize the clocks of client machines to a time server. It’s most suitable for environments where network
latency is low and there is a reliable central time server.
Steps of Cristian’s Algorithm
1. Client Requests Time: The client sends a request to the time server at time T0 (according to its local
clock).
2. Server Responds: The server receives the request, notes its current time (TSERVER), and sends this time
back to the client.
3. Client Receives Time: The client receives the response at time T1 (again, according to its local clock).
4. Calculate Network Delay: The round-trip delay is calculated as T1−T0
5. Set Client Clock: The client sets its clock to the time reported by the server plus half of the round-trip
delay: T_CLIENT=T_SERVER+(T1−T0)/2
This formula assumes that network delays are roughly symmetric for the request and response.
Cristian's Algorithm
Advantages
• Simple & Fast: Only one round-trip communication is needed.
• Works well for small, low-latency networks.
• Low network traffic: Minimal messages.
Disadvantages
• Requires a trusted, accurate time server.
• Not resilient to network failures or high latencies.
• Vulnerable to malicious attacks if the time server is compromised.
Use Case
Cristian's algorithm is specifically used for allocating physical (hardware) clock synchronization in
distributed systems—not for logical clocks. It aims to keep the wall-clock time of computers in-sync
using a time reference, typically Coordinated Universal Time (UTC).
Berkeley's Algorithm
Berkeley's Algorithm is a clock synchronization method used in distributed systems where no
machine has an accurate external time source (such as a UTC server). It was developed to achieve
synchronization among multiple nodes that each may have their own drifting clocks.
How Berkeley's Algorithm Works
1. Leader Election: One node in the distributed system is chosen as the master or leader (using an
election algorithm like Chang and Roberts).
2. Polling Clocks: The master periodically polls all other nodes (called followers or slaves) to get
their current clock times.
3. Calculating Average Time:
• The master estimates each follower's time adjusting for network delay.
• Then, it calculates the average of all collected times including its own.
• Outliers (nodes with times significantly different) can be ignored.
Berkeley's Algorithm
4. Broadcast Time Adjustment:
• Instead of sending the absolute updated time, the master sends a correction (positive or
negative) to each follower.
• Each follower adjusts its clock by this delta.
• This avoids additional errors from network delay in broadcasting the time.
5. Clock Adjustments:
• Systems often avoid moving clocks backward (negative corrections) by slowing down clock
speed or briefly stopping it, to maintain monotonic time.
Logical Clocks
Purpose: Assign timestamps to events consistent with the happened-before relation, without
relying on physical time.
Lamport's Logical Clocks:
Mechanism: Each process maintains a counter.
• Increment counter for internal events.
• Send current counter with messages.
• On receive, update local counter to max(local_counter,message_counter)+1.
Property: If a→b, then C(a)<C(b).
Limitation: If C(a)<C(b), it doesn't necessarily mean a→b (could be concurrent).
Lamport's Logical Clocks
What is Lamport's Logical Clock?
Lamport's Logical Clock assigns timestamps to events in a distributed system such that if event A causally precedes
event B, then the timestamp of A is less than the timestamp of B.
The Three Rules:
1. Internal Event Rule
• Before executing any internal event, a process increments its logical clock by 1
• LC = LC + 1
2. Send Message Rule
• When sending a message, the sender includes its current logical clock value with the message
• The sender then increments its clock before the next event
3. Receive Message Rule
• Upon receiving a message with timestamp t, the receiver updates its clock:
• LC = max(current_LC, received_timestamp) + 1
• This ensures the receiver's clock is ahead of both its previous value and the sender's timestamp
Lamport's Logical Clocks
Key Benefits:
• Causal Ordering: Maintains the "happens-before" relationship
• No Physical Synchronization: Works without synchronized clocks
• Partial Ordering: Establishes ordering for causally related events
• Simplicity: Easy to implement and understand
Limitations:
• Cannot determine if two events are concurrent
• Clock values can grow large over time
• Only provides partial ordering, not total ordering
This algorithm is foundational for many distributed system protocols and is essential for maintaining
consistency in distributed databases, message ordering, and distributed mutual exclusion.
Logical Clocks
Vector Clocks:
Mechanism: Each process maintains a vector (array) of counters, one for each process
in the system.
• Increment own entry for internal events.
• Send entire vector with messages.
• On receive, update local vector by taking element-wise maximum with message
vector.
Properties:
If a→b, then V(a)<V(b).
Crucially, it can determine concurrency: a∥b if neither V(a)<V(b) nor V(b)<V(a).
States in Distributed Systems
System State Definition:
• The collection of local states of all processes.
• The states of all messages in transit across communication channels.
The Global State Problem:
• Challenge: Due to the absence of a global clock, it's impossible to capture an instantaneous snapshot
of the entire system.
• Inconsistent Global State: A naive collection of local states taken at roughly the same physical time
can be inconsistent (e.g., message received but not yet sent in the snapshot). This is an inconsistent
cut.
Consistent Global State (Consistent Cut):
• A snapshot of local states and in-transit messages where, if a message's receive event is included, its
corresponding send event must also be included.
• Represents a state the system could have actually been in.
Lamport’s Vector Clock
This algorithm extends Lamport's concept to capture a more precise causal relationship, allowing for the
detection of concurrency. Each process maintains a vector with one entry for every process in the system.
Algorithm for a process Piin a system of N processes:
Initialization: The local vector clock, Vi, is initialized to a vector of all zeros: Vi=[0,0,...,0].
On a Local Event: When an event occurs within process Pi:
a. Increment its own entry in the vector: Vi[i]=Vi[i]+1.
On Sending a Message: When process Pisends a message M:
a. Increment its own entry: Vi[i]=Vi[i]+1.
b. Send the message M along with a copy of its current vector clock, Vi.
On Receiving a Message: When process Pjreceives a message M with a vector timestamp VM:
a. For each entry k in the vector, update its own vector by taking the element-wise maximum:
Vj[k]=max(Vj[k],VM[k]) for k=1,...,N.
b. Increment its own entry: Vj[j]=Vj[j]+1.
Lamport’s Vector Clock
The Chandy-Lamport Snapshot Algorithm
This algorithm captures a consistent global state of an asynchronous distributed system. It relies on
special "marker" messages to coordinate the snapshot.
Algorithm:
1. Initiator Process: Any process can initiate the snapshot. The initiator process Pi performs the
following:
a. Records its own local state.
b. For each outgoing channel, it sends a marker message.
2. Non-Initiator Process: A process Pjreceiving a marker from channel C for the first time:
a. Records its own local state.
b. Records the state of channel C as empty.
c. For each of its outgoing channels, it sends a marker message.
d. d. Starts recording all messages arriving on its other incoming channels.
The Chandy-Lamport Snapshot Algorithm
3. Subsequent Marker Receipt: A process Pjreceiving a marker from channel C after it has
already recorded its state:
a. It stops recording messages on channel C.
b. Records the state of channel C as the set of all messages it has received on that
channel since it first recorded its own state and before receiving this marker.
4. Termination: The algorithm terminates when every process has received a marker on all
of its incoming channels. The recorded local states and channel states together form the
consistent global state.
Ricart-Agrawala Mutual Exclusion Algorithm
This is a permission-based algorithm to ensure that only one process can access a critical
section at a time. It uses a logical clock (like Lamport's) to order requests.
Algorithm for a process Pi:
1. Requesting the Critical Section: To enter the critical section, Pi creates a timestamped
request message (using its logical clock) and sends it to all other processes in the
system.
2. Receiving a Request: When process Pjreceives a request from Pi:
a. If Pj is not requesting or is not in the critical section, it immediately sends a REPLY
message back to Pi.
b. If Pjis already in the critical section, it defers the reply to Pi.
Ricart-Agrawala Mutual Exclusion Algorithm
c. If Pjis also requesting the critical section, it compares its own request's timestamp
with Pi's request timestamp.
i. If Pj's timestamp is smaller, it defers the reply.
ii. If Pi's timestamp is smaller, Pj sends an immediate REPLY and defers its own
entry.
3. Entering the Critical Section: Pi enters the critical section only after it has received a
REPLY from all other processes.
4. Exiting the Critical Section: After exiting, Pisends a REPLY to all deferred requests.
Difference Table
Aspect Ricart-Agrawala Chandy-Lamport
Mutual Exclusion (Critical Section Global State Recording
Purpose
control) (Snapshot)
Year 1981 1985
Avoid multiple processes entering Capture consistent global state in
Problem Solved
CS simultaneously async system
Type Permission-based algorithm Snapshot-based algorithm
Message Complexity 2*(N-1) per CS entry 1 Marker per channel
Synchronization for exclusive
Focus Consistency across processes
access
Distributed checkpointing,
Example Use Database write lock
recovery
Maekawa's Mutual Exclusion Algorithm
This is a quorum-based algorithm that reduces the message overhead of Ricart-Agrawala by
requiring a process to request permission from only a subset of other processes.
Algorithm for a process Pi:
1. Quorum Definition: Each process Piis a member of a quorum set Vi. The key property is
that any two quorum sets, Viand Vj, must have a non-empty intersection (Vi ∩ Vj≠ ∅).
2. Requesting the Critical Section: To enter the critical section, Pisends a timestamped
request message to all processes in its quorum set Vi.
Maekawa's Mutual Exclusion Algorithm
3. Receiving a Request: When a process Pjin Pi's quorum receives the request:
a. It checks its status. If it has not granted a "lock" to any other process, it grants a lock
to Piand sends a REPLY.
b. If it has already granted a lock to another process, it defers Pi's request.
4. Entering the Critical Section: Pienters the critical section only after receiving a REPLY
from every process in its quorum set Vi.
5. Exiting the Critical Section: After exiting, Pisends a "release" message to all processes in
its quorum, freeing the lock for the next requesting process.
Maekawa's Mutual Exclusion Algorithm
Maekawa's Mutual Exclusion Algorithm
Maekawa's Mutual Exclusion Algorithm
Maekawa's Mutual Exclusion Algorithm
Maekawa's Mutual Exclusion Algorithm
Maekawa's Mutual Exclusion Algorithm
Deadlock Detection & Resolution
What is a Deadlock?
• Definition: A situation where a set of processes are blocked indefinitely.
• Each process is waiting for a resource held by another process in the set.
• The four conditions for a deadlock:
1. Mutual Exclusion: Resources cannot be shared.
2. Hold and Wait: A process holds a resource while waiting for another.
3. No Preemption: Resources cannot be taken away.
4. Circular Wait: A chain of processes waiting for each other.
Challenges in Distributed Systems
• No Global State: No single process has a complete, real-time view of the system.
• Communication Delays: Messages take time, leading to outdated information.
• Scalability: Large number of processes and resources makes centralized solutions
inefficient.
Deadlock Detection (Wait-For Graph)
Wait-For Graph (WFG):
•A directed graph used to model resource dependencies.
•Nodes: Represent processes.
•Edges: A directed edge from Pito Pjmeans Piis waiting for a resource held by Pj.
•Deadlock Condition: A cycle in the Wait-For Graph indicates a deadlock.
Centralized Deadlock Detection
•Concept: A single coordinator node is responsible for detection.
Mechanism:
• Processes send their resource request/release info to the coordinator.
• Coordinator builds and maintains the global WFG.
• Coordinator periodically checks for cycles.
Pros: Simple to implement.
Cons:
• Single Point of Failure: If coordinator fails, detection stops.
• Performance Bottleneck: Coordinator can become overwhelmed.
• False Deadlocks: Outdated information can lead to incorrect cycle detection.
Distributed Deadlock Detection
Concept: No single coordinator; all processes cooperate.
Algorithm Example: Chandy-Misra-Haas
Probe Messages: Special messages sent to detect cycles.
Mechanism:
1. A waiting process sends a probe message to the process it's waiting for.
2. The probe is forwarded along the dependency chain (the WFG).
3. If a probe returns to its originator, a deadlock cycle is confirmed.
Pros: No single point of failure; more scalable.
Cons: More complex; can generate many messages.
Deadlock Resolution
Concept: Once a deadlock is detected, a strategy is needed to break the cycle.
1. Process Termination:
• Simplest method: Kill one or more processes involved in the deadlock.
• Resources are released.
• Drawback: Can lead to lost work and is often a drastic solution.
2. Resource Preemption:
• Forcibly take a resource from a process.
• The preempted process must be rolled back to a previous state.
• Drawback: Can be inefficient and complex to manage.
3. Process Rollback:
• Roll back one or more deadlocked processes to a previous checkpoint.
• This releases the held resources and breaks the cycle.
• Benefit: Saves a process's progress, but requires a robust check pointing mechanism.