0% found this document useful (0 votes)
10 views19 pages

Challenges in Distributed Systems Consistency

The document discusses the challenges of maintaining global consistency in distributed transactions, highlighting issues such as network latency, partial failures, and clock skew. It explains how distributed locks and clocks help address these challenges by ensuring mutual exclusion and establishing causality. Additionally, it explores the implications of the CAP theorem in cloud-based health records and contrasts the scalability and consistency approaches of NoSQL and NewSQL systems with traditional DDBMS.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views19 pages

Challenges in Distributed Systems Consistency

The document discusses the challenges of maintaining global consistency in distributed transactions, highlighting issues such as network latency, partial failures, and clock skew. It explains how distributed locks and clocks help address these challenges by ensuring mutual exclusion and establishing causality. Additionally, it explores the implications of the CAP theorem in cloud-based health records and contrasts the scalability and consistency approaches of NoSQL and NewSQL systems with traditional DDBMS.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Distributed Systems: Challenges and Solutions

Conversation with Gemini

8) Explain the challenges of maintaining global consistency in distributed transactions. How do

distributed locks and clocks help address these challenges?

4. Cloud-Based Health Records: A hospital stores patient data across multiple data

centers with strict consistency requirements.

→ Discuss how CAP theorem constraints influence system design in this scenario

12) Discuss how NoSQL and NewSQL systems address scalability and consistency differently

from traditional DDBMS

Please the above question in Deep explanation for University assignment

With IEEE reference


This explanation covers the fundamental challenges of distributed transactions, the mechanics of
synchronization, and the design implications of the CAP theorem in a clinical setting.

1. Challenges of Maintaining Global Consistency

In a Distributed Database Management System (DDBMS), maintaining global consistency—ensuring all


nodes reflect the same state simultaneously—is inherently difficult due to the "fallacies of distributed
computing."

Primary Challenges:

Network Latency and Asynchrony: Communication between nodes is not instantaneous. Updates at
Node A take time to reach Node B, creating a window of inconsistency where different users see
different data.

Partial Failures: In a single-machine system, the system is either up or down. In a distributed system, a
subset of nodes or network links may fail. Coordinating a transaction when some participants are
unreachable is a major bottleneck.

Clock Skew: Physical clocks on different servers are never perfectly synchronized. This makes it difficult
to determine the "true" order of events (e.g., did Transaction A happen before Transaction B?).

The Split-Brain Problem: During a network partition, two groups of nodes might both think they are the
"leader," leading to divergent data states that are difficult to reconcile later.

2. Role of Distributed Locks and Clocks

To mitigate these challenges, distributed systems employ specialized synchronization mechanisms:

A. Distributed Locks

Locks ensure mutual exclusion across multiple nodes.


How they help: They prevent race conditions by ensuring only one process can modify a specific data
item at a time.

Implementation: Often managed by coordination services like Apache ZooKeeper or etcd. These use
consensus algorithms (like Paxos or Raft) to ensure that the "granting" of a lock is itself consistent across
the cluster.

Challenge: If a process holds a lock and crashes, the resource could be locked forever. This is solved
using Leases (locks with an expiration time).

B. Distributed Clocks

Since physical time is unreliable, systems use Logical Clocks and Hybrid Clocks.

Lamport & Vector Clocks: These do not track "time" in seconds but rather "causality." They help
determine if event A caused event B, allowing the system to order transactions correctly without a
global clock.

TrueTime (Google Spanner): Uses atomic clocks and GPS receivers to provide a tight bound on clock
uncertainty (ϵ). This allows the system to assign globally monotonic timestamps, effectively achieving
external consistency at a global scale.

3. Case Study: Cloud-Based Health Records & CAP Theorem

In a scenario where a hospital stores patient data across multiple data centers, the CAP Theorem
(Consistency, Availability, Partition Tolerance) dictates the architectural boundaries.

CAP Theorem Influence:

Partition Tolerance (P) is Mandatory: In a cloud environment spanning multiple data centers, network
partitions are inevitable. Therefore, the design choice is between CP (Consistency + Partition Tolerance)
or AP (Availability + Partition Tolerance).
+1

The Clinical Requirement (Consistency): Healthcare data (e.g., drug dosages, allergy alerts) requires
Strict Consistency. A doctor in Data Center A must see the prescription updated by a nurse in Data
Center B immediately.

The Design Trade-off (Sacrificing Availability): Because the hospital requires strict consistency, the
system must be a CP system.

Impact: If the network link between data centers fails, the system may reject updates or reads to
prevent "stale" or "conflicting" data. While this reduces "availability" (the system might be temporarily
read-only or error out), it ensures patient safety by preventing conflicting medical records.

4. NoSQL and NewSQL vs. Traditional DDBMS

The evolution of database systems reflects different priorities regarding the "Scalability vs. Consistency"
trade-off.

Feature Traditional DDBMS (RDBMS) NoSQL Systems NewSQL Systems

Scalability Vertical (Scale-up). Hard to scale across many nodes. Horizontal (Scale-out). Designed
for thousands of nodes. Horizontal (Scale-out). Combines NoSQL speed with SQL structure.

Consistency Strong (ACID). Immediate consistency via 2PC/3PC. Eventual (BASE). Prioritizes
availability; data syncs over time. Strong (ACID). Uses modern consensus (Raft/Paxos) for speed.

Data Model Rigid Schema (Tables/Rows). Flexible (Key-Value, Document, Graph). Relational (SQL)
with distributed sharding.

Example Oracle, MySQL, PostgreSQL. Cassandra, MongoDB, DynamoDB. Google


Spanner, CockroachDB, VoltDB.

Key Distinction:

NoSQL addressed the "Big Data" problem by relaxing ACID guarantees in favor of the BASE model
(Basically Available, Soft-state, Eventual consistency).
NewSQL emerged to bridge the gap, providing the horizontal scalability of NoSQL while maintaining the
ACID compliance required for mission-critical applications (like the health records case above).

IEEE References

E. Brewer, "CAP twelve years later: How the "rules" have changed," Computer, vol. 45, no. 2, pp. 23-29,
Feb. 2012, doi: 10.1109/MC.2012.37.

D. J. Abadi, "Consistency Tradeoffs in Modern Distributed Database System Design," Computer, vol. 45,
no. 2, pp. 37-42, Feb. 2012, doi: 10.1109/MC.2012.33.

J. Corbett et al., "Spanner: Google’s Globally Distributed Database," ACM Transactions on Computer
Systems (TOCS), vol. 31, no. 3, pp. 1-22, Aug. 2013.

Would you like me to create a comparison table for specific NoSQL databases (like MongoDB vs.
Cassandra) based on their CAP classification?

Only for 2 in deep

Maintaining global consistency in distributed transactions is a primary concern in Distributed Database


Management System (DDBMS) design. For your university assignment, this response provides an
academic breakdown of the challenges and the mechanisms used to solve them.

1. Challenges of Global Consistency in Distributed Systems

The difficulty in maintaining a "single version of the truth" across multiple nodes stems from three core
architectural hurdles:

A. The Absence of a Global State


In a single-machine system, the CPU and memory share a common bus and a single clock. In a
distributed system, there is no shared memory and no global physical clock. Because message
propagation takes a non-zero, variable amount of time, a node can only observe the state of another
node by receiving a message, which is already "stale" by the time it arrives.

B. Network Partitions and Asynchrony

Networks are inherently unreliable. A "Partition" occurs when nodes are alive but cannot communicate.
In such cases, if the system allows updates to continue, different nodes will diverge (the Split-Brain
problem). If it stops updates, the system loses availability.

C. Clock Skew and Drift

Even high-precision quartz oscillators in servers drift at different rates. This makes it impossible to rely
on "wall-clock time" to order transactions. For example, if Node A timestamps an event at [Link] and
Node B at [Link], it might appear B happened first, even if A actually triggered B.

2. Distributed Locks: Ensuring Mutual Exclusion

Distributed locks serve as a synchronization primitive to ensure that only one process or node can access
a shared resource at a given time across the entire network.

Mechanism: They implement Mutual Exclusion (Mutex) at a global scale. A node must "acquire" the lock
from a coordination service (like Apache ZooKeeper or etcd) before executing a transaction.

Addressing Challenges:

Race Conditions: By serializing access to data, locks prevent multiple nodes from overwriting the same
record simultaneously.

Fencing Tokens: To solve the problem of a "zombie" process (a node that thinks it still has a lock but was
actually timed out due to a network lag), distributed locks often use incrementing tokens. The storage
layer will only accept writes from the client with the highest token number.
3. Distributed Clocks: Ordering and Causality

Since physical clocks are unreliable, distributed systems use Logical Clocks and Physical Clock
Synchronization to establish the "Happened-Before" relationship.

A. Logical Clocks (Lamport and Vector Clocks)

Logical clocks do not track time in seconds but in sequences.

Lamport Clocks: Every message sent includes a counter. Upon receiving a message, a node updates its
counter to max(local_clock,message_clock)+1. This ensures a total ordering of events.

Vector Clocks: These track the state of every node in the system, allowing the detection of causality
violations and concurrent updates (conflicts).

B. Physical Sync (TrueTime and NTP)

Modern systems like Google Spanner use "TrueTime," which utilizes GPS and Atomic clocks.

How it helps: Instead of providing a single timestamp, it provides a confidence interval [earliest,latest].
By waiting out the maximum possible uncertainty (ϵ), the system can guarantee that a transaction
committed at time T

will always be seen as occurring before a transaction started at T

if T

2
>T

4. CAP Theorem in Cloud Health Records

In a hospital scenario with data spread across multiple data centers, the CAP Theorem (Consistency,
Availability, Partition Tolerance) dictates the following:

P (Partition Tolerance) is Non-Negotiable: Because data centers are connected via wide-area networks
(WAN), interruptions are guaranteed. The system must survive these.

Consistency (C) vs. Availability (A): For medical records, Consistency is a safety requirement. A doctor
must not see a "stale" medication list.

The Resulting Design: The system must be a CP system.

Design Influence: During a network failure between data centers, the system will block or error out
rather than serve potentially incorrect data. This ensures that a patient's allergy update in Data Center A
is not ignored by a prescription order in Data Center B.

IEEE References

E. Brewer, "CAP twelve years later: How the 'rules' have changed," Computer, vol. 45, no. 2, pp. 23-29,
2012, doi: 10.1109/MC.2012.37.

M. Kleppmann, "How to do distributed locking," Distributed Systems Research, 2016. [Online]. Available:
[Link]
J. Corbett et al., "Spanner: Google’s Globally Distributed Database," ACM Transactions on Computer
Systems (TOCS), vol. 31, no. 3, 2013.

Would you like me to expand on the specific implementation of the Two-Phase Commit (2PC) protocol
used in these transactions?

Distributed Locks in System Design

This video explains the mechanics of distributed locking and the Redlock algorithm, which is essential for
understanding how systems manage mutual exclusion when multiple nodes compete for the same
resource.

How Distributed Lock works | ft Redis | System Design - YouTube

ByteMonk · 36K views

Only for 8 in deep

For a university-level assignment, the challenges of maintaining global consistency must be framed
through the lens of the "Fallacies of Distributed Computing" and the fundamental constraints of state
coordination across decentralized nodes.

1. Challenges of Global Consistency in Distributed Transactions

In a distributed environment, "Global Consistency" means that all nodes observe the same data state at
the same logical time. Achieving this is exceptionally difficult due to three primary "impossibility"
factors:

A. The Asynchrony of Communication (Network Uncertainty)


Unlike a centralized system where components share a common bus, distributed nodes communicate
over unreliable networks.

Challenge: There is no upper bound on message delivery time. If Node A sends an update to Node B,
Node A cannot distinguish between a slow network, a crashed Node B, or a dropped packet.

Impact: This leads to the CAP Theorem trade-off: in the event of a network partition (P), a system must
choose between serving potentially stale data (Availability) or refusing to answer until all nodes agree
(Consistency).

B. Lack of a Global State and Memory

No single node has a complete, instantaneous view of the entire system.

Challenge: By the time a "snapshot" of the system is collected at a central coordinator, the state of the
individual nodes has likely already changed.

Impact: This makes "Atomic" operations (where either all nodes commit or none do) extremely
expensive in terms of latency, as it requires multiple rounds of communication (e.g., Two-Phase
Commit).

C. Clock Skew and Drift

Physical clocks are driven by quartz oscillators that drift at different rates due to temperature and
hardware variances.

Challenge: You cannot rely on "wall-clock time" to order transactions. If Transaction A happens at
[Link] on Server 1, and Transaction B happens at [Link] on Server 2, B might appear to happen first
even if A actually caused B.

Impact: Without a way to order events, the system cannot resolve conflicts accurately (e.g., "Last Write
Wins" becomes arbitrary).
2. Addressing Challenges with Distributed Locks

Distributed locks provide Mutual Exclusion across an entire cluster, ensuring that only one transaction
can modify a specific data object at a time.

How they help: They prevent Race Conditions. In a distributed environment, if two nodes try to update a
patient's record simultaneously, a distributed lock manager (like ZooKeeper or etcd) ensures only one
"owns" the resource.

Fencing Tokens: To solve the "zombie process" problem (where a node holds a lock but hangs, and the
lock is eventually timed out and given to someone else), distributed locks use Fencing Tokens. Every
time a lock is granted, it comes with a monotonically increasing number. The storage layer checks this
number; if it receives a write with an "old" token, it rejects it, preventing a delayed process from
corrupting data.

3. Addressing Challenges with Distributed Clocks

Distributed clocks are used to establish a Total Ordering of events and ensure Causality.

A. Logical Clocks (Lamport & Vector Clocks)

Logical clocks do not track "seconds" but rather "causality."

Mechanism: Every message sent between nodes includes a counter. When a node receives a message, it
updates its own counter to max(local,received)+1.

How they help: This creates a "Happen-Before" (→) relationship. If A→B, we know A's state could have
influenced B. This allows the system to reconstruct a consistent history of events even if physical times
were wrong.

B. TrueTime and Physical Synchronization


Modern systems (like Google Spanner) use high-precision physical clocks (Atomic/GPS).

Mechanism: Instead of returning a single time, the clock returns an interval [earliest,latest] representing
the uncertainty (ϵ).

How they help: To ensure global consistency, a transaction is not allowed to commit until the system is
certain that the "current time" has passed the "commit time" plus the uncertainty ϵ. This "Wait" ensures
that any subsequent transaction will strictly have a higher timestamp, achieving External Consistency.

4. Summary of Mechanisms vs. Challenges

Challenge Solution Mechanism

Concurrency/Race Conditions Distributed Locks Ensures only one node modifies data (Mutual
Exclusion).

Ordering of Events Logical Clocks Establishes "Happen-Before" causality.

Global Timestamping TrueTime / GPS Clocks Reduces uncertainty to allow global serializability.

Partial Failures Consensus Protocols (e.g., Paxos/Raft) used to manage locks/clocks reliably.

IEEE References

E. Brewer, "CAP twelve years later: How the 'rules' have changed," Computer, vol. 45, no. 2, pp. 23-29,
2012, doi: 10.1109/MC.2012.37.

L. Lamport, "Time, Clocks, and the Ordering of Events in a Distributed System," Communications of the
ACM, vol. 21, no. 7, pp. 558-565, 1978. (Foundational IEEE/ACM standard).

J. Corbett et al., "Spanner: Google’s Globally Distributed Database," ACM Transactions on Computer
Systems (TOCS), vol. 31, no. 3, 2013.

Would you like me to elaborate on the specific logic of the Raft Consensus Algorithm, which is often
used to implement these distributed locks?
Only for 12 in deep

In a university context, the transition from traditional Distributed Database Management Systems
(DDBMS) to NoSQL and NewSQL is framed as an evolution of the CAP Theorem trade-offs. While
traditional systems prioritize transactional integrity, modern systems address the explosive growth of
data and global distribution through distinct architectural shifts.

1. Traditional DDBMS: The ACID-Strong Model

Traditional DDBMS (e.g., Oracle, DB2, or distributed PostgreSQL) were designed for Strong Consistency
and ACID (Atomicity, Consistency, Isolation, Durability) compliance.

Scalability Approach: They primarily rely on Vertical Scaling (scaling up hardware). Horizontal scaling is
difficult because maintaining strict ACID properties across a network requires heavy coordination
protocols like Two-Phase Commit (2PC).

Consistency Mechanism: They use locking and synchronous replication. This ensures that a data item is
locked on all nodes until a transaction completes, preventing any other user from seeing an
intermediate state.

The Limitation: As the number of nodes increases, the overhead of 2PC grows exponentially, leading to
high latency and a "single point of failure" if the coordinator node goes down.

2. NoSQL: Prioritizing Scalability (The BASE Model)

NoSQL systems (e.g., Cassandra, MongoDB, DynamoDB) emerged to handle "Big Data" by relaxing the
strict requirements of traditional RDBMS.

Scalability Approach: They are designed for Horizontal Scalability (scaling out). Data is partitioned across
many cheap, commodity servers using Sharding.

Consistency Mechanism (BASE): Instead of ACID, NoSQL follows the BASE principle:
Basically Available: The system guarantees a response, even if it’s not the most recent data.

Soft-state: The state of the system may change over time without input (due to background sync).

Eventual Consistency: If no new updates are made, all replicas will eventually become consistent.

How it Addresses Scalability: By allowing "stale reads," NoSQL avoids the global locks and
synchronization bottlenecks of traditional systems, enabling near-linear performance gains as more
nodes are added.

3. NewSQL: Bridging the Gap (Distributed ACID)

NewSQL (e.g., Google Spanner, CockroachDB, VoltDB) is a class of modern relational databases that seek
to provide the NoSQL scale without sacrificing the ACID guarantees of traditional systems.

Scalability Approach: Like NoSQL, NewSQL uses horizontal sharding and a distributed architecture.
However, it often utilizes a non-blocking architecture and in-memory storage to reduce latency.

Consistency Mechanism: NewSQL uses modern consensus algorithms like Paxos or Raft instead of the
"heavy" Two-Phase Commit.

Consensus-Based Scaling: Instead of locking all nodes, a "majority" (quorum) of nodes must agree on a
transaction. This allows the system to remain consistent and available even if a minority of nodes fail.

TrueTime/Logical Clocks: Some NewSQL systems use high-precision clocks to order transactions globally
without needing a central coordinator, effectively solving the "Clock Skew" challenge discussed in
Question 8.

Summary Comparison Table


Feature Traditional DDBMS NoSQL NewSQL

Primary Goal Transactional Integrity High Availability & ScaleScalability + ACID

Scaling Type Vertical (Scale-up) Horizontal (Scale-out) Horizontal (Scale-out)

Consistency Model ACID (Strong) BASE (Eventual) ACID (Distributed Strong)

Typical Schema Rigid / Fixed Schema-less / Flexible Relational (SQL)

Coordination Heavy (2-Phase Commit) Minimal (Gossip protocols) Optimized (Paxos/Raft)

IEEE References

E. Brewer, "CAP twelve years later: How the 'rules' have changed," Computer, vol. 45, no. 2, pp. 23-29,
2012.

A. Pavlo and M. Aslett, "What's New with NewSQL?," SIGMOD Record, vol. 45, no. 2, pp. 45-55, 2016.

R. Cattell, "Scalable SQL and NoSQL Data Stores," SIGMOD Record, vol. 39, no. 4, pp. 12-27, 2011.

J. Corbett et al., "Spanner: Google’s Globally Distributed Database," ACM Trans. Comput. Syst., vol. 31,
no. 3, 2013.

Would you like me to analyze a specific NewSQL database like CockroachDB to show how its "Survival"
architecture compares to a NoSQL database like Cassandra?

Distributed Database Consistency Models This video provides a technical comparison of NoSQL and
NewSQL, specifically highlighting how they manage the trade-offs between performance and
consistency in large-scale deployments.

NoSQL and NewSQL: Tradeoffs Between Scalable Performance & Consistency - YouTube

ScyllaDB · 615 views


Only for 4 in deep

In a clinical scenario where patient health records are distributed across multiple data centers, the CAP
Theorem (Consistency, Availability, Partition Tolerance) serves as a critical framework for architectural
decision-making.

Because medical data is inherently sensitive and requires high accuracy to ensure patient safety, the
constraints of the CAP theorem directly dictate how the system handles failures.

1. Defining the CAP Constraints in Healthcare

The CAP theorem states that a distributed system can only provide two of the following three
guarantees at any given time:

Consistency (C): Every read receives the most recent write (or an error). In a hospital, this means if a
nurse updates a patient’s allergy in Data Center A, a doctor in Data Center B must see that update
immediately.

Availability (A): Every request receives a (non-error) response, even if the data is not the most recent.

Partition Tolerance (P): The system continues to operate despite the loss of communication between
nodes (network partition).

Why "P" is Mandatory

In a cloud-based environment spanning multiple geographic locations, network partitions are inevitable
(e.g., fiber cuts, router failures). Therefore, the system must be Partition Tolerant (P). This leaves the
designer with a binary choice during a network failure: CP or AP.

2. The Influence on System Design: Choosing CP


For health records, the system must prioritize Consistency over Availability, resulting in a CP system
design.

A. Prioritizing Safety (CP)

In a medical context, "Stale Data" (Availability over Consistency) can be fatal. If a network partition
occurs and Data Center B cannot confirm if Data Center A has updated a patient's medication list, the
system must not serve the old data.

Design Decision: The system will block the read/write request or return an error rather than risk
providing outdated information.

Reasoning: It is safer for a system to be temporarily "unavailable" (requiring a manual backup process
like paper records) than to allow a doctor to prescribe a drug that the patient is newly allergic to.

B. Impact on Transactional Logic

To maintain this strict consistency across data centers, the system design must incorporate:

Distributed Consensus: Using protocols like Paxos or Raft. A write is only successful if a majority
(quorum) of nodes across data centers agree.

Atomic Commits: Ensuring that updates to a patient's record happen in an "all-or-nothing" fashion
across all replicas.

C. Latency vs. Consistency (The PACELC Extension)

While CAP describes behavior during a partition, the PACELC theorem notes that even during normal
operation (E for Else), there is a trade-off between Latency (L) and Consistency (C).

Design Influence: Because the hospital requires strict consistency, the system will naturally have higher
latency because it must wait for multi-site synchronization before confirming a write.
3. Summary of Design Influences

Constraint Design Influence

Strict Consistency Implementation of Strong Consistency models; data must be synchronous.

Partition Event The system enters a Fail-Safe mode, preferring to deny service than to show stale
records.

Data Replication Synchronous replication across sites; a write is not "done" until mirrored in
multiple data centers.

User Experience Healthcare professionals are trained on "Down-time procedures" for when the
CP system blocks access.

IEEE References

E. Brewer, "CAP twelve years later: How the 'rules' have changed," Computer, vol. 45, no. 2, pp. 23-29,
2012, doi: 10.1109/MC.2012.37.

S. Gilbert and N. Lynch, "Perspectives on the CAP Theorem," Computer, vol. 45, no. 2, pp. 30-36, Feb.
2012, doi: 10.1109/MC.2012.33.

D. J. Abadi, "Consistency Tradeoffs in Modern Distributed Database System Design," Computer, vol. 45,
no. 2, pp. 37-42, 2012.

Would you like me to draft a section on the "Down-time Procedures" a hospital must implement when a
CP system becomes unavailable due to a network partition?

Gemini can make mistakes, so double-check it

You might also like