System Design
Designing Scalable and Reliable Distributed Systems
Programming Guides
1. What Is System Design?
System design is the process of defining the architecture, components, modules, interfaces, and data flows of a system
to satisfy specified requirements. For software engineers, it means deciding how to build systems that are:
Scalable: can handle growing load by adding resources.
Reliable: continues to work correctly even when components fail.
Available: serves requests with minimal downtime.
Maintainable: can be changed, debugged, and extended without heroic effort.
Efficient: uses computational resources proportionally to the problem.
System design is inherently about trade-offs. Every decision (SQL vs NoSQL, sync vs async, monolith vs microservices)
involves gaining something and giving up something else. Good system designers understand these trade-offs deeply
and make explicit, reasoned choices.
2. Scalability Fundamentals
Vertical vs Horizontal Scaling
Vertical scaling (scale up): add more CPU, RAM, or disk to a single machine. Simple - no code changes needed.
Limited: there is a maximum machine size, and a single machine is a single point of failure.
Horizontal scaling (scale out): add more machines and distribute the load. Theoretically unlimited, resilient to individual
machine failures, but requires the system to be designed for distribution (stateless, data partitioned, etc.).
Stateless vs Stateful Services
Stateless services do not store any session or user-specific data between requests. Any instance can handle any
request. This is the prerequisite for horizontal scaling - you can add or remove instances freely.
Page 1
Stateful services hold data in memory between requests (e.g., WebSocket connections, in-memory caches). They are
harder to scale because a request must reach the same instance. Session state should be externalized to Redis or a
database.
Load Balancing
A load balancer distributes incoming traffic across multiple backend instances. Common algorithms:
Algorithm Description Best for
Round Robin Cycle through servers in order Uniform request cost
Least Connections Route to server with fewest active conn Variable request cost
IP Hash Hash client IP to a server Sticky sessions
Weighted RR Cycle with weights (some servers get more) Heterogeneous servers
Random Pick a server randomly Simple, low overhead
3. Caching
Caching stores the result of expensive computations or I/O so subsequent requests can be served faster. A cache hit
avoids the expensive operation entirely. Caching is one of the highest-leverage optimizations in distributed systems.
Cache Strategies
Strategy How it works Trade-off
Cache-Aside App reads cache; on miss, reads DB and populates
App manages
cache cache; stale on DB update
Read-Through Cache reads DB on miss transparently Simpler app; cold start latency
Write-Through Write to cache and DB simultaneously Consistent; double write cost
Write-Behind Write to cache; DB written asynchronously Fast writes; risk of data loss
Refresh-Ahead Pre-populate cache before expiry Low latency; may cache unused data
Cache Invalidation
The hardest problem in caching: knowing when cached data is stale. Strategies:
TTL (Time-To-Live): expire cache entries after N seconds. Simple but may serve stale data until expiry.
Event-driven invalidation: the service that writes data publishes an event; cache consumers listen and invalidate affected
keys immediately.
Write-through: always write to cache and DB together. Cache is always fresh. Higher write cost.
Cache stampede prevention: when a popular key expires, many requests hit the DB simultaneously. Mitigate with
probabilistic early expiry or a distributed lock.
4. Databases at Scale
Replication
Page 2
Replication copies data to multiple nodes for redundancy and read scaling. Primary-Replica (master-slave): one primary
accepts writes; replicas get asynchronous copies and serve reads. Reads scale linearly; writes do not.
Multi-primary: multiple nodes accept writes. Conflicts must be resolved. Used for geographically distributed writes.
Sharding (Partitioning)
Sharding splits data across multiple database nodes. Each shard holds a subset of the data. Common strategies:
Range sharding: rows with keys 0-999 on shard 1, 1000-1999 on shard 2. Simple but causes hotspots if keys are not
evenly distributed.
Hash sharding: shard = hash(key) mod N. Even distribution but range queries span multiple shards.
Directory sharding: a lookup table maps keys to shards. Flexible but the directory is a bottleneck and single point of
failure.
CAP Theorem
CAP theorem states that a distributed data store can guarantee at most two of:
Consistency: every read receives the most recent write or an error.
Availability: every request receives a non-error response (may be stale).
Partition Tolerance: the system works even when network partitions split nodes.
Since network partitions are unavoidable in real systems, the real choice is CP (consistent but may be unavailable
during partition) vs AP (available but may return stale data). Most modern distributed databases document their
consistency model explicitly.
5. Message Queues and Event-Driven Architecture
Message queues decouple producers from consumers. A producer sends a message to a queue; one or more
consumers process it asynchronously. This improves resilience (consumer can be down), scalability (scale consumers
independently), and absorbs traffic spikes.
System Model Best for
RabbitMQ Queue (point-to-point or pub/sub) Task queues, RPC, routing
Apache Kafka Persistent distributed log Event streaming, audit log, replay
AWS SQS Managed queue Simple task queues, AWS ecosystem
Redis Streams In-memory persistent log Low-latency event streams
NATS Lightweight pub/sub High-throughput, low-latency messaging
Key patterns:
At-least-once delivery: messages are retried until acknowledged. Consumers must be idempotent (processing the same
message twice has no extra effect).
Page 3
Exactly-once semantics: guaranteed no duplicates. Complex and expensive. Kafka Transactions and AWS SQS FIFO
support this.
Dead letter queue (DLQ): messages that fail after N retries are sent to a DLQ for inspection and manual reprocessing.
6. Microservices vs Monolith
The choice between a monolith and microservices is one of the most consequential early architecture decisions.
Aspect Monolith Microservices
Deployment Single unit Independent per service
Scaling Scale everything Scale per service
Dev complexity Low initially High (networking, contracts)
Ops complexity Low High (K8s, tracing, CI/CD)
Data Single database Database per service
Team structure One team One team per service
Communication In-process function calls Network calls (REST, gRPC)
Best for Early-stage, small teams Large orgs, high scale
Start with a monolith. Extract services only when you have identified a clear scalability or team boundary need. The
biggest failure mode is premature microservices: the distributed system complexity arrives before the team and traffic
justify it.
7. Reliability Patterns
Circuit Breaker
When a downstream service starts failing, the circuit breaker stops sending requests to it for a recovery period, returning
a fast failure to callers. This prevents cascading failures where a slow service backs up queues and exhausts resources
throughout the system.
# States: CLOSED (normal) -> OPEN (failing) -> HALF-OPEN (testing)
# CLOSED: requests pass through. Track failure rate.
# OPEN: reject requests immediately (fast fail) for N seconds.
# HALF-OPEN: allow one probe request. If success -> CLOSED, else -> OPEN.
Retry with Exponential Backoff
Retrying immediately on failure can worsen an overloaded downstream service. Exponential backoff increases the wait
between retries: 1s, 2s, 4s, 8s, ... Add jitter (random +/- offset) to prevent the 'thundering herd' - all clients retrying at the
exact same moment.
import random, time
def retry_with_backoff(fn, max_retries=5, base=1.0, jitter=0.5):
for attempt in range(max_retries):
try:
return fn()
except Exception as e:
Page 4
if attempt == max_retries - 1: raise
wait = base * (2 ** attempt) + [Link](0, jitter)
[Link](wait)
Rate Limiting
Rate limiting prevents any single client from overwhelming the system. Algorithms: Token Bucket (allows bursts up to
bucket capacity), Leaky Bucket (smooth constant output rate), Fixed Window (N requests per window), Sliding Window
(more accurate, no boundary spike).
8. Observability: Metrics, Logs, Traces
Observability is the ability to understand the internal state of a system from its external outputs. The three pillars:
* Metrics: numeric time-series data (request rate, error rate, latency p99, CPU usage). Aggregated and stored in
Prometheus or Datadog. Visualized in Grafana dashboards.
* Log
Ela
The key metric framework is RED:
Rate - requests per second
Errors - error rate as a percentage
Duration - latency (p50, p95, p99)
For resources (databases, queues), use USE:
Utilization - % time the resource is busy
Saturation - queue depth or wait time
Errors - error count
9. CDN and Global Distribution
A Content Delivery Network (CDN) caches static assets (images, JS, CSS, HTML) at edge nodes geographically close
to users. Requests are served from the nearest edge, reducing latency from hundreds of milliseconds to single digits for
cached content.
Beyond static files, modern CDNs (Cloudflare Workers, Lambda@Edge) run dynamic code at the edge: A/B testing,
authentication, personalization, and API responses can be handled at the edge before reaching origin servers.
DNS-based routing (GeoDNS) directs users to the nearest regional data center, complementing CDNs for dynamic
workloads that cannot be fully edge-cached.
10. Putting It Together: Design an URL Shortener
Walk-through of a real system design to demonstrate applying these concepts:
Page 5
Requirements
* Shorten a long URL to a 7-character code (e.g., [Link]/abc1234).
* Re
Key Design Decisions
ID generation: use a distributed ID service (Snowflake) or a counter with base-62 encoding to generate short codes.
Avoid random IDs (collision probability).
Storage: PostgreSQL for URL mappings (10B rows, ~100 bytes each = 1TB). Add a read replica and shard by code
prefix at 10B rows.
Caching: hot URLs (top 1% get 80% of traffic). Cache in Redis with LRU eviction. A 10GB cache fits ~100M URL
mappings (100 bytes each).
Read path: Client -> CDN (cache hit?) -> Load Balancer -> Stateless API -> Redis (cache hit?) -> Postgres. Cached
reads avoid the database entirely.
Redirect: HTTP 301 (permanent, browser caches) or 302 (temporary, track clicks). Use 302 to measure traffic; use 301
for performance.
11. Conclusion
System design is a skill developed through deliberate practice. Start by deeply understanding the fundamentals: how
caching reduces load, how replication provides redundancy, how message queues decouple services, and how load
balancers distribute traffic.
For each design decision, ask: what are the failure modes? What is the bottleneck at 10x current load? What happens
when this component goes down?
The best system designers avoid premature optimization. They design for the current scale with clear extension points,
instrument everything so they can see what is actually happening, and evolve the architecture based on real data rather
than hypothetical future requirements.
Simple systems that work are worth infinitely more than elegant systems that are too complex to debug at 3am when
they fail.
12. API Gateway and BFF Pattern
An API Gateway is a single entry point for all client traffic. It handles cross-cutting concerns: authentication, rate limiting,
request routing, SSL termination, and response transformation. Clients talk to one URL; the gateway routes internally.
Backend for Frontend (BFF)
Page 6
The BFF pattern creates a dedicated backend per client type (web BFF, mobile BFF). Each BFF aggregates calls to
microservices and shapes the response for its specific client. This avoids the one-size-fits-all problem of a generic API:
Mobile clients need compact payloads; web clients may need richer data.
Mobile clients often need a single request to aggregate data that would otherwise require N separate API calls (solves
the N+1 problem at the network level).
13. Storage Architecture Patterns
CQRS (Command Query Responsibility Segregation)
CQRS separates read and write models. Commands (writes) update a normalized, consistent write store. Queries
(reads) use a denormalized read store optimized for the specific access patterns. The read store is populated
asynchronously from the write store via events.
Benefits: reads and writes scale independently; read models are optimized for each query without compromising write
integrity.
Event Sourcing
Instead of storing current state, store the full sequence of events that led to it. To get the current state, replay all events.
This gives a complete audit log, enables time-travel debugging, and supports deriving new read models from history.
Trade-off: querying current state requires replay (mitigated by snapshots); event schema evolution is complex.
# Event store record
{
"event_id": "uuid-1234",
"stream_id": "order-42",
"type": "OrderPlaced",
"timestamp": "2026-03-17T[Link]Z",
"payload": { "total": 59.99, "items": [...] }
}
# Reconstruct order: replay all events for stream order-42
14. Capacity Estimation Framework
Before designing a system, estimate its scale to drive the right decisions. Use powers of 2 and round numbers for
back-of-envelope estimates:
Unit Approximate value
1 million/day ~12 req/s
10 million/day ~120 req/s
100 million/day ~1,200 req/s
1 billion/day ~12,000 req/s
1 char (UTF-8) 1 byte
1 KB 1,000 bytes (~1 short paragraph)
Page 7
1 tweet ~280 bytes
1 user record ~200 bytes
1 photo (avg) ~300 KB
1 video min (1080p) ~100 MB
Example estimation for a Twitter-like service:
300M daily active users, each reads 100 tweets/day = 30B reads/day = 350K reads/s.
10M tweets/day = 120 writes/s.
300 bytes/tweet x 10M/day x 365 x 5 years = 5.5 TB/year for tweet storage.
These numbers tell you: you need heavy read optimization (cache), moderate write throughput (sharded DB), and a few
terabytes of storage per year.
15. Consistent Hashing
When you add or remove a node from a sharded system with naive hash-mod-N, almost all keys are remapped - every
cache entry becomes invalid, and the database sees a massive spike in cache misses. Consistent hashing solves this.
In consistent hashing, both servers and keys are mapped onto a ring (0 to 2^32). A key belongs to the first server
clockwise from its position on the ring. When a server is added or removed, only the keys between it and its predecessor
are remapped - on average 1/N of all keys, not all of them.
Virtual nodes (vnodes) improve distribution: each physical server is represented by many points on the ring, smoothing
out the load distribution. Used by DynamoDB, Cassandra, and most distributed caches.
# Simplified consistent hash ring
import hashlib
from sortedcontainers import SortedDict
class HashRing:
def __init__(self, vnodes=100):
[Link] = SortedDict()
[Link] = vnodes
def add_node(self, node: str):
for i in range([Link]):
key = self._hash(f'{node}:{i}')
[Link][key] = node
def get_node(self, key: str) -> str:
h = self._hash(key)
idx = [Link].bisect_left(h) % len([Link])
return [Link]()[idx]
def _hash(self, s: str) -> int:
return int(hashlib.md5([Link]()).hexdigest(), 16)
Page 8
15b. Database Connection Pooling
Opening a database connection is expensive: TCP handshake, authentication, and session setup can take 20-100ms.
Connection pooling reuses a fixed pool of open connections, eliminating this overhead for every request.
Key pool parameters:
min_size: connections kept alive even at low load. Avoids cold-start latency.
max_size: hard cap. Requests that arrive when the pool is full queue or fail fast.
max_idle_time: return idle connections to prevent stale connections and server-side timeouts.
connection_timeout: how long a request waits for a pool slot before failing.
PgBouncer (PostgreSQL) and ProxySQL (MySQL) are dedicated connection pooling proxies that sit between application
servers and the database, allowing thousands of application connections to multiplex onto a small number of real DB
connections.
# asyncpg connection pool (Python async PostgreSQL)
import asyncpg
pool = await asyncpg.create_pool(
dsn='postgresql://user:pass@localhost/mydb',
min_size=5,
max_size=20,
max_inactive_connection_lifetime=300, # 5 minutes
)
async def get_user(user_id: int):
async with [Link]() as conn: # borrow from pool
return await [Link]('SELECT * FROM users WHERE id=$1', user_id)
# connection automatically returned to pool
16. SLA, SLO, and SLI
Reliability is meaningless without clear targets. The SRE (Site Reliability Engineering) framework uses three related
concepts:
Term Full Name Meaning
SLI Service Level Indicator A metric that measures service behavior (e.g., reque
SLO Service Level Objective The target value or range for an SLI (e.g., p99 laten
SLA Service Level Agreement A contractual commitment to customers, often with f
Error Budget - 1 - SLO availability. The allowed amount of unreliab
Example: SLI = HTTP success rate. SLO = 99.9% success rate over 30 days. Error budget = 0.1% = 43.2 minutes of
downtime per month.
The error budget is a management tool: if you have budget remaining, ship new features aggressively. If you are burning
budget too fast, stop releases and focus on reliability. This makes the tension between velocity and reliability explicit and
Page 9
data-driven.
17. Conclusion
System design is a skill developed through deliberate practice. Start by deeply understanding the fundamentals: how
caching reduces load, how replication provides redundancy, how message queues decouple services, and how load
balancers distribute traffic.
For each design decision, ask: what are the failure modes? What is the bottleneck at 10x current load? What happens
when this component goes down?
Define SLIs and SLOs early. You cannot improve what you do not measure, and you cannot have a reliability
conversation without a shared definition of 'reliable'.
The best system designers avoid premature optimization. They design for the current scale with clear extension points,
instrument everything so they can see what is actually happening, and evolve the architecture based on real data rather
than hypothetical future requirements.
Simple systems that work are worth infinitely more than elegant systems that are too complex to debug at 3am when
they fail.
Page 10