0% found this document useful (0 votes)

10 views10 pages

System Design

System design involves defining the architecture and components of a system to meet requirements such as scalability, reliability, and maintainability. Key concepts include scalability methods (vertical vs horizontal), caching strategies, database management (replication and sharding), and the choice between microservices and monolithic architectures. Effective system design requires understanding trade-offs, implementing reliability patterns, and ensuring observability through metrics and logs.

Uploaded by

Pedro Henrique

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views10 pages

System Design

Uploaded by

Pedro Henrique

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

System Design

Designing Scalable and Reliable Distributed Systems

Programming Guides

1. What Is System Design?

System design is the process of defining the architecture, components, modules, interfaces, and data flows of a system
to satisfy specified requirements. For software engineers, it means deciding how to build systems that are:

Scalable: can handle growing load by adding resources.

Reliable: continues to work correctly even when components fail.
Available: serves requests with minimal downtime.
Maintainable: can be changed, debugged, and extended without heroic effort.
Efficient: uses computational resources proportionally to the problem.

System design is inherently about trade-offs. Every decision (SQL vs NoSQL, sync vs async, monolith vs microservices)
involves gaining something and giving up something else. Good system designers understand these trade-offs deeply
and make explicit, reasoned choices.

2. Scalability Fundamentals

Vertical vs Horizontal Scaling

Vertical scaling (scale up): add more CPU, RAM, or disk to a single machine. Simple - no code changes needed.
Limited: there is a maximum machine size, and a single machine is a single point of failure.

Horizontal scaling (scale out): add more machines and distribute the load. Theoretically unlimited, resilient to individual
machine failures, but requires the system to be designed for distribution (stateless, data partitioned, etc.).

Stateless vs Stateful Services

Stateless services do not store any session or user-specific data between requests. Any instance can handle any
request. This is the prerequisite for horizontal scaling - you can add or remove instances freely.

Page 1
Stateful services hold data in memory between requests (e.g., WebSocket connections, in-memory caches). They are
harder to scale because a request must reach the same instance. Session state should be externalized to Redis or a
database.

Load Balancing
A load balancer distributes incoming traffic across multiple backend instances. Common algorithms:

Algorithm Description Best for

Round Robin Cycle through servers in order Uniform request cost

Least Connections Route to server with fewest active conn Variable request cost
IP Hash Hash client IP to a server Sticky sessions
Weighted RR Cycle with weights (some servers get more) Heterogeneous servers
Random Pick a server randomly Simple, low overhead

3. Caching

Caching stores the result of expensive computations or I/O so subsequent requests can be served faster. A cache hit
avoids the expensive operation entirely. Caching is one of the highest-leverage optimizations in distributed systems.

Cache Strategies

Strategy How it works Trade-off

Cache-Aside App reads cache; on miss, reads DB and populates

App manages
cache cache; stale on DB update
Read-Through Cache reads DB on miss transparently Simpler app; cold start latency
Write-Through Write to cache and DB simultaneously Consistent; double write cost
Write-Behind Write to cache; DB written asynchronously Fast writes; risk of data loss
Refresh-Ahead Pre-populate cache before expiry Low latency; may cache unused data

Cache Invalidation
The hardest problem in caching: knowing when cached data is stale. Strategies:

TTL (Time-To-Live): expire cache entries after N seconds. Simple but may serve stale data until expiry.

Event-driven invalidation: the service that writes data publishes an event; cache consumers listen and invalidate affected
keys immediately.

Write-through: always write to cache and DB together. Cache is always fresh. Higher write cost.

Cache stampede prevention: when a popular key expires, many requests hit the DB simultaneously. Mitigate with
probabilistic early expiry or a distributed lock.

4. Databases at Scale

Replication

Page 2
Replication copies data to multiple nodes for redundancy and read scaling. Primary-Replica (master-slave): one primary
accepts writes; replicas get asynchronous copies and serve reads. Reads scale linearly; writes do not.

Multi-primary: multiple nodes accept writes. Conflicts must be resolved. Used for geographically distributed writes.

Sharding (Partitioning)
Sharding splits data across multiple database nodes. Each shard holds a subset of the data. Common strategies:

Range sharding: rows with keys 0-999 on shard 1, 1000-1999 on shard 2. Simple but causes hotspots if keys are not
evenly distributed.

Hash sharding: shard = hash(key) mod N. Even distribution but range queries span multiple shards.

Directory sharding: a lookup table maps keys to shards. Flexible but the directory is a bottleneck and single point of
failure.

CAP Theorem
CAP theorem states that a distributed data store can guarantee at most two of:

Consistency: every read receives the most recent write or an error.

Availability: every request receives a non-error response (may be stale).
Partition Tolerance: the system works even when network partitions split nodes.

Since network partitions are unavoidable in real systems, the real choice is CP (consistent but may be unavailable
during partition) vs AP (available but may return stale data). Most modern distributed databases document their
consistency model explicitly.

5. Message Queues and Event-Driven Architecture

Message queues decouple producers from consumers. A producer sends a message to a queue; one or more
consumers process it asynchronously. This improves resilience (consumer can be down), scalability (scale consumers
independently), and absorbs traffic spikes.

System Model Best for

RabbitMQ Queue (point-to-point or pub/sub) Task queues, RPC, routing

Apache Kafka Persistent distributed log Event streaming, audit log, replay
AWS SQS Managed queue Simple task queues, AWS ecosystem
Redis Streams In-memory persistent log Low-latency event streams
NATS Lightweight pub/sub High-throughput, low-latency messaging

Key patterns:

At-least-once delivery: messages are retried until acknowledged. Consumers must be idempotent (processing the same
message twice has no extra effect).

Page 3
Exactly-once semantics: guaranteed no duplicates. Complex and expensive. Kafka Transactions and AWS SQS FIFO
support this.

Dead letter queue (DLQ): messages that fail after N retries are sent to a DLQ for inspection and manual reprocessing.

6. Microservices vs Monolith

The choice between a monolith and microservices is one of the most consequential early architecture decisions.

Aspect Monolith Microservices

Deployment Single unit Independent per service

Scaling Scale everything Scale per service
Dev complexity Low initially High (networking, contracts)
Ops complexity Low High (K8s, tracing, CI/CD)
Data Single database Database per service
Team structure One team One team per service
Communication In-process function calls Network calls (REST, gRPC)
Best for Early-stage, small teams Large orgs, high scale

Start with a monolith. Extract services only when you have identified a clear scalability or team boundary need. The
biggest failure mode is premature microservices: the distributed system complexity arrives before the team and traffic
justify it.

7. Reliability Patterns

Circuit Breaker
When a downstream service starts failing, the circuit breaker stops sending requests to it for a recovery period, returning
a fast failure to callers. This prevents cascading failures where a slow service backs up queues and exhausts resources
throughout the system.

# States: CLOSED (normal) -> OPEN (failing) -> HALF-OPEN (testing)

# CLOSED: requests pass through. Track failure rate.
# OPEN: reject requests immediately (fast fail) for N seconds.
# HALF-OPEN: allow one probe request. If success -> CLOSED, else -> OPEN.

Retry with Exponential Backoff

Retrying immediately on failure can worsen an overloaded downstream service. Exponential backoff increases the wait
between retries: 1s, 2s, 4s, 8s, ... Add jitter (random +/- offset) to prevent the 'thundering herd' - all clients retrying at the
exact same moment.

import random, time

def retry_with_backoff(fn, max_retries=5, base=1.0, jitter=0.5):

for attempt in range(max_retries):
try:
return fn()
except Exception as e:

Page 4
if attempt == max_retries - 1: raise
wait = base * (2 ** attempt) + [Link](0, jitter)
[Link](wait)

Rate Limiting
Rate limiting prevents any single client from overwhelming the system. Algorithms: Token Bucket (allows bursts up to
bucket capacity), Leaky Bucket (smooth constant output rate), Fixed Window (N requests per window), Sliding Window
(more accurate, no boundary spike).

8. Observability: Metrics, Logs, Traces

Observability is the ability to understand the internal state of a system from its external outputs. The three pillars:

* Metrics: numeric time-series data (request rate, error rate, latency p99, CPU usage). Aggregated and stored in
Prometheus or Datadog. Visualized in Grafana dashboards.
* Log
Ela

The key metric framework is RED:

Rate - requests per second
Errors - error rate as a percentage
Duration - latency (p50, p95, p99)

For resources (databases, queues), use USE:

Utilization - % time the resource is busy
Saturation - queue depth or wait time
Errors - error count

9. CDN and Global Distribution

A Content Delivery Network (CDN) caches static assets (images, JS, CSS, HTML) at edge nodes geographically close
to users. Requests are served from the nearest edge, reducing latency from hundreds of milliseconds to single digits for
cached content.

Beyond static files, modern CDNs (Cloudflare Workers, Lambda@Edge) run dynamic code at the edge: A/B testing,
authentication, personalization, and API responses can be handled at the edge before reaching origin servers.

DNS-based routing (GeoDNS) directs users to the nearest regional data center, complementing CDNs for dynamic
workloads that cannot be fully edge-cached.

10. Putting It Together: Design an URL Shortener

Walk-through of a real system design to demonstrate applying these concepts:

Page 5
Requirements
* Shorten a long URL to a 7-character code (e.g., [Link]/abc1234).
* Re

Key Design Decisions

ID generation: use a distributed ID service (Snowflake) or a counter with base-62 encoding to generate short codes.
Avoid random IDs (collision probability).

Storage: PostgreSQL for URL mappings (10B rows, ~100 bytes each = 1TB). Add a read replica and shard by code
prefix at 10B rows.

Caching: hot URLs (top 1% get 80% of traffic). Cache in Redis with LRU eviction. A 10GB cache fits ~100M URL
mappings (100 bytes each).

Read path: Client -> CDN (cache hit?) -> Load Balancer -> Stateless API -> Redis (cache hit?) -> Postgres. Cached
reads avoid the database entirely.

Redirect: HTTP 301 (permanent, browser caches) or 302 (temporary, track clicks). Use 302 to measure traffic; use 301
for performance.

11. Conclusion

System design is a skill developed through deliberate practice. Start by deeply understanding the fundamentals: how
caching reduces load, how replication provides redundancy, how message queues decouple services, and how load
balancers distribute traffic.

For each design decision, ask: what are the failure modes? What is the bottleneck at 10x current load? What happens
when this component goes down?

The best system designers avoid premature optimization. They design for the current scale with clear extension points,
instrument everything so they can see what is actually happening, and evolve the architecture based on real data rather
than hypothetical future requirements.

Simple systems that work are worth infinitely more than elegant systems that are too complex to debug at 3am when
they fail.

12. API Gateway and BFF Pattern

An API Gateway is a single entry point for all client traffic. It handles cross-cutting concerns: authentication, rate limiting,
request routing, SSL termination, and response transformation. Clients talk to one URL; the gateway routes internally.

Backend for Frontend (BFF)

Page 6
The BFF pattern creates a dedicated backend per client type (web BFF, mobile BFF). Each BFF aggregates calls to
microservices and shapes the response for its specific client. This avoids the one-size-fits-all problem of a generic API:

Mobile clients need compact payloads; web clients may need richer data.
Mobile clients often need a single request to aggregate data that would otherwise require N separate API calls (solves
the N+1 problem at the network level).

13. Storage Architecture Patterns

CQRS (Command Query Responsibility Segregation)

CQRS separates read and write models. Commands (writes) update a normalized, consistent write store. Queries
(reads) use a denormalized read store optimized for the specific access patterns. The read store is populated
asynchronously from the write store via events.

Benefits: reads and writes scale independently; read models are optimized for each query without compromising write
integrity.

Event Sourcing
Instead of storing current state, store the full sequence of events that led to it. To get the current state, replay all events.
This gives a complete audit log, enables time-travel debugging, and supports deriving new read models from history.

Trade-off: querying current state requires replay (mitigated by snapshots); event schema evolution is complex.

# Event store record

{
"event_id": "uuid-1234",
"stream_id": "order-42",
"type": "OrderPlaced",
"timestamp": "2026-03-17T[Link]Z",
"payload": { "total": 59.99, "items": [...] }
}
# Reconstruct order: replay all events for stream order-42

14. Capacity Estimation Framework

Before designing a system, estimate its scale to drive the right decisions. Use powers of 2 and round numbers for
back-of-envelope estimates:

Unit Approximate value

1 million/day ~12 req/s

10 million/day ~120 req/s
100 million/day ~1,200 req/s
1 billion/day ~12,000 req/s
1 char (UTF-8) 1 byte
1 KB 1,000 bytes (~1 short paragraph)

Page 7
1 tweet ~280 bytes
1 user record ~200 bytes
1 photo (avg) ~300 KB
1 video min (1080p) ~100 MB

Example estimation for a Twitter-like service:

300M daily active users, each reads 100 tweets/day = 30B reads/day = 350K reads/s.
10M tweets/day = 120 writes/s.
300 bytes/tweet x 10M/day x 365 x 5 years = 5.5 TB/year for tweet storage.

These numbers tell you: you need heavy read optimization (cache), moderate write throughput (sharded DB), and a few
terabytes of storage per year.

15. Consistent Hashing

When you add or remove a node from a sharded system with naive hash-mod-N, almost all keys are remapped - every
cache entry becomes invalid, and the database sees a massive spike in cache misses. Consistent hashing solves this.

In consistent hashing, both servers and keys are mapped onto a ring (0 to 2^32). A key belongs to the first server
clockwise from its position on the ring. When a server is added or removed, only the keys between it and its predecessor
are remapped - on average 1/N of all keys, not all of them.

Virtual nodes (vnodes) improve distribution: each physical server is represented by many points on the ring, smoothing
out the load distribution. Used by DynamoDB, Cassandra, and most distributed caches.

# Simplified consistent hash ring

import hashlib
from sortedcontainers import SortedDict

class HashRing:
def __init__(self, vnodes=100):
[Link] = SortedDict()
[Link] = vnodes

def add_node(self, node: str):

for i in range([Link]):
key = self._hash(f'{node}:{i}')
[Link][key] = node

def get_node(self, key: str) -> str:

h = self._hash(key)
idx = [Link].bisect_left(h) % len([Link])
return [Link]()[idx]

def _hash(self, s: str) -> int:

return int(hashlib.md5([Link]()).hexdigest(), 16)

Page 8
15b. Database Connection Pooling

Opening a database connection is expensive: TCP handshake, authentication, and session setup can take 20-100ms.
Connection pooling reuses a fixed pool of open connections, eliminating this overhead for every request.

Key pool parameters:

min_size: connections kept alive even at low load. Avoids cold-start latency.
max_size: hard cap. Requests that arrive when the pool is full queue or fail fast.
max_idle_time: return idle connections to prevent stale connections and server-side timeouts.
connection_timeout: how long a request waits for a pool slot before failing.

PgBouncer (PostgreSQL) and ProxySQL (MySQL) are dedicated connection pooling proxies that sit between application
servers and the database, allowing thousands of application connections to multiplex onto a small number of real DB
connections.

# asyncpg connection pool (Python async PostgreSQL)

import asyncpg

pool = await asyncpg.create_pool(

dsn='postgresql://user:pass@localhost/mydb',
min_size=5,
max_size=20,
max_inactive_connection_lifetime=300, # 5 minutes
)

async def get_user(user_id: int):

async with [Link]() as conn: # borrow from pool
return await [Link]('SELECT * FROM users WHERE id=$1', user_id)
# connection automatically returned to pool

16. SLA, SLO, and SLI

Reliability is meaningless without clear targets. The SRE (Site Reliability Engineering) framework uses three related
concepts:

Term Full Name Meaning

SLI Service Level Indicator A metric that measures service behavior (e.g., reque
SLO Service Level Objective The target value or range for an SLI (e.g., p99 laten
SLA Service Level Agreement A contractual commitment to customers, often with f
Error Budget - 1 - SLO availability. The allowed amount of unreliab

Example: SLI = HTTP success rate. SLO = 99.9% success rate over 30 days. Error budget = 0.1% = 43.2 minutes of
downtime per month.

The error budget is a management tool: if you have budget remaining, ship new features aggressively. If you are burning
budget too fast, stop releases and focus on reliability. This makes the tension between velocity and reliability explicit and

Page 9
data-driven.

17. Conclusion

For each design decision, ask: what are the failure modes? What is the bottleneck at 10x current load? What happens
when this component goes down?

Define SLIs and SLOs early. You cannot improve what you do not measure, and you cannot have a reliability
conversation without a shared definition of 'reliable'.

Simple systems that work are worth infinitely more than elegant systems that are too complex to debug at 3am when
they fail.

Page 10

Scalable System Design Patterns Guide
No ratings yet
Scalable System Design Patterns Guide
10 pages
02 Scalability
No ratings yet
02 Scalability
7 pages
Machine Learning System Design Guide
100% (1)
Machine Learning System Design Guide
24 pages
Cracking the System Design Interview Guide
100% (3)
Cracking the System Design Interview Guide
91 pages
System Design Terms Explained
No ratings yet
System Design Terms Explained
20 pages
System Design HLD LLD Guide 2026
No ratings yet
System Design HLD LLD Guide 2026
18 pages
System Design Handbook Overview
No ratings yet
System Design Handbook Overview
5 pages
Key Components of Analytics Architecture
No ratings yet
Key Components of Analytics Architecture
62 pages
System Design
No ratings yet
System Design
10 pages
System Design Fundamentals Course Overview
No ratings yet
System Design Fundamentals Course Overview
24 pages
Database Performance Solutions Overview
No ratings yet
Database Performance Solutions Overview
385 pages
27 System Design
No ratings yet
27 System Design
2 pages
Scalable Web Architecture Strategies
No ratings yet
Scalable Web Architecture Strategies
8 pages
Understanding System Design Concepts
No ratings yet
Understanding System Design Concepts
22 pages
Google System Design Cheat Sheet
No ratings yet
Google System Design Cheat Sheet
12 pages
High-Level Design and Scalability Planning
No ratings yet
High-Level Design and Scalability Planning
35 pages
System Design for Distributed Applications
No ratings yet
System Design for Distributed Applications
20 pages
System Design Interview Essentials
No ratings yet
System Design Interview Essentials
19 pages
Scaling from Zero to Millions of Users
No ratings yet
Scaling from Zero to Millions of Users
40 pages
System Design: Scalability, Reliability, CAP
No ratings yet
System Design: Scalability, Reliability, CAP
35 pages
Algomasterio System Design Interview Handbook
No ratings yet
Algomasterio System Design Interview Handbook
75 pages
System Design Intro
No ratings yet
System Design Intro
7 pages
Architectural Concerns in Distributed Systems
No ratings yet
Architectural Concerns in Distributed Systems
12 pages
Scalability Strategies for High Traffic Systems
No ratings yet
Scalability Strategies for High Traffic Systems
4 pages
Understanding Remote Procedure Calls and Protocols
No ratings yet
Understanding Remote Procedure Calls and Protocols
2 pages
Document 5
No ratings yet
Document 5
8 pages
Backend System Design Slides
No ratings yet
Backend System Design Slides
116 pages
HLD Notes on System Design Patterns
No ratings yet
HLD Notes on System Design Patterns
76 pages
FAANG System Design Interview Guide
No ratings yet
FAANG System Design Interview Guide
151 pages
Scaling Web Apps for Millions of Users
No ratings yet
Scaling Web Apps for Millions of Users
8 pages
Cloud Architecture Basics
No ratings yet
Cloud Architecture Basics
14 pages
System Design Concepts and Components
No ratings yet
System Design Concepts and Components
56 pages
Understanding API Scaling and Load Balancing
No ratings yet
Understanding API Scaling and Load Balancing
20 pages
Back-End Architecture Essentials with Redis
No ratings yet
Back-End Architecture Essentials with Redis
29 pages
Scalable Web Application Development Guide
No ratings yet
Scalable Web Application Development Guide
19 pages
NoSQL Database Overview and Concepts
No ratings yet
NoSQL Database Overview and Concepts
40 pages
Data Distribution Models Explained
No ratings yet
Data Distribution Models Explained
11 pages
Overview of System Design Principles
No ratings yet
Overview of System Design Principles
3 pages
02 System Design Cheatsheet
No ratings yet
02 System Design Cheatsheet
5 pages
Introduction to Distributed Systems
No ratings yet
Introduction to Distributed Systems
31 pages
Designing Reliable Data Systems
No ratings yet
Designing Reliable Data Systems
2 pages
System Design Roadmap for Beginners
No ratings yet
System Design Roadmap for Beginners
18 pages
System Design Fundamentals Explained
No ratings yet
System Design Fundamentals Explained
4 pages
System Design Interview Guide
No ratings yet
System Design Interview Guide
33 pages
Software Architecture Patterns
No ratings yet
Software Architecture Patterns
6 pages
System Design Interview Fundamentals
100% (6)
System Design Interview Fundamentals
412 pages
System Design Concepts and Techniques
No ratings yet
System Design Concepts and Techniques
128 pages
Back-End Architecture Essentials
No ratings yet
Back-End Architecture Essentials
30 pages
Overview of Network Protocols and CAP Theorem
No ratings yet
Overview of Network Protocols and CAP Theorem
35 pages
System Design Notes
No ratings yet
System Design Notes
42 pages
System Design Concepts
No ratings yet
System Design Concepts
36 pages
Building Scalable Backend Systems
No ratings yet
Building Scalable Backend Systems
11 pages
Slides+ +section+8+ +performance
No ratings yet
Slides+ +section+8+ +performance
56 pages
Data Visualization Techniques Explained
No ratings yet
Data Visualization Techniques Explained
32 pages
Neon Inc. SOC 3 Report 2024
No ratings yet
Neon Inc. SOC 3 Report 2024
12 pages
AWS Data Engineering Internship Report
No ratings yet
AWS Data Engineering Internship Report
10 pages
OFM Database Creation for Waterflood Projects
No ratings yet
OFM Database Creation for Waterflood Projects
9 pages
Introduction to Transaction Processing
No ratings yet
Introduction to Transaction Processing
56 pages
Photogrammetry in Real Estate Management
No ratings yet
Photogrammetry in Real Estate Management
5 pages
AI Chatbot for College Recommendations
No ratings yet
AI Chatbot for College Recommendations
3 pages
DCR Application Requirements for Aviation
No ratings yet
DCR Application Requirements for Aviation
5 pages
Essential SAP T-Codes for System Monitoring
No ratings yet
Essential SAP T-Codes for System Monitoring
5 pages
SAP SCM APO Course Overview
No ratings yet
SAP SCM APO Course Overview
3 pages
Non-Structured Data and NoSQL Databases
No ratings yet
Non-Structured Data and NoSQL Databases
43 pages
Computer Science Practical Exam Paper
No ratings yet
Computer Science Practical Exam Paper
10 pages
MCS-225 Hand Written
No ratings yet
MCS-225 Hand Written
21 pages
Mohammed Azizuddin: Business Analyst Profile
No ratings yet
Mohammed Azizuddin: Business Analyst Profile
5 pages
Anis Sarker: Machine Learning Expert
No ratings yet
Anis Sarker: Machine Learning Expert
2 pages
BS IT ICT Practical Exam 2022
No ratings yet
BS IT ICT Practical Exam 2022
2 pages
Dissertation 2653337
No ratings yet
Dissertation 2653337
44 pages
Java Programming Implementation Guide
No ratings yet
Java Programming Implementation Guide
40 pages
Advanced Excel Training Course Details
No ratings yet
Advanced Excel Training Course Details
6 pages
Tech Stack Scoring Framework Analysis
No ratings yet
Tech Stack Scoring Framework Analysis
1 page
Redis Database Tutorial Guide
No ratings yet
Redis Database Tutorial Guide
8 pages
Introduction to Data Engineering
No ratings yet
Introduction to Data Engineering
3 pages
B.Tech Database Management Exam 2021-22
No ratings yet
B.Tech Database Management Exam 2021-22
2 pages
Online Book Store Management System
No ratings yet
Online Book Store Management System
31 pages
Database Assignments and Trigger Exercises
No ratings yet
Database Assignments and Trigger Exercises
2 pages
Oracle Database Performance Tuning Course
No ratings yet
Oracle Database Performance Tuning Course
11 pages
College Management System Project Report
No ratings yet
College Management System Project Report
5 pages
Database Recovery Techniques Explained
No ratings yet
Database Recovery Techniques Explained
11 pages
Data Structure Teaching Plan 2020-21
No ratings yet
Data Structure Teaching Plan 2020-21
3 pages
Understanding Data Models and Languages
No ratings yet
Understanding Data Models and Languages
8 pages

System Design

Uploaded by

System Design

Uploaded by

System Design

Designing Scalable and Reliable Distributed Systems

1. What Is System Design?

Scalable: can handle growing load by adding resources.

Vertical vs Horizontal Scaling

Stateless vs Stateful Services

Algorithm Description Best for

Round Robin Cycle through servers in order Uniform request cost

Strategy How it works Trade-off

Cache-Aside App reads cache; on miss, reads DB and populates

Consistency: every read receives the most recent write or an error.

5. Message Queues and Event-Driven Architecture

System Model Best for

RabbitMQ Queue (point-to-point or pub/sub) Task queues, RPC, routing

Aspect Monolith Microservices

Deployment Single unit Independent per service

# States: CLOSED (normal) -> OPEN (failing) -> HALF-OPEN (testing)

Retry with Exponential Backoff

import random, time

def retry_with_backoff(fn, max_retries=5, base=1.0, jitter=0.5):

8. Observability: Metrics, Logs, Traces

The key metric framework is RED:

For resources (databases, queues), use USE:

9. CDN and Global Distribution

10. Putting It Together: Design an URL Shortener

Walk-through of a real system design to demonstrate applying these concepts:

Key Design Decisions

12. API Gateway and BFF Pattern

Backend for Frontend (BFF)

13. Storage Architecture Patterns

CQRS (Command Query Responsibility Segregation)

# Event store record

14. Capacity Estimation Framework

Unit Approximate value

1 million/day ~12 req/s

Example estimation for a Twitter-like service:

15. Consistent Hashing

# Simplified consistent hash ring

def add_node(self, node: str):

def get_node(self, key: str) -> str:

def _hash(self, s: str) -> int:

Key pool parameters:

# asyncpg connection pool (Python async PostgreSQL)

pool = await asyncpg.create_pool(

async def get_user(user_id: int):

16. SLA, SLO, and SLI

Term Full Name Meaning

You might also like