0% found this document useful (0 votes)
9 views35 pages

Week 12 System Design

The document outlines the design of a scalable global file storage and synchronization service, addressing challenges such as data consistency, durability, and efficient file sharing across devices. It details the architecture, including file chunking, deduplication, and sync protocols to manage file changes and conflicts. The document also discusses storage strategies, including blob storage and replication methods to ensure data resilience and availability.

Uploaded by

saltyfishalpha
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views35 pages

Week 12 System Design

The document outlines the design of a scalable global file storage and synchronization service, addressing challenges such as data consistency, durability, and efficient file sharing across devices. It details the architecture, including file chunking, deduplication, and sync protocols to manage file changes and conflicts. The document also discusses storage strategies, including blob storage and replication methods to ensure data resilience and availability.

Uploaded by

saltyfishalpha
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Northeastern University

CS6650 Building Scalable Distributed Systems


Designing Large-Scale Systems

1
Week 12 – Designing Large-Scale Systems
• Designing a Global File Storage & Sync Service
• The Problem & Scale
• High-Level Architecture
• File Chunking & Deduplication
• Sync Protocol
• Storage & Replication
• Wrap-up & Trade-offs
• Video streaming service

2
01 · THE PROBLEM

The Problem: Your Files Are Trapped


• You work across multiple devices — laptop, phone, tablet, work PC.
• Scattered files across devices with no automatic synchronization
• But your files live in one place.
• Edit a report on your laptop;
• it doesn't exist on your phone.
• Data loss
• Your hard drive fails; your files are gone.
• Collaboration friction
• You need to share a 2 GB video; email won't cut it.

3
What a global storage service guarantees?
• Any file, on any device, anywhere in the world — always up to date
• Consistency, availability needed; demands geo-distribution.

• 11 nines of durability (99.999999999%) — your data outlives any


hardware failure
• Durability demands replication.

• Instant sharing via a link, regardless of file size or recipient's OS


• a consistent global namespace with fine-grained access control needed

4
01 · THE PROBLEM

The Scale of a Global Storage Service


Numbers that put the engineering challenge in perspective

2B+ 700 PB 1.2B 100ms


Google Drive Dropbox data Files uploaded Target sync
Active Users stored (est.) to Dropbox/day latency (P95) on fast N/Ws

Files must survive disk failures, datacenter outages, When you save a file on your laptop, your phone
Durability: even natural disasters. Target: 11 nines Consistency: must eventually see the same version — no stale
(99.999999999%) durability. reads, no silent data loss.

Two devices edit the same file offline. When they


Upload speed limited by last mile, not servers. Conflict
Performance: Download must use CDN edge nodes closest to user.
reconnect, who wins? How do you show both
Resolution: versions?
02 · ARCHITECTURE

High-Level System Architecture


Five layers from client to cold storage

Client Layer API Gateway Metadata Service Block / Blob Storage CDN / Edge

Desktop / Mobile / Web


Auth, rate limiting, File namespace, versions Chunked file blocks Global PoPs
Sync daemon + local
routing, TLS termination folder tree, permissions Content-addressed store Read acceleration
cache

Notification Service Dedup Engine Quota / Billing


SHA-256 fingerprint lookup; block-level
Push sync events to connected devices Per-user storage accounting in real time
deduplication
02 · ARCHITECTURE

The Sync Client — Inside the Desktop App


How Dropbox watches your filesystem and decides what to upload

Local Filesystem File Watching — OS Hooks


(~/Dropbox folder) On macOS: FSEvents API. Linux: inotify. Windows: ReadDirectoryChangesW. The OS tells the daemon
which files changed — no polling needed.
inotify / FSEvents / ReadDirectoryChangesW
Local SQLite Journal
Change Detector The client maintains a local database of (file_path → chunk_hashes, last_modified, size). On startup it
(file watcher thread) compares this to the filesystem to catch offline edits.

Upload Throttling
The daemon measures available bandwidth and caps upload rate to avoid saturating the user's
Chunker + Hasher connection. Configurable in Dropbox settings.
(SHA-256 per block)

Retry with Backoff


Network is unreliable. Every upload is retried with exponential backoff. Idempotent design: re-uploading
the same chunk produces the same result.
02 · ARCHITECTURE

The Metadata Service


The source of truth for your file namespace — not the bytes, just the map

Core Schema (simplified) Namespace Sharding


files At Google scale, one machine can't hold all file metadata. Shard by user_id
file_id UUID PK (or owner_id). All files of one user land on the same shard — consistent
folder tree traversal.
owner_id UUID FK
parent_folder UUID
name TEXT
Why Separate Metadata from Bytes?
is_deleted BOOL
created_at TIMESTAMPTZ Metadata operations (list folder, rename, move, delete) are frequent and
need low latency ACID transactions. Blob storage is optimized for large
sequential I/O. Different workloads → different systems.
file_versions
version_id UUID PK
file_id UUID FK
Version Retention
chunk_ids UUID[]
size_bytes BIGINT Every file edit creates a new version_id row — the old version is NOT
checksum TEXT (SHA-256) deleted. This enables 30-day version history (Dropbox free tier) or unlimited
created_at TIMESTAMPTZ history (paid). Storage cost is shared via dedup.
device_id UUID
03 · CHUNKING & DEDUP

File Chunking — The Foundation of Efficient Sync


Why we split files into blocks instead of transferring them whole

[Link] (8 MB) Single file — naive approach: re-upload entire 8 MB on every edit

Split into fixed-size or variable-size chunks

Chunk 1 Chunk 2
4 MB 4 MB
[UNCHANGED] [CHANGED]

Upload only Chunk 2 (4 MB) — saves 50% bandwidth. At scale: 90%+ savings from deduplication.

Simple. 4–8 MB blocks. Problem: inserting 1 byte at Content-Defined Chunking (Rabin fingerprint).
Fixed-size chunks: position 0 shifts every chunk boundary — all chunks Variable-size (CDC): Boundaries based on content, not position.
change. Insertion only affects nearby chunks.
03 · CHUNKING & DEDUP

Global Deduplication — One Copy for Everyone


Why Dropbox stores far less data than users think they're uploading

Client computes Query block store: EXISTS: store only NOT EXISTS: upload
SHA-256(chunk) does hash exist? pointer, skip upload chunk, store bytes

Concrete Example: The Viral Presentation


1,000,000 users each upload the same 10 MB Keynote template | Naive: 10 TB stored | With dedup: 10 MB stored (99.999%
savings)

Content-Addressed Storage Hash Collision Risk Cross-User Dedup & Privacy


The blob store is addressed by SHA-256(chunk). Two SHA-256 has 2^256 possible values. The probability Only hash-identical bytes are shared — no user can
identical files automatically share storage. Moving of two different chunks colliding is astronomically access another's data. Some services avoid cross-
or renaming a file only updates metadata — no byte small — effectively zero. Dropbox adds a secondary user dedup for compliance reasons (each user gets
movement. checksum as belt-and-suspenders. their own namespace).
03 · CHUNKING & DEDUP

End-to-End Upload Flow


From saving a file to all your devices seeing it — step by step
Client API Gateway Metadata Svc Block Store Notification Svc

1. POST /upload (file_id, chunk_hashes[])

2. Check existing chunks

3. Return: missing_chunks[]

4. Presigned URLs for missing chunks

5. PUT chunk bytes directly → Block Store

6. COMMIT (all chunks uploaded)

7. Insert new file_version row

8. Notify other devices


04 · SYNC PROTOCOL

The Sync Protocol — Keeping Devices Consistent


What happens when a file changes and how every device learns about it

Online Sync (Device Connected) Offline Sync (Device Reconnects)

1. File change detected on Device A 1. Device B edited file offline (no connectivity)

2. Client uploads new chunks (delta only) 2. Server has version_id=5 (edited by Device A)

3. Metadata commit: new version_id created 3. Device B had version_id=3 when it went offline

4. Server pushes notification via long-poll / WebSocket 4. Device B reconnects — sends local version_id=3

5. Device B receives notification with version_id 5. Server detects divergence: version gap detected

6. Device B fetches changed chunks from CDN/store 6. Server computes delta: versions 4 and 5 sent down

7. Device B applies changes to local file 7. Conflict check: did Device B edit same lines?

8. Both devices now on same version_id 8. If conflict → conflict resolution (next slide)
04 · SYNC PROTOCOL

Version Vectors — Detecting Conflicts


How the system knows whether two edits are concurrent or sequential

Version Vector Basics No Conflict — Sequential SAFE


A version vector tracks, per device, how many
edits that device has contributed.
Device B is simply behind. Server version dominates Device B's version. Server
VV = { DeviceA: 3, DeviceB: 1 } sends the missing deltas. Device B fast-forwards.

This means: DeviceA contributed 3 edits, DeviceB


contributed 1.
No Conflict — Local Ahead SAFE
Rules:
• Edit on DeviceA: increment DeviceA counter
• VV_x dominates VV_y if all counters of x >= y Device B has edits the server hasn't seen (offline edits on unrelated sections).
→ no conflict Device B's local changes are uploaded and become the new server version.
• Counters in neither direction → CONFLICT

Example:
Server: { A:3, B:1 } True Conflict — Concurrent Edits CONFLICT
Client: { A:3, B:2 }
→ Client has newer B edits, no conflict
Neither side dominates. Both Device A and Device B edited the same file in the
same region while disconnected. Must be resolved.
Server: { A:5, B:1 }
Client: { A:3, B:2 }
→ Neither dominates → CONFLICT
04 · SYNC PROTOCOL

Conflict Resolution Strategies


When two devices edit the same file simultaneously — who wins?
Whichever device committed to the server last has its version kept. The other version is silently
STRATEGY A discarded.

Used by: S3 (object-level), some key-value stores Simple but lossy


Last-Writer-Wins (LWW)
Pros: Simple, automatic
Cons: Data loss — silent discard of one version

STRATEGY B Both versions are kept. A conflict copy is created: 'report (Alice's conflicted copy 2024-01-15).pdf'

Used by: Dropbox, iCloud Drive Safe — used in


production
Fork Both Versions
Pros: No data loss
Cons: User must manually merge; folder gets cluttered

Track individual operations (insert char at pos 42). Mathematically merge concurrent operations
STRATEGY C without conflicts.
Best UX — complex to
Operational Transformation / Used by: Google Docs (OT), Figma (CRDT) build
CRDT Pros: Real-time, seamless
Design Insight: Most file sync services useCons: Complex;
Strategy requires
B (fork). structured
Google Docs usesdata modelC (OT) because it stores text documents — highly structured, line-level
Strategy
merges are possible.
04 · SYNC PROTOCOL

Delta Sync — Only Transferring What Changed


The rsync algorithm and how cloud storage services implement efficient delta transfers

Problem: Device B has version V1 of a 100 MB file. Server has version V2. Transfer only the diff.

V2 (Device B has): Block A Block B Block C Block D Block E Block F

V1 (Server has): Block A Block B Block C' Block D Block E' Block F

Transfer only C' and E' — saving ~66% bandwidth

Server computes weak checksums Device B identifies matching blocks


1 Server computes Adler-32 (weak, fast) rolling checksums of fixed- 2 Device B slides through its V1 file looking for blocks matching the
size blocks of V2. Sends the list of (block_offset, weak_checksum) server's checksums. Matching blocks are 'already have' — no
to Device B. transfer needed.

Delta computed on Device B Server reconstructs V2


3 Non-matching regions between matched blocks are the delta. 4 Server combines copied blocks from V1 (already has them) with
Device B sends only these byte ranges plus the instruction: 'copy the uploaded delta bytes to reconstruct V2. Verified with SHA-256
block A from local, insert delta, copy block D...' checksum.
04 · SYNC PROTOCOL

Real-Time Notification
Long-polling vs WebSocket vs Server-Sent Events for sync event delivery

Long-Polling WebSocket Server-Sent Events (SSE)


Client sends HTTP request. Server holds it Client upgrades HTTP connection to a One-directional persistent HTTP stream:
open until a change occurs (up to 30s). persistent full-duplex channel. Server can server → client. Client sends changes via
Server responds → client immediately push events at any time. No polling normal POST requests. Simpler than
makes next request. overhead. WebSocket.
PROS PROS PROS

Lowest latency, efficient for frequent Works with HTTP/2 multiplexing, simple
Simple, firewall-friendly, works everywhere
updates implementation
CONS CONS CONS

Higher latency (new connection per event), Stateful — harder to scale (connection
Half-duplex — upload path is separate
overhead per reconnect affinity needed)

Used by: Dropbox original approach Used by: Google Drive, modern Dropbox Used by: Used in some mobile clients

At Google scale: dedicated notification servers (not API servers) hold millions of open connections. Routed by consistent hashing on
user_id.
05 · STORAGE & REPLICATION

Blob Storage — Storing the Actual Bytes


Object stores, content addressing, and why we don't use regular databases for files

Hot Storage Object Store Design


NVMe SSD-backed object store
Content-Addressed Keys
Latency: < 5ms | Cost: $$$
Recently accessed files; active sync Key = SHA-256(chunk). Same bytes → same key → automatic dedup. Immutable
objects — never overwritten, only new versions created.

Warm Storage No Hierarchical Namespace


Unlike a filesystem, object stores have a flat namespace. Folders are just prefixes
HDD-backed distributed object store in keys. Metadata service owns the hierarchy, not blob store.
Latency: 20-50ms | Cost: $$ Multipart Upload
Files > 30 days old, infrequently accessed
Large chunks uploaded in parts (e.g. 5 MB parts for a 50 MB chunk). If part
upload fails, only that part is retried. Resumable by default.
Cold Storage Presigned URLs
AWS Glacier / Google Coldline API gateway issues time-limited pre-signed URLs. Client uploads/downloads
Latency: hours | Cost: $ directly to/from object store — gateway never proxies bytes.
Files > 1 year, compliance archival Immutability
Automated lifecycle: Last access > 30 days → move to warm. > 1 year → move Objects are never mutated. A new version of a file = new object keys. Simplifies
to cold. Zero user impact. replication: you only need to propagate writes, never updates.
05 · STORAGE & REPLICATION

Durability — Erasure Coding vs. Replication


How to survive disk failures, rack failures, and datacenter outages at low cost

Simple 3x Replication Erasure Coding (e.g. RS 6+3)

Data Shard 1 Data Shard 2 Data Shard 3


Primary — full copy of data

Data Shard 4 Data Shard 5 Data Shard 6

Replica 1 — full copy of data


Parity Shard P1 Parity Shard P2 Parity Shard P3

Replica 2 — full copy of data

Storage overhead: 3x (200 GB raw → 600 GB stored) Storage overhead: 1.5x (200 GB raw → 300 GB stored)
Survives: 2 disk failures Survives: any 3 shard failures (same durability!)
Pros: Simple, fast reads from any replica Pros: 50% less storage than 3x replication
Cons: Expensive — 200% storage overhead Cons: More CPU, higher reconstruction latency
05 · STORAGE & REPLICATION

Global Distribution — CDN & Geo-Replication


How a file saved in Boston reaches Tokyo in under 100ms

EU-West Global Architecture Decisions


(Frankfurt)
US-East
(Primary)
Regional Replication
AP-East Entire blob store replicated across 3+ geographically separated
(Tokyo) datacenters. Asynchronous replication — primary region commits, then
replicates. Eventual consistency across regions.
AP-SE
(Singapore) CDN for Downloads
Frequently accessed chunks cached at edge PoPs (Points of Presence).
US-West Cache hit → 10-20ms latency. Cache miss → fetch from nearest
(Oregon) datacenter and cache for next user.

Upload Routing
CDN Edge PoPs Client uploads to nearest datacenter to minimize upload latency.
Replication to other regions happens asynchronously post-commit —
NYC LAX LHR FRA SIN NRT
doesn't slow the user.
CAPACITY ESTIMATION

Back-of-Envelope: Dropbox at Scale


Estimating storage, bandwidth, and server requirements for 500M users

Assumptions Derived Metrics

Total users: 500 million Upload QPS (peak 3x avg): ~30,000 req/s

DAU: 100 million (20%) Upload bandwidth (peak): ~30 GB/s

Avg storage per user: 10 GB With 90% dedup savings: ~3 GB/s actual

Total raw storage: 500M × 10 GB = 5 PB Metadata reads (per open): ~1M RPS

With 3x replication: 15 PB stored Notification fanout/s (DAU): ~100M events/day

With EC (1.5x): 7.5 PB stored Chunk store lookup QPS: ~500K RPS

Files changed per DAU: 5 files/day Download bandwidth (CDN hit%): ~50 GB/s gross

Avg file size: 500 KB CDN cache hit rate: ~80% (popular files)

Avg chunk size (CDC): 4 MB (variable) Metadata DB size: ~2 TB (file records)

Daily upload volume: 100M × 5 × 500 KB = 250 TB Metadata shards needed: ~50 shards
COMPLETE ARCHITECTURE

Full System Architecture — All Components


From client filesystem to global CDN — the complete picture
Desktop Mobile Web
Client Client Browser

CDN / Edge
PoPs

Load Balancer API Gateway WebSocket


Geo-
+ DNS GeoDNS (Auth/Rate Limit) Notif Server
Replica
Regions

US-E
Metadata Dedup Upload Message Quota & EU-W
Service Engine Coordinator Queue (Kafka)Billing AP-SE

Metadata DB Block Store Redis Cache


(Sharded Postgres / Spanner) (Object Storage — S3/GCS) (Session / hot metadata)
06 · WRAP-UP

Key Design Trade-offs & Decisions


Every architectural choice is a balance — know the axes
Chunking Strategy Conflict Resolution
A: Fixed-size (simple) A: LWW (simple, data loss)
B: CDC Rabin (efficient) B: Fork copies (safe, manual)

CDC wins for sync efficiency; fixed-size wins for simplicity Fork for file sync; OT/CRDT for real-time collaboration

Durability Strategy Metadata Storage


A: 3x Replication (fast reads, 3x cost) A: Single RDBMS (simple ACID)
B: Erasure Coding (1.5x cost, slower reconstruct) B: Sharded + distributed

EC for warm/cold; replication for hot data Must shard at scale; Spanner offers distributed ACID

Sync Notification Upload Path


A: Polling (simple, high latency) A: API proxy (simple, bottleneck)
B: WebSocket push (low latency, stateful) B: Presigned URL direct (scalable)

WebSocket for real-time; polling acceptable for mobile battery Direct upload always — presigned URLs eliminate gateway bottleneck
KEY TAKEAWAYS

Designing a Global File Storage & Sync Service


01 Separate Metadata from Bytes 02 Content-Defined Chunking

Metadata (filenames, versions, permissions) and blob storage have fundamentally CDC + SHA-256 gives you efficient delta sync (only changed chunks re-uploaded)
different access patterns. Never store them in the same system. and automatic cross-user deduplication — the 90% bandwidth saving.

03 Dedup Before You Upload 04 Version Vectors for Conflicts

Check block store for existing hash before transferring bytes. At scale this saves Use vector clocks to detect whether version divergence is sequential (no conflict,
petabytes. The presigned URL pattern lets clients upload directly to blob store. safe to fast-forward) or concurrent (true conflict, must resolve).

05 Erasure Coding for Durability 06 CDN + Geo-Replication for Speed

EC (e.g. RS 6+3) achieves the same 11-nines durability as 3x replication at 1.5x Upload to nearest DC; async replicate globally. Cache popular chunks at edge
storage overhead vs 3x. Use EC for warm/cold tiers. PoPs. 80% CDN hit rate → most users never touch origin servers.
Video streaming service
• What we will cover today? 1 Video Transcoding Pipeline

2 CDN & Open Connect

3 Adaptive Bitrate Streaming

4 System Trade-offs & Scale

25
The problem
• The Problem: Delivering the Right Video, to Anyone, Anywhere,
Instantly
• Three core challenges:
• Volume at scale
• A popular title can trigger millions of simultaneous streams the moment it drops.
• No single server, datacenter, or network link can absorb that load.
• Network heterogeneity
• Your viewers are on 100 Mbps fibre, 4G LTE, 3G mobile, and congested hotel Wi-Fi —
simultaneously, unpredictably.
• The system must deliver the best possible picture to every viewer given their current
connection, in real time without making them wait!
• Storage and distribution at global scale
• Storing and routing hundreds of petabytes so that any piece of content reaches any user in
any country within milliseconds is a fundamental distributed systems problem

26
The Problem: Video Delivery at Global Scale
Three interlocking engineering challenges that constrain every design decision

260M 100M+ 15% 1,200


Paid subscribers Hours streamed Global internet Encoded variants
worldwide every single day traffic at peak per film title

STORE DIST NETWORK

Storage at Scale Global Distribution Heterogeneous Networks

A single 4K HDR film is 80–100 GB. Netflix 100M hours/day cannot come from one Viewers are on 0.3 Mbps 2G mobile to 100
stores 15,000+ titles, each in dozens of datacenter. A Tokyo viewer cannot wait for Mbps fibre simultaneously. The system must
quality variants. Total storage: hundreds of packets from Virginia — physics dictates never stall — and must maximise quality on
petabytes. architecture. every connection.

CS6650 · Designing a Video Streaming Service 2


High-Level Architecture
Three pipelines, three timescales — the key structural insight
INGESTION PIPELINE (offline — hours/days ahead)

Raw Video Encoding Farm Quality Origin Catalogue


Upload (1,200+ variants) Validation Object Store Metadata DB

DISTRIBUTION LAYER (daily pre-positioning)

Origin S3 CDN Edge Open Connect Cache


Store Servers (PoPs) Appliances (ISPs) Invalidation

CLIENT DELIVERY (real time — every 2–6 seconds)

Fetch ABR Segment Playback


Player Output
Manifest Algorithm Fetch (OCA) Buffer

KEY INSIGHT: The three pipelines operate at completely different timescales — design each one independently with the right tools.
CS6650 · Designing a Video Streaming Service 3
Video Transcoding — One Master, 1,200+ Variants
Why encoding happens offline, and why per-title profiling matters
Resolution Bitrate Codec Target Device

4K UHD 15–20 Mbps H.265/HEVC HDR screens

MASTER
1080p HD 5–8 Mbps H.264/AVC Smart TVs, PC
FILE
Encoding
4K RAW Farm
~100 GB 720p 3–5 Mbps H.264 Tablets

Studio
Delivery 480p 0.7–1 Mbps H.264 / VP9 Mobile (LTE)

240p 0.3 Mbps VP9 Mobile (2G/3G)

Per-title encoding: Netflix profiles each film individually — an action film with fast motion needs different encoder settings than a slow dialogue scene. Same
perceived quality, lower bitrate.
CS6650 · Designing a Video Streaming Service 4
Content Delivery — CDN & Open Connect
How Netflix ships servers inside ISP networks to eliminate backbone latency

The Origin-Server Problem Delivery Flow

A viewer in Tokyo streaming from a US datacenter experiences


~150ms round-trip latency per request. Video players request a AWS Origin
new segment every 2–6 seconds — this latency compounds into (S3 + Servers)
constant buffering regardless of connection speed.
nightly push

Netflix Open Connect (OCA)


Netflix ships physical server appliances to ISPs — free of OCA Server OCA Server
charge (ISP · Tokyo) (ISP · Sao Paulo)
Appliances pre-loaded nightly with most-popular local content

Viewer traffic never leaves the ISP's own network LOCAL LOCAL

Over 95% of Netflix traffic served from OCA, not cloud


Viewer (Japan) Viewer (Brazil)
1,000+ ISP partners in 1,000+ locations worldwide

CS6650 · Designing a Video Streaming Service 5


Adaptive Bitrate Streaming (ABR)
How the player adjusts quality every few seconds to fit the available network bandwidth

Core Mechanism ABR Quality Switching — Example Playback


Each bar = one 2-second segment
Video is pre-cut into 2–6 second segments. Every segment is
encoded at multiple quality levels. The player downloads one
segment at a time and can choose a different quality level for
each, based on current network conditions.

DASH (ISO standard) and HLS (Apple) both implement this: a


manifest file indexes all quality levels and segment URLs.
1080p 1080p
5M 5M
Network
drop!
720p 720p
3.5M 3.5M
ABR Decision Inputs
EWMA of recent segment download 480p
Throughput estimate 1M
speeds 360p
0.5M
If >30s: upgrade · if <10s: step down
Buffer level
immediately
Up one level at a time; drop fast to
Switching rules S1 S2 S3 S4 S5 S6
prevent stall

CS6650 · Designing a Video Streaming Service 6


Complete System — All Components Together
From studio upload to viewer screen — the distributed data path
INGESTION (offline)

Encoding Farm
Studio Master QC Validation Origin S3 Store Catalogue DB
(1,200+ variants)

DISTRIBUTION (daily)

manifest
CDN PoPs Open Connect Notification Message Queue feed
(Edge cache) Appliances Service (Kafka)

CLIENT (real time) fast path (95% of traffic)

Manifest ABR Segment Playback


Player
Request Algorithm Fetch (OCA) Buffer

CS6650 · Designing a Video Streaming Service 8


Distributed Systems Trade-offs
Every architectural decision is a trade-off — here are the critical six

Consistency vs Availability Centralised vs Edge Storage


Netflix uses eventual consistency for catalogue metadata (title availability may Origin store is the authoritative single source. Delivery is edge-first via OCA. 95%+
vary briefly by region). Result: 99.99% availability. For billing and account data, of reads hit the edge. Writes always go through origin. Classic read-heavy CDN
stronger consistency is enforced. pattern.

→ AP system for content · CP for transactions → Write to origin · read from edge

Fixed vs Adaptive Quality Stateless vs Stateful Clients


Fixed quality wastes bandwidth on fast connections and buffers on slow ones. ABR Segment URLs are standard HTTP URLs. Clients are fully stateless — any CDN node
adapts every 2–6 seconds, trading algorithm complexity for universal can serve any request. This enables trivial horizontal scaling, failover, and global
compatibility and rebuffer elimination. load balancing.

→ Adaptive always wins on heterogeneous networks → Stateless HTTP segments — scales horizontally

Push vs Pull Distribution Encode Once vs Re-encode


Netflix pushes popular content to OCA nightly (proactive). Clients pull segments All transcoding is offline. Per-title encoding profiles each film for optimal bitrate-
on demand (reactive). Proactive push drives cache hit rates on popular titles to quality. Re-encoding at delivery time is infeasible at 150 Tbps. Storage cost for
near 100%. 1,200 variants per title is amortised over millions of plays.

→ Push popular content · pull on cache miss → Offline, once — amortised over millions of streams

CS6650 · Designing a Video Streaming Service 9


Back-of-Envelope: Netflix at Scale
Deriving the numbers that justify every architectural decision

Assumptions Derived Metrics

Daily Active Users 100 million Peak concurrent streams


~30 million 100M × 90min / 24h × 3×
Avg watch time / DAU 90 min / day
Peak outbound bandwidth
Avg stream bitrate 5 Mbps (1080p)
~150 Tbps 30M × 5 Mbps

Peak traffic multiplier 3× average


OCA-served bandwidth
Titles in library 15,000 ~142 Tbps 95% of 150 Tbps served locally

Variants per title 1,200 files Total encoded storage


~72 PB 15k × 1,200 × 4 GB
Avg file size / variant ~4 GB
Daily upload volume
Total encoded storage 15k × 1,200 × 4 GB
~100 TB/day ~25 new titles / day × avg variants

CS6650 · Designing a Video Streaming Service 10


SUMMARY

Six Principles to Carry Forward


ENCODE CDN
01 02
Encode Once, Serve Many Co-Design with Infrastructure
All transcoding is offline. One 100 GB master → 1,200+ variants. Per-title encoding Netflix ships servers to ISPs. 95%+ traffic is local. The architecture fights physics by
maximises quality per bit. Delivery serves static files. eliminating distance — a CDN is not enough.

ABR DASH
03 04
Never Buffer — Gracefully Degrade Stateless HTTP = Free Scalability
2–6s segments + EWMA bandwidth + buffer zones = smooth playback on any Streaming video is HTTP file downloads, structured cleverly. Stateless GETs give
network. Quality degradation always beats stalling. CDN caching, failover, and horizontal scaling for free.

SCALE METRIC
05 06
Separate by Timescale One North-Star Metric
Ingestion (offline), distribution (daily), delivery (real-time) operate independently. Rebuffering ratio = % of time spent on the loading spinner. Netflix tunes every
Design each layer for its own timescale and tool set. algorithm, every CDN node, every encoder setting to reduce this number.
Thank you!
• Questions?

37

You might also like