Week 12 System Design
Week 12 System Design
1
Week 12 – Designing Large-Scale Systems
• Designing a Global File Storage & Sync Service
• The Problem & Scale
• High-Level Architecture
• File Chunking & Deduplication
• Sync Protocol
• Storage & Replication
• Wrap-up & Trade-offs
• Video streaming service
2
01 · THE PROBLEM
3
What a global storage service guarantees?
• Any file, on any device, anywhere in the world — always up to date
• Consistency, availability needed; demands geo-distribution.
4
01 · THE PROBLEM
Files must survive disk failures, datacenter outages, When you save a file on your laptop, your phone
Durability: even natural disasters. Target: 11 nines Consistency: must eventually see the same version — no stale
(99.999999999%) durability. reads, no silent data loss.
Client Layer API Gateway Metadata Service Block / Blob Storage CDN / Edge
Upload Throttling
The daemon measures available bandwidth and caps upload rate to avoid saturating the user's
Chunker + Hasher connection. Configurable in Dropbox settings.
(SHA-256 per block)
[Link] (8 MB) Single file — naive approach: re-upload entire 8 MB on every edit
Chunk 1 Chunk 2
4 MB 4 MB
[UNCHANGED] [CHANGED]
Upload only Chunk 2 (4 MB) — saves 50% bandwidth. At scale: 90%+ savings from deduplication.
Simple. 4–8 MB blocks. Problem: inserting 1 byte at Content-Defined Chunking (Rabin fingerprint).
Fixed-size chunks: position 0 shifts every chunk boundary — all chunks Variable-size (CDC): Boundaries based on content, not position.
change. Insertion only affects nearby chunks.
03 · CHUNKING & DEDUP
Client computes Query block store: EXISTS: store only NOT EXISTS: upload
SHA-256(chunk) does hash exist? pointer, skip upload chunk, store bytes
3. Return: missing_chunks[]
1. File change detected on Device A 1. Device B edited file offline (no connectivity)
2. Client uploads new chunks (delta only) 2. Server has version_id=5 (edited by Device A)
3. Metadata commit: new version_id created 3. Device B had version_id=3 when it went offline
4. Server pushes notification via long-poll / WebSocket 4. Device B reconnects — sends local version_id=3
5. Device B receives notification with version_id 5. Server detects divergence: version gap detected
6. Device B fetches changed chunks from CDN/store 6. Server computes delta: versions 4 and 5 sent down
7. Device B applies changes to local file 7. Conflict check: did Device B edit same lines?
8. Both devices now on same version_id 8. If conflict → conflict resolution (next slide)
04 · SYNC PROTOCOL
Example:
Server: { A:3, B:1 } True Conflict — Concurrent Edits CONFLICT
Client: { A:3, B:2 }
→ Client has newer B edits, no conflict
Neither side dominates. Both Device A and Device B edited the same file in the
same region while disconnected. Must be resolved.
Server: { A:5, B:1 }
Client: { A:3, B:2 }
→ Neither dominates → CONFLICT
04 · SYNC PROTOCOL
STRATEGY B Both versions are kept. A conflict copy is created: 'report (Alice's conflicted copy 2024-01-15).pdf'
Track individual operations (insert char at pos 42). Mathematically merge concurrent operations
STRATEGY C without conflicts.
Best UX — complex to
Operational Transformation / Used by: Google Docs (OT), Figma (CRDT) build
CRDT Pros: Real-time, seamless
Design Insight: Most file sync services useCons: Complex;
Strategy requires
B (fork). structured
Google Docs usesdata modelC (OT) because it stores text documents — highly structured, line-level
Strategy
merges are possible.
04 · SYNC PROTOCOL
Problem: Device B has version V1 of a 100 MB file. Server has version V2. Transfer only the diff.
V1 (Server has): Block A Block B Block C' Block D Block E' Block F
Real-Time Notification
Long-polling vs WebSocket vs Server-Sent Events for sync event delivery
Lowest latency, efficient for frequent Works with HTTP/2 multiplexing, simple
Simple, firewall-friendly, works everywhere
updates implementation
CONS CONS CONS
Higher latency (new connection per event), Stateful — harder to scale (connection
Half-duplex — upload path is separate
overhead per reconnect affinity needed)
Used by: Dropbox original approach Used by: Google Drive, modern Dropbox Used by: Used in some mobile clients
At Google scale: dedicated notification servers (not API servers) hold millions of open connections. Routed by consistent hashing on
user_id.
05 · STORAGE & REPLICATION
Storage overhead: 3x (200 GB raw → 600 GB stored) Storage overhead: 1.5x (200 GB raw → 300 GB stored)
Survives: 2 disk failures Survives: any 3 shard failures (same durability!)
Pros: Simple, fast reads from any replica Pros: 50% less storage than 3x replication
Cons: Expensive — 200% storage overhead Cons: More CPU, higher reconstruction latency
05 · STORAGE & REPLICATION
Upload Routing
CDN Edge PoPs Client uploads to nearest datacenter to minimize upload latency.
Replication to other regions happens asynchronously post-commit —
NYC LAX LHR FRA SIN NRT
doesn't slow the user.
CAPACITY ESTIMATION
Total users: 500 million Upload QPS (peak 3x avg): ~30,000 req/s
Avg storage per user: 10 GB With 90% dedup savings: ~3 GB/s actual
Total raw storage: 500M × 10 GB = 5 PB Metadata reads (per open): ~1M RPS
With EC (1.5x): 7.5 PB stored Chunk store lookup QPS: ~500K RPS
Files changed per DAU: 5 files/day Download bandwidth (CDN hit%): ~50 GB/s gross
Avg file size: 500 KB CDN cache hit rate: ~80% (popular files)
Daily upload volume: 100M × 5 × 500 KB = 250 TB Metadata shards needed: ~50 shards
COMPLETE ARCHITECTURE
CDN / Edge
PoPs
US-E
Metadata Dedup Upload Message Quota & EU-W
Service Engine Coordinator Queue (Kafka)Billing AP-SE
CDC wins for sync efficiency; fixed-size wins for simplicity Fork for file sync; OT/CRDT for real-time collaboration
EC for warm/cold; replication for hot data Must shard at scale; Spanner offers distributed ACID
WebSocket for real-time; polling acceptable for mobile battery Direct upload always — presigned URLs eliminate gateway bottleneck
KEY TAKEAWAYS
Metadata (filenames, versions, permissions) and blob storage have fundamentally CDC + SHA-256 gives you efficient delta sync (only changed chunks re-uploaded)
different access patterns. Never store them in the same system. and automatic cross-user deduplication — the 90% bandwidth saving.
Check block store for existing hash before transferring bytes. At scale this saves Use vector clocks to detect whether version divergence is sequential (no conflict,
petabytes. The presigned URL pattern lets clients upload directly to blob store. safe to fast-forward) or concurrent (true conflict, must resolve).
EC (e.g. RS 6+3) achieves the same 11-nines durability as 3x replication at 1.5x Upload to nearest DC; async replicate globally. Cache popular chunks at edge
storage overhead vs 3x. Use EC for warm/cold tiers. PoPs. 80% CDN hit rate → most users never touch origin servers.
Video streaming service
• What we will cover today? 1 Video Transcoding Pipeline
25
The problem
• The Problem: Delivering the Right Video, to Anyone, Anywhere,
Instantly
• Three core challenges:
• Volume at scale
• A popular title can trigger millions of simultaneous streams the moment it drops.
• No single server, datacenter, or network link can absorb that load.
• Network heterogeneity
• Your viewers are on 100 Mbps fibre, 4G LTE, 3G mobile, and congested hotel Wi-Fi —
simultaneously, unpredictably.
• The system must deliver the best possible picture to every viewer given their current
connection, in real time without making them wait!
• Storage and distribution at global scale
• Storing and routing hundreds of petabytes so that any piece of content reaches any user in
any country within milliseconds is a fundamental distributed systems problem
26
The Problem: Video Delivery at Global Scale
Three interlocking engineering challenges that constrain every design decision
A single 4K HDR film is 80–100 GB. Netflix 100M hours/day cannot come from one Viewers are on 0.3 Mbps 2G mobile to 100
stores 15,000+ titles, each in dozens of datacenter. A Tokyo viewer cannot wait for Mbps fibre simultaneously. The system must
quality variants. Total storage: hundreds of packets from Virginia — physics dictates never stall — and must maximise quality on
petabytes. architecture. every connection.
KEY INSIGHT: The three pipelines operate at completely different timescales — design each one independently with the right tools.
CS6650 · Designing a Video Streaming Service 3
Video Transcoding — One Master, 1,200+ Variants
Why encoding happens offline, and why per-title profiling matters
Resolution Bitrate Codec Target Device
MASTER
1080p HD 5–8 Mbps H.264/AVC Smart TVs, PC
FILE
Encoding
4K RAW Farm
~100 GB 720p 3–5 Mbps H.264 Tablets
Studio
Delivery 480p 0.7–1 Mbps H.264 / VP9 Mobile (LTE)
Per-title encoding: Netflix profiles each film individually — an action film with fast motion needs different encoder settings than a slow dialogue scene. Same
perceived quality, lower bitrate.
CS6650 · Designing a Video Streaming Service 4
Content Delivery — CDN & Open Connect
How Netflix ships servers inside ISP networks to eliminate backbone latency
Viewer traffic never leaves the ISP's own network LOCAL LOCAL
Encoding Farm
Studio Master QC Validation Origin S3 Store Catalogue DB
(1,200+ variants)
DISTRIBUTION (daily)
manifest
CDN PoPs Open Connect Notification Message Queue feed
(Edge cache) Appliances Service (Kafka)
→ AP system for content · CP for transactions → Write to origin · read from edge
→ Adaptive always wins on heterogeneous networks → Stateless HTTP segments — scales horizontally
→ Push popular content · pull on cache miss → Offline, once — amortised over millions of streams
ABR DASH
03 04
Never Buffer — Gracefully Degrade Stateless HTTP = Free Scalability
2–6s segments + EWMA bandwidth + buffer zones = smooth playback on any Streaming video is HTTP file downloads, structured cleverly. Stateless GETs give
network. Quality degradation always beats stalling. CDN caching, failover, and horizontal scaling for free.
SCALE METRIC
05 06
Separate by Timescale One North-Star Metric
Ingestion (offline), distribution (daily), delivery (real-time) operate independently. Rebuffering ratio = % of time spent on the loading spinner. Netflix tunes every
Design each layer for its own timescale and tool set. algorithm, every CDN node, every encoder setting to reduce this number.
Thank you!
• Questions?
37