The Raft Consensus Algorithm
Originally by :Diego Ongaro John Ousterhout
Stanford University
Edited using content from the paper as well.
[Link]
What is Consensus?
● Agreement on shared state (single system image)
● Recovers from server failures autonomously
Minority of servers fail: no problem
Majority fail: lose availability, retain consistency
Servers
● Key to building consistent storage systems
September 2014 Raft Consensus Algorithm Slide 2
Inside a Consistent System
● TODO: eliminate single point of failure
● An ad hoc algorithm
“This case is rare and typically occurs as a result
of a network partition with replication lag.”
– OR –
● A consensus algorithm (built-in or library)
Paxos, Raft, …
● A consensus service
ZooKeeper, etcd, consul, …
September 2014 Raft Consensus Algorithm Slide 3
Replicated State Machines
Clients
z6
Consensus State Consensus State Consensus State
Module Machine Module Machine Module Machine
x 1 x 1 x 1
y 2 y 2 y 2
Servers
z 6 z 6 z 6
Log Log Log
x3 y2 x1 z6 x3 y2 x1 z6 x3 y2 x1 z6
● Replicated log replicated state machine
All servers execute same commands in same order
● Consensus module ensures proper log replication
● System makes progress as long as any majority of servers are up
● Failure model: fail-stop (not Byzantine), delayed/lost messages
September 2014 Raft Consensus Algorithm Slide 4
Paxos Protocol
● Leslie Lamport, 1989
● Nearly synonymous with consensus
“The dirty little secret of the NSDI community is that at
most five people really, truly understand every part of
Paxos ;-).” – NSDI reviewer
“There are significant gaps between the description of
the Paxos algorithm and the needs of a real-world
system…the final system will be based on an unproven
protocol.” – Chubby authors
September 2014 Raft Consensus Algorithm Slide 5
Raft’s Design for
Understandability
● We wanted the best algorithm for building real
systems
Must be correct, complete, and perform well
Must also be understandable
● “What would be easier to understand or explain?”
Fundamentally different decomposition than Paxos
Less complexity in state space
Less mechanism
September 2014 Raft Consensus Algorithm Slide 6
Raft Overview
1. Leader election
Select one of the servers to act as cluster leader
Detect crashes, choose new leader
2. Log replication (normal operation)
Leader takes commands from clients, appends them
to its log
Leader replicates its log to other servers (overwriting
inconsistencies)
3. Safety
How to keep log consistent even on leader crash.
Could crash half way between log replication.
Only servers with uptodate logs can be leaders
September 2014 Raft Consensus Algorithm Slide 7
Leader election
Time out (random numbers ) on not receiving any heart beat.
First server to time out will send out leader election requests.
If gets vote from majority then he considers himself as leader and
starts sending heartbeat msgs
For every heart beat received all reset their timeout value
Log replication
Client when wants to write sends to leader.
Leader writes to its log
Sends requests to other servers to write into log
After replication done then marks that it can be used by state machine
September 2014 Raft Consensus Algorithm Slide 8
Terms
Detect obsolete leader.
One term is for one leader. When leader crashes then next term.
No global view of terms. Each server has his own understanding of term numbe
Updated when servers talk to each other and ensures he is follower if leader.
If gets client request then he wont execute and just steps down.
September 2014 Raft Consensus Algorithm Slide 9
Core Raft Review
1. Leader election
Heartbeats and timeouts to detect crashes
Randomized timeouts to avoid split votes
Majority voting to guarantee at most one leader per term
2. Log replication (normal operation)
Leader takes commands from clients, appends them to its log
Leader replicates its log to other servers
Built-in consistency check simplifies how logs may differ
3. Safety
Only elect leaders with all committed entries in their logs
New leader defers committing entries from prior terms
September 2014 Raft Consensus Algorithm Slide 10
Leader Operation
● Client sends request to leader
● Leader appends to its log and sends it to followers
to append
● Once leader gets enough response from followers
he then executes on his state machine and tells
client. For next RPCS it also informs other followers
who also executes on their SM.
● Leader keeps connecting to followers till its done if
follower crashes meanwhile
Assumption: majority of servers always up.
September 2014 Raft Consensus Algorithm Slide 11
Log entries
September 2014 Raft Consensus Algorithm Slide 12
Inconsistencies: Can this happen?
September 2014 Raft Consensus Algorithm Slide 13
Conclusions
● Consensus widely regarded as difficult
● Raft designed for understandability
Easier to teach in classrooms
Better foundation for building practical systems
● Paper/thesis covers much more
Cluster membership changes (simpler in thesis)
Log compaction (expanded tech report/thesis)
Client interaction (expanded tech report/thesis)
Evaluation (thesis)
September 2014 Raft Consensus Algorithm Slide 14
Questions
[Link]
September 2014 Raft Consensus Algorithm Slide 15