The inherent complexity
of stream processing
Easy problem
i want this talk to be interactive...
going deep into technical details
please do not hesitate to jump in with any questions
struct PageView {
UserID id,
String url,
Timestamp timestamp
}
Implement:
function NumUniqueVisitors(
String url,
int startHour,
int endHour)
Unique visitors over a range
A
B
C
A
B
C
D
E
C
D
E
1 2 3 4
Hours
Pageviews
uniques for just hour 1 = 3
uniques for hours 1 and 2 = 3
uniques for 1 to 3 = 5
uniques for 2-4 = 4
Notes:
• Not limiting ourselves to current tooling
• Reasonable variations of existing tooling
are acceptable
• Interested in what’s fundamentally possible
Traditional
Architectures
Application Databases
Application Databases
Stream
processor
Queue
Synchronous
Asynchronous
synchronous
asynchronous
characterized by maintaining state incrementally as data comes in and serving queries off of
that same state
Approach #1
• Use Key->Set database
• Key = [URL, hour bucket]
• Value = Set of UserIDs
Approach #1
• Queries:
• Get all sets for all hours in range of
query
• Union sets together
• Compute count of merged set
Approach #1
• Lot of database lookups for large ranges
• Potentially a lot of items in sets, so lots of
work to merge/count
• Database will use a lot of space
Approach #2
Use HyperLogLog
interface HyperLogLog {
boolean add(Object o);
long size();
HyperLogLog merge(HyperLogLog... otherSets);
}
1 KB to estimate size up to 1B with only 2% error
Approach #2
• Use Key->HyperLogLog database
• Key = [URL, hour bucket]
• Value = HyperLogLog structure
it’s not a stretch to imagine a database that can do hyperloglog natively, so updates don’t
require fetching the entire set
Approach #2
• Queries:
• Get all HyperLogLog structures for all
hours in range of query
• Merge structures together
• Retrieve count from merged structure
Approach #2
• Much more efficient use of storage
• Less work at query time
• Mild accuracy tradeoff
Approach #2
• Large ranges still require lots of database
lookups / work
Approach #3
• Use Key->HyperLogLog database
• Key = [URL, bucket, granularity]
• Value = HyperLogLog structure
it’s not a stretch to imagine a database that can do hyperloglog natively, so updates don’t
require fetching the entire set
Approach #3
• Queries:
• Compute minimal number of database
lookups to satisfy range
• Get all HyperLogLog structures in range
• Merge structures together
• Retrieve count from merged structure
Approach #3
• All benefits of #2
• Minimal number of lookups for any range,
so less variation in latency
• Minimal increase in storage
• Requires more work at write time
example: 1 month there are ~720 hours, 30 days, 4 weeks, 1 month... adding all
granularities makes 755 stored values total instead of 720 values, only a 4.8% increase in
storage
Hard problem
struct Equiv {
UserID id1,
UserID id2
}
struct PageView {
UserID id,
String url,
Timestamp timestamp
}
Person A Person B
Implement:
function NumUniqueVisitors(
String url,
int startHour,
int endHour)
except now userids should be normalized, so if there’s equiv that user only appears once
even if under multiple ids
[“foo.com/page1”, 0]
[“foo.com/page1”, 1]
[“foo.com/page1”, 2]
...
[“foo.com/page1”, 1002]
{A, B, C}
{B}
{A, C, D, E}
...
{A, B, C, F, Z}
equiv can change ANY or ALL buckets in the past
[“foo.com/page1”, 0]
[“foo.com/page1”, 1]
[“foo.com/page1”, 2]
...
[“foo.com/page1”, 1002]
{A, B, C}
{B}
{A, C, D, E}
...
{A, B, C, F, Z}
A <-> C
Any single equiv could change any bucket
No way to take advantage of HyperLogLog
Approach #1
• [URL, hour] -> Set of PersonIDs
• PersonID -> Set of buckets
• Indexes to incrementally normalize
UserIDs into PersonIDs
will get back to incrementally updating userids
Approach #1
• Getting complicated
• Large indexes
• Operations require a lot of work
will get back to incrementally updating userids
Approach #2
• [URL, bucket] -> Set of UserIDs
• Like Approach 1, incrementally normalize
UserId’s
• UserID -> PersonID
offload a lot of the work to read time
Approach #2
• Query:
• Retrieve all UserID sets for range
• Merge sets together
• Convert UserIDs -> PersonIDs to
produce new set
• Get count of new set
this is still an insane amount of work at read time
overall
Approach #3
• [URL, bucket] -> Set of sampled UserIDs
• Like Approaches 1 & 2, incrementally
normalize UserId’s
• UserID -> PersonID
offload a lot of the work to read time
Approach #3
• Query:
• Retrieve all UserID sets for range
• Merge sets together
• Convert UserIDs -> PersonIDs to
produce new set
• Get count of new set
• Divide count by sample rate
Approach #3
if sampled
database contains [URL, hour bucket] -> set of sampled user ids
Approach #3
• Sample the user ids using hash sampling
• Divide by the sample rate at end to
approximate the unique count
can’t just do straight random sampling (imagine if only have 4 user ids that visit thousands of
times... a sample rate of 50% will have all user ids, and you’ll end up giving the answer of 8)
Approach #3
• Still need complete UserID -> PersonID index
• Still requires about 100 lookups into
UserID -> PersonID index to resolve queries
• Error rate 3-5x worse than HyperLogLog for
same space usage
• Requires SSD’s for reasonable throughput
Incremental UserID
normalization
Attempt 1:
• Maintain index from UserID -> PersonID
• When receive A <-> B:
• Find what they’re each normalized to, and
transitively normalize all reachable IDs to
“smallest” val
1 <-> 4
1 -> 1
4 -> 1
2 <-> 5
5 -> 2
2 -> 2
5 <-> 3 3 -> 2
4 <-> 5
5 -> 1
2 -> 1
3 -> 1 never gets produced!
Attempt 2:
• UserID -> PersonID
• PersonID -> Set of UserIDs
• When receive A <-> B
• Find what they’re each normalized to, and
choose one for both to be normalized to
• Update all UserID’s in both normalized sets
1 <-> 4
1 -> 1
4 -> 1
1 -> {1, 4}
2 <-> 5
5 -> 2
2 -> 2
2 -> {2, 5}
5 <-> 3 3 -> 2
2 -> {2, 3, 5}
4 <-> 5
5 -> 1
2 -> 1
3 -> 1
1 -> {1, 2, 3, 4, 5}
Challenges
• Fault-tolerance / ensuring consistency
between indexes
• Concurrency challenges
if using distributed database to store indexes and computing everything concurrently
when receive equivs for 4<->3 and 3<->1 at same time, will need some sort of locking so
they don’t step on each other
General challenges with
traditional architectures
• Redundant storage of information
(“denormalization”)
• Brittle to human error
• Operational challenges of enormous
installations of very complex databases
e.g. granularities, the 2 indexes for user id normalization... we know it’s a bad idea to store the same thing in multiple places... opens up
possibility of them getting out of sync if you don’t handle every case perfectly
If you have a bug that accidentally sets the second value of all equivs to 1, you’re in trouble
even the version without equivs suffers from these problems
Master
Dataset
Indexes for
uniques
over time
Stream
processor
No fully incremental approach works well!
Let’s take a completely different approach!
Some Rough
Definitions
Complicated: lots of parts
Some Rough
Definitions
Complex: intertwinement between separate functions
Some Rough
Definitions
Simple: the opposite of complex
Real World Example
2 functions: produce water of a certain strength, and produce water of a certain temperature
faucet on left gives you “hot” and “cold” inputs which each affect BOTH outputs - complex to
use
faucet on right gives you independent “heat” and “strength” inputs, so SIMPLE to use
neither is very complicated
Real World Example #2
recipe... heat it up then cool it down
Real World Example #2
I have to use two devices for the same task!!! temperature control!
wouldn’t it be SIMPLER if I had just one device that could regulate temperature from 0 degrees to 500 degrees?
then i could have more features, like “450 for 20 minutes then refrigerate”
who objects to this?
... talk about how mixing them can create REAL COMPLEXITY...
- sometimes you need MORE PIECES to AVOID COMPLEXITY and CREATE SIMPLICITY
- we’ll come back to this... this same situation happens in software all the time
- people want one tool with a million features... but it turns out these features interact with each other and create COMPLEXITY
ID Name
Location
ID
1 Sally 3
2 George 1
3 Bob 3
Location
ID
City State Population
1 NewYork NY 8.2M
2 San Diego CA 1.3M
3 Chicago IL 2.7M
Normalized schema
Normalization vs
Denormalization
so just a quick overview of denormalization, here’s a schema that stores user information and location information
each is in its own table, and a user’s location is a reference to a row in the location table
this is pretty standard relational database stuff
now let’s say a really common query is getting the city and state a person lives in
to do this you have to join the tables together as part of your query
Join is too expensive, so
denormalize...
you might find joins are too expensive, they use too many resources
ID Name Location ID City State
1 Sally 3 Chicago IL
2 George 1 NewYork NY
3 Bob 3 Chicago IL
Location ID City State Population
1 NewYork NY 8.2M
2 San Diego CA 1.3M
3 Chicago IL 2.7M
Denormalized schema
so you denormalize the schema for performance
you redundantly store the city and state in the users table to make that query faster, cause now it doesn’t require a join
now obviously, this sucks. the same data is now stored in multiple places, which we all know is a bad idea
whenever you need to change something about a location you need to change it everywhere it’s stored
but since people make mistakes, inevitably things become inconsistent
but you have no choice, you want to normalize, but you have to denormalize for performance
Complexity between robust data model
and query performance
you have to choose which one you’re going to suck at
Allow queries to be out of date by hours
Store every Equiv and PageView
Master
Dataset
Master
Dataset
Continuously recompute indexes
Indexes for
uniques
over time
Indexes = function(all data)
Equivs
PageViews
Compute
UserID ->
PersonID
Convert
PageView
UserIDs to
PersonIDs
Compute
[URL, bucket, granularity] -> HyperLogLog
Equivs
PageViews
Compute
UserID ->
PersonID
Convert
PageView
UserIDs to
PersonIDs
Compute
[URL, bucket, granularity] -> HyperLogLog
Iterative graph algorithm
Equivs
PageViews
Compute
UserID ->
PersonID
Convert
PageView
UserIDs to
PersonIDs
Compute
[URL, bucket, granularity] -> HyperLogLog
Join
Equivs
PageViews
Compute
UserID ->
PersonID
Convert
PageView
UserIDs to
PersonIDs
Compute
[URL, bucket, granularity] -> HyperLogLog
Basic aggregation
Sidenote on tooling
• Batch processing systems are tools to
implement function(all data) scalably
• Implementing this is easy
Person 1 Person 6
UserID normalization
UserID normalization
1 3
4
5
11
2
Initial Graph
1 3
4
5
11
2
Iteration 1
1 3
4
5
11
2
Iteration 2
1 3
4
5
11
2
Iteration 3 / Fixed Point
Conclusions
• Easy to understand and implement
• Scalable
• Concurrency / fault-tolerance easily
abstracted away from you
• Great query performance
Conclusions
• But... always out of date
Absorbed into batch views
Not
absorbed
Now
Time
Just a small percentage
of data!
Master
Dataset
Batch views
New Data
Realtime
views
Query
Get historical buckets from batch views
and recent buckets from realtime views
Implementing realtime
layer
• Isn’t this the exact same problem we faced
before we went down the path of batch
computation?
i hope you are looking at this and asking the question...
still have to compute uniques over time and deal with the equivs problem
how are we better off than before?
Approach #1
• Use the exact same approach as we did in
fully incremental implementation
• Query performance only degraded for
recent buckets
• e.g.,“last month” range computes vast
majority of query from efficient batch
indexes
Approach #1
• Relatively small number of buckets in
realtime layer
• So not that much effect on storage costs
Approach #1
• Complexity of realtime layer is softened by
existence of batch layer
• Batch layer continuously overrides realtime
layer, so mistakes are auto-fixed
Approach #1
• Still going to be a lot of work to implement
this realtime layer
• Recent buckets with lots of uniques will still
cause bad query performance
• No way to apply recent equivs to batch
views without restructuring batch views
Approach #2
• Approximate!
• Ignore realtime equivs
options for taking different approaches to problem without having to sacrifice too much
UserID ->
PersonID
(from batch)
Approach #2
Pageview
Convert UserID
to PersonID
[URL, bucket]
->
HyperLogLog
Approach #2
• Highly efficient
• Great performance
• Easy to implement
Approach #2
• Only inaccurate for recent equivs
• Intuitively, shouldn’t be that much
inaccuracy
• Should quantify additional error
Approach #2
• Extra inaccuracy is automatically weeded
out over time
• “Eventual accuracy”
Simplicity
Input:
Normalize/denormalize
Output:
Data model robustness
Query performance
Master
Dataset
Batch views
Normalized
Robust data model
Denormalized
Optimized for queries
Normalization problem
solved
• Maintaining consistency in views easy
because defined as function(all data)
• Can recompute if anything ever goes wrong
Human fault-tolerance
Complexity of Read/
Write Databases
Black box fallacy
people say it does “key/value”, so I can use it when I need key/value operations... and they
stop there
can’t treat it as a black box, that doesn’t tell the full story
Online compaction
• Databases write to write-ahead log before
modifying disk and memory indexes
• Need to occasionally compact the log and
indexes
Memory Disk
Key B
Key D
Key F
Value 1
Value 2
Value 3
Value 4
Key A Write A, Value 3
Write F, Value 4
Write-ahead log
Memory Disk
Write-ahead log
Key B
Key D
Key F
Value 1
Value 2
Value 3
Value 4
Key A
Value 5
Write A, Value 3
Write F, Value 4
Write A, Value 5
Memory Disk
Write-ahead log
Key B
Key D
Key F
Value 1
Value 2
Value 4
Key A
Value 5
Compaction
Online compaction
• Notorious for causing huge, sudden
changes in performance
• Machines can seem locked up
• Necessitated by random writes
• Extremely complex to deal with
Dealing with CAP
Application Databases
Synchronous
100Replica 1
100Replica 2
Network partition
What CAP theorem is