A Theorist’s Toolkit (CMU 18-859T, Fall 2013)
Lecture 03: Chernoff/Tail Bounds
September 16, 2013
Lecturer: Ryan O’Donnell Scribe: Elara Willett
1 INTRODUCTION
Define H ∼ Binomial(n, 12 ), i.e. H is the number of heads that occur when we
√
flip a coin n
n n
times. Recall that the mean of H is µ = 2 and the standard deviation is σ = 2 . Last time,
we applied the Berry-Esseen Theorem to get
√
n n φ(t) −t2
Pr H ≥ + t ≈ Φ(t) ≤ ≤e 2
2 2 t
where Φ(t) is the probability that the Gaussian distribution is at most t and t ≥ 1. This
tells us that the probability that we exceed the standard deviation decreases very rapidly.
√ h √ i
n
As an example, consider t = 10 ln n. Then we get P r H ≥ 2 + t 2 ≤ e−50 ln n = n150 .
n
However, the error term of our above approximation is ±O( √1n ), which is important in this
example because the error is so much larger than the bound. We will see in the analysis to
come that the bound does, indeed, hold in this example.
In this lecture, we will cover bounds on random variables given varying assumptions.
For a comprehensive reference on probabilistic techniques used in approximation algorithms,
including topics covered in this lecture and more, see Dubhashi and Panconesi’s book [DP+ ].
For another reference, see Mitzenmacher and Upfal’s book [MU05], which is an introduction
to probabilistic techniques used in algorithm analysis, including the topics covered in this
lecture.
2 BOUNDS BASED ONLY ON MEAN & VARIANCE
Say we have a random variable X. Generally speaking, the more we know about X, the
better bounds we can calculate for P [X ≥ t]. So we’ll begin by supposing we know only the
expectation E[X]. Assume X ≥ 0 always and X is not always 0.
Theorem 2.1. Markov Inequality. For all t > 0, P [X ≥ t · E[X]] ≤ 1t .
Proof. WOLOG assume E[X] = 1. (We can multiply X by a constant and we do not change
the probabilities.) For contradiction, suppose P [X ≥ t] > 1t . Then
E[X] ≥ t · P [X ≥ t] + 0 · P [X < t] > 1
which contradicts our assumption that E[X] = 1.
1
Here is an alternative proof, using the intuition from the depicted graph.
Proof. Define f (x) to be 0 for x < t and 1 for x ≥ t, so that P r[X ≥ t] = E[f (X)]. Define
g(x) = xt . Since g(x) ≥ f (x), we get
X 1
P r[X ≥ t] = E[f (X)] ≤ E[g(X)] = E =
t t
where the last equality holds because E[X] = 1.
We can prove another similar fact based only on expectation using a similar argument,
known as a ‘Markov-type’ or ‘average-type’ argument.
Fact 2.2. Suppose E[X] = and 0 ≤ X ≤ 1, then P r[X ≥ 2 ] ≥ 2 .
Proof. For contradiction, suppose P r[X ≥ 2 ] < 2 , then
h i h i
E[X] ≤ 1 · P r X ≥ + · Pr X < <1· + ·1=
2 2 2 2 2
which contradicts our assumption that E[X] = .
Now, we can move on to the scenerio where we know the expectation and the variance
of X. In this case we get the following bounds.
Theorem 2.3. Chebyshev’s Inequality. Let E[X] = µ and stddev[X] = σ 6= 0, then for all
t > 0, P [|X − µ| ≥ tσ] ≤ t12 .
Proof. WOLOG assume µ = 0. (We can subtract µ from the mean without changing the
standard deviation.) Also, WOLOG assume σ = 1. (We can multiply X by σ1 without
changing the probabilites.) Note that these assumptions imply E[X 2 ] = 1. Now, it is
sufficient to show P [|X| ≥ t] ≤ t12 . Well, the event |X| ≥ t is equivalent to the event
X 2 ≥ t2 . Since X 2 is nonnegative and t2 = t2 · E[X 2 ], we can apply Markov’s Inequality to
bound P [X 2 ≥ t2 ], which gives us the desired bound
1
P [|X| ≥ t] = P [X 2 ≥ t2 ] ≤
t2
2
Of course, as in the proof of Markov’s Inequality, we can alternatively prove this theorem
with a picture.
Proof. Again WOLOG assume µ = 0 and σ = 1. Now define f (x) to be 0 for x < t and 1
for x ≥ t and define g(x) = tx2 . As before, since g(x) ≤ f (x),
2
X 1
P r[|X| ≥ t] = E[f (X)] ≤ E[g(X)] = E 2 = 2
t t
3 BACK TO SUMS OF VARIABLES
Let X = X1 + X2 + · · · + Xn , and assume (for now) that the Xi ’s are independent. Given this
additional information about X, what bounds can we prove? In this section we’ll explore
Chernoff Bounds to answer this question.
3.1 Markov and Chebyshev on Sums
To start, let’s again consider H ∼ Binomial(n, 12 ). Recall that we are interested in bounding
n
√
n
√ n
√
n
P r[H ≥ 2 + t 2 ], where t = 10 ln n. We know that µ = 2 and σ = 2 , so we can apply
the inequalities form the previous section. Using only the mean, we can apply Markov’s
Inequality and get
" √ !!#
ln n 1 1
Pr H ≥ µ 1 + O √ ≤ √ =
n 1 + O( √lnnn ) e √1 )
1 + O( n
Using the mean and the standard deviation, we can apply Chebyshev’ Inequality and get
√ 1 1
P r[|X − µ| ≥ 10 ln n · σ] ≤ √ =
(10 ln n) 2 100 ln n
Chebyshev’s bound will still hold if we only have pairwise independence of the Wi ’s. Pairwise
independence means that for all pairs i 6= j, E[Xi Xj ] = E[Xi ]E[Xj ]. The mean will remain
the same as in the independence case because of linearity of independence. To see that the
variance will also remain unchanged, notice that pairwise independence gives us
" n # n n
X X X X X
E ( Xi )2 = E[Xi2 ] + 2 E[Xi Xj ] = E[Xi2 ] + 2 E[Xi ]E[Xj ]
i=1 i=1 i6=j i=1 i6=j
3
n n
Xi )2 ] − (E[ Xi ])2 , and by our calculation
P P
By definition, V ar[X1 + X2 + · · · + Xn ] = E[(
i=1 i=1
above this value is the same as in the case of independent variables, so variance is unchanged.
Since the mean and variance are the same in the cases of independence and pairwise inde-
pendence, Chebychev’s bound is the same for both. Therefore, we see that Chebychev’s
bound is strong in some cases and weak in others.
3.2 Fourth Moment Method
So far we have seen bounds based on the mean, e.g. the first moment, and the variance, e.g.
the second moment, so now we will look at bounds based on higher moments. The graph
below shows how a fourth-degree polynomial could more closely fit our function, f , than a
quadratic polynomial. An odd-degree polynomial could not upper bound f , so we do not
consider the third moment.
Let X = X1 + X2 + · · · + Xn , where the Xi ’s are independent identical random variables
and
+1 w.p. 1/2
Xi =
−1 w.p. 1/2
√ √
We will use the Fourth
√ Moment Method to upper bound P r[|X| ≥ 10 ln n · n].
4 4 2 4
Well, P r[|X| ≥ t n] = P [X ≥ t n ], and X is a nonnegative variable, so we can apply
Markov’s Inequality, namely,
√ E[X 4 ]
P r[|X| ≥ t n] = P [X 4 ≥ t4 n2 ] ≤ 4 2
tn
4
P 4
Now, let’s take a close look at X = Xi . When we multiple the variables out, we get
i
X X X X X
X4 = Xi4 + c Xi2 Xj2 + c0 Xi3 Xj + c00 Xi2 Xj Xk + c000 Xi Xj Xk Xl
i i6=j i6=j i6=j6=k i6=j6=k6=l
for some integers c, c0 , c00 , c000 . Since E[Xi ] = 0 and the variables are independent,
E[Xi Xj Xk Xl ] = E[Xi ]E[Xj ]E[Xk ]E[Xl ] = 0 · 0 · 0 · 0 = 0
Similarly, E[Xi3 Xj ] = E[Xi3 ]E[Xj ] = E[Xi3 ] · 0 = 0, and E[Xi2 Xj Xk ] = 0. Thus,
X X
E[X 4 ] = E[Xi4 ] + c E[Xi2 ]E[Xj2 ]
i i6=j
4
n
4
and if we have to calculate it, c = 2 2
= 3n2 − 3n. Thus,
√ E[X 4 ] 3n2 − 3n 3
P r[|X| ≥ t n] ≤ = ≤
tn2 t4 n2 t4
Furthermore, in our specific example,
√ √
3 1
P r[|X| ≥ 10 ln n · n] ≤ √ =Θ
(10 ln n)4 ln2 n
We could continue in√the same manner, applying the 2s Moment Method to get bounds
of the form P r[|X| ≥ t n] ≤ tC2ss for some constant t. Notice that as s goes to infinity, cs
also goes to infinity. Thus, to see which of these methods gives the best bound, we would
need to optimize over all possible s, geting s(t) for a given t. This would be quite difficult,
but it might give us a slightly better bound.
3.3 Chernoff Method
For this subsection, we assume X is the sum of independent identical variables.
√
Define c = eλ for some λ > 0. Let g(x) = cx /cu for a given u = t n. Looking at the
diagram below, it is clear that g is strictly increasing and g(u) = 1.
This tells us that the event that g ≥ 1 is equivalent to the event that X ≥ u, which
implies P [X ≥ u] = P r[g ≥ 1] = P r[eλX ≥ eλu ]. Since eλX is a nonegative random variable,
we can now apply Markov’s Inequality to get,
E[eλX ]
P [X ≥ u] = P r[eλX ≥ eλu ] ≤
eλu
This bound is a function of λ and is called the moment generating function of X. By
optimizing over λ > 0, we could determine the best bound achievable by this function for
any given u.
In order to use this strategy, we first need to get a function for E[eλX ]. Well,
E[exp(λX)] = E[exp(λX1 + λX2 + · · · + λXn )] = E[exp(λX1 )exp(λX2 ) · · · exp(λXn )]
By independence of the Xi ’s, we get
E[exp(λX1 )exp(λX2 ) · · · exp(λXn )] = E[exp(λX1 )] · E[exp(λX2 )] · · · E[exp(λXn )]
5
and since the Xi ’s and identical, we get
E[exp(λX1 )] · E[exp(λX2 )] · · · E[exp(λXn )] = (E[exp(λX1 )])n
Therefore, E[eλX ] = (E[exp(λX1 )])n . Now, E[exp(λX1 )] = 21 eλ + 21 e−λ , which is known as
the hyperbolic cosine. We can simplify this function by applying the Taylor Expansion for
2 3 2 3
ex . Namely, eλ = (1 + λ + λ2 + λ3! + · · ·) and e−λ = (1 − λ + λ2 − λ3! + · · ·). The odd terms
2 4 6
cancel each other out, so we get E[exp(λX1 )] = (1 + λ2 + λ4! + λ6! + · · ·). By Claim 3.1, this
2
series is upper bounded by the Taylor Expansion of ex for x = λ2 . Thus, we get
2
E[eλX ] eλ n/2 2
P [X ≥ u] ≤ λu
≤ λu
= eλ n/2−λu
e e
Now, we have an upper bound for P [X ≥ u] that we could optimize over λ for a given u.
For example, if we choose λ = nu then we get the bound:
u2
P [X ≥ u] ≤ e− 2n
√ √
Furthermore, for u = 10 ln n · n, we get P [X ≥ u] ≤ e−50 ln n = n150 . Recall from the
Introduction that this is precisely the bound we desired. For this particular random variable
and choice of t, we can deduce from the Berry-Esseen Theorem that this is about the best
bound we can get. In more general cases, this bound is still about the best we can do. See
[AS04] or [Slu77] for more details.
λ2 λ4 λ6 2 2 2
Claim 3.1. Let x = 1 + 2
+ 4!
+ 6!
+ · · · and y = 1 + ( λ2 ) + 12 ( λ2 )2 + 3!1 ( λ2 )2 + · · ·. then
x ≤ y.
∞ ∞ ∞
P (λ2 /2)i P (λ)2i P (λ)2i
Proof. Well, y = i!
= 2i i!
≥ (2i)!
=x, where the inequality follows from
i=0 i=0 i=0
(2i)! 2i(2i − 1) · · · (i + 1) 2i 2i − 1 i+1
≥ = ··· ≥ 2i
i! i(i − 1) · · · 1 i i−1 1
Note: Suppose our Xi ’s were not identical, but had some common structure, such as
1 w.p. pi
Xi =
0 w.p. 1 − pi
As long as the Xi ’s are independent, we could do a similar, yet somewhat more complicated,
analysis and get a good bound.
We will now formally state two variations on the bound we proved, the Hoeffding Bound
and the Chernoff [Link] bounds assume X = X1 + X2 + · · · + Xn with independent
Xi ’s and µ = E[X] = E[X]. These bounds are extremely useful and the Chernoff
i
Bound should be memorized.
6
Theorem 3.2. Hoeffding Bound. Suppose ai ≤ Xi ≤ bi always. Then for all t > 0,
2t2
P r[X ≥ µ + t] ≤ exp − P 2
i (bi − ai )
and the exact same bound holds for P r[X ≥ µ − t].
The Chernoff Bound, below, is a little stronger than the Hoeffding Bound, which is also
sometimes called the Chernoff Bound.
Theorem 3.3. Chernoff Bound. Suppose 0 ≤ Xi ≤ 1. Then for all > 0,
2
2
P r[X ≤ (1 − )µ] ≤ exp − µ , and P r[X ≥ (1 + )µ] ≤ exp − µ
2 2+
The Chernoff Bound is even useful in the case where you only know some range within
which µ falls. Suppose µL ≤ µ ≤ µH , then as you might expect,
2
2
P r[X ≤ (1 − )µL ] ≤ exp − µL , and P r[X ≥ (1 + )µH ] ≤ exp − µH
2 2+
4 RELAXING INDEPENDENCE
So far we have mostly considered only sums of independent variables. In this section we will
consider two special cases where we can prove good bounds about the sums of dependent
variables: negatively associated random variables and martingales.
4.1 Negatively Associated Random Variables
For intuition’s sake, negative association (NA) means that if one variable in a set of negatively
associated random variables is large than the others will be small. The following are examples
of negatively associated variables.
1. Uniformly at random, draw n points on the unit circle as depicted, and let X1 , X2 , ...Xn
be the arc lengths between these points. Notice that if you know that one arc length
is large, then the others must be small.
2. Independently throw n balls into n bins, and let Xi be the number of balls in bin i. If
there are a lot of balls in bin i, then there are few balls in the other bins.
7
3. Let x1 , x2 , ..., xn be any real numbers. Let X1 , ..., Xn be a random permutation of these
numbers. If we know Xi = xj for a high xj , than the other variables cannot take on
xj , so they will probably be comparatively low. Note, this setup can be used to model
sampling without replacement.
We will now formally define negative association.
Definition 4.1. Random variables X1 , X2 , ..., Xn are randomly associated if
E[f (Xi : i ∈ A)g(Xj : j ∈ B)] ≤ E[f (Xi : i ∈ A)]E[g(Xj : j ∈ B)]
for any disjoint sets A, B ⊆ {1, 2, ...n} and any nondecreasing functions, f, g.
This definition implies that E[Xi Xj ] ≤ E[Xi ]E[Xj ], or cov[Xi , Xj ] ≤ 0, but the converse
is not true. In general, Negative association is stronger than this claim about covariance.
See [JDP83] for more details.
By induction, one can show that negative association is enough to give,
E[exp(λX1 )exp(λX2 ) · · · exp(λXn )] ≤ E[exp(λX1 )] · E[exp(λX2 )] · · · E[exp(λXn )]
In our discussion of the Chernoff Method, we only needed independence to get this one fact.
Therefore, the Chernoff Bounds hold for negatively associated variables.
Here are a few interesting facts to ponder about negative association:
• Independence implies negative association.
• Negative association is closed under subsets and independent unions.
• Negative association is closed under sapplying nondecreasing functions to subsets of
variables.
4.2 Martingales
Suppose you bought a lottery ticket with numbers 12 33 5 60. Before the winning numbers
are revealed, you have an expected payoff of about 0, but after the first number is revealed
to be 12, your expected payoff is much higher. Your expected payoff becomes further refined
as more of the numbers are revealed. Finally, the last number is revealed and you know your
payoff exactly. We could model your series of guesses as a martingale.
Definition 4.2. Let X1 , X2 , ..., Xn be discrete random variables. Let f : Rn → R. Define
Yi = E[f (X1 , ..., Xn )|X1 , ..., Xi ]. Then Y0 , ..., Yn is a martingale with respect to X1 , ..., Xn .
So we have Y0 = E[f (X1 , ..., Xn )] = µ, Y1 = E[f (X1 , ..., Xn )|X1 ],..., Yn = f (X1 , ..., Xn ).
Given a property of the Yi ’s, i.e. bounded diffference, we can get a bound on f .
8
Theorem 4.3. Azuma’s Inequality or Method of Bounded Difference. Let X1 , ..., Xn be
independent. If f satisfies
f (X1 , ..., Xn ) − f (X1 , ..., Xi−1 , Xi0 , Xi+1 , ..., Xn ) ≤ ci
then
2t2
P r[f (X1 , ..., Xn ) ≥ µ + t] ≤ exp − P 2
i ci
and the same upper bound holds for P r[f (X1 , ..., Xn ) ≥ µ − t].
For more information on martingales see [McD89].
References
[AS04] N. Alon and J.H. Spencer. The Probabilistic Method, chapter Appendix A.2. Wiley-
Interscience series in discrete mathematics and optimization. Wiley, 2004.
[DP+ ] D Dubhashi, Alessandro Panconesi, et al. Concentration of measure for the analysis
of randomised algorithms.
[JDP83] Kumar Joag-Dev and Frank Proschan. Negative association of random variables
with applications. The Annals of Statistics, 11(1):286–295, 1983.
[McD89] Colin McDiarmid. On the method of bounded differences. Surveys in combina-
torics, 141(1):148–188, 1989.
[MU05] M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algo-
rithms and Probabilistic Analysis. Cambridge University Press, 2005.
[Slu77] Eric V Slud. Distribution inequalities for the binomial law. The Annals of Proba-
bility, 5(3):404–412, 1977.