Chapter 7.
Statistical Estimation
7.2: Maximum Likelihood Examples
(From “Probability & Statistics with Applications to Computing” by Alex Tsun)
We spend an entire section just doing examples because maximum likelihood is such a fundamental concept
used everywhere (especially machine learning). I promise that the idea is simple: find θ that maximizes the
likelihood of the data. The computation and notation can be confusing at first though.
7.2.1 MLE Example (Poisson)
Example(s)
Let’s say x1 , x2 , ..., xn are iid samples from Poi(θ). (These values might look like x1 = 13, x2 = 5, x3 =
6, etc...) What is the MLE of θ?
Solution Remember that we discussed that the sample mean might be a good estimate of θ. If we observed
20 events over 5 units of time, a good estimate for λ, the average number of events per unit of time, would
be 20
5 = 4. This turns out to be the maximum likelihood estimate!
Let’s follow the recipe provided in 7.1.
1. Compute the likelihood and log-likelihood of data. To do this, we take the following product
of the Poisson PMFs at each sample xi , over all the data points:
n n
Y Y θ xi
L(x | θ) = pX (xi | θ) = e−θ
i=1 i=1
xi !
Again, this is the probability of seeing x1 , then x2 , and so on. This function is pretty hard to differ-
entiate, so to make it easier, let’s compute the log-likelihood instead, using the following identities:
log(ab) = log(a) + log(b) log(a/b) = log(a) − log(b) log(ab ) = b log(a)
In most cases, we’ll want to optimize the log-likelihood instead of the likelihood (since we don’t want
to use the product rule of calculus)!
n
!
xi
−θ θ
Y
ln L(x | θ) = ln e [def of likelihood]
i=1
xi !
n
θ xi
X
= ln e−θ [log of product is sum of logs]
i=1
xi !
n
X
= [ln(e−θ ) + ln(θxi ) − ln xi !)] [log of product is sum of logs]
i=1
Xn
= [−θ + xi ln θ − ln xi !)] [other log properties]
i=1
1
2 Probability & Statistics with Applications to Computing 7.2
2. Take the partial derivative(s) with respect to θ and set to 0. Solve the equation(s).
Now we want to take the derivative of the log likelihood with respect to θ, so the derivative of −θ is
just −1, and the derivative of xi ln θ is just xθi , because remember xi is a constant with respect to θ.
n
∂ Xh xi i
ln L(x | θ) = −1 +
∂θ i=1
θ
And now we want to set the derivative equal P
to 0, and solve for θ, and θ̂ is actually the estimate that
n
we solve for. We do some algebra, and get n1 i=1 xi , which is actually just the sample mean!
n h n n
X xi i 1X 1X
−1 + = 0 → −n + xi = 0 → θ̂ = xi
i=1
θ θ̂ i=1 n i=1
3. Optionally, verify θ̂M LE is indeed a (local) maximizer by checking that the second deriva-
tive at θ̂M LE is negative (if θ is a single parameter), or the Hessian (matrix of second
partial derivatives) is negative semi-definite (if θ is a vector of parameters).
We want to take the second derivative also, because
Pn otherwise we don’t know if this is a maximum or
a minimum. We differentiate the first derivative i=1 [−1 + xθi ] again with respect to θ, and we notice
that because θ2 is always positive, the negative of that is always negative, so the second derivative
is always less than 0, so that means that it’s concave down everywhere. This means that anywhere
the derivative is zero is a global maximum, so we’ve successfully found the global maximum of our
likelihood equation.
n
∂2 X h xi i
ln L(x | θ) = − 2 < 0 → concave down everywhere
∂θ2 i=1
θ
7.2.2 MLE Example (Exponential)
Example(s)
Let’s say x1 , x2 , ..., xn are iid samples from Exp(θ). (These values might look like x1 = 1.354, x2 =
3.198, x3 = 4.312, etc...) What is the MLE of θ?
Solution Now that we’ve seen one example, we’ll just follow the procedure given in the previous section.
1. Compute the likelihood and log-likelihood of data.
Since we have a continuous distribution, our likelihood is the product of the PDFs:
n
Y n
Y
L(x | θ) = fX (xi | θ) = θe−θxi
i=1 i=1
The log-likelihood is
n
X n
X
ln L(x | θ) = ln θe−θxi = [ln(θ) − θxi ]
i=1 i=1
7.2 Probability & Statistics with Applications to Computing 3
2. Take the partial derivative(s) with respect to θ and set to 0. Solve the equation(s).
n
∂ X 1
ln L(x | θ) = − xi
∂θ i=1
θ
Now, we set the derivative to 0 and solve (here we replace θ with θ̂):
n n
X 1 n X n
− xi = 0 → − xi = 0 → θ̂ = Pn
i=1 θ̂ θ̂ i=1 i=1 xi
This is just the inverse of the sample mean! This makes sense because if the average waiting time was
1
1/2 hours, then the average rate per unit of time λ should be 1/2 = 2 per hour!
3. Optionally, verify θ̂M LE is indeed a (local) maximizer by checking that the second deriva-
tive at θ̂M LE is negative (if θ is a single parameter), or the Hessian (matrix of second
partial derivatives) is negative semi-definite (if θ is a vector of parameters). The second
derivative of the log-likelihood just requires us to take one more derivative:
n
∂2
X −1
ln L(x | θ) = <0
∂θ2 i=1
θ2
Since the second derivative is negative everywhere, the function is concave down, and any critical point
is a global maximum!
7.2.3 MLE Example (Uniform)
Example(s)
Let’s say x1 , x2 , ..., xn are iid samples from (continuous) Unif(0, θ). (These values might look like
x1 = 2.325, x2 = 1.1242, x3 = 9.262, etc...) What is the MLE of θ?
Solution It turns out our usual procedure won’t work on this example, unfortunately. We’ll explain why
once we run into the problem!
To compute the likelihood, we first need the individual density functions. Recall
1 0 ≤ x ≤ θ
fX (x | θ) = θ
0 otherwise
Let’s actually define an indicator function for whether or not some boolean condition A is true or false:
(
1 A is true
IA =
0 A is false
This way, we can rewrite the uniform density in one line as (1/θ for 0 ≤ x ≤ θ and 0 otherwise):
1
fX (x | θ) = I{0≤x≤θ}
θ
4 Probability & Statistics with Applications to Computing 7.2
First, we take the product over all data points of the density at that data point, and plug in the density of
the uniform distribution. How do we simplify this? First of all, we notice that in every term in the product,
there is still a θ1 , so multiply it by itself n times and get θ1n . How do we multiply indicators? If we want the
product of 1’s and 0’s to be 1, they ALL have to be 1. So,
I{0≤x1 ≤θ} · I{0≤x2 ≤θ} · · · I{0≤xn ≤θ} = I{0≤x1 ,...,xn ≤θ}
and our likelihood is
n n
Y Y 1 1
L(x | θ) = fX (xi | θ) = I{0≤xi ≤θ} = I{0≤x1 ,...,xn ≤θ}
i=1 i=1
θ θn
We could take the log-likelihood before differentiating, but this function isn’t too bad-looking, so let’s take
the derivative of this. The I{0≤x1 ,...,xn ≤θ} just says the function is θ1n when it the condition is true and 0
otherwise. So our derivative will just be the derivative of θ1n when that condition is true and 0 otherwise.
d n
L(x | θ) = − n+1 I{0≤x1 ,...,xn ≤θ}
dθ θ
Now, let’s set the derivative equal to 0 and solve for θ.
n
− = 0 → θ =???
θn+1
There seems
n to be no value of θ that solves this, what’s going on? Let’s plot the likelihood. First, we plot
just θ1 (not quite the likelihood) where θ is on the x-axis:
Above is a graph of θ1n , and so if we wanted to maximize this function, we should choose θ = 0. But
remember that the likelihood, was θ1n I{0≤x1 ,...,xn ≤θ} , which can also be written as θ1n I{xmax ≤θ} , because all
the samples are ≤ θ if and only if the maximum is. Below is the graph of the actual likelihood:
7.2 Probability & Statistics with Applications to Computing 5
Notice that multiplying by the indicator function just kept the function as is when the condition was true,
xmax ≤ θ, but zeroed it out otherwise. So now we can see that our maximum likelihood estimator should be
θ̂M LE = xmax = max{x1 , x2 , . . . , xn }, since it achieves the highest value.
Why? Remember x1 , . . . , xn ∼ Unif(0, θ), so θ has to be at least as large as the biggest xi , because if it’s
not as large as the biggest xi , then it would have been impossible for that uniform to produce that largest
xi . For example, if our samples were x1 = 2.53, x2 = 8.55, x3 = 4.12, our θ had to be at least 8.55 (the
maximum sample), because if it were 7 for example, then Unif(0, 7) could not possibly generate the sample
8.55.
So our likelihood remember θ1n would have preferred as small a θ as possible to maximize it, but subject to
θ ≥ xmax . Therefore the “compromise” was reached by making them equal!
I’d like to point out this is a special case because the range of the uniform distribution depends on its
parameter(s) a, b (the range of Unif(a, b) is [a, b]). On the other hand, most of our distributions like Poisson
or Exponential have the same range no matter what value the value of their parameters. For example, the
range of Poi(λ) is always {0, 1, 2, . . . } and the range of Exp(λ) is always [0, ∞), independent of λ.
Therefore, most MLE problems will be similar to the first two examples rather than this complicated one!