0% found this document useful (0 votes)
14 views21 pages

Maximum Likelihood Estimation Explained

The document discusses Maximum Likelihood Estimation (MLE) as a method for estimating parameters of a model to best fit data, particularly in the context of Gaussian distributions and linear regression. It highlights the process of maximizing the likelihood function and the computational convenience of minimizing the negative log-likelihood instead. Additionally, it contrasts MLE with Maximum A Posteriori (MAP) estimation, which incorporates prior knowledge to mitigate risks associated with relying solely on sample data.

Uploaded by

motherpanda06
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views21 pages

Maximum Likelihood Estimation Explained

The document discusses Maximum Likelihood Estimation (MLE) as a method for estimating parameters of a model to best fit data, particularly in the context of Gaussian distributions and linear regression. It highlights the process of maximizing the likelihood function and the computational convenience of minimizing the negative log-likelihood instead. Additionally, it contrasts MLE with Maximum A Posteriori (MAP) estimation, which incorporates prior knowledge to mitigate risks associated with relying solely on sample data.

Uploaded by

motherpanda06
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Week 4 Lecture

Parameter Estimation: The


Maximum Likelihood
Estimation (MLE)
Content

o Maximum Likelihood Estimation


o The MLE and The Gaussian
o MLE in Linear Regression
o The Bayesian Way
Likelihood
o For the data given below (i.e the histogram), which model fits the data the best? How
do we choose?

o The MLE aims to answer that question. It gives estimates of the parameters of the
true (but unknown) underlying distribution by using the available data (in this case,
the histogram).
Likelihood
o Let us denote by w the vector of parameters that govern a deterministic probability
distribution
o For instance, the Normal distribution has 2 parameters,and ; hence

o We had seen in Week 1 how a model could be optimised on the training set by
minimising a loss function. However, given the data above, we do not have pairs of
and , but only data on with no corresponding to train an algorithm.
o Given such a parametric model for a probability density function ), e.g. a Gaussian
distribution, and some sample data points (a training set without labels y), how are
the parameters optimised?
Likelihood
o Aim: We seek to find the values of w that will maximise . In other words, we want to
find the that gives the model that will fit our data the best.
o The MLE aims to optimise the Likelihood function defined by:

o The likelihood gives the probability of observing the data that we have (i.e the
histogram above), given a certain value of . Maximum Likelihood Estimation aims to
find the that will maximise the likelihood .
Maximum Likelihood
o We thus want to find thesuch that:

o In practice, instead of maximising the likelihood, we choose to minimise the negative


log-likelihood (either way gives rise to the same )

o Why?
o 1) The likelihood is a product of many small probabilities and can give rise to
numerical instabilities, so we want to turn the product into a sum.
o 2) Minimising a function is usually easier, and can be done using common algorithms
such as the Gradient Descent
Maximum Likelihood
o We thus seek to find , the that minimises the negative log-likelihood (NLL) such
that:

o The logarithm (commonly in base e) is a monotonic function. Hence, the value


which gives a maximum of is identical to that which minimises the NLL.
Maximum Likelihood and the Gaussian distribution

o The Gaussian p.d.f. in one dimension for data point is


Maximum Likelihood and the Gaussian distribution

o The Gaussian p.d.f. in one dimension for data point is

o The negative log-likelihood function is therefore:


Maximum Likelihood and the Gaussian distribution

o The maximum likelihood estimate (M.L.E.) for can be determined directly by


differentiation of the NLL:
MLE in Linear Regression
o Regression setting:

o Assumption 1: The data can be modelled by a line.


o Assumption 2: Noise of each point is modelled by a Gaussian:
MLE in Linear Regression
o Regression setting:
MLE in Linear Regression

o Again, we are interested in predicting given data on the features


o For a single data point we are interested in ) where is the slope of our linear model
o Since ) is Normally distributed, we have

o How do we find the that best explains our data?


o MLE!
MLE in Linear Regression
MLE in Linear Regression
o We introduce the loss function

o is a convex parabola →easy to minimise!


o This specific loss function is also known as the Squared Loss, or the Ordinary Least
Square (OLS)
o The OLS can be minimised using gradient descent, Newton’s method, or in closed-
form
o To be completely accurate, the loss function for data point is The cost function is the
averaged sum of the loss
MLE Summary
o The MLE method aims to find the best parameters for a model to fit our data
o It does so by finding the parameters that maximise the probability of observing the
data that we have
o For computational purposes, it is often easier to minimise the Negative Log-
Likelihood instead
o In Linear Regression, the estimate of can be modelled by a Gaussian, where the
mean defines the line, and the variance defines the noise (or the spread of the data)
around that line. The MLE for the mean gives us the Squared Loss function

o What might be a drawback of only relying on the MLE


to obtain a model that will be used for prediction?
MLE Summary
o The Maximum Likelihood Estimator (MLE) gives an estimate based on the sample
only (observed data) by maximising the Likelihood function

o Risk: Since the MLE only considers the data given to us (sample), if that data is too
small or non-representative of the population, then the model built using the MLE will
do great on that sample, but not so great on the population (which is what we care
about when making predictions on new data).
MLE and MAP
o The Maximum A Posteriori estimate (MAP) uses the observed data (i.e through the
likelihood function) but also incorporates our prior knowledge of what the distribution
of the parameter that we try to estimate might be

o This prior knowledge aims to address the possible risk of the MLE described above by
bringing information that is not dictated by the sample

o Posterior ∝ Likelihood x Prior


o Risk: If our prior knowledge is wrong, we might make things worse.
MLE and MAP

Posterior ∝ Likelihood x
Prior
The Bayesian Way
Questions?

You might also like