Introduction to Machine learning
Sylvain Le Corff
2 / 68
Prerequisites: basics of probability and statistics.
Objectives:
+ Learn and apply the fundamental concepts of statistical learning;
+ Understand the basic theory underlying data science;
+ Implement (Python) the most classical learning algorithms;
+ Be able to read current research books and papers.
This course:
+ 4h of lecture/TD/TP per week;
+ Evaluation: 1/4 Moodle quiz + 3/4 exam.
3 / 68
References
+ Slides adapted from Claire Boyer’s lectures (thanks!).
+ Probabilistic Machine Learning: An Introduction by Kevin P.
Murphy
(Complete book & figures)
+M1 Lecture notes - statistical learning introduction and basic
topics on regression
(Lecture notes)
Material
+ Mini tests on Moodle for training.
+ Notebooks to create all illustrations of the lectures (Experiments)
+ Materials on Moodle and .
(Github)
Introduction to machine learning 4 / 68
1. Some applications
2. Learning settings
3. Mathematical framework
4. Some criterion for regression and supervised classification
Regression
Binary classification
Scoring function
5. Empirical risk
6. Case of a finite class
7. Overfitting
8. Cross-validation
Summary 5 / 68
1. Some applications
2. Learning settings
3. Mathematical framework
4. Some criterion for regression and supervised classification
Regression
Binary classification
Scoring function
5. Empirical risk
6. Case of a finite class
7. Overfitting
8. Cross-validation
Computer vision 6 / 68
Figure: Detection of plastic trash in videos, image classification. (ImageNet,
Krizhevsky et al., 2012), (Chagneux et al. 2022).
Computer vision 7 / 68
Figure: Automatic detections using Region Proposal Network (bounding
box design). (Faster-RCNN, Ren et al. 2015). Code is available at
[Link]
Vision + Natural Language Processing 8 / 68
Figure: Visual Question Answering (VQA) tasks. (Mutan, Ben-Younes et al. 2017).
Medical applications 9 / 68
Figure: Predicting the three-dimensional structure that a protein will
adopt based solely on its amino acid sequence—the structure prediction
component of the "protein folding problem". Performance of AlphaFold
on the CASP14 dataset. (AlphaFold, Jumper et al. 2022).
Challenges - reduce GHG emissions from cities 10 / 68
Figure: Selected opportunities to reduce GHG emissions from buildings
and cities using ML. (Tackling Climate Change with Machine Learning, Rolnick et al. 2022).
Challenges - reduce GHG emissions from land use 11 / 68
Figure: Selected opportunities to reduce GHG emissions from land use
using ML. (Tackling Climate Change with Machine Learning, Rolnick et al. 2022).
Many other applications 12 / 68
I Efficient simulation in physics-based problem (Fluid simulation,
Tompson et al., 2017), (Metamodels for building energy models, Cohen et al., 2021) .
I Natural language processing (Dynamic memory networks, Kumar A. et al.,
2016), (Dall-E, OpenAI).
I Medical diagnosis .
(Causal inference for medical diagnosis, Richens, J.G. et al., 2020)
I Generativel models for speech processing (WaveNet, van den Oord A. et
al., 2016).
I Many more...
Summary 13 / 68
1. Some applications
2. Learning settings
3. Mathematical framework
4. Some criterion for regression and supervised classification
Regression
Binary classification
Scoring function
5. Empirical risk
6. Case of a finite class
7. Overfitting
8. Cross-validation
Typical machine learning problems 14 / 68
I Classification: Assign a category to each input data.
I Regression: Predict a vector associated with each input data.
I Ranking: Order input data according to some criterion.
I Clustering: Partition input data into homogeneous regions.
I Dimensionality reduction or manifold learning: Transform
an initial representation of input data into a lower-dimensional
representation while preserving some properties.
Focus on classification and regression 15 / 68
Classification: (X , Y ) ∈ Rd × {1, . . . , M}
+ Learn whether an individual from Rd belongs to some class.
+ Focus with known number M of classes, Y = {1, . . . , M}.
+ Objective: define a function f : Rd → {1, . . . , M}, such that
f (X ) is the best prediction of Y in a given context.
Regression: (X , Y ) ∈ Rd × Rm
+ The observation associated with X is assumed to be given by
Y = f∗ (X ) + ε ,
where ε is a centered noise independent of X .
+ Objective: define the best estimator of f∗ in a given context.
Learning settings 16 / 68
The two main settings are supervised and unsupervised learning.
I Supervised learning:
I Training data: a set of labeled examples
I Prediction for all unseen points.
classification, regression, and ranking problems
I Unsupervised learning:
I Training data: a set of unlabeled examples
I Prediction for all unseen points.
clustering and dimensionality reduction
I Semi-supervised learning:
I Training data: a set of both labeled and unlabeled examples
I Prediction for all unseen points.
I and also Online learning, etc...
Summary 17 / 68
1. Some applications
2. Learning settings
3. Mathematical framework
4. Some criterion for regression and supervised classification
Regression
Binary classification
Scoring function
5. Empirical risk
6. Case of a finite class
7. Overfitting
8. Cross-validation
Supervised Learning 18 / 68
Supervised Learning Framework
+ Input measurement X ∈ X (often X ⊂ Rd ).
+ Output measurement Y ∈ Y.
+ The joint distribution of (X, Y ) is unknown.
+ Y ∈ {1, . . . , M} (classification) or Y ∈ Rm (regression).
+ A predictor is a measurable function in f : X → Y.
Training data
Dn = {(Xi , Yi )}16i6n i.i.d. with the same distribution as (X, Y ).
Goal
+ Construct a good predictor fbn from the training data.
+ Predict the output for input data not in the training dataset.
Mathematical framework 19 / 68
I We use a cost function to evaluate the quality of a prediction:
` : Y × Y → R+ is such that
`(y , y 0 ) = 0 if y = y 0
> 0 if y 6= y 0 .
I Interpretation: `(y , y 0 ) measures an error between the
prediction y 0 and the observation y .
Classical choices
+ Prediction loss: `(Y , f (X)) = 1Y 6=f (X) .
+ Quadratic loss: `(Y , X) = kY − f (X)k22 .
Risk 20 / 68
We assume for the whole lecture that
I data dn = {(x1 , y1 ), . . . , (xn , yn )} are realizations of a n-sample
Dn := {(X1 , Y1 ), . . . , (Xn , Yn )}, meaning that the (Xi , Yi ),
1 6 i 6 n, are i.i.d. copies of (X , Y ) taking value in X × Y.
I For a given cost function ` : Y × Y → R+ , the performance of
a predictor f : X → Y is measured as the average loss.
Risk of a predictor
The risk of a predictor f : X → Y is defined by
R := E [`(Y , f (X ))] .
Classical choices
+ Prediction loss: E[`(Y , f (X))] = P(Y 6= f (X)).
+ Quadratic loss: E[`(Y , f (X))] = E[kY − f (X)k22 ].
THE problem in Machine Learning 21 / 68
The problem is then to find a minimizer of the risk for a fixed cost
function `
f ? ∈ argminf :X →Y {R(f ) = E [`(Y , f (X ))]} .
Such a function f ? (if it exists) is called the optimal predictor for
the cost function ` and we define R? := R(f ? ).
I f ? generally depends on the unknown probability distribution
of (X , Y ), then f ? is unknown in practice.
I Using Dn , our work consists in finding a good estimate fn of
f ? , i.e. finding fn : X → Y such that R(fn ) ' R(f ? ).
I As fn depends on Dn , R(fbn ) is a random variable!
Choice of a cost function `? 22 / 68
I The proposed mathematical framework implies that a predictor
is performant with respect to a criterion (represented by the
cost function `).
I It means that a predictor f could be good for a cost function
`1 (R1 (f ) small) but not for another cost function `2 (R2 (f )
large).
I Crucial to choose a relevant cost function for the problem we
are faced.
I Can reflect a prior that you know on your problem
Summary 23 / 68
1. Some applications
2. Learning settings
3. Mathematical framework
4. Some criterion for regression and supervised classification
Regression
Binary classification
Scoring function
5. Empirical risk
6. Case of a finite class
7. Overfitting
8. Cross-validation
Regression 24 / 68
In regression (Y = R), the quadratic cost is often used, defined as
follows:
` : R × R → R+
(y , y 0 ) 7→ (y − y 0 )2 .
Define the quadratic risk for a machine or regression function
m : X → R:
R(m) := E (Y − m(X ))2 .
As seen in your previous statistics lectures, the optimal regression
function m? for the quadratic risk is
m? (X ) := E [Y |X ] .
Indeed, for all m,
R(m? ) = E (Y − m? (X ))2 6 E (Y − m(X ))2 =: R(m).
Binary classification 25 / 68
I The output can only take 2 values (Y ∈ {0, 1}).
I Note that the distribution of (X,Y) is entirely characterized by
the marginal distribution of X and the conditional distribution
of Y given X . More precisely, for all A ∈ B(Rd ), write
µX (A) = P(X ∈ A), and
r (X ) = E [Y |X ] = P (Y = 1|X ) .
The error probability or the risk for a classification rule
For a classifier g : Rd → {0, 1},
R(g ) = E 1g (X )6=Y = P(g (X ) 6= Y ).
Binary classification 26 / 68
Does an optimum exist?
The Bayes classifier g ? is defined as:
? 1 if P (Y = 1|X ) > P (Y = 0|X ) ,
g (X ) =
0 otherwise.
Equivalently,
? 1 if r (X ) > 1/2,
g (X ) =
0 otherwise,
Lemma
For any classification rule g : Rd → {0, 1}, one has
R(g ? ) 6 R(g ).
Binary classification 27 / 68
The Bayes risk
R? := R(g ? ) = inf P(g (X ) 6= Y ).
g :Rd →{0,1}
See blackboard
I g ? depends on the distribution of (X , Y ).
I The explicit solution requires to know E[Y |X].
I If not, we cannot know g ? and R? and we will use a n-sample,
i.e. n i.i.d. copies of (X , Y ) to retrieve information on those
two quantities.
Scoring function 28 / 68
I Still in the setting of binary classification,
I instead of a classification rule g : Rd → {0, 1}, we want to
find a function S : X → R.
Definition
I Perfect score: S is perfect if there exists s ? such that
P (Y = 1|S(X ) > s ? ) = 1 and P (Y = 0|S(X ) < s ? ) = 1.
I Random score: S is random if S(X ) and Y are independent.
Scoring function I 29 / 68
Link between a score and a classification rule For a given score
S and a threshold s, we obtain a classification rule:
1 if S(x) > s,
gs (x) =
0 otherwise.
gs (X ) = 0 gs (X ) = 1
Y =0 3 7
Y =1 7 3
Therefore, for any threshold, we define two types of errors:
α(s) := P(gs (X ) = 1|Y = 0) = P(S(X ) > s|Y = 0),
β(s) := P(gs (X ) = 0|Y = 1) = P(S(X ) < s|Y = 1).
One can also define the following related quantities
I the specificity: sp(s) = P(S(X ) < s|Y = 0) = 1 − α(s)
I the sensibility: se(s) = P(S(X ) > s|Y = 1) = 1 − β(s)
The ROC curve I 30 / 68
We can measure the performance of a score by visualizing errors
α(s) and β(s) on a same graph for all threshold s.
Definition
The ROC curve of a score S is the parametrized curve (x(s), y (s))
defined by
x(s) = α(s) = 1 − sp(s)
= P(gs (X ) = 1|Y = 0) = P(S(X ) > s|Y = 0)
y (s) = 1 − β(s) = se(s)
= P(gs (X ) = 1|Y = 1) = P(S(X ) > s|Y = 1).
ROC stands for "receiver operating characteristic".
The ROC curve - example of a linear classifier 31 / 68
For a linear classifier, the score writes S : x 7→ ω > x + b where
ω ∈ Rd and b ∈ R (see next lecture on discriminant analysis and
logistic regression).
For each choice of the threshold s ∗ , x(s ∗ ) = P(S(X ) > s ∗ |Y = 0)
and y (s ∗ ) = P(S(X ) > s ∗ |Y = 1) are unknown and replaced by the
empirical false positive rate (FPR) and true positive rate (TPR).
Pn
i=1 1S(Xi )>s ∗ ;Yi =0
FPR = Pn ,
1Y =0
Pn i=1 i
i=1 1S(Xi )>s ∗ ;Yi =1
TPR = Pn .
i=1 1Yi =1
The ROC curve - example of a linear classifier 32 / 68
Figure: 1000 data points are randomly generated (d = 2) with Gaussian
distribution. A Logistic regression model is used to label each data
randomly.
The ROC curve - example of a linear classifier 33 / 68
Figure: A ROC curve for a logistic regression classifier 200 additional data
points not in the training set. Simulations are inspired by (Roc curve with
scikit-learn) and can be found here [Link]
Area Under ROC I 34 / 68
The Area Under ROC (AUC) is often used to measure performance
of a score S. Note that
I Perfect score: AUC (S) = 1.
I Random score: AUC (S) = 1/2.
Proposition
Let (X1 , Y1 ) and (X2 , Y2 ) be two i.i.d. copies of (X , Y ). Then,
AUC (S) = P (S(X1 ) > S(X2 )|(Y1 , Y2 ) = (1, 0)) .
AUC(S) measures the probability that S correctly orders two
observations with different labels.
AUC(S) can be seen as a cost function for a score S;
Area Under ROC II 35 / 68
Question: does there exist an optimal score S ? for this cost
function?
Theorem (S Clémençon, G Lugosi, N Vayatis, 2008)
Let S ? (X ) = P(Y = 1|X ), then for any score S we have
AUC (S ? ) > AUC (S).
I The distribution of (X,Y) is unknown, so we do not have
access to S ? (x).
I We should find a good empirical estimate Sn of
S ? (X ) = P(Y = 1|X ).
Summary 36 / 68
Cost function `(y , f (x)) Risk E [`(y , f (x))] Optimum f ?
(y − f (x))2 E (Y − f (X ))2
Regression E [Y |X ]
Binary classif. 1y 6=f (x) P(Y 6= f (X )) Bayes rule
Scoring AUC (S) P(Y = 1|X ).
Summary 37 / 68
1. Some applications
2. Learning settings
3. Mathematical framework
4. Some criterion for regression and supervised classification
Regression
Binary classification
Scoring function
5. Empirical risk
6. Case of a finite class
7. Overfitting
8. Cross-validation
To the empirical risk 38 / 68
I Consider a n-sample Dn = (X1 , Y1 ), . . . , (Xn , Yn ), where
(Xi , Yi ), 1 6 i 6 n, are i.i.d. copies of (X , Y ).
I Let C be a class of potential classifiers.
Since the distribution of (X , Y ) is generally unknown, the
minimization of the risk is impossible in practice.
Goal
Therefore, given a cost function ` : Y × Y → R+ , we search a
predictor fn (x) = fn (x, Dn ) "close" to the optimal f ? defined by
f ? ∈ argmin R(f ) ,
where R(f ) = E [`(Y , f (X ))].
The problem is then to find fn? ∈ C such that R(fn? ) ' inf f ∈C R(f ).
Empirical risk minimization (ERM) 39 / 68
Empirical risk
Since the risk is an expectation, a first natural choice is to choose
the one that minimizes its empirical version:
n
b n (f ) = 1
X
R `(Yi , f (Xi )).
n
i=1
A simple reformulation reads as follows
? ?
Rn (fn ) − R = Rn (fn ) − inf R(f ) + inf R(f ) − R
b b
f ∈C f ∈C
| {z } | {z }
estimation error approximation error
I The estimation error is random. It reflects the discrepancy in terms
of P, between the chosen classifier and the "local champion" in C.
I The approximation error is deterministic and measures the closeness
in terms of P between the class C and the optimal choice f ? .
Some comments on Empirical risk minimization 40 / 68
C should be wide enough for the approximation error to be small.
C should not be too wide for the control of the estimation error.
Consider for instance that C is the set of all measurable functions
from Rd to {0, 1}. The approximation error is zero, but the
estimation error can be large: think of the choice
Yi , if x = Xi , 1 6 i 6 n ,
fn (x) =
0 otherwise,
for which the empirical risk is zero!
no generalization ability.
overfitting (to be continued).
Empirical risk minimization 41 / 68
Lemma
Define fn? ∈ argminf ∈C R
b n (f ) . Then,
1.
R(fn? ) − inf R(f ) 6 2 sup R
b n (f ) − R(f )
f ∈C f ∈C
and,
2.
b n (f ? ) − R(f ? ) 6 sup |R
R b n (f ) − R(f )|.
n n
f ∈C
Proof on blackboard
The objective is now to control supf ∈C |R
b n (f ) − R(f )|.
Summary 42 / 68
1. Some applications
2. Learning settings
3. Mathematical framework
4. Some criterion for regression and supervised classification
Regression
Binary classification
Scoring function
5. Empirical risk
6. Case of a finite class
7. Overfitting
8. Cross-validation
The case of a finite class of binary classifiers 43 / 68
I (X , Y ) ∈ Rd × {0, 1}.
I A natural choice for the cost function in this case is
`(y , f (x)) = 1y 6=f (x) .
I The risk can be then written for this loss function:
R(f ) = P(Y 6= f (X )).
I The optimal choice for the classifier is the Bayes rule
? 1 if P (Y = 1|X ) > P (Y = 0|X ) ,
g (X ) =
0 otherwise.
I The empirical risk is then
n
b n (f ) = 1
X
R 1f (Xi )6=Yi .
n
i=1
ERM and a finite class 44 / 68
Objective: Provide a theoretical control of supf ∈C |R
b n (f ) − R(f )|.
We need a uniform deviation of R
b n (f ) from its expectation R(f ).
Given f ∈ C,
n
X
nR
b n (f ) = 1f (Xi )6=Yi ,
i=1
L
so that nR
b n (f ) ∼ B(n, R(f )).
Therefore, we need a uniform control for binomial randm variables.
ERM and a finite class 45 / 68
Theorem (Hoeffding’s inequality)
Let X1 , . . . , Xn be independent real-valued random variables. Assume
that each Xi takes its values in [ai , bi ] (ai < bi) with probability one.
Then, for all t > 0,
n
" n # !
X X 2 Pn 2
P Xi − E Xi > t 6 e−2t / i=1 (bi −ai ) ,
i=1 i=1
and " n # !
n
X X 2 Pn 2
P Xi − E Xi 6 −t 6 e−2t / i=1 (bi −ai ) .
i=1 i=1
In particular,
n
" n
# !
X X 2 Pn 2
P Xi − E Xi >t 6 2e−2t / i=1 (bi −ai ) .
i=1 i=1
46 / 68
Proof of Hoeffding’s inequality by using Chernoff’s bounding
method and the following lemma.
Lemma
Let X be a real-valued random variable with E[X ] = 0 and
X ∈ [a, b] (a < b) with probability one. Then, for all s > 0,
h i 2 2 2
E esX 6 es (b −a )/8 .
Proof on blackboard
47 / 68
3. Getting back to the deviation of binomial r.v.
Corollary
Let X ∼ B(n, p), n > 1, p ∈ [0, 1]. Then for all t > 0,
2
P (|X − np| > t) 6 2e−2t /n
.
A union bound (|C| < ∞) leads to the following result.
Theorem
Assume that |C| is finite, with |C| 6 N. Then, for all t > 0,
2
P sup Rn (f ) − R(f ) > t 6 2Ne−2nt .
b
f ∈C
48 / 68
In the case where |C| is finite, with |C| 6 N,
2
P sup Rn (f ) − R(f ) > t 6 2Ne −2nt .
b
f ∈C
I The bound is universal.
I Borel-Cantelli: supf ∈C |R
b n (f ) − R(f )| → 0 almost surely.
I Consequence: R(fn? ) − inf f ∈C R(f ) → 0 almost surely.
⇒The estimation error tends to 0 almost surely
meaning that learning is asymptotically optimal.
I Bound on E[R(fn? )] − inf f ∈C R(f )?
49 / 68
4. From probability to expectation
Lemma (P to E)
Let Z be a random variable taking values in R+ . Assume that there
exists a constant C > 1 such that, for all t > 0,
2
P(Z > t) 6 C e−2nt . Then,
r
log(C e)
E[Z ] 6 .
2n
Combined with the previous result, this yields
r
E sup R b n (f ) − R(f ) 6 log (2eN) ,
f ∈C 2n
and r
log (2eN)
E[R(fn? )] − inf R(f ) 6 2 .
f ∈C 2n
ERM with a finite class 50 / 68
In the case where |C| is finite, with |C| 6 N,
r
? log (2eN)
E [R(fn )] − inf R(f ) 6 2 .
f ∈C 2n
Take-home message
For a finite class C, such that |C| 6 N
q
log (N)
Expectation of Estimation error = O n .
The next objective is to handle more complex classes of functions,
that would be the purpose of next sessions
Additional results at [A few notes]
Summary 51 / 68
1. Some applications
2. Learning settings
3. Mathematical framework
4. Some criterion for regression and supervised classification
Regression
Binary classification
Scoring function
5. Empirical risk
6. Case of a finite class
7. Overfitting
8. Cross-validation
Model complexity 52 / 68
Most of statistical learning algorithms depends on parameters λ.
Some examples are:
I number of input variables in linear and logistic models,
I penalty parameters for lasso and ridge regressions,
I depth for tree algorithms,
I number of nearest neighbors,
I number of iterations for boosting algorithms,
I The choice of theses parameters reveals crucial for the
performance of the machine...
Parameter λ often measures model complexity.
53 / 68
With a fixed sample size, varying the model complexity.
Bias/Variance 54 / 68
I Bias: difference between the expected value of the estimator
(model) and the true value being estimated!
h i
Bias(fˆ(x)) = E fˆ(x) − y
I A simpler model has a higher bias (naturally a simple model
will do some errors)!
I High bias can cause underfitting!
I Variance: deviation from the expected value of the estimates!
2
ˆ ˆ ˆ
Var(f (x)) = E f (x) − E(f (x))
I A more complex model has a higher variance!
I High variance can cause overfitting!
Ideally we want to optimize both of them.
Bias-Variance tradeoff 55 / 68
I When do we have high bias?
I We have high bias when the model (function) cannot model
the true data distribution well!
I This doesn’t depend on the training data size!
I Underfitting!
I When do we have high variance?
I We have high variance when there is a small amount of
training data and a very complex model!
I Overfitting!
I Variance decreases with larger training data, and increases
with more complicated classifiers!
Bias/Variance tradeoff 56 / 68
57 / 68
Take-home message
I High bias =⇒ high training and test errors!
I High variance =⇒ low training error, high test
errors!
58 / 68
Figure: Illustration of overfitting in regression. Here, λ controls the
regressor smoothness.
Summary 59 / 68
1. Some applications
2. Learning settings
3. Mathematical framework
4. Some criterion for regression and supervised classification
Regression
Binary classification
Scoring function
5. Empirical risk
6. Case of a finite class
7. Overfitting
8. Cross-validation
Generalization 60 / 68
Goal of supervised learning
I A trained classifier has to be generalizable: it must be able to
work on other data than the training dataset
I Generalizable means also “works without overfitting”.
I The empirical error on the training set is a poor estimate of
the generalization error (expected error on new data)!
If the model is overfitting, the generalization error can be
arbitrarily large!
I We would like to estimate the generalization error on new
data, which we do not have!
Validation sets 61 / 68
I Choose the model that performs best on a validation set
separate from the training set!
I Because we have not used the validation data at any point
during training, the validation set can be considered “new
data” and the error on the validation set is an estimation of
the generalization error!
Validation hold out 62 / 68
The simplest approach consists in splitting the data Dn into:
1. a learning or training set Dn,train used to learn the machine fn ,
2. a validation or test set Dn,test used to estimate the risk of fn .
Algorithm 1 Validation hold out
Inputs: Dn data, (T , V) a partition of {1, . . . , n}.
1. Learn the machine with
Dn,train = {(Xi , Yi ), i ∈ T } =⇒ fn,train ;
2. Compute
b n (fn,train ) = 1
X
R ` (Yi , fn,train (Xi )) .
|V|
i∈V
Remark
ntrain = |T | and ntest should be large enough.
Model selection 63 / 68
I What if we want to choose among k models?!
I Train each model on the train set!
I Compute the prediction error of each model on the validation
set!
I Pick the model with the smallest prediction error on the
validation set!
I What is the generalization error?!
I We don’t know!!
I Validation data was used to select the model!
I We have “cheated” and looked at the validation data: it is not
a good proxy for new, unseen data any more!
Model selection 64 / 68
I Hence we need to set aside part of the data, the test set, that
remains untouched during the entire procedure and on which
we’ll estimate the generalization error!
I Model selection: pick the best model!
I Model assessment: estimate its prediction error on new data!
Validation sets 65 / 68
I How much data should go in each of the training, validation
and test sets?!
I How do we know that we have enough data to evaluate the
prediction and generalization errors?!
I Empirical evaluation with sample re-use!
I Cross-validation
I Bootstrap (random sampling with replacement)
Cross-validation 66 / 68
I Cut the training set in K separate folds!
I For each fold, train on the (K − 1) remaining folds!
K-fold cross validation 67 / 68
Algorithm 2 K-fold CV
Inputs: Dn data, K an integer;
1. Define a random partition {I1 , . . . , IK } of {1, . . . , n};
2. For a fixed λ, for k = 1, . . . , K
I Itrain = {1, . . . , n} \ Ik and Itest = Ik ;
(λ)
I Learn the machine with Dn,train = {(Xi , Yi ), i ∈ T } =⇒ fn,k ;
I Compute the test error
(λ) 1
P (λ)
Errtest (fn,k ) = n−|I k| i∈Ik ` Yi , fn,k (Xi ) .
3. Choose
K
1 X (λ)
λ̂(CV ) ∈ argminλ∈Λ Errtest (fn,k ).
K
k=1
K-fold CV 68 / 68
I K has to be chosen by the user. Usually K = 5 or 10.
I Advantage of this method over repeated random sub-sampling
(bootstrap) is that all observations are used for both training
and validation, and each observation is used for validation
exactly once.
Leave-one-out CV
I When K = n, we obtain the leave-one-out (LOO) cross
validation, since at each iteration exactly one instance is left
out of the training sample.
I In general, the leave-one-out error is very costly to compute,
since it requires training n times on samples of size n − 1, but
for some algorithms it admits a very efficient computation.
I Exercise: LOO-CV in least squares regression