0% found this document useful (0 votes)

14 views68 pages

Intro

The document is an introduction to machine learning, covering fundamental concepts of statistical learning, learning algorithms, and applications in various fields such as computer vision and medical diagnosis. It outlines course objectives, prerequisites, evaluation methods, and includes a mathematical framework for supervised and unsupervised learning. Key topics discussed include classification, regression, overfitting, cross-validation, and the importance of choosing appropriate cost functions.

Uploaded by

zakariaboullam3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views68 pages

Intro

Uploaded by

zakariaboullam3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction to Machine learning

Sylvain Le Corff
2 / 68

Prerequisites: basics of probability and statistics.

Objectives:
+ Learn and apply the fundamental concepts of statistical learning;

+ Understand the basic theory underlying data science;

+ Implement (Python) the most classical learning algorithms;

+ Be able to read current research books and papers.

This course:
+ 4h of lecture/TD/TP per week;

+ Evaluation: 1/4 Moodle quiz + 3/4 exam.

3 / 68

References
+ Slides adapted from Claire Boyer’s lectures (thanks!).
+ Probabilistic Machine Learning: An Introduction by Kevin P.
Murphy
(Complete book & figures)

+M1 Lecture notes - statistical learning introduction and basic

topics on regression
(Lecture notes)

Material
+ Mini tests on Moodle for training.

+ Notebooks to create all illustrations of the lectures (Experiments)

+ Materials on Moodle and .

(Github)
Introduction to machine learning 4 / 68

1. Some applications
2. Learning settings
3. Mathematical framework
4. Some criterion for regression and supervised classification
Regression
Binary classification
Scoring function
5. Empirical risk
6. Case of a finite class
7. Overfitting
8. Cross-validation
Summary 5 / 68

Figure: Detection of plastic trash in videos, image classification. (ImageNet,

Krizhevsky et al., 2012), (Chagneux et al. 2022).
Computer vision 7 / 68

Figure: Automatic detections using Region Proposal Network (bounding

box design). (Faster-RCNN, Ren et al. 2015). Code is available at
[Link]
Vision + Natural Language Processing 8 / 68

Figure: Visual Question Answering (VQA) tasks. (Mutan, Ben-Younes et al. 2017).
Medical applications 9 / 68

Figure: Predicting the three-dimensional structure that a protein will

adopt based solely on its amino acid sequence—the structure prediction
component of the "protein folding problem". Performance of AlphaFold
on the CASP14 dataset. (AlphaFold, Jumper et al. 2022).
Challenges - reduce GHG emissions from cities 10 / 68

Figure: Selected opportunities to reduce GHG emissions from buildings

and cities using ML. (Tackling Climate Change with Machine Learning, Rolnick et al. 2022).
Challenges - reduce GHG emissions from land use 11 / 68

Figure: Selected opportunities to reduce GHG emissions from land use

using ML. (Tackling Climate Change with Machine Learning, Rolnick et al. 2022).
Many other applications 12 / 68

I Efficient simulation in physics-based problem (Fluid simulation,

Tompson et al., 2017), (Metamodels for building energy models, Cohen et al., 2021) .
I Natural language processing (Dynamic memory networks, Kumar A. et al.,

2016), (Dall-E, OpenAI).

I Medical diagnosis .
(Causal inference for medical diagnosis, Richens, J.G. et al., 2020)

I Generativel models for speech processing (WaveNet, van den Oord A. et

al., 2016).

I Many more...
Summary 13 / 68

I Classification: Assign a category to each input data.

I Regression: Predict a vector associated with each input data.

I Ranking: Order input data according to some criterion.

I Clustering: Partition input data into homogeneous regions.

I Dimensionality reduction or manifold learning: Transform

an initial representation of input data into a lower-dimensional
representation while preserving some properties.
Focus on classification and regression 15 / 68

Classification: (X , Y ) ∈ Rd × {1, . . . , M}
+ Learn whether an individual from Rd belongs to some class.
+ Focus with known number M of classes, Y = {1, . . . , M}.
+ Objective: define a function f : Rd → {1, . . . , M}, such that
f (X ) is the best prediction of Y in a given context.

Regression: (X , Y ) ∈ Rd × Rm
+ The observation associated with X is assumed to be given by

Y = f∗ (X ) + ε ,

where ε is a centered noise independent of X .

+ Objective: define the best estimator of f∗ in a given context.
Learning settings 16 / 68

The two main settings are supervised and unsupervised learning.

I Supervised learning:
I Training data: a set of labeled examples
I Prediction for all unseen points.
classification, regression, and ranking problems
I Unsupervised learning:
I Training data: a set of unlabeled examples
I Prediction for all unseen points.
clustering and dimensionality reduction
I Semi-supervised learning:
I Training data: a set of both labeled and unlabeled examples
I Prediction for all unseen points.
I and also Online learning, etc...
Summary 17 / 68

Supervised Learning Framework

+ Input measurement X ∈ X (often X ⊂ Rd ).
+ Output measurement Y ∈ Y.
+ The joint distribution of (X, Y ) is unknown.
+ Y ∈ {1, . . . , M} (classification) or Y ∈ Rm (regression).
+ A predictor is a measurable function in f : X → Y.

Training data
Dn = {(Xi , Yi )}16i6n i.i.d. with the same distribution as (X, Y ).

Goal
+ Construct a good predictor fbn from the training data.
+ Predict the output for input data not in the training dataset.
Mathematical framework 19 / 68

I We use a cost function to evaluate the quality of a prediction:

` : Y × Y → R+ is such that

`(y , y 0 ) = 0 if y = y 0
> 0 if y 6= y 0 .

I Interpretation: `(y , y 0 ) measures an error between the

prediction y 0 and the observation y .

Classical choices
+ Prediction loss: `(Y , f (X)) = 1Y 6=f (X) .
+ Quadratic loss: `(Y , X) = kY − f (X)k22 .
Risk 20 / 68

We assume for the whole lecture that

I data dn = {(x1 , y1 ), . . . , (xn , yn )} are realizations of a n-sample
Dn := {(X1 , Y1 ), . . . , (Xn , Yn )}, meaning that the (Xi , Yi ),
1 6 i 6 n, are i.i.d. copies of (X , Y ) taking value in X × Y.
I For a given cost function ` : Y × Y → R+ , the performance of
a predictor f : X → Y is measured as the average loss.

Risk of a predictor
The risk of a predictor f : X → Y is defined by

R := E [`(Y , f (X ))] .

Classical choices
+ Prediction loss: E[`(Y , f (X))] = P(Y 6= f (X)).
+ Quadratic loss: E[`(Y , f (X))] = E[kY − f (X)k22 ].
THE problem in Machine Learning 21 / 68

The problem is then to find a minimizer of the risk for a fixed cost
function `

f ? ∈ argminf :X →Y {R(f ) = E [`(Y , f (X ))]} .

Such a function f ? (if it exists) is called the optimal predictor for

the cost function ` and we define R? := R(f ? ).

I f ? generally depends on the unknown probability distribution

of (X , Y ), then f ? is unknown in practice.
I Using Dn , our work consists in finding a good estimate fn of
f ? , i.e. finding fn : X → Y such that R(fn ) ' R(f ? ).
I As fn depends on Dn , R(fbn ) is a random variable!
Choice of a cost function `? 22 / 68

I The proposed mathematical framework implies that a predictor

is performant with respect to a criterion (represented by the
cost function `).
I It means that a predictor f could be good for a cost function
`1 (R1 (f ) small) but not for another cost function `2 (R2 (f )
large).

I Crucial to choose a relevant cost function for the problem we

are faced.
I Can reflect a prior that you know on your problem
Summary 23 / 68

In regression (Y = R), the quadratic cost is often used, defined as

follows:
` : R × R → R+
(y , y 0 ) 7→ (y − y 0 )2 .
Define the quadratic risk for a machine or regression function
m : X → R:
R(m) := E (Y − m(X ))2 .

As seen in your previous statistics lectures, the optimal regression

function m? for the quadratic risk is

m? (X ) := E [Y |X ] .
Indeed, for all m,
R(m? ) = E (Y − m? (X ))2 6 E (Y − m(X ))2 =: R(m).

Binary classification 25 / 68

I The output can only take 2 values (Y ∈ {0, 1}).

I Note that the distribution of (X,Y) is entirely characterized by
the marginal distribution of X and the conditional distribution
of Y given X . More precisely, for all A ∈ B(Rd ), write
µX (A) = P(X ∈ A), and

r (X ) = E [Y |X ] = P (Y = 1|X ) .

The error probability or the risk for a classification rule

For a classifier g : Rd → {0, 1},

R(g ) = E 1g (X )6=Y = P(g (X ) 6= Y ).
Binary classification 26 / 68

Does an optimum exist?

The Bayes classifier g ? is defined as:

? 1 if P (Y = 1|X ) > P (Y = 0|X ) ,
g (X ) =
0 otherwise.

Equivalently,

? 1 if r (X ) > 1/2,
g (X ) =
0 otherwise,

Lemma
For any classification rule g : Rd → {0, 1}, one has

R(g ? ) 6 R(g ).
Binary classification 27 / 68

The Bayes risk

R? := R(g ? ) = inf P(g (X ) 6= Y ).
g :Rd →{0,1}

See blackboard

I g ? depends on the distribution of (X , Y ).

I The explicit solution requires to know E[Y |X].
I If not, we cannot know g ? and R? and we will use a n-sample,
i.e. n i.i.d. copies of (X , Y ) to retrieve information on those
two quantities.
Scoring function 28 / 68

I Still in the setting of binary classification,

I instead of a classification rule g : Rd → {0, 1}, we want to
find a function S : X → R.

Definition
I Perfect score: S is perfect if there exists s ? such that

P (Y = 1|S(X ) > s ? ) = 1 and P (Y = 0|S(X ) < s ? ) = 1.

I Random score: S is random if S(X ) and Y are independent.

Scoring function I 29 / 68

Link between a score and a classification rule For a given score

S and a threshold s, we obtain a classification rule:

1 if S(x) > s,
gs (x) =
0 otherwise.

gs (X ) = 0 gs (X ) = 1
Y =0 3 7
Y =1 7 3

Therefore, for any threshold, we define two types of errors:

α(s) := P(gs (X ) = 1|Y = 0) = P(S(X ) > s|Y = 0),

β(s) := P(gs (X ) = 0|Y = 1) = P(S(X ) < s|Y = 1).

One can also define the following related quantities

I the specificity: sp(s) = P(S(X ) < s|Y = 0) = 1 − α(s)
I the sensibility: se(s) = P(S(X ) > s|Y = 1) = 1 − β(s)
The ROC curve I 30 / 68

We can measure the performance of a score by visualizing errors

α(s) and β(s) on a same graph for all threshold s.

Definition
The ROC curve of a score S is the parametrized curve (x(s), y (s))
defined by


 x(s) = α(s) = 1 − sp(s)
= P(gs (X ) = 1|Y = 0) = P(S(X ) > s|Y = 0)


 y (s) = 1 − β(s) = se(s)
= P(gs (X ) = 1|Y = 1) = P(S(X ) > s|Y = 1).


ROC stands for "receiver operating characteristic".

The ROC curve - example of a linear classifier 31 / 68

For a linear classifier, the score writes S : x 7→ ω > x + b where

ω ∈ Rd and b ∈ R (see next lecture on discriminant analysis and
logistic regression).

For each choice of the threshold s ∗ , x(s ∗ ) = P(S(X ) > s ∗ |Y = 0)

and y (s ∗ ) = P(S(X ) > s ∗ |Y = 1) are unknown and replaced by the
empirical false positive rate (FPR) and true positive rate (TPR).

Pn
i=1 1S(Xi )>s ∗ ;Yi =0
FPR = Pn ,
1Y =0
Pn i=1 i
i=1 1S(Xi )>s ∗ ;Yi =1
TPR = Pn .
i=1 1Yi =1
The ROC curve - example of a linear classifier 32 / 68

Figure: 1000 data points are randomly generated (d = 2) with Gaussian

distribution. A Logistic regression model is used to label each data
randomly.
The ROC curve - example of a linear classifier 33 / 68

Figure: A ROC curve for a logistic regression classifier 200 additional data
points not in the training set. Simulations are inspired by (Roc curve with
scikit-learn) and can be found here [Link]
Area Under ROC I 34 / 68

The Area Under ROC (AUC) is often used to measure performance

of a score S. Note that
I Perfect score: AUC (S) = 1.
I Random score: AUC (S) = 1/2.

Proposition
Let (X1 , Y1 ) and (X2 , Y2 ) be two i.i.d. copies of (X , Y ). Then,

AUC (S) = P (S(X1 ) > S(X2 )|(Y1 , Y2 ) = (1, 0)) .

AUC(S) measures the probability that S correctly orders two

observations with different labels.
AUC(S) can be seen as a cost function for a score S;
Area Under ROC II 35 / 68

Question: does there exist an optimal score S ? for this cost

function?
Theorem (S Clémençon, G Lugosi, N Vayatis, 2008)
Let S ? (X ) = P(Y = 1|X ), then for any score S we have

AUC (S ? ) > AUC (S).

I The distribution of (X,Y) is unknown, so we do not have

access to S ? (x).
I We should find a good empirical estimate Sn of
S ? (X ) = P(Y = 1|X ).
Summary 36 / 68

Cost function `(y , f (x)) Risk E [`(y , f (x))] Optimum f ?

(y − f (x))2 E (Y − f (X ))2

Regression E [Y |X ]

Binary classif. 1y 6=f (x) P(Y 6= f (X )) Bayes rule

Scoring AUC (S) P(Y = 1|X ).

Summary 37 / 68

I Consider a n-sample Dn = (X1 , Y1 ), . . . , (Xn , Yn ), where

(Xi , Yi ), 1 6 i 6 n, are i.i.d. copies of (X , Y ).
I Let C be a class of potential classifiers.
Since the distribution of (X , Y ) is generally unknown, the
minimization of the risk is impossible in practice.
Goal
Therefore, given a cost function ` : Y × Y → R+ , we search a
predictor fn (x) = fn (x, Dn ) "close" to the optimal f ? defined by

f ? ∈ argmin R(f ) ,

where R(f ) = E [`(Y , f (X ))].

The problem is then to find fn? ∈ C such that R(fn? ) ' inf f ∈C R(f ).
Empirical risk minimization (ERM) 39 / 68

Empirical risk
Since the risk is an expectation, a first natural choice is to choose
the one that minimizes its empirical version:
n
b n (f ) = 1
X
R `(Yi , f (Xi )).
n
i=1

A simple reformulation reads as follows

? ?
Rn (fn ) − R = Rn (fn ) − inf R(f ) + inf R(f ) − R
b b
f ∈C f ∈C
| {z } | {z }
estimation error approximation error

I The estimation error is random. It reflects the discrepancy in terms

of P, between the chosen classifier and the "local champion" in C.
I The approximation error is deterministic and measures the closeness
in terms of P between the class C and the optimal choice f ? .
Some comments on Empirical risk minimization 40 / 68

C should be wide enough for the approximation error to be small.

C should not be too wide for the control of the estimation error.

Consider for instance that C is the set of all measurable functions

from Rd to {0, 1}. The approximation error is zero, but the
estimation error can be large: think of the choice

Yi , if x = Xi , 1 6 i 6 n ,
fn (x) =
0 otherwise,

for which the empirical risk is zero!

no generalization ability.
overfitting (to be continued).
Empirical risk minimization 41 / 68

Lemma
Define fn? ∈ argminf ∈C R
b n (f ) . Then,
1.
R(fn? ) − inf R(f ) 6 2 sup R
b n (f ) − R(f )
f ∈C f ∈C

and,
2.
b n (f ? ) − R(f ? ) 6 sup |R
R b n (f ) − R(f )|.
n n
f ∈C

Proof on blackboard

The objective is now to control supf ∈C |R

b n (f ) − R(f )|.
Summary 42 / 68

I (X , Y ) ∈ Rd × {0, 1}.
I A natural choice for the cost function in this case is

`(y , f (x)) = 1y 6=f (x) .

I The risk can be then written for this loss function:

R(f ) = P(Y 6= f (X )).

I The optimal choice for the classifier is the Bayes rule

? 1 if P (Y = 1|X ) > P (Y = 0|X ) ,
g (X ) =
0 otherwise.
I The empirical risk is then
n
b n (f ) = 1
X
R 1f (Xi )6=Yi .
n
i=1
ERM and a finite class 44 / 68

Objective: Provide a theoretical control of supf ∈C |R

b n (f ) − R(f )|.

We need a uniform deviation of R

b n (f ) from its expectation R(f ).

Given f ∈ C,
n
X
nR
b n (f ) = 1f (Xi )6=Yi ,
i=1
L
so that nR
b n (f ) ∼ B(n, R(f )).

Therefore, we need a uniform control for binomial randm variables.

ERM and a finite class 45 / 68

Theorem (Hoeffding’s inequality)

Let X1 , . . . , Xn be independent real-valued random variables. Assume
that each Xi takes its values in [ai , bi ] (ai < bi) with probability one.
Then, for all t > 0,
n
" n # !
X X 2 Pn 2
P Xi − E Xi > t 6 e−2t / i=1 (bi −ai ) ,
i=1 i=1

and " n # !
n
X X 2 Pn 2
P Xi − E Xi 6 −t 6 e−2t / i=1 (bi −ai ) .
i=1 i=1

In particular,
n
" n
# !
X X 2 Pn 2
P Xi − E Xi >t 6 2e−2t / i=1 (bi −ai ) .
i=1 i=1
46 / 68

Proof of Hoeffding’s inequality by using Chernoff’s bounding

method and the following lemma.
Lemma
Let X be a real-valued random variable with E[X ] = 0 and
X ∈ [a, b] (a < b) with probability one. Then, for all s > 0,
h i 2 2 2
E esX 6 es (b −a )/8 .

Proof on blackboard
47 / 68

3. Getting back to the deviation of binomial r.v.

Corollary
Let X ∼ B(n, p), n > 1, p ∈ [0, 1]. Then for all t > 0,
2
P (|X − np| > t) 6 2e−2t /n
.

A union bound (|C| < ∞) leads to the following result.

Theorem
Assume that |C| is finite, with |C| 6 N. Then, for all t > 0,

2
P sup Rn (f ) − R(f ) > t 6 2Ne−2nt .
b
f ∈C
48 / 68

In the case where |C| is finite, with |C| 6 N,

2
P sup Rn (f ) − R(f ) > t 6 2Ne −2nt .
b
f ∈C

I The bound is universal.

I Borel-Cantelli: supf ∈C |R
b n (f ) − R(f )| → 0 almost surely.
I Consequence: R(fn? ) − inf f ∈C R(f ) → 0 almost surely.
⇒The estimation error tends to 0 almost surely
meaning that learning is asymptotically optimal.
I Bound on E[R(fn? )] − inf f ∈C R(f )?
49 / 68

4. From probability to expectation

Lemma (P to E)
Let Z be a random variable taking values in R+ . Assume that there
exists a constant C > 1 such that, for all t > 0,
2
P(Z > t) 6 C e−2nt . Then,
r
log(C e)
E[Z ] 6 .
2n
Combined with the previous result, this yields
r
E sup R b n (f ) − R(f ) 6 log (2eN) ,
f ∈C 2n

and r
log (2eN)
E[R(fn? )] − inf R(f ) 6 2 .
f ∈C 2n
ERM with a finite class 50 / 68

In the case where |C| is finite, with |C| 6 N,

r
? log (2eN)
E [R(fn )] − inf R(f ) 6 2 .
f ∈C 2n

Take-home message
For a finite class C, such that |C| 6 N
q
log (N)
Expectation of Estimation error = O n .

The next objective is to handle more complex classes of functions,

that would be the purpose of next sessions
Additional results at [A few notes]
Summary 51 / 68

Most of statistical learning algorithms depends on parameters λ.

Some examples are:
I number of input variables in linear and logistic models,
I penalty parameters for lasso and ridge regressions,
I depth for tree algorithms,
I number of nearest neighbors,
I number of iterations for boosting algorithms,
I The choice of theses parameters reveals crucial for the
performance of the machine...
Parameter λ often measures model complexity.
53 / 68

With a fixed sample size, varying the model complexity.

Bias/Variance 54 / 68

I Bias: difference between the expected value of the estimator

(model) and the true value being estimated!
h i
Bias(fˆ(x)) = E fˆ(x) − y

I A simpler model has a higher bias (naturally a simple model

will do some errors)!
I High bias can cause underfitting!
I Variance: deviation from the expected value of the estimates!
2
ˆ ˆ ˆ
Var(f (x)) = E f (x) − E(f (x))

I A more complex model has a higher variance!

I High variance can cause overfitting!

Ideally we want to optimize both of them.

Bias-Variance tradeoff 55 / 68

I When do we have high bias?

I We have high bias when the model (function) cannot model
the true data distribution well!
I This doesn’t depend on the training data size!
I Underfitting!
I When do we have high variance?
I We have high variance when there is a small amount of
training data and a very complex model!
I Overfitting!
I Variance decreases with larger training data, and increases
with more complicated classifiers!
Bias/Variance tradeoff 56 / 68
57 / 68

Take-home message
I High bias =⇒ high training and test errors!
I High variance =⇒ low training error, high test
errors!
58 / 68

Figure: Illustration of overfitting in regression. Here, λ controls the

regressor smoothness.
Summary 59 / 68

Goal of supervised learning

I A trained classifier has to be generalizable: it must be able to
work on other data than the training dataset
I Generalizable means also “works without overfitting”.

I The empirical error on the training set is a poor estimate of

the generalization error (expected error on new data)!
If the model is overfitting, the generalization error can be
arbitrarily large!
I We would like to estimate the generalization error on new
data, which we do not have!
Validation sets 61 / 68

I Choose the model that performs best on a validation set

separate from the training set!

I Because we have not used the validation data at any point

during training, the validation set can be considered “new
data” and the error on the validation set is an estimation of
the generalization error!
Validation hold out 62 / 68

The simplest approach consists in splitting the data Dn into:

1. a learning or training set Dn,train used to learn the machine fn ,
2. a validation or test set Dn,test used to estimate the risk of fn .

Algorithm 1 Validation hold out

Inputs: Dn data, (T , V) a partition of {1, . . . , n}.
1. Learn the machine with
Dn,train = {(Xi , Yi ), i ∈ T } =⇒ fn,train ;
2. Compute

b n (fn,train ) = 1
X
R ` (Yi , fn,train (Xi )) .
|V|
i∈V

Remark
ntrain = |T | and ntest should be large enough.
Model selection 63 / 68

I What if we want to choose among k models?!

I Train each model on the train set!
I Compute the prediction error of each model on the validation
set!
I Pick the model with the smallest prediction error on the
validation set!
I What is the generalization error?!
I We don’t know!!
I Validation data was used to select the model!
I We have “cheated” and looked at the validation data: it is not
a good proxy for new, unseen data any more!
Model selection 64 / 68

I Hence we need to set aside part of the data, the test set, that
remains untouched during the entire procedure and on which
we’ll estimate the generalization error!

I Model selection: pick the best model!

I Model assessment: estimate its prediction error on new data!
Validation sets 65 / 68

I How much data should go in each of the training, validation

and test sets?!
I How do we know that we have enough data to evaluate the
prediction and generalization errors?!
I Empirical evaluation with sample re-use!
I Cross-validation
I Bootstrap (random sampling with replacement)
Cross-validation 66 / 68

I Cut the training set in K separate folds!

I For each fold, train on the (K − 1) remaining folds!
K-fold cross validation 67 / 68

Algorithm 2 K-fold CV
Inputs: Dn data, K an integer;
1. Define a random partition {I1 , . . . , IK } of {1, . . . , n};
2. For a fixed λ, for k = 1, . . . , K
I Itrain = {1, . . . , n} \ Ik and Itest = Ik ;
(λ)
I Learn the machine with Dn,train = {(Xi , Yi ), i ∈ T } =⇒ fn,k ;
I Compute the test error
(λ) 1
P (λ)
Errtest (fn,k ) = n−|I k| i∈Ik ` Yi , fn,k (Xi ) .

3. Choose
K
1 X (λ)
λ̂(CV ) ∈ argminλ∈Λ Errtest (fn,k ).
K
k=1
K-fold CV 68 / 68

I K has to be chosen by the user. Usually K = 5 or 10.

I Advantage of this method over repeated random sub-sampling
(bootstrap) is that all observations are used for both training
and validation, and each observation is used for validation
exactly once.

Leave-one-out CV
I When K = n, we obtain the leave-one-out (LOO) cross
validation, since at each iteration exactly one instance is left
out of the training sample.
I In general, the leave-one-out error is very costly to compute,
since it requires training n times on samples of size n − 1, but
for some algorithms it admits a very efficient computation.
I Exercise: LOO-CV in least squares regression

Introduction to Machine Learning 2015
No ratings yet
Introduction to Machine Learning 2015
46 pages
CS-E4715: Supervised ML Overview
No ratings yet
CS-E4715: Supervised ML Overview
37 pages
Supervised Machine Learning Overview
No ratings yet
Supervised Machine Learning Overview
37 pages
Machine Learning Fundamentals Overview
No ratings yet
Machine Learning Fundamentals Overview
24 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
127 pages
DSA 5105: Intro to Machine Learning
No ratings yet
DSA 5105: Intro to Machine Learning
51 pages
Introduction to Deep Learning Concepts
No ratings yet
Introduction to Deep Learning Concepts
64 pages
Chapter - 01 - Introduction To ML
No ratings yet
Chapter - 01 - Introduction To ML
60 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
4 pages
Foundations of Machine Learning Overview
No ratings yet
Foundations of Machine Learning Overview
51 pages
Optimizing Feedforward Networks: Computational Systems Biology Deep Learning in The Life Sciences
No ratings yet
Optimizing Feedforward Networks: Computational Systems Biology Deep Learning in The Life Sciences
168 pages
Overview of Statistical Learning Theory
No ratings yet
Overview of Statistical Learning Theory
4 pages
Eml JHP Lec 01
No ratings yet
Eml JHP Lec 01
51 pages
Decision Tree Predictions in ML
No ratings yet
Decision Tree Predictions in ML
86 pages
Machine Learning Basics: Supervised Learning
No ratings yet
Machine Learning Basics: Supervised Learning
6 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
33 pages
Machine Learning Overview and Concepts
No ratings yet
Machine Learning Overview and Concepts
45 pages
DL Unit-1
No ratings yet
DL Unit-1
20 pages
Supervised Learning
No ratings yet
Supervised Learning
145 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
24 pages
Machine Learning Fundamentals Explained
No ratings yet
Machine Learning Fundamentals Explained
6 pages
Understanding Machine Learning Concepts
No ratings yet
Understanding Machine Learning Concepts
81 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
89 pages
Machine Learning Fundamentals Overview
No ratings yet
Machine Learning Fundamentals Overview
92 pages
BIM 40 AIML - Module3.3 - Machine Learning - Supervised Techniques Using Scikit Learn - v1
No ratings yet
BIM 40 AIML - Module3.3 - Machine Learning - Supervised Techniques Using Scikit Learn - v1
50 pages
Introduction to Machine Learning Concepts
100% (5)
Introduction to Machine Learning Concepts
134 pages
Statistical Learning Theory Course Overview
No ratings yet
Statistical Learning Theory Course Overview
24 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
60 pages
Supervised Learning in Machine Learning
No ratings yet
Supervised Learning in Machine Learning
20 pages
Understanding Machine Learning Concepts
No ratings yet
Understanding Machine Learning Concepts
85 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
17 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
48 pages
Comprehensive Guide to Machine Learning Models
No ratings yet
Comprehensive Guide to Machine Learning Models
59 pages
Machine Learning and AI Overview
No ratings yet
Machine Learning and AI Overview
38 pages
ML Notes
No ratings yet
ML Notes
25 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
103 pages
Supervised Learning: A Complete Guide
No ratings yet
Supervised Learning: A Complete Guide
50 pages
Data Science Engineering Overview
No ratings yet
Data Science Engineering Overview
53 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
15 pages
Simple Linear Regression Derivation Guide
No ratings yet
Simple Linear Regression Derivation Guide
8 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
70 pages
ML 1-5
No ratings yet
ML 1-5
16 pages
Machine Learning Algorithms Explained
No ratings yet
Machine Learning Algorithms Explained
77 pages
Week11 - COMP517 S2 2025
No ratings yet
Week11 - COMP517 S2 2025
50 pages
Machine Learning Overview and Applications
No ratings yet
Machine Learning Overview and Applications
80 pages
Supervised Learning: Linear Regression Guide
No ratings yet
Supervised Learning: Linear Regression Guide
9 pages
Machine Learning: Feature Scaling & Regression
No ratings yet
Machine Learning: Feature Scaling & Regression
33 pages
Machine Learning Fundamentals Explained
No ratings yet
Machine Learning Fundamentals Explained
24 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
19 pages
Deep Learning Fundamentals Explained
No ratings yet
Deep Learning Fundamentals Explained
56 pages
Machine Learning Concepts Explained
No ratings yet
Machine Learning Concepts Explained
34 pages
Introduction to Cognitive Science and Machine Learning
No ratings yet
Introduction to Cognitive Science and Machine Learning
14 pages
Machine Learning Fundamentals and Techniques
No ratings yet
Machine Learning Fundamentals and Techniques
20 pages
AIFinance 2a - Introduction To Supervised Learning
No ratings yet
AIFinance 2a - Introduction To Supervised Learning
42 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
60 pages
Machine Learning Course Syllabus
No ratings yet
Machine Learning Course Syllabus
56 pages
Machine Learning Error Types Explained
No ratings yet
Machine Learning Error Types Explained
21 pages
Machine Learning Techniques Overview
100% (1)
Machine Learning Techniques Overview
137 pages
Model Selection and Occam's Razor
No ratings yet
Model Selection and Occam's Razor
38 pages
COSC 6342 Machine Learning Homework 1
No ratings yet
COSC 6342 Machine Learning Homework 1
9 pages
Understanding Machine Learning Capacity
No ratings yet
Understanding Machine Learning Capacity
30 pages
Understanding Generalization Error in ML
No ratings yet
Understanding Generalization Error in ML
37 pages
Swarm Intelligence and PSO Insights
No ratings yet
Swarm Intelligence and PSO Insights
60 pages
Generalization in Selectivity Learning
100% (1)
Generalization in Selectivity Learning
14 pages
Introduction to Boosting Algorithms
No ratings yet
Introduction to Boosting Algorithms
14 pages
Machine Learning Fundamentals Overview
No ratings yet
Machine Learning Fundamentals Overview
151 pages
Deep Learning Practical Methodology Guide
No ratings yet
Deep Learning Practical Methodology Guide
19 pages
Data Mining & Business Analytics Insights
No ratings yet
Data Mining & Business Analytics Insights
4 pages
Kernel Classifiers: Theory & Algorithms
No ratings yet
Kernel Classifiers: Theory & Algorithms
382 pages
Statistical Learning Theory in ML
No ratings yet
Statistical Learning Theory in ML
35 pages
PAC Learning and VC Dimension in ML
No ratings yet
PAC Learning and VC Dimension in ML
19 pages
Data Science: Bias-Variance Tradeoff
No ratings yet
Data Science: Bias-Variance Tradeoff
20 pages
Machine Learning in Perovskite Discovery
No ratings yet
Machine Learning in Perovskite Discovery
18 pages
Introduction to AdaBoost Algorithm
No ratings yet
Introduction to AdaBoost Algorithm
14 pages
MLP Backpropagation Algorithm Overview
No ratings yet
MLP Backpropagation Algorithm Overview
34 pages
Deep Neural Networks with Noisy Labels
No ratings yet
Deep Neural Networks with Noisy Labels
13 pages
Machine Learning Classification Overview
No ratings yet
Machine Learning Classification Overview
98 pages
Machine Learning Concepts and Techniques
No ratings yet
Machine Learning Concepts and Techniques
62 pages
Loss Function Optimization in Neural Networks
100% (1)
Loss Function Optimization in Neural Networks
24 pages
Improved Random Forest for Traffic Classification
No ratings yet
Improved Random Forest for Traffic Classification
4 pages
Decision Tree Classification Overview
No ratings yet
Decision Tree Classification Overview
86 pages
Trace Fossils As Indicators of Sedimentary Environments Dirk Knaust and Richard G Bromley Eds Available All Format
No ratings yet
Trace Fossils As Indicators of Sedimentary Environments Dirk Knaust and Richard G Bromley Eds Available All Format
96 pages
PAC-Bayes and Generalization in Deep Learning
No ratings yet
PAC-Bayes and Generalization in Deep Learning
146 pages
Stability and Generalization in ML
No ratings yet
Stability and Generalization in ML
28 pages
SISSA TSDS PhD Admission Exam 2022
No ratings yet
SISSA TSDS PhD Admission Exam 2022
4 pages
Decision Trees in Python for ML
No ratings yet
Decision Trees in Python for ML
261 pages