0% found this document useful (0 votes)

30 views85 pages

Unit-I Machine Learning Basics

The document provides an overview of machine learning concepts, including definitions of tasks, performance measures, and types of learning such as supervised, unsupervised, and reinforcement learning. It discusses the importance of generalization, capacity, overfitting, and underfitting, along with techniques like regularization and cross-validation. Additionally, it touches on the no free lunch theorem and the role of hyperparameters in model training and evaluation.

Uploaded by

22h51a6710

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views85 pages

Unit-I Machine Learning Basics

Uploaded by

22h51a6710

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Machine Learning

Basics
Machine Learning

experience 𝐸 with respect to some class of

• A computer program is said to learn from

tasks 𝑇 and performance measure 𝑃, if its

performance at tasks in 𝑇, as measured
by 𝑃, improves with experience 𝐸 (Tom M.
Michell, 1997)
• Example
 𝐸: the experience of playing thousands of
games
 𝑇: playing checkers game
 P: the fraction of games it wins against
human opponents
The task, T
• Tasks are usually described in terms of
how the machine learning should process

𝑥 ∈ ℝ𝑛 where each entry 𝑥𝑖

an example:
is a
feature
 Classification: Learn 𝑓: ℝ𝑛 → 1, . . , k
o y = 𝑓(𝑥) : assigns the input to the
category with numerical code 𝑦
o Example: object recognition
 Classification with missing inputs: learn a
set of functions
o Example: medical diagnosis
The task,
 Regression: Learn 𝑓: ℝ → ℝ
T 𝑛

o Example: Predict claim amount for an

insured person
 Transcription: unstructured
representation to discrete textual form
o Example: optical character recognition,
speech recognition
 Machine Translation: Sequence to sequence
o Example: translate English to French
The task, T
 Structured output: output is a vector or
data structure of values that are tightly
interrelated
o Example: parsing natural language
sentence into a tree that describes
grammatical structure by tagging nodes
of the tree as being verbs, nouns, etc.
o image segmentation: assigning a
pixel to a segment
o Image captioning
The task, T
 Anomaly detection: flag unusual or
atypical events
o Credit card fraud detection
 Synthesis and sampling: generate
examples that are similar to those in the
training data
o Genreate textures for video games

o Speech synthesis
 Imputation of missing values: predict
the values of the missing entries
The task,
 Denoising: 𝑓:
T 𝑥3 ∈ ℝ → 𝑥 ∈ ℝ 𝑛 𝑛

o Predict clean example from corrupted

example
o 𝑝 𝑥 𝑥5 =?
 Density estimation or probability

o 𝑝𝑚𝑜𝑑𝑒𝑙 (𝒙): ℝ𝑛 → ℝ (x can be

mass estimation

discrete or continuous)
o Example: 𝑝 𝑚𝑜𝑑𝑒𝑙 (𝑥 𝑖 |𝒙 –𝒊 )
The performance
measure,
•Accuracy: P
 The proportion of examples for which the model
produces the correct output
• Error rate (0-1 loss):
 The proportion of examples for which the model
produces incorrect output
• Average log-probability of some examples (for
density estimation)
• It is difficult sometimes, to decide what should
be measured
• It is imparactical sometimes to measure an
implicit performance metric
• Evaluate the performance measure using a
test set
The experience,
• Elearning: 𝑝(𝑦|𝒙)
Supervised
 Experience is a labeled dataset (or datapoints)
 Each datapoint has a label or target

• Unsupervised learning: 𝑝(𝒙)

 Experience is an unlabeled dataset

 Clustering, learning probability distribution,

denoising, etc.
• The line between supervised and unsupervised

𝑝 𝒙 = ∏𝑛 𝑝(𝑥𝑖|𝑥1, … , 𝑥 𝑖 – 1 )
is often blurred

 𝑝 𝑦 𝒙 = 𝑖=1 𝑝


𝒙,𝑦
𝑝
𝑦
∑ " 𝒙,𝑦"
The experience,
E
• Semi-Supervised learning:
 Some examples include supervision
targets, but others do not.
• Multi-instance learning:
A collection of examples is labeled as
containing an example of a class
• Reinforcement learning:
 Interaction with an environment
A
• dataset
Design matrix X:
 Each row is an example
 Each column is a feature
 Example: iris dataset: 𝑋 ∈ ℝ150×4
• Vector of labels y
 Example: 0 is a person, 1 is a car, 2 is a
cat, etc.
 The label can be a set of words
(e.g. transcription)
Example: Linear
Task: regression problem ℝ
regression 𝑛

→ℝ
•

 𝑦# = 𝒘𝑻 𝒙

1 𝒕𝒆𝒔𝒕, 𝒚
 Have a test set: 𝑿
• Performance measure:
𝑀𝑆𝐸𝑡𝑒𝑠𝑡 0 𝒚1 𝒕𝒆𝒕𝒆𝒔𝒕 𝒕𝒆 2
𝒔𝒕 𝒔𝒕 𝑖
= −𝒚
�
𝒚1 𝑖 −
� 𝒕𝒆𝒔𝒕
= � 𝒚𝒕𝒆𝒔𝒕
1� 2
2
Experience
𝑿𝒕𝒓𝒂𝒊𝒏, 𝒚𝒕𝒓𝒂𝒊𝒏
•

𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛
 Minimize
Linear regression
𝛻𝒘 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛
= 10
•

• � 𝑿𝒕𝒓𝒂𝒊𝒏𝒘 − 𝒚𝒕𝒓𝒂𝒊𝒏 𝑇 𝑿𝒕𝒓𝒂𝒊𝒏𝒘 =

0
�
𝑚 𝑡r𝑎𝑖 �

(𝑿 𝑿 )𝒘 = 𝑿 𝒚 𝒕𝒓𝒂𝒊𝒏 (normal
−𝒚
𝒕𝒓𝒂𝒊𝒏 𝑻 𝒕𝒓𝒂𝒊𝒏 𝒕𝒓𝒂𝒊𝒏 𝑻
𝑛 𝒕𝒓𝒂𝒊𝒏
�

equations)
•

• We can solve for 𝑦O = 𝒘𝑻 𝒙 + 𝑏 using a

simple trick
• b is called the bias term (to not confound
with statistical bias that will be discussed
later)
Linear Regression

Figure
5.1
Capacity, overfitting
and underfitting
• Generalization: ability to perform well on
previously unobserved data
• i.i.d assumptions:
 The examples in each dataset are
independent from each other,
 and that the train set and test set are
identically distributed, drawn from the
same probability distribution as each
other.
 We call that shared underlying

distribution, denoted 𝑝𝑑𝑎𝑡𝑎

distribution the data generating
Capacity, overfitting
and underfitting
• For some fixed value w, the expected
training set error is exactly the same as the
expected test set error, because both
expectations are formed using the same
dataset sampling process.
• In practice:
 expected test set error > expected train
set error
• We need ability to:
 1. Make the training error small.
(underfitting)
 2. Make the gap between training and
Capacity, overfitting
and underfitting
• Underfitting occurs when the model is not
able to obtain a sufficiently low error value
on the training set.
• Overfitting occurs when the gap
between the training error and test
error is too large.
• Informally, a model’s capacity is its ability
to fit a wide variety of functions.
 Low capacity may cause underfitting
 High capacity may cause overfitting
Hypothesis space
• One way to control the capacity of a
learning algorithm is by choosing its
hypothesis space,
 the set of functions that the learning
algorithm is allowed to select as being the
solution.
 For example, linear regression has the set
of all linear functions of its input
 To increase capacity: (model is
polynomial of degree 9)
9

𝑦O = 𝑏 + R 𝑤𝑖 𝑥𝑖
𝑖=1
Underfitting and
Overfitting in
Polynomial Estimation

Degree Degree Degree 9:

infinitely many
1 2
Capacity
• Representational capacity: finding the best
function within a family of functions
• Effective capacity may be less than the
representational capacity because it is hard to
find the best function
• Occam’s razor: Among competing
hypotheses that explain known observations
equally well, one should choose the
“simplest” one.
• Statistical learning theory shows that the
discrepancy between training error and
generalization error is bounded from above by a
quantity that grows as the model capacity grows
but shrinks as the number of training examples
increases
Generalization and
Capacity

Figure
High capacity: non-
parametric models
• Nearest neighbor
2
𝑦O = 𝑦𝑖
regression
2
𝑖 = argmin 𝑿𝒊,: −
where

𝒙
Bayes error

predictions from the true distribution 𝑝 𝒙, 𝑦

• The error incurred by an oracle making

.
• This error is due to:
 There may still be noise in the distribution
 𝑦 might be inherently stochastic
𝑦 may be deterministic but
involves other variables besides 𝒙

Training Set
Size moderate amount
of noise added to
a degree-5
polynomial
The no Free lunch
• theorem
Averaged over all possible data generating
distributions, every classification algorithm has
the same error rate when classifying previously
unobserved points.

• In some sense, no machine learning

algorithm is universally any better than
any other.

• The most sophisticated algorithm we can

conceive of has the same average performance
(over all possible tasks) as merely predicting
that every point belongs to the same class.
No free lunch
• The goal of machine learning research is not
to seek a universal learning algorithm or the
absolute best learning algorithm.
• Instead, our goal is to understand what
kinds of distributions are relevant to the
“real world”
 and what kinds of machine learning
algorithms perform well on data drawn
distributions we care about.
Regularization
• We can give a learning algorithm a
preference for one solution in its
hypothesis space to another.
 both functions are eligible, but one is
preferred.
• Example: linear regression with weight

𝐽 𝑤 = 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 + 𝜆𝒘𝑻𝒘
decay:

• We are expressing a preference for the

weights to have smaller L2 norm
 Larger 𝜆 forces the weights to become
Weight Decay

The true function is quadratic,

but here we use only models with
degree 9
Regularization
• Regularization is any modification we
make to a learning algorithm that is
intended to reduce its generalization
error but not its training error.
• Regularization is one of the central concerns
of the field of machine learning
Hyperparameters
• The values of hyperparameters are not
adapted by the learning algorithm itself
• Polynomial regression hyperparameters:
 Degree of the polynomial
 𝜆 (control of the weight decay)
• To set the hyperparameters, we need a
validation set
 Typically, one uses about 80% of the
training data for training and 20% for
validation.
Cross validation
• It is used to estimate generalization error of a
learning algorithm A when the given dataset D
is too small for a simple train/test or train/valid
split to yield accurate estimation of
generalization error

• In k-fold cross-validation a partition of the

dataset is formed by splitting it into k non-
overlapping subsets.
 The test error may then be estimated by
taking the average test error across k trials.
On trial i, the i-th subset of the data is used as
the test set and the rest of the data is used as
Cross validation
Point estimation
• Provides the single “best” prediction of some
quantity of interest
• can be a single parameter or a vector of
parameters in some parametric model
• A point estimator or statistic is any function of the
data (assuming i.i.d. samples):
𝜽3 = 𝑔(𝒙 𝟏 ,𝒙 𝟐 ,…,𝒙 𝒎 )
• Frequentist perspective: we assume that
the true parameter value θ is fixed but
unknown
• Function estimation is really just the same as
estimating a parameter θ; the function estimator
is simply a point estimator in function space.
Statistical
the data 𝒙 𝟏 , 𝒙
Expectation over
bias 𝟐 , … , 𝒙 𝒎

𝑏𝑖𝑎𝑠 𝜽Z > = D 𝜽Z > −𝜽

•
An estimator 𝜽Z > is said to be unbiased
if
𝑏𝑖𝑎𝑠 𝜽Z > = 0
𝑚→

which @
implies that D 𝜽Z > = 𝜽.
•
An estimator 𝜽Z >
lim
is said to be
asymptotically unbiased if
Example: Bernoulli
distribution
• Consider a set of samples {𝑥 1 ,𝑥 2 ,…,𝑥 𝑚 } that are i.i.d according to a Bernoulli
𝑃 𝑥 𝑖 ; = 𝜃𝑥 𝑖 1 − 𝜃
distribution with mean θ
𝜃

1−𝑥 𝑖

of 𝜃 is: 𝑚
 A common estimator

1
𝜃𝑚 = � /
- 𝑖=1
𝑥
(𝑖)
�
Example: Gaussian
distribution - mean
Consider a set of samples {𝑥 1 , 𝑥 2 ,

…,𝑥 𝑚 } that are i.i.d

•

according to a a Gaussian distribution:

𝑝 𝑥 1𝑖 = 𝒩 𝑥 𝑖 , 𝜇, 𝜎2
𝜇̂𝑚 = 𝑥 (𝑖

𝑚 𝑖=
•
1 )
∑𝑚

The sample mean is an

unbiased estimator of
Gaussian mean parameter.
Example: Gaussian
distribution - variance
2
Biased estimator:𝑚𝜎#𝑚2 =
𝑖= 𝑥 − 𝑚
𝜇
•
1 𝑖
∑
1 𝑚
#

The sample variance is a

biased estimator.

• Unbiased
2
−
1
𝜎# 2
𝑚− ∑𝑖= 𝑥
estimator:
𝑚 𝑚
= 1 𝑚1 𝑖 𝜇
•
Variance
• how much we expect an estimator to
vary as a function of the data sample?
•
𝑣𝑎𝑟 𝜃g =?
•
Standard error 𝑆𝐸 𝜃g =?
• Just as we might like an estimator to
exhibit low bias we would also like it to
have relatively low variance.
Example: Bernoulli
distribution

• The variance of the estimator decreases as a

function of m, the number of examples in the
dataset.
Trading off Bias and Variance

Figure (Goodfellow
2016)
Bias vs. variance
Bias vs. variance
Maximum Likelihood
Estimation
• Rather than guessing that some function
might make a good estimator and then
analyzing its bias and variance, we would
like to have some principle from which we
can derive specific functions that are good
estimators for different models.

Same
Maximum Likelihood
Estimation
• Divide by
• m:
MLE minimizes the
Empirica
dissimilarity between the l

𝑝O𝑑𝑎𝑡𝑎 and the model

empirical distribution distributi
on
distribution

of 𝜃
Any loss consisting of a negative log- Independent
likelihood is a cross-entropy between
the empirical distribution defined by the
training set and the probability
distribution defined by model.
Conditional log-
• likelihood
Supervised
learning

• Instead of producing a single prediction 𝑦O,

distribution 𝑝(𝑦 | 𝒙).

we now think of the model as producing a
conditional Output of
linear
• For linear regression, regression as fixe
take: the mean d
Conditional log-likelihood
for linear regression
• To
maximize:

From
• Same as the
minimizing: dataset
Output of
linear
regression as
the mean
Properties of
maximum
number of examples 𝑚 → ∞, in terms of its
•
likelihood
The best estimator asymptotically, as the

rate of convergence as 𝑚 increases.

•
Consistency: plim 𝜃g𝑚 = 𝜃
𝑚→@
 Under appropriate conditions
o The true distribution must lie within the
model family
o The true distribution must correspond to
exactly one value of θ
Bayesian statistics
• the true parameter 𝜃 is unknown or
uncertain and thus is represented as a
random variable.

• It is different than the single point

estimate (frequentist statistics)
Maximum A Posteriori
(MAP) Estimation
• Add the
prior:

• MAP Bayesian inference has the advantage

of leveraging information that is brought
by the prior and cannot be found in the
 If this prior is given by 𝒩𝒘, 𝟎, 1 𝑰2 , then
training data.
�
�
𝜆𝑤
prior term
the log- is proportional to� the familiar

𝑤
�

weight decay penalty

Supervised
learning
•
examples
Logistic regression
• Support vector
machines
• Decision trees
Logistic regression

• There is no closed-form solution for the optimal

weights.
• Instead, we must search for them by maximizing
the log- likelihood.
• We can do this by minimizing the negative log-
likelihood (NLL) using gradient descent.
Support vector
• machines
Introduced by Vapnik in the
90s
• Widest street approach
• Decision rule
 Which side of the street?

𝒘𝒙 + 𝑏 ≥ 0 ⇒
+
o

𝒘𝒙 + 𝑏 ≤
0⇒−
o

 w is perpendicular to the median line of the street, but

which w?
o We add the following scaling constraints:

• 𝒘𝒙+ + 𝑏 ≥ 1
o 𝒘𝒙 – + 𝑏 ≤ −1
SVM
• y𝑖 = +1 for + samples
• y𝑖 = −1 for – samples
• 𝑦𝑖(𝒘𝒙𝒊 + 𝑏) ≥ 1 same equation for x+ and
x-
• 𝑦𝑖𝒘𝒙𝒊 + 𝑏 −1≥0
• w𝑖𝑑𝑡ℎ = 𝒙𝑦+𝑖 −𝒘𝒙 𝒊 +
𝒙 –b𝒘 𝑏 –(1+b) − 1 = 0
𝐰 1–
− � = �
= 𝒙
Constraints:
for 𝒊 in 𝒘 gutter2
the � �
• Maximize the width of the
 ma ⇒ max ⇒
street: 2 𝒘
1
𝒘 22
x 𝒘 min
Lagrange multipliers
2 − ∑𝛼 𝑦𝑖 𝒘𝒙𝒊 + 𝑏 −1
𝐿 =2 1
𝑖

𝛻𝒘𝒘 𝛼𝑖− ∑𝛼≥ 𝑖 𝑦𝑖0

•
• 𝐿=𝒘 𝒙𝒊
𝒘 − ∑𝛼𝑖𝑦𝑖𝒙𝒊 =𝟎⇒𝒘=
∑𝛼𝑖𝑦𝑖𝒙𝒊

𝛛
𝐿

= ∑𝛼 𝑖
• Replacing
•
𝑦𝑖
in 𝐿 ∑𝛼
⇒ : 𝑖𝑦𝑖 = 0
𝛛𝑏
1
 𝐿= 2
∑𝛼𝑖 𝑦𝑖 𝒙𝒊 ∑𝛼j 𝑦j 𝒙𝒋 − ∑𝛼𝑖 𝑦𝑖 𝒙𝒊 ∑𝛼j 𝑦j 𝒙𝒋
− ∑𝛼𝑖 𝑦𝑖 𝑏 +
1
∑𝛼𝑖
 𝐿 = ∑𝛼𝑖 − 2 ∑𝛼𝑖 𝛼j 𝑦𝑖 𝑦j 𝒙𝒊 𝒙𝒋
• Numerical analysts will solve this problem
Kernel trick
𝛼𝑖 ≠ 0 only for points in the gutter, also
known as support vectors
•

Decision for 𝒙𝒋: 𝒘𝒙𝒋 + 𝑏 = ∑𝛼𝑖 𝑦𝑖 𝒙𝒊𝒙𝒋

+𝑏
•

We can change the representation of 𝒙𝒊

to be 𝜙 𝒙𝒊
•

• We do not need direct access to 𝜙 𝒙𝒊

Instead we define 𝑘 𝒙 𝒊 , 𝒙𝒋 = 𝜙 𝒙𝒊
𝜙 𝒙𝒋


 The function k is called kernel

Advantages of the
kernel trick
• The kernel trick is powerful for two reasons.

nonlinear as a function of 𝒙 using convex

 it allows us to learn models that are

optimization techniques that are

guaranteed to converge efficiently
 the kernel function often admits an
implementation that is significantly more

constructing two 𝜙 𝒙 vectors and

computational efficient than naively

explicitly taking their dot product.

o In some cases, 𝜙 𝒙 can even
be infinite dimensional
Gaussian
• kernel
Gaussian kernel
 Aka RBF: Radial Basis Function

• We can think of the Gaussian kernel as performing a

kind of template matching.
 A training example x associated with training
label y becomes a template for class y.
 When a test point x’ is near x according to
Euclidean distance, the Gaussian kernel has a
large response, indicating that x’ is very
similar to the x template.
 The model then puts a large weight on the
associated training label y.
• Overall, the prediction will combine many such
training labels weighted by the similarity of the
corresponding training examples.
SVM
•
limitations
high computational cost of training when the
dataset is large.

• The current deep learning renaissance began

when Hinton et al. (2006) demonstrated that
a neural network could outperform the RBF
kernel SVM on the MNIST benchmark.
Watch (SVM)
• https:
//[Link]/watch?v=_PwhiWxHK8o
K-Nearest Neighbors
• There is not even really a training stage or
learning process.
 We find the k-nearest neighbors to x in the
training data X. We then return the average
of the corresponding y values in the
training set
• In the case of classification, we can
average over one-hot code vectors
 We can then interpret the average over
these one-hot codes as giving a
probability distribution over classes
K-NN limitations
• It cannot learn that one feature is
more discriminative than another.

regression task with 𝑥 ∈ ℝ100 drawn from

 For example, imagine we have a

an isotropic Gaussian distribution,

 but only a single variable is relevant to the

𝑦 = 𝑥1
output

 Nearest neighbor regression will not be

able to detect this simple pattern.
Decision
Trees How a decision
divide ℝ2.
tree might

Each leaf requires at least one

training example to define,
so it is not possible for the
decision tree to
learn a function that has more local
maxima than the number of training
examples.
Decision trees
limitations
• It struggles to solve some problems that
are easy even for logistic regression.

positive class occurs wherever 𝑥2 > 𝑥1

• If we have a two-class problem and the

 the decision boundary is not axis-aligned.

The decision tree will thus need to
approximate the decision boundary with
many splits, implementing a step function
that constantly walks back and forth across
the true decision function.
Unsupervided
• learning
Density estimation
• Learning to draw samples from a distribution
• Learning to denoise data from some
distribution
• Clustering the data into groups of related
examples
• Simpler representation:
 Lower-dimensional representation
 Sparse representation
 Independent representation
Principal Components
Analysis

• lower
dimensionality
• no linear correlation
No linear
correlation
Let us consider the 𝑚×𝑛-dimensional design
matrix 𝑿.
•

D 𝒙 = 0.
• We will assume that the data has a mean of zero,

 If this is not the case, the data can easily be

centered by subtracting the mean from all
examples in a preprocessing step.
• The unbiased sample covariance matrix
 𝑉𝑎𝑟 𝒙 1 𝑿𝑻
associated with
𝑚−
= 𝑿 𝑻
PCA finds 𝒛 = 𝑾 𝒙
X is given by:
1

where 𝑉𝑎𝑟
• is

𝒛
diagonal.
Decorrelati
PCA finds 𝒛 = 𝑾 𝒙 where 𝑉𝑎𝑟𝒛
𝑻
on
• is
diagonal.
 Remember 𝑻
that 𝑿
𝑉𝑎𝑟 𝒛 𝟏 𝒁 𝒁 = 𝑾 𝑿 𝑿𝑾 =
1 𝑻𝑿 = 𝑾𝚲𝑾𝑻
𝑻 𝑻
1
𝑚– 𝒎– 𝑚–
= 1 𝑾𝑻𝑾𝜦𝑾 𝑻𝑾 =
•

𝑉𝑎𝑟 𝒛 𝑚1– 𝚲 since 𝑾𝑻𝑾

𝟏 1

= =𝑰
•

1
• PCA disentangles the unknown factors of
variation underlying the data.
 this disentangling takes the form of finding a
rotation of the input space that aligns the
principal axes of variance with the basis of the
new representation space associated with z.
k-means clustering
• Divides the training set into k different
clusters of examples

vector ℎ
• Provides a k-dimensional one-hot code

representing an input 𝑥.
 If 𝑥 belongs to cluster 𝑖, then ℎ = 1 and
𝑖
all other entries of the representation ℎ are
zero.
 This is an example of a sparse
representation
k-means algorithm
initialize k different centroids {𝜇(1), . . . ,
𝜇(𝑘)} to different values,
•

• then alternate between two different

steps until convergence:

assigned to cluster 𝑖, where 𝑖 is the index

 In one step, each training example is

of the nearest centroid 𝜇(𝑖).

 In the other step, each centroid 𝜇(𝑖) is

examples 𝑥(𝑗) assigned to cluster 𝑖.

updated to the mean of all training
How to evaluate
• clustering?
Suppose we have images of
 Red trucks, gray trucks
 Red cars, gray cars
• One clustering algorithm may find a cluster of
cars and a cluster of trucks
• Another learning algorithm may find a
cluster of red vehicules and a cluster of gray
vehicules
• A third algorithm may find 4 clusters …
 Red cars similarity to gray trucks is same
as their similarity to gray cars!
• A distributed representation may have two
attributes for each vehicule
Building a machine
learning
• Simple recipe
algorithm
 A dataset
 A cost function (e.g. MSE, negative log-
likelihood)
 A model (e.g. linear, non-linear)
 An optimization procedure (closed form,
gradient descent, stochastic gradient
descent … )
Exercis
• Link equations e
and
definitions
Stochastic gradient
descent

Computing the gradient is

O(m) Sample a mini-batch
Challenges motivating
deep learning
• Simple machine learning algorithms work
very well on a wide variety of important
problems.
• However, they have not succeeded in
solving the central problems in AI, such as
recognizing speech or recognizing objects.
• Generalizing to new examples becomes
exponentially more difficult when working
with high- dimensional data,
• Such data also often impose high
computational costs.
Curse of
Many Dimensionality
machine learning
problems become
exceedingly difficult when
the number of dimensions
in the data is high.

one variable, two variables, three variables,

10 regions of 10 regions of 10 regions of
interest. interest each. interest each.

For 𝑑 dimensions and 𝑣 values to be distinguished

along each axis, we seem to need 𝑂(𝑣𝑑 ) regions
Smoothness prior (aka
local consistency)
• the function we learn
should not change very
much within a small region.
 KNN
 RBF kernel
 Decision trees
• to distinguish 𝑂(𝑘)

of these require 𝑂(𝑘)

regions in input space, all
methods Vorono
examples i
diagra
m
Smoothness prior is
not enough!
• Is there a way to represent a complex
function that has many more regions to be
distinguished than the number of training
examples?
 A checkerboard contains many
variations but there is a simple
structure to them.
• The answer is Yes:
 If we introduce some dependencies
between the regions
 via additional assumptions about the
underlying data generating distribution
Deep learning solution
• Deep learning assumes that the data was
generated by the composition of factors or
features, potentially at multiple levels in a
hierarchy
• These apparently mild assumptions allow an
exponential gain in the relationship between
the number of examples and the number of
regions that can be distinguished.
• Deep and distributed representations counter the
curse of dimensionality.
Manifold
• A manifold is a connected set of points
 that can be approximated well by considering
only a small number of degrees of freedom,
or dimensions, embedded in a higher-
dimensional space.
• A manifold is a set of points, associated
with a neighborhood around each
point.
 Transformations can be applied to move
on the manifold from one position to a
neighboring one.
• From any given point, the manifold locally
appears to be an Euclidean space.
 In everyday life, we experience the surface
of the world as a 2-D plane, but it is in fact
a spherical manifold in 3-D space.
Manifold

One-dimensional manifold embedded in two

dimensional space.
Manifold
• we allow the dimensionality of the manifold
to vary from one point to another.
 This often happens when a manifold
intersects itself.
 For example, a figure eight is a manifold
that has a single dimension in most places
but two dimensions at the intersection at
the center.
Manifold learning
• Manifold learning algorithms assumes that

ℝ𝑛 consists of invalid inputs,

most of

 interesting inputs occur only along a

collection of manifolds
 Interesting variations in the output of the
learned function occurring only along
directions that lie on the manifold,
 or with interesting variations happening
only when we move from one manifold to
another.
the key assumption
Arguments supporting
the manifold
•
hypothesis
the probability distribution over images, text
strings, and sounds that occur in real life is
highly concentrated.

Uniforml
y
Sample
d
Images
Arguments supporting
the manifold
•
hypothesis
Examples we encounter are connected to
each other by applying transformations to
traverse the manifold.

QMUL
Conclusion
• The manifold assumption is at least
approximately correct.

• This concludes part I of the course

• You are now prepared to embark upon your
study of deep learning.

Congratulations

Machine Learning Concepts Overview
No ratings yet
Machine Learning Concepts Overview
85 pages
Sparse Modeling in Machine Learning
100% (1)
Sparse Modeling in Machine Learning
61 pages
DL Unit-1
No ratings yet
DL Unit-1
20 pages
Yoshua Bengio's Salary Insights
No ratings yet
Yoshua Bengio's Salary Insights
100 pages
Machine Learning Fundamentals Explained
No ratings yet
Machine Learning Fundamentals Explained
24 pages
Deep Learning and Machine Learning Basics
No ratings yet
Deep Learning and Machine Learning Basics
63 pages
DL Unit 1
No ratings yet
DL Unit 1
54 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
21 pages
Optimizing Feedforward Networks: Computational Systems Biology Deep Learning in The Life Sciences
No ratings yet
Optimizing Feedforward Networks: Computational Systems Biology Deep Learning in The Life Sciences
168 pages
Machine Learning Techniques Overview
No ratings yet
Machine Learning Techniques Overview
145 pages
Understanding Machine Learning Concepts
No ratings yet
Understanding Machine Learning Concepts
117 pages
Supervised Learning
No ratings yet
Supervised Learning
145 pages
Machine Learning Fundamentals Explained
No ratings yet
Machine Learning Fundamentals Explained
52 pages
Deep Learning Introduction and Algorithms
No ratings yet
Deep Learning Introduction and Algorithms
24 pages
DL Unit Ii
No ratings yet
DL Unit Ii
8 pages
Types of Machine Learning Explained
No ratings yet
Types of Machine Learning Explained
50 pages
Machine Learning Concepts Overview
No ratings yet
Machine Learning Concepts Overview
284 pages
DSA 5105: Intro to Machine Learning
No ratings yet
DSA 5105: Intro to Machine Learning
51 pages
ML Complete Study Guide With AKTU Important Questions
No ratings yet
ML Complete Study Guide With AKTU Important Questions
47 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
64 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
125 pages
Machine Learning Overview for COMS 4771
No ratings yet
Machine Learning Overview for COMS 4771
17 pages
Supervised Learning and Classification Methods
No ratings yet
Supervised Learning and Classification Methods
25 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
72 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
128 pages
Understanding Linear Regression in ML
No ratings yet
Understanding Linear Regression in ML
37 pages
Introduction to Cognitive Science and Machine Learning
No ratings yet
Introduction to Cognitive Science and Machine Learning
14 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
48 pages
Machine Learning Types and Techniques
No ratings yet
Machine Learning Types and Techniques
63 pages
Machine Learning Basics and Tasks
No ratings yet
Machine Learning Basics and Tasks
16 pages
CPR261 Lecture 8 - Loss Function and SVM
No ratings yet
CPR261 Lecture 8 - Loss Function and SVM
47 pages
Machine Learning Fundamentals Overview
No ratings yet
Machine Learning Fundamentals Overview
24 pages
Lect 4
No ratings yet
Lect 4
37 pages
Machine Learning Basics: Supervised Learning
No ratings yet
Machine Learning Basics: Supervised Learning
6 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
60 pages
Foundations of Machine Learning Overview
No ratings yet
Foundations of Machine Learning Overview
51 pages
Eml JHP Lec 01
No ratings yet
Eml JHP Lec 01
51 pages
Understanding Machine Learning Concepts
No ratings yet
Understanding Machine Learning Concepts
16 pages
Linier & Logistic
No ratings yet
Linier & Logistic
15 pages
Introduction to Supervised Learning
No ratings yet
Introduction to Supervised Learning
37 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
15 pages
Understanding Bias, Variance, and Estimators
100% (2)
Understanding Bias, Variance, and Estimators
79 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
57 pages
Learning Agents in AI: Chapter 6 Overview
No ratings yet
Learning Agents in AI: Chapter 6 Overview
42 pages
Introduction to Machine Learning Course
No ratings yet
Introduction to Machine Learning Course
59 pages
Introduction to Machine Learning Basics
No ratings yet
Introduction to Machine Learning Basics
28 pages
Coefficient of Determination (R²) Explained
No ratings yet
Coefficient of Determination (R²) Explained
72 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
63 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
70 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
24 pages
Chapter - 01 - Introduction To ML
No ratings yet
Chapter - 01 - Introduction To ML
60 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
65 pages
Mapping Language Models in ML-19
No ratings yet
Mapping Language Models in ML-19
33 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
4 pages
L13: Principles of Learning: C. V. Jawahar
No ratings yet
L13: Principles of Learning: C. V. Jawahar
21 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
54 pages
Process Concepts in Operating Systems
No ratings yet
Process Concepts in Operating Systems
53 pages
Steps in Predictive Analytics Process
No ratings yet
Steps in Predictive Analytics Process
13 pages
Combinatorial Counting Methods Explained
No ratings yet
Combinatorial Counting Methods Explained
13 pages
Mastering Angular Components Guide
No ratings yet
Mastering Angular Components Guide
12 pages
Hadoop Output Format Overview
No ratings yet
Hadoop Output Format Overview
20 pages
OpenCV 2.4 C++ Cheat Sheet
No ratings yet
OpenCV 2.4 C++ Cheat Sheet
2 pages
Finding Zeros of Quadratic Polynomials
No ratings yet
Finding Zeros of Quadratic Polynomials
8 pages
Optimal Bitonic Tour Algorithm
No ratings yet
Optimal Bitonic Tour Algorithm
20 pages
AI/ML Internship Summary Report 2025
No ratings yet
AI/ML Internship Summary Report 2025
25 pages
Karamraji Algorithm Overview
No ratings yet
Karamraji Algorithm Overview
5 pages
LSTM-AE Model for Network Intrusion Detection
No ratings yet
LSTM-AE Model for Network Intrusion Detection
18 pages
Phylogenetic Tree Building Methods
No ratings yet
Phylogenetic Tree Building Methods
40 pages
Bellman-Ford Algorithm Explained
No ratings yet
Bellman-Ford Algorithm Explained
6 pages
Strassen's Matrix Multiplication Explained
100% (5)
Strassen's Matrix Multiplication Explained
11 pages
COL100: Intro to Computer Science Methods
No ratings yet
COL100: Intro to Computer Science Methods
13 pages
50+ Python Coding Questions & Solutions
No ratings yet
50+ Python Coding Questions & Solutions
54 pages
CUDA Parallel Prefix Sum Algorithm
No ratings yet
CUDA Parallel Prefix Sum Algorithm
17 pages
HSEvo: Balancing Diversity in AHD
No ratings yet
HSEvo: Balancing Diversity in AHD
18 pages
Discrete Math Test 3 Overview
No ratings yet
Discrete Math Test 3 Overview
7 pages
Error-Correcting Codes Overview
No ratings yet
Error-Correcting Codes Overview
78 pages
DSA Roadmap with Free Resources Guide
No ratings yet
DSA Roadmap with Free Resources Guide
5 pages
BAI701 Deep Learning Question Bank
No ratings yet
BAI701 Deep Learning Question Bank
3 pages
Next Number in Polynomial Sequence
0% (1)
Next Number in Polynomial Sequence
5 pages
Greedy Algorithms in Optimization Problems
No ratings yet
Greedy Algorithms in Optimization Problems
8 pages
Self-Tuning Kalman Filters in Protection
No ratings yet
Self-Tuning Kalman Filters in Protection
4 pages
Hamiltonian Cycles and Branch & Bound
No ratings yet
Hamiltonian Cycles and Branch & Bound
22 pages
Matlab Simulink Lab Exercises Designed For Teaching Digital Signal Processing Applications
No ratings yet
Matlab Simulink Lab Exercises Designed For Teaching Digital Signal Processing Applications
14 pages
Fusion Tutorial
No ratings yet
Fusion Tutorial
60 pages
Numerical Solutions for Linear Systems
No ratings yet
Numerical Solutions for Linear Systems
7 pages
Interval and Speed of Convergence On Iterative Methods: The Himalayan Physics
No ratings yet
Interval and Speed of Convergence On Iterative Methods: The Himalayan Physics
3 pages
Data Structures Exam for Mechatronics BE
No ratings yet
Data Structures Exam for Mechatronics BE
2 pages
Advanced Algorithms: Graph and Tree Methods
No ratings yet
Advanced Algorithms: Graph and Tree Methods
21 pages
Travelling Salesman Problem Lab Report
No ratings yet
Travelling Salesman Problem Lab Report
5 pages
Binary Search Algorithm Explained
No ratings yet
Binary Search Algorithm Explained
12 pages
DIVP Final Sem
No ratings yet
DIVP Final Sem
2 pages

Unit-I Machine Learning Basics

Uploaded by

Unit-I Machine Learning Basics

Uploaded by

Machine Learning

experience 𝐸 with respect to some class of

tasks 𝑇 and performance measure 𝑃, if its

𝑥 ∈ ℝ𝑛 where each entry 𝑥𝑖

o Example: Predict claim amount for an

o Predict clean example from corrupted

o 𝑝𝑚𝑜𝑑𝑒𝑙 (𝒙): ℝ𝑛 → ℝ (x can be

• Unsupervised learning: 𝑝(𝒙)

 Clustering, learning probability distribution,

• � 𝑿𝒕𝒓𝒂𝒊𝒏𝒘 − 𝒚𝒕𝒓𝒂𝒊𝒏 𝑇 𝑿𝒕𝒓𝒂𝒊𝒏𝒘 =

• We can solve for 𝑦O = 𝒘𝑻 𝒙 + 𝑏 using a

distribution, denoted 𝑝𝑑𝑎𝑡𝑎

Degree Degree Degree 9:

predictions from the true distribution 𝑝 𝒙, 𝑦

• In some sense, no machine learning

• The most sophisticated algorithm we can

• We are expressing a preference for the

The true function is quadratic,

• In k-fold cross-validation a partition of the

𝑏𝑖𝑎𝑠 𝜽Z > = D 𝜽Z > −𝜽

…,𝑥 𝑚 } that are i.i.d

according to a a Gaussian distribution:

The sample mean is an

The sample variance is a

• The variance of the estimator decreases as a

𝑝O𝑑𝑎𝑡𝑎 and the model

• Instead of producing a single prediction 𝑦O,

distribution 𝑝(𝑦 | 𝒙).

rate of convergence as 𝑚 increases.

• It is different than the single point

• MAP Bayesian inference has the advantage

weight decay penalty

• There is no closed-form solution for the optimal

 w is perpendicular to the median line of the street, but

𝛻𝒘𝒘 𝛼𝑖− ∑𝛼≥ 𝑖 𝑦𝑖0

Decision for 𝒙𝒋: 𝒘𝒙𝒋 + 𝑏 = ∑𝛼𝑖 𝑦𝑖 𝒙𝒊𝒙𝒋

We can change the representation of 𝒙𝒊

• We do not need direct access to 𝜙 𝒙𝒊

 The function k is called kernel

nonlinear as a function of 𝒙 using convex

optimization techniques that are

constructing two 𝜙 𝒙 vectors and

explicitly taking their dot product.

• We can think of the Gaussian kernel as performing a

• The current deep learning renaissance began

regression task with 𝑥 ∈ ℝ100 drawn from

an isotropic Gaussian distribution,

 Nearest neighbor regression will not be

Each leaf requires at least one

positive class occurs wherever 𝑥2 > 𝑥1

 the decision boundary is not axis-aligned.

 If this is not the case, the data can easily be

𝑉𝑎𝑟 𝒛 𝑚1– 𝚲 since 𝑾𝑻𝑾

• then alternate between two different

assigned to cluster 𝑖, where 𝑖 is the index

of the nearest centroid 𝜇(𝑖).

examples 𝑥(𝑗) assigned to cluster 𝑖.

Computing the gradient is

one variable, two variables, three variables,

For 𝑑 dimensions and 𝑣 values to be distinguished

of these require 𝑂(𝑘)

One-dimensional manifold embedded in two

ℝ𝑛 consists of invalid inputs,

 interesting inputs occur only along a

• This concludes part I of the course

You might also like