Machine Learning
Basics
Machine Learning
experience 𝐸 with respect to some class of
• A computer program is said to learn from
tasks 𝑇 and performance measure 𝑃, if its
performance at tasks in 𝑇, as measured
by 𝑃, improves with experience 𝐸 (Tom M.
Michell, 1997)
• Example
𝐸: the experience of playing thousands of
games
𝑇: playing checkers game
P: the fraction of games it wins against
human opponents
The task, T
• Tasks are usually described in terms of
how the machine learning should process
𝑥 ∈ ℝ𝑛 where each entry 𝑥𝑖
an example:
is a
feature
Classification: Learn 𝑓: ℝ𝑛 → 1, . . , k
o y = 𝑓(𝑥) : assigns the input to the
category with numerical code 𝑦
o Example: object recognition
Classification with missing inputs: learn a
set of functions
o Example: medical diagnosis
The task,
Regression: Learn 𝑓: ℝ → ℝ
T 𝑛
o Example: Predict claim amount for an
insured person
Transcription: unstructured
representation to discrete textual form
o Example: optical character recognition,
speech recognition
Machine Translation: Sequence to sequence
o Example: translate English to French
The task, T
Structured output: output is a vector or
data structure of values that are tightly
interrelated
o Example: parsing natural language
sentence into a tree that describes
grammatical structure by tagging nodes
of the tree as being verbs, nouns, etc.
o image segmentation: assigning a
pixel to a segment
o Image captioning
The task, T
Anomaly detection: flag unusual or
atypical events
o Credit card fraud detection
Synthesis and sampling: generate
examples that are similar to those in the
training data
o Genreate textures for video games
o Speech synthesis
Imputation of missing values: predict
the values of the missing entries
The task,
Denoising: 𝑓:
T 𝑥3 ∈ ℝ → 𝑥 ∈ ℝ 𝑛 𝑛
o Predict clean example from corrupted
example
o 𝑝 𝑥 𝑥5 =?
Density estimation or probability
o 𝑝𝑚𝑜𝑑𝑒𝑙 (𝒙): ℝ𝑛 → ℝ (x can be
mass estimation
discrete or continuous)
o Example: 𝑝 𝑚𝑜𝑑𝑒𝑙 (𝑥 𝑖 |𝒙 –𝒊 )
The performance
measure,
•Accuracy: P
The proportion of examples for which the model
produces the correct output
• Error rate (0-1 loss):
The proportion of examples for which the model
produces incorrect output
• Average log-probability of some examples (for
density estimation)
• It is difficult sometimes, to decide what should
be measured
• It is imparactical sometimes to measure an
implicit performance metric
• Evaluate the performance measure using a
test set
The experience,
• Elearning: 𝑝(𝑦|𝒙)
Supervised
Experience is a labeled dataset (or datapoints)
Each datapoint has a label or target
• Unsupervised learning: 𝑝(𝒙)
Experience is an unlabeled dataset
Clustering, learning probability distribution,
denoising, etc.
• The line between supervised and unsupervised
𝑝 𝒙 = ∏𝑛 𝑝(𝑥𝑖|𝑥1, … , 𝑥 𝑖 – 1 )
is often blurred
𝑝 𝑦 𝒙 = 𝑖=1 𝑝
𝒙,𝑦
𝑝
𝑦
∑ " 𝒙,𝑦"
The experience,
E
• Semi-Supervised learning:
Some examples include supervision
targets, but others do not.
• Multi-instance learning:
A collection of examples is labeled as
containing an example of a class
• Reinforcement learning:
Interaction with an environment
A
• dataset
Design matrix X:
Each row is an example
Each column is a feature
Example: iris dataset: 𝑋 ∈ ℝ150×4
• Vector of labels y
Example: 0 is a person, 1 is a car, 2 is a
cat, etc.
The label can be a set of words
(e.g. transcription)
Example: Linear
Task: regression problem ℝ
regression 𝑛
→ℝ
•
𝑦# = 𝒘𝑻 𝒙
1 𝒕𝒆𝒔𝒕, 𝒚
Have a test set: 𝑿
• Performance measure:
𝑀𝑆𝐸𝑡𝑒𝑠𝑡 0 𝒚1 𝒕𝒆𝒕𝒆𝒔𝒕 𝒕𝒆 2
𝒔𝒕 𝒔𝒕 𝑖
= −𝒚
�
𝒚1 𝑖 −
� 𝒕𝒆𝒔𝒕
= � 𝒚𝒕𝒆𝒔𝒕
1� 2
2
Experience
𝑿𝒕𝒓𝒂𝒊𝒏, 𝒚𝒕𝒓𝒂𝒊𝒏
•
𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛
Minimize
Linear regression
𝛻𝒘 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛
= 10
•
• � 𝑿𝒕𝒓𝒂𝒊𝒏𝒘 − 𝒚𝒕𝒓𝒂𝒊𝒏 𝑇 𝑿𝒕𝒓𝒂𝒊𝒏𝒘 =
0
�
𝑚 𝑡r𝑎𝑖 �
(𝑿 𝑿 )𝒘 = 𝑿 𝒚 𝒕𝒓𝒂𝒊𝒏 (normal
−𝒚
𝒕𝒓𝒂𝒊𝒏 𝑻 𝒕𝒓𝒂𝒊𝒏 𝒕𝒓𝒂𝒊𝒏 𝑻
𝑛 𝒕𝒓𝒂𝒊𝒏
�
equations)
•
• We can solve for 𝑦O = 𝒘𝑻 𝒙 + 𝑏 using a
simple trick
• b is called the bias term (to not confound
with statistical bias that will be discussed
later)
Linear Regression
Figure
5.1
Capacity, overfitting
and underfitting
• Generalization: ability to perform well on
previously unobserved data
• i.i.d assumptions:
The examples in each dataset are
independent from each other,
and that the train set and test set are
identically distributed, drawn from the
same probability distribution as each
other.
We call that shared underlying
distribution, denoted 𝑝𝑑𝑎𝑡𝑎
distribution the data generating
Capacity, overfitting
and underfitting
• For some fixed value w, the expected
training set error is exactly the same as the
expected test set error, because both
expectations are formed using the same
dataset sampling process.
• In practice:
expected test set error > expected train
set error
• We need ability to:
1. Make the training error small.
(underfitting)
2. Make the gap between training and
Capacity, overfitting
and underfitting
• Underfitting occurs when the model is not
able to obtain a sufficiently low error value
on the training set.
• Overfitting occurs when the gap
between the training error and test
error is too large.
• Informally, a model’s capacity is its ability
to fit a wide variety of functions.
Low capacity may cause underfitting
High capacity may cause overfitting
Hypothesis space
• One way to control the capacity of a
learning algorithm is by choosing its
hypothesis space,
the set of functions that the learning
algorithm is allowed to select as being the
solution.
For example, linear regression has the set
of all linear functions of its input
To increase capacity: (model is
polynomial of degree 9)
9
𝑦O = 𝑏 + R 𝑤𝑖 𝑥𝑖
𝑖=1
Underfitting and
Overfitting in
Polynomial Estimation
Degree Degree Degree 9:
infinitely many
1 2
Capacity
• Representational capacity: finding the best
function within a family of functions
• Effective capacity may be less than the
representational capacity because it is hard to
find the best function
• Occam’s razor: Among competing
hypotheses that explain known observations
equally well, one should choose the
“simplest” one.
• Statistical learning theory shows that the
discrepancy between training error and
generalization error is bounded from above by a
quantity that grows as the model capacity grows
but shrinks as the number of training examples
increases
Generalization and
Capacity
Figure
High capacity: non-
parametric models
• Nearest neighbor
2
𝑦O = 𝑦𝑖
regression
2
𝑖 = argmin 𝑿𝒊,: −
where
𝒙
Bayes error
predictions from the true distribution 𝑝 𝒙, 𝑦
• The error incurred by an oracle making
.
• This error is due to:
There may still be noise in the distribution
𝑦 might be inherently stochastic
𝑦 may be deterministic but
involves other variables besides 𝒙
Training Set
Size moderate amount
of noise added to
a degree-5
polynomial
The no Free lunch
• theorem
Averaged over all possible data generating
distributions, every classification algorithm has
the same error rate when classifying previously
unobserved points.
• In some sense, no machine learning
algorithm is universally any better than
any other.
• The most sophisticated algorithm we can
conceive of has the same average performance
(over all possible tasks) as merely predicting
that every point belongs to the same class.
No free lunch
• The goal of machine learning research is not
to seek a universal learning algorithm or the
absolute best learning algorithm.
• Instead, our goal is to understand what
kinds of distributions are relevant to the
“real world”
and what kinds of machine learning
algorithms perform well on data drawn
distributions we care about.
Regularization
• We can give a learning algorithm a
preference for one solution in its
hypothesis space to another.
both functions are eligible, but one is
preferred.
• Example: linear regression with weight
𝐽 𝑤 = 𝑀𝑆𝐸𝑡𝑟𝑎𝑖𝑛 + 𝜆𝒘𝑻𝒘
decay:
• We are expressing a preference for the
weights to have smaller L2 norm
Larger 𝜆 forces the weights to become
Weight Decay
The true function is quadratic,
but here we use only models with
degree 9
Regularization
• Regularization is any modification we
make to a learning algorithm that is
intended to reduce its generalization
error but not its training error.
• Regularization is one of the central concerns
of the field of machine learning
Hyperparameters
• The values of hyperparameters are not
adapted by the learning algorithm itself
• Polynomial regression hyperparameters:
Degree of the polynomial
𝜆 (control of the weight decay)
• To set the hyperparameters, we need a
validation set
Typically, one uses about 80% of the
training data for training and 20% for
validation.
Cross validation
• It is used to estimate generalization error of a
learning algorithm A when the given dataset D
is too small for a simple train/test or train/valid
split to yield accurate estimation of
generalization error
• In k-fold cross-validation a partition of the
dataset is formed by splitting it into k non-
overlapping subsets.
The test error may then be estimated by
taking the average test error across k trials.
On trial i, the i-th subset of the data is used as
the test set and the rest of the data is used as
Cross validation
Point estimation
• Provides the single “best” prediction of some
quantity of interest
• can be a single parameter or a vector of
parameters in some parametric model
• A point estimator or statistic is any function of the
data (assuming i.i.d. samples):
𝜽3 = 𝑔(𝒙 𝟏 ,𝒙 𝟐 ,…,𝒙 𝒎 )
• Frequentist perspective: we assume that
the true parameter value θ is fixed but
unknown
• Function estimation is really just the same as
estimating a parameter θ; the function estimator
is simply a point estimator in function space.
Statistical
the data 𝒙 𝟏 , 𝒙
Expectation over
bias 𝟐 , … , 𝒙 𝒎
𝑏𝑖𝑎𝑠 𝜽Z > = D 𝜽Z > −𝜽
•
An estimator 𝜽Z > is said to be unbiased
if
𝑏𝑖𝑎𝑠 𝜽Z > = 0
𝑚→
which @
implies that D 𝜽Z > = 𝜽.
•
An estimator 𝜽Z >
lim
is said to be
asymptotically unbiased if
Example: Bernoulli
distribution
• Consider a set of samples {𝑥 1 ,𝑥 2 ,…,𝑥 𝑚 } that are i.i.d according to a Bernoulli
𝑃 𝑥 𝑖 ; = 𝜃𝑥 𝑖 1 − 𝜃
distribution with mean θ
𝜃
1−𝑥 𝑖
of 𝜃 is: 𝑚
A common estimator
1
𝜃𝑚 = � /
- 𝑖=1
𝑥
(𝑖)
�
Example: Gaussian
distribution - mean
Consider a set of samples {𝑥 1 , 𝑥 2 ,
…,𝑥 𝑚 } that are i.i.d
•
according to a a Gaussian distribution:
𝑝 𝑥 1𝑖 = 𝒩 𝑥 𝑖 , 𝜇, 𝜎2
𝜇̂𝑚 = 𝑥 (𝑖
𝑚 𝑖=
•
1 )
∑𝑚
The sample mean is an
unbiased estimator of
Gaussian mean parameter.
Example: Gaussian
distribution - variance
2
Biased estimator:𝑚𝜎#𝑚2 =
𝑖= 𝑥 − 𝑚
𝜇
•
1 𝑖
∑
1 𝑚
#
The sample variance is a
biased estimator.
• Unbiased
2
−
1
𝜎# 2
𝑚− ∑𝑖= 𝑥
estimator:
𝑚 𝑚
= 1 𝑚1 𝑖 𝜇
•
Variance
• how much we expect an estimator to
vary as a function of the data sample?
•
𝑣𝑎𝑟 𝜃g =?
•
Standard error 𝑆𝐸 𝜃g =?
• Just as we might like an estimator to
exhibit low bias we would also like it to
have relatively low variance.
Example: Bernoulli
distribution
• The variance of the estimator decreases as a
function of m, the number of examples in the
dataset.
Trading off Bias and Variance
Figure (Goodfellow
2016)
Bias vs. variance
Bias vs. variance
Maximum Likelihood
Estimation
• Rather than guessing that some function
might make a good estimator and then
analyzing its bias and variance, we would
like to have some principle from which we
can derive specific functions that are good
estimators for different models.
Same
Maximum Likelihood
Estimation
• Divide by
• m:
MLE minimizes the
Empirica
dissimilarity between the l
𝑝O𝑑𝑎𝑡𝑎 and the model
empirical distribution distributi
on
distribution
of 𝜃
Any loss consisting of a negative log- Independent
likelihood is a cross-entropy between
the empirical distribution defined by the
training set and the probability
distribution defined by model.
Conditional log-
• likelihood
Supervised
learning
• Instead of producing a single prediction 𝑦O,
distribution 𝑝(𝑦 | 𝒙).
we now think of the model as producing a
conditional Output of
linear
• For linear regression, regression as fixe
take: the mean d
Conditional log-likelihood
for linear regression
• To
maximize:
From
• Same as the
minimizing: dataset
Output of
linear
regression as
the mean
Properties of
maximum
number of examples 𝑚 → ∞, in terms of its
•
likelihood
The best estimator asymptotically, as the
rate of convergence as 𝑚 increases.
•
Consistency: plim 𝜃g𝑚 = 𝜃
𝑚→@
Under appropriate conditions
o The true distribution must lie within the
model family
o The true distribution must correspond to
exactly one value of θ
Bayesian statistics
• the true parameter 𝜃 is unknown or
uncertain and thus is represented as a
random variable.
• It is different than the single point
estimate (frequentist statistics)
Maximum A Posteriori
(MAP) Estimation
• Add the
prior:
• MAP Bayesian inference has the advantage
of leveraging information that is brought
by the prior and cannot be found in the
If this prior is given by 𝒩𝒘, 𝟎, 1 𝑰2 , then
training data.
�
�
𝜆𝑤
prior term
the log- is proportional to� the familiar
𝑤
�
weight decay penalty
Supervised
learning
•
examples
Logistic regression
• Support vector
machines
• Decision trees
Logistic regression
• There is no closed-form solution for the optimal
weights.
• Instead, we must search for them by maximizing
the log- likelihood.
• We can do this by minimizing the negative log-
likelihood (NLL) using gradient descent.
Support vector
• machines
Introduced by Vapnik in the
90s
• Widest street approach
• Decision rule
Which side of the street?
𝒘𝒙 + 𝑏 ≥ 0 ⇒
+
o
𝒘𝒙 + 𝑏 ≤
0⇒−
o
w is perpendicular to the median line of the street, but
which w?
o We add the following scaling constraints:
• 𝒘𝒙+ + 𝑏 ≥ 1
o 𝒘𝒙 – + 𝑏 ≤ −1
SVM
• y𝑖 = +1 for + samples
• y𝑖 = −1 for – samples
• 𝑦𝑖(𝒘𝒙𝒊 + 𝑏) ≥ 1 same equation for x+ and
x-
• 𝑦𝑖𝒘𝒙𝒊 + 𝑏 −1≥0
• w𝑖𝑑𝑡ℎ = 𝒙𝑦+𝑖 −𝒘𝒙 𝒊 +
𝒙 –b𝒘 𝑏 –(1+b) − 1 = 0
𝐰 1–
− � = �
= 𝒙
Constraints:
for 𝒊 in 𝒘 gutter2
the � �
• Maximize the width of the
ma ⇒ max ⇒
street: 2 𝒘
1
𝒘 22
x 𝒘 min
Lagrange multipliers
2 − ∑𝛼 𝑦𝑖 𝒘𝒙𝒊 + 𝑏 −1
𝐿 =2 1
𝑖
𝛻𝒘𝒘 𝛼𝑖− ∑𝛼≥ 𝑖 𝑦𝑖0
•
• 𝐿=𝒘 𝒙𝒊
𝒘 − ∑𝛼𝑖𝑦𝑖𝒙𝒊 =𝟎⇒𝒘=
∑𝛼𝑖𝑦𝑖𝒙𝒊
𝛛
𝐿
= ∑𝛼 𝑖
• Replacing
•
𝑦𝑖
in 𝐿 ∑𝛼
⇒ : 𝑖𝑦𝑖 = 0
𝛛𝑏
1
𝐿= 2
∑𝛼𝑖 𝑦𝑖 𝒙𝒊 ∑𝛼j 𝑦j 𝒙𝒋 − ∑𝛼𝑖 𝑦𝑖 𝒙𝒊 ∑𝛼j 𝑦j 𝒙𝒋
− ∑𝛼𝑖 𝑦𝑖 𝑏 +
1
∑𝛼𝑖
𝐿 = ∑𝛼𝑖 − 2 ∑𝛼𝑖 𝛼j 𝑦𝑖 𝑦j 𝒙𝒊 𝒙𝒋
• Numerical analysts will solve this problem
Kernel trick
𝛼𝑖 ≠ 0 only for points in the gutter, also
known as support vectors
•
Decision for 𝒙𝒋: 𝒘𝒙𝒋 + 𝑏 = ∑𝛼𝑖 𝑦𝑖 𝒙𝒊𝒙𝒋
+𝑏
•
We can change the representation of 𝒙𝒊
to be 𝜙 𝒙𝒊
•
• We do not need direct access to 𝜙 𝒙𝒊
Instead we define 𝑘 𝒙 𝒊 , 𝒙𝒋 = 𝜙 𝒙𝒊
𝜙 𝒙𝒋
The function k is called kernel
Advantages of the
kernel trick
• The kernel trick is powerful for two reasons.
nonlinear as a function of 𝒙 using convex
it allows us to learn models that are
optimization techniques that are
guaranteed to converge efficiently
the kernel function often admits an
implementation that is significantly more
constructing two 𝜙 𝒙 vectors and
computational efficient than naively
explicitly taking their dot product.
o In some cases, 𝜙 𝒙 can even
be infinite dimensional
Gaussian
• kernel
Gaussian kernel
Aka RBF: Radial Basis Function
• We can think of the Gaussian kernel as performing a
kind of template matching.
A training example x associated with training
label y becomes a template for class y.
When a test point x’ is near x according to
Euclidean distance, the Gaussian kernel has a
large response, indicating that x’ is very
similar to the x template.
The model then puts a large weight on the
associated training label y.
• Overall, the prediction will combine many such
training labels weighted by the similarity of the
corresponding training examples.
SVM
•
limitations
high computational cost of training when the
dataset is large.
• The current deep learning renaissance began
when Hinton et al. (2006) demonstrated that
a neural network could outperform the RBF
kernel SVM on the MNIST benchmark.
Watch (SVM)
• https:
//[Link]/watch?v=_PwhiWxHK8o
K-Nearest Neighbors
• There is not even really a training stage or
learning process.
We find the k-nearest neighbors to x in the
training data X. We then return the average
of the corresponding y values in the
training set
• In the case of classification, we can
average over one-hot code vectors
We can then interpret the average over
these one-hot codes as giving a
probability distribution over classes
K-NN limitations
• It cannot learn that one feature is
more discriminative than another.
regression task with 𝑥 ∈ ℝ100 drawn from
For example, imagine we have a
an isotropic Gaussian distribution,
but only a single variable is relevant to the
𝑦 = 𝑥1
output
Nearest neighbor regression will not be
able to detect this simple pattern.
Decision
Trees How a decision
divide ℝ2.
tree might
Each leaf requires at least one
training example to define,
so it is not possible for the
decision tree to
learn a function that has more local
maxima than the number of training
examples.
Decision trees
limitations
• It struggles to solve some problems that
are easy even for logistic regression.
positive class occurs wherever 𝑥2 > 𝑥1
• If we have a two-class problem and the
the decision boundary is not axis-aligned.
The decision tree will thus need to
approximate the decision boundary with
many splits, implementing a step function
that constantly walks back and forth across
the true decision function.
Unsupervided
• learning
Density estimation
• Learning to draw samples from a distribution
• Learning to denoise data from some
distribution
• Clustering the data into groups of related
examples
• Simpler representation:
Lower-dimensional representation
Sparse representation
Independent representation
Principal Components
Analysis
• lower
dimensionality
• no linear correlation
No linear
correlation
Let us consider the 𝑚×𝑛-dimensional design
matrix 𝑿.
•
D 𝒙 = 0.
• We will assume that the data has a mean of zero,
If this is not the case, the data can easily be
centered by subtracting the mean from all
examples in a preprocessing step.
• The unbiased sample covariance matrix
𝑉𝑎𝑟 𝒙 1 𝑿𝑻
associated with
𝑚−
= 𝑿 𝑻
PCA finds 𝒛 = 𝑾 𝒙
X is given by:
1
where 𝑉𝑎𝑟
• is
𝒛
diagonal.
Decorrelati
PCA finds 𝒛 = 𝑾 𝒙 where 𝑉𝑎𝑟𝒛
𝑻
on
• is
diagonal.
Remember 𝑻
that 𝑿
𝑉𝑎𝑟 𝒛 𝟏 𝒁 𝒁 = 𝑾 𝑿 𝑿𝑾 =
1 𝑻𝑿 = 𝑾𝚲𝑾𝑻
𝑻 𝑻
1
𝑚– 𝒎– 𝑚–
= 1 𝑾𝑻𝑾𝜦𝑾 𝑻𝑾 =
•
𝑉𝑎𝑟 𝒛 𝑚1– 𝚲 since 𝑾𝑻𝑾
𝟏 1
= =𝑰
•
1
• PCA disentangles the unknown factors of
variation underlying the data.
this disentangling takes the form of finding a
rotation of the input space that aligns the
principal axes of variance with the basis of the
new representation space associated with z.
k-means clustering
• Divides the training set into k different
clusters of examples
vector ℎ
• Provides a k-dimensional one-hot code
representing an input 𝑥.
If 𝑥 belongs to cluster 𝑖, then ℎ = 1 and
𝑖
all other entries of the representation ℎ are
zero.
This is an example of a sparse
representation
k-means algorithm
initialize k different centroids {𝜇(1), . . . ,
𝜇(𝑘)} to different values,
•
• then alternate between two different
steps until convergence:
assigned to cluster 𝑖, where 𝑖 is the index
In one step, each training example is
of the nearest centroid 𝜇(𝑖).
In the other step, each centroid 𝜇(𝑖) is
examples 𝑥(𝑗) assigned to cluster 𝑖.
updated to the mean of all training
How to evaluate
• clustering?
Suppose we have images of
Red trucks, gray trucks
Red cars, gray cars
• One clustering algorithm may find a cluster of
cars and a cluster of trucks
• Another learning algorithm may find a
cluster of red vehicules and a cluster of gray
vehicules
• A third algorithm may find 4 clusters …
Red cars similarity to gray trucks is same
as their similarity to gray cars!
• A distributed representation may have two
attributes for each vehicule
Building a machine
learning
• Simple recipe
algorithm
A dataset
A cost function (e.g. MSE, negative log-
likelihood)
A model (e.g. linear, non-linear)
An optimization procedure (closed form,
gradient descent, stochastic gradient
descent … )
Exercis
• Link equations e
and
definitions
Stochastic gradient
descent
Computing the gradient is
O(m) Sample a mini-batch
Challenges motivating
deep learning
• Simple machine learning algorithms work
very well on a wide variety of important
problems.
• However, they have not succeeded in
solving the central problems in AI, such as
recognizing speech or recognizing objects.
• Generalizing to new examples becomes
exponentially more difficult when working
with high- dimensional data,
• Such data also often impose high
computational costs.
Curse of
Many Dimensionality
machine learning
problems become
exceedingly difficult when
the number of dimensions
in the data is high.
one variable, two variables, three variables,
10 regions of 10 regions of 10 regions of
interest. interest each. interest each.
For 𝑑 dimensions and 𝑣 values to be distinguished
along each axis, we seem to need 𝑂(𝑣𝑑 ) regions
Smoothness prior (aka
local consistency)
• the function we learn
should not change very
much within a small region.
KNN
RBF kernel
Decision trees
• to distinguish 𝑂(𝑘)
of these require 𝑂(𝑘)
regions in input space, all
methods Vorono
examples i
diagra
m
Smoothness prior is
not enough!
• Is there a way to represent a complex
function that has many more regions to be
distinguished than the number of training
examples?
A checkerboard contains many
variations but there is a simple
structure to them.
• The answer is Yes:
If we introduce some dependencies
between the regions
via additional assumptions about the
underlying data generating distribution
Deep learning solution
• Deep learning assumes that the data was
generated by the composition of factors or
features, potentially at multiple levels in a
hierarchy
• These apparently mild assumptions allow an
exponential gain in the relationship between
the number of examples and the number of
regions that can be distinguished.
• Deep and distributed representations counter the
curse of dimensionality.
Manifold
• A manifold is a connected set of points
that can be approximated well by considering
only a small number of degrees of freedom,
or dimensions, embedded in a higher-
dimensional space.
• A manifold is a set of points, associated
with a neighborhood around each
point.
Transformations can be applied to move
on the manifold from one position to a
neighboring one.
• From any given point, the manifold locally
appears to be an Euclidean space.
In everyday life, we experience the surface
of the world as a 2-D plane, but it is in fact
a spherical manifold in 3-D space.
Manifold
One-dimensional manifold embedded in two
dimensional space.
Manifold
• we allow the dimensionality of the manifold
to vary from one point to another.
This often happens when a manifold
intersects itself.
For example, a figure eight is a manifold
that has a single dimension in most places
but two dimensions at the intersection at
the center.
Manifold learning
• Manifold learning algorithms assumes that
ℝ𝑛 consists of invalid inputs,
most of
interesting inputs occur only along a
collection of manifolds
Interesting variations in the output of the
learned function occurring only along
directions that lie on the manifold,
or with interesting variations happening
only when we move from one manifold to
another.
the key assumption
Arguments supporting
the manifold
•
hypothesis
the probability distribution over images, text
strings, and sounds that occur in real life is
highly concentrated.
Uniforml
y
Sample
d
Images
Arguments supporting
the manifold
•
hypothesis
Examples we encounter are connected to
each other by applying transformations to
traverse the manifold.
QMUL
Conclusion
• The manifold assumption is at least
approximately correct.
• This concludes part I of the course
• You are now prepared to embark upon your
study of deep learning.
Congratulations