Deep Learning
CS F425
BITS Pilani Prof. Pratik Narang
Department of CSIS
Pilani Campus
BITS Pilani
Pilani Campus
Regression models
Introduction
• Before we venture into deep learning, we need to cover the basics of neural
network training.
• Here, we will understand the entire training process, including:
o Defining simple neural network architectures
o Handling data
o Specifying a loss function
o Training the model
o Classic statistical learning techniques such as linear and softmax regression can
be thought of as linear neural networks.
Linear Regression
• Regression refers to a set of methods for modeling the relationship
between one or more independent variables and a dependent variable.
o The purpose of regression is most often to characterize the relationship
between the inputs and outputs.
o Machine learning, on the other hand, is most often concerned with prediction.
o We can use regression whenever we want to predict a numerical value.
o Predicting prices (of homes, stocks, etc.)
o Predicting length of stay (for patients in the hospital)
o Demand forecasting (for retail sales)
Linear Regression
• Linear regression flows from a few simple assumptions:
• The relationship between the independent variables 𝒙 and the
dependent variable 𝑦 is linear, i.e., 𝑦 can be expressed as a weighted
sum of the elements in 𝒙, given some noise on the observations.
• We will use 𝑛 to denote the number of examples. We index the data
examples by 𝑖.
• and the corresponding label as 𝑦(𝑖)
Basic Elements of Linear Regression
Linear Model Loss Function Analytic Solution
Making Predictions
Gradient Descent with the Learned
Model
Basic Elements of Linear Regression
Linear Model Loss Function Analytic Solution
Making Predictions
Gradient Descent with the Learned
Model
Linear regression: linear model
• The linearity assumption says that the target (price) can be expressed as a
weighted sum of the features (area and age):
𝑝𝑟𝑖𝑐𝑒 = 𝑤𝑎𝑟𝑒𝑎 ⋅ 𝑎𝑟𝑒𝑎 + 𝑤𝑎𝑔𝑒 ⋅ 𝑎𝑔𝑒 + 𝑏
• 𝑤𝑎𝑟𝑒𝑎 and 𝑤𝑎𝑔𝑒 are called weights, and 𝑏 is called a bias(/offset/intercept).
• The weights determine the influence of each feature on our prediction.
• The bias just says what value the predicted price should take when all of the
features take value 0.
• The equation above is an affine transformation of input features, which is
characterized by a linear transformation of features via weighted sum, combined
with a translation via the added bias.
Linear regression: linear model
• Given a dataset, our goal is to choose the weights 𝒘 and the bias 𝑏 such that, on
average, the predictions made by our model best fit the true observations.
• Models whose output prediction is determined by the affine transformation of
input features are linear models.
Linear regression: linear model
• In machine learning, we usually work with high-dimensional datasets.
• When our inputs consist of 𝑑 features, we express our prediction 𝑦ො as
𝑦ො = 𝑤1 𝑥1 + ⋯ + 𝑤𝑑 𝑥𝑑 + 𝑏
• Collecting all features into a vector 𝒙 ∈ 𝑅𝑑 and all weights into a vector 𝒘 ∈ 𝑅𝑑 ,
we can express our model using a dot product:
𝑦ො = 𝒘⊤ 𝒙 + 𝑏
the vector 𝒙 corresponds to features of a single data example.
• We refer to features of our entire dataset of 𝒏 examples via the matrix 𝑿 ∈ 𝑅𝑛×𝑑 ,
where, 𝑿 contains one row for every example and one column for every feature.
• For a collection of features 𝑿, the predictions 𝑦ො ∈ 𝑅𝑛 can be expressed via the
matrix-vector product:
𝑦ො = 𝑿𝒘 + 𝑏
Linear regression: linear model
• Given the features of a training dataset 𝑿 and corresponding (known) labels 𝑦, the goal
of linear regression is to find the weight vector 𝒘 and the bias term 𝑏 such that, given
features of a new data example sampled from the same distribution as 𝑿, the new
example’s label will (in expectation) be predicted with the lowest error.
• We would not expect to find a real-world dataset of 𝑛 examples where 𝑦 𝑖 exactly
equals 𝒘⊤ 𝒙 𝑖 + 𝑏 for all 1 ≤ 𝑖 ≤ 𝑛.
o Thus, even when we are confident that the underlying relationship is linear, we will
incorporate a noise term to account for such errors.
• Before searching for the best parameters (or model parameters) 𝒘 and 𝑏, we need two
more things:
1. A quality measure for some given model.
2. A procedure for updating the model to improve its quality.
Basic Elements of Linear Regression
Linear Model Loss Function Analytic Solution
Making Predictions
Gradient Descent with the Learned
Model
Linear regression: loss function
• To think about how to fit data with our model, we need to determine a measure
of fitness.
• The loss function quantifies the distance between the real and predicted value of
the target.
o The loss will be a non-negative number where smaller values are better.
o Perfect predictions incur a loss of 0.
• The most popular loss function in regression problems is the squared error:
𝑖 1 𝑖 𝑖 2
𝑙 (𝒘, 𝑏) = 𝑦ො − 𝑦
2
1
• The constant makes no difference but will prove to be notationally convenient, cancelling out
2
when we take the derivative of the loss.
Linear regression: loss function
• The empirical error is only a function of the model parameters.
• Consider the example below where we plot a regression problem for
a one-dimensional case.
• Large differences between estimates
𝑦ො 𝑖 and observations 𝑦 𝑖 lead to
even larger contributions to the loss,
due to the quadratic dependence.
Linear regression: loss function
• To measure the quality of a model on the entire dataset of 𝑛 examples, we
average (or equivalently, sum) the losses on the training set:
1 𝑛 1 𝑛 2
𝐿(𝒘, 𝑏) = σ𝑖=1 𝑙 𝑖 𝒘, 𝑏 = σ𝑖=1 𝒘⊤ 𝑥 𝑖 +𝑏−𝑦 𝑖
𝑛 𝑛
• When training the model, we want to find parameters (𝒘∗ , 𝑏 ∗ ) that
minimize the total loss across all training examples:
𝒘∗ , 𝑏 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛𝒘,𝑏 𝐿(𝒘, 𝑏)
Basic Elements of Linear Regression
Linear Model Loss Function Analytic Solution
Making Predictions
Gradient Descent with the Learned
Model
Linear regression: analytic solution
• Linear regression can be solved analytically by applying a simple formula:
• Subsume the bias 𝑏 into the parameter 𝒘 by appending a column to the design matrix
consisting of all ones.
• Then our prediction problem is to minimize ∥ 𝑦 − 𝑿𝒘 ∥2 .
• Take the loss surface to be the minimum of the loss over the entire domain.
• Taking the derivative of the loss with respect to 𝒘 and setting it equal to zero yields the
analytic solution:
𝒘 ∗ = 𝑿⊤ 𝑿 −1 𝑿⊤ 𝑦
• The requirement of an analytic solution is so restrictive that it would exclude all exciting
aspects of deep learning.
• Simple problems like linear regression may admit analytic solutions but, we should not
get used to such good fortune!
Basic Elements of Linear Regression
Linear Model Loss Function Analytic Solution
Making Predictions
Gradient Descent with the Learned
Model
Linear regression: gradient descent
• In cases where we cannot solve the models analytically, we can still train models
effectively in practice.
• The key technique for optimizing nearly any deep learning model is called
gradient descent.
• Gradient descent iteratively reduces the error by updating the parameters in the
direction that incrementally lowers the loss function.
Starting
point
loss
Value of weight Point of convergence
Linear regression: gradient descent
• The most naïve application of gradient descent consists of taking the derivative
of the loss function, which is an average of the losses computed on every single
example in the dataset.
o This is extremely slow: we must pass over the entire dataset before making a single update.
o Thus, we will sample a random minibatch of examples every time we need to compute the
update, this variant called minibatch stochastic gradient descent.
• In each iteration:
1. We first randomly sample a minibatch B consisting of a fixed number of training examples.
2. We then compute the derivative (gradient) of the average loss on the minibatch with
regard to the model parameters.
3. Finally, we multiply the gradient by a predetermined positive value η and subtract the
resulting term from the current parameter values.
Linear regression: gradient descent
• We can express the update mathematically:
𝜂
𝒘, 𝑏 ← 𝒘, 𝑏 − σ𝑖∈𝐵 𝜕 𝒘,𝑏 𝑙 𝑖 𝒘, 𝑏
𝐵
• 𝒘 is the weights vector, 𝑏 is the bias, η is predetermined positive
value (learning rate), cardinality |𝑩| represents the number of
examples in each minibatch (the batch size), and “𝜕 𝒘,𝑏 𝑙 𝑖 𝒘, 𝑏 ”
means the partial derivative of the loss of 𝑖th element.
Randomly Iteratively sample Update the
initialize the random parameters in the
values of the minibatches from direction of the
model parameters the data negative gradient
Linear regression: gradient descent
• We can write this out explicitly as follows:
𝜂 𝑖 𝜂 𝑖
𝒘←𝒘− σ𝑖∈𝐵 𝜕𝒘 𝑙 𝒘, 𝑏 = 𝒘 − σ𝑖∈𝐵 𝒙 𝒘⊤ 𝒙 𝑖 +𝑏−𝑦 𝑖
𝐵 𝐵
𝜂 𝑖 𝜂
𝑏←𝑏− σ𝑖∈𝐵 𝜕𝑏 𝑙 𝒘, 𝑏 = 𝑏 − σ𝑖∈𝐵(𝒘⊤ 𝒙 𝑖
+𝑏−𝑦 𝑖 )
𝐵 𝐵
• The values of the batch size and learning rate are manually pre-
specified and not typically learned through model training.
o These parameters that are tunable but not updated in the training loop are
called hyperparameters.
Linear regression: gradient descent
• Linear regression happens to be a learning problem where there is only one minimum
over the entire domain.
• For more complicated models, like deep networks, the loss surfaces contain many
minima.
• Deep learning practitioners seldom struggle to find parameters that minimize the loss on
training sets.
• The more formidable task is to find parameters that will achieve
low loss on data that we have not seen before.
• A challenge called generalization.
Basic Elements of Linear Regression
Linear Model Loss Function Analytic Solution
Making Predictions
Gradient Descent with the Learned
Model
Making Predictions with the Learned Model
• Given the learned linear regression model 𝒘 ෝ ⊤ 𝒙 + 𝑏 , we can estimate
the price of a new house given its area 𝑥1 and age 𝑥2 .
o Estimating targets given features is commonly called prediction or
inference.
BITS Pilani
Pilani Campus
Logistic Regression
Regression vs. Classification
• Regression estimates a continuous value
• Classification predicts a discrete category
Logistic regression
• This is just like linear regression, except that the values y we want to
predict takes on only a small number of discrete values.
• For now, we will focus on the binary classification problem in which y
can take on only two values: 0 and 1.
• For instance, if we are trying to build a spam classifier for email, then
x (i) may be some features of a piece of email, and y may be 1 if it is a
piece of spam mail, and 0 otherwise.
Logistic regression
• We could approach the classification problem ignoring the fact that y
is discrete-valued, and use our old linear regression algorithm to try
to predict y given x.
• However, this method performs very poorly.
• Intuitively, it also doesn’t make sense to consider 𝑦ො values larger than
1 or smaller than 0 when we know that y ∈ {0, 1}.
Logistic regression
• Let’s change the form for our hypotheses for the prediction 𝑦.
ො
• We will choose
ෝ=σ(θT)
𝒚
• where
• is called the logistic function or the sigmoid function
The logistic/sigmoid function
BITS Pilani
Pilani Campus
Perceptron
Perceptron
• A perceptron takes several binary inputs, x1,x2,…,
and produces a single binary output
• Computing the output? – we introduce weights,
which express the importance of an input
• By varying the weights and the threshold, we can
get different models of decision-making
• Perceptron helps makes decisions by weighing up
evidence!
Perceptron
• A complex network of perceptrons could make quite subtle decisions!
• The first layer of perceptrons is making three very simple decisions by weighing
the input evidence.
• The perceptrons in the second layer? – making decision by weighing results from
the first layer, thus can make a decision at a more complex and abstract level.
• More complex decisions can be made by the perceptron in the third layer.
• In this way, a many-layer network of perceptrons can engage in sophisticated
decision making.
Perceptron – notations
• Let's make the notations uniform!
• Instead of
• We will use
Perceptron – more uses!
• Another way perceptrons can be used is to compute the elementary
logical functions, such as AND, OR, and NAND
Perceptron
• The computational universality of perceptrons does not imply that
perceptrons are merely a new type of NAND gate!
• We want to devise learning algorithms which can automatically tune
the weights and biases of a network of artificial neurons.
• Instead of explicitly laying out a circuit of NAND and other gates, our
neural networks can simply learn to solve problems, sometimes for
problems where it would be extremely difficult to directly design a
conventional circuit.
Sigmoid neurons
• How can we devise learning algorithms for a neural
network?
• Let’s say we have a network of perceptrons, and
we want to use to learn to solve some problem.
• To see how learning might work, suppose we make
a small change in some weight (or bias) in the
network.
• What we want: this small change in weight should
cause only a small corresponding change in the
output from the network.
• This property will make learning possible. [WHY?]
Sigmoid neurons
• However… perceptrons won’t work that way.
• A small change in weights/bias of any single perceptron in the
network may cause the output of that perceptron to completely flip,
say from 0 to 1.
• To overcome this, we introduce sigmoid neurons – similar to
perceptrons, but modified so that small changes in their weights and
bias cause only a small change in their output.
Sigmoid neurons
• The sigmoid neuron has inputs, x1,x2,…. But instead of being just 0
or 1, these inputs can also take on any values between 0 and 1
• The sigmoid neuron has weights for each input, w1,w2,…, and an
overall bias, b. But the output is not 0 or 1. Instead, it's σ(w⋅x+b),
where σ is called the sigmoid function, and is defined by:
• So, the output of a sigmoid neuron with inputs x1,x2,…, weights
w1,w2,…, and bias b is:
Sigmoid function
• This shape is smoothed-out version of a step function.
• It's the smoothness of the σ function that is the crucial fact, not its
detailed form.
• The smoothness of σ means that small changes Δwj in the weights
and Δb in the bias will produce a small change Δoutput in the
output from the neuron.
• Calculus tells us that Δoutput is well approximated by
• where the sum is over all the weights, wj, and ∂output/∂wj and ∂output/∂b denote partial
derivatives of the output with respect to wj and b, respectively.
• Δoutput is a linear function of the changes Δwj and Δb in the weights and bias.
• This linearity makes it easy to choose small changes in the weights and biases to achieve any
desired small change in the output.
Neural networks
“Feedforward” neural networks
• We studied neural networks where the output from one layer is used
as input to the next layer.
• Such networks are called feedforward neural networks.
• This means there are no loops in the network – information is always
fed forward, never fed back.
Neural Networks – notations
Neural Networks – notations
• We use a similar notation for the network's biases and activations.
Explicitly, we use blj for the bias of the jth neuron in the lth layer. And
we use alj for the activation of the jth neuron in the lth layer.
Neural Networks – notations
• So, the activation alj of the jth neuron in the lth layer is related to the
activations in the (l−1)th layer by the equation:
• where the sum is over all neurons k in the (l−1)th layer.
Neural Networks – notations
• We rewrite this expression in a matrix form.
• We define a weight matrix wl for each layer, l, where the entries of
the weight matrix wl are the weights connecting to the lth layer of
neurons. That is, the entry in the jth row and kth column is wljk
• Similarly, for each layer l, we define a bias vector, bl , where the
components of the bias vector are just the values blj , one component
for each neuron in the lth layer.
• We define an activation vector al whose components are the
activations alj.
• Finally:
• Sometimes written as: al=σ(zl) and zl≡wlal−1+bl
Exercise
• Compute (write the equation) for the weighted input and activations
for each layer
Computing the activations
Neural Networks – learning
• Training/testing data!
• Each training input x is a 28×28 = 784-dimensional vector
• Each entry in the vector represents the grey value for a single pixel in
the image.
• We'll denote the corresponding desired output by y=y(x), where y is a
10-dimensional vector.
• Example: if a particular training image, x, depicts a 6, then
y(x)=(0,0,0,0,0,0,1,0,0,0)T is the desired output from the network.
Neural Networks – learning
• We need an algorithm which lets us find weights and biases so that the output
from the network approximates y(x) for all training inputs x.
• To quantify how well we're achieving this goal, we define a cost function
• Here, w denotes the collection of all weights in the network, b all the biases, n is
the total number of training inputs, a is the vector of outputs from the network
when x is input, and the sum is over all training inputs x.
• C is the quadratic cost function and same as MSE.
• The aim of our training algorithm will be to minimize the cost C(w,b) as a function
of the weights and biases.
• We can use gradient descent. It can be viewed as a way of taking small steps in
the direction which does the most to immediately decrease C.
Neural Networks – learning
• We use gradient descent to find the weights wk and biases bl which
minimize the cost
• Gradient descent update rule, in terms of w and b
• By repeatedly applying this update rule we can "roll down the hill", and
hopefully find a minimum of the cost function.
• We compute derivatives for gradient descent using “backpropagation”.
Thank you!
Content adapted from: Michael A. Nielsen, "Neural Networks and Deep Learning", Determination Press, 2015