UNIT-2
DEEP NETWORK
TOPICS
▪ History of Deep Learning
▪ A Probabilistic Theory of Deep Learning
▪ Backpropogation and regularization
▪ Batch Normalization-VC Dimension and Neural Nets-Deep vs Shallow Network
▪ Convolutional Networks
▪ Generatives Adversarial Network (GAN)
▪ Semi-Supervised Learning
History of Deep Learning
• 1. Early Foundations (1940s–1960s)
• 1943 – McCulloch & Pitts:
Proposed the first artificial neuron model (binary threshold unit). It
showed how a network of simple units could, in theory, perform
computations like logic gates.
• 1950 – Alan Turing:
Introduced the idea of learning machines and posed the Turing Test
for machine intelligence.
• 1958 – Perceptron (Frank Rosenblatt):
First trainable neural network. Could learn simple patterns but
failed on problems like XOR → highlighted by Minsky & Papert
(1969), leading to a decline in neural network research.
History of Deep Learning
• 2. The First AI Winter (1970s)
• Neural networks lost popularity due to limitations (linear separability, lack
of computational power, and weak algorithms).
• Funding for AI dropped → shift towards symbolic AI and expert systems.
• 3. Revival through Backpropagation (1980s)
• 1986 – Rumelhart, Hinton, & Williams:
Popularized the Backpropagation algorithm, enabling multi-layer
perceptrons (MLPs) to learn complex nonlinear mappings.
• Neural networks began competing with traditional statistical methods.
• Applications in speech recognition, character recognition, and early
computer vision emerged.
History of Deep Learning
• 3. Revival through Backpropagation (1980s)
• 1986 – Rumelhart, Hinton, & Williams:
Popularized the Backpropagation algorithm, enabling
multi-layer perceptrons (MLPs) to learn complex
nonlinear mappings.
• Neural networks began competing with traditional
statistical methods.
• Applications in speech recognition, character recognition,
and early computer vision emerged.
History of Deep Learning
• 4. The Deep Learning Era Begins (2000s)
• 2006 – Geoffrey Hinton ("Deep Belief Networks")
Introduced layer-wise pretraining using Restricted
Boltzmann Machines. This solved optimization issues and
revived interest in deep architectures.
• Growth of large datasets (e.g., MNIST, ImageNet) and
GPU computing fueled progress.
History of Deep Learning
• 5. Breakthroughs and Dominance (2010s–Present)
• 2012 – ImageNet Moment (AlexNet by Krizhevsky, Sutskever & Hinton):
Deep Convolutional Neural Network (CNN) drastically outperformed
others in image classification.
→ Sparked global adoption of deep learning.
• RNNs & LSTMs: Advanced sequence modeling for speech, language, and
time-series.
• GANs (2014, Goodfellow): Generated realistic images and media.
• Transformers (2017, Vaswani et al.): Revolutionized Natural Language
Processing (NLP).
→ Enabled large language models (LLMs) like GPT, BERT, etc.
History of Deep Learning
• Modern Deep Learning (2020s–Future)
• LLMs (GPT, PaLM, LLaMA, etc.): Foundation models
capable of text, code, reasoning, and multimodal tasks.
• Multimodal AI: Models that integrate text, vision, audio
(e.g., CLIP, DALL·E, Gemini).
• Efficient Training: Work on reducing computational cost
(quantization, distillation).
• AI Ethics & Safety: Growing focus on fairness,
interpretability, and responsible deployment.
A Probabilistic Theory of Deep
Learning
▪ As machine learning models grow in complexity and impact across various
domains, traditional deterministic approaches are increasingly limited in
their ability to handle real-world uncertainties.
▪ Probabilistic deep learning addresses these challenges by incorporating
probability into model architectures, enabling more robust and reliable
predictions in applications ranging from autonomous vehicles to medical
diagnostics.
▪ Deterministic deep learning models optimize a scalar-valued loss function,
providing a single prediction for each input. These models aim to minimize
prediction error by adjusting their parameters to reduce this loss metric,
resulting in precise, concrete outputs that align with their training data.
However, they lack the ability to express uncertainty about their
predictions, which can be crucial in applications where output confidence
is as important as the prediction itself.
A Probabilistic Theory of Deep
Learning
• Probabilistic deep learning models, on the other hand, optimize a
probabilistic objective function, offering a more nuanced approach to
prediction.
• By characterizing predictions with probability distributions rather than
fixed values, these models can quantify uncertainty in their outputs. This
capability makes them particularly valuable in domains where decisions
must account for probabilistic outcomes, such as financial forecasting or
medical diagnosis.
A Probabilistic Theory of Deep
Learning
• The integration of probability into deep learning systems provides several
benefits. Probability-based approaches allow models to handle chaotic
real-world data more effectively through the use of Bayesian and graphical
models. Regularization techniques like batch normalization play a crucial
role in improving model generalization by reducing variance while
maintaining low training error—a key advantage over purely deterministic
approaches.
Categories Of Probabilistic Models
• Generative models
• Discriminative models.
• Graphical models
A Probabilistic Theory of Deep
Learning
• Generative models:
Generative models aim to model the joint distribution of the input and
output variables. These models generate new data based on the
probability distribution of the original dataset. Generative models are
powerful because they can generate new data that resembles the
training data. They can be used for tasks such as image and speech
synthesis, language translation, and text generation.
• Discriminative models
The discriminative model aims to model the conditional distribution of
the output variable given the input variable. They learn a decision
boundary that separates the different classes of the output variable.
Discriminative models are useful when the focus is on making accurate
predictions rather than generating new data. They can be used for
tasks such as image recognition, speech recognition.
A Probabilistic Theory of Deep
Learning
• Graphical models
These models use graphical representations to show the conditional
dependence between variables. They are commonly used for tasks such as
image recognition, natural language processing, and causal inference.
Naive Bayes Algorithm in Probabilistic Models
• The Naive Bayes algorithm is a widely used approach in
probabilistic models, demonstrating remarkable efficiency
and effectiveness in solving classification problems.
• By leveraging the power of the Bayes theorem and making
simplifying assumptions about feature independence, the
algorithm calculates the probability of the target class given
the feature set.
• This method has found diverse applications across various
industries, ranging from spam filtering to medical diagnosis.
Despite its simplicity, the Naive Bayes algorithm has proven to
be highly robust, providing rapid results in a multitude of real-
world problems.
Naive Bayes Algorithm in Probabilistic Models
The algorithm works as follows:
• Collect a labeled dataset of samples, where each sample has a set of
features and a class label.
• For each feature in the dataset, calculate the conditional probability of the
feature given the class.
• This is done by counting the number of times the feature occurs in
samples of the class and dividing by the total number of samples in the
class.
• Calculate the prior probability of each class by counting the number of
samples in each class and dividing by the total number of samples in the
dataset.
• Given a new sample with a set of features, calculate the posterior
probability of each class using the Bayes theorem and the conditional
probabilities and prior probabilities calculated in steps 2 and 3.
• Select the class with the highest posterior probability as the predicted
class for the new sample.
Backpropagation and Regularization
• The Backpropagation Algorithm
• Backpropagation works by propagating the error backward through
the network, layer by layer. This process involves several key steps:
• Forward pass: Input data is fed through the network to generate
predictions.
• Error calculation: The difference between the predicted output and
the actual target is computed.
• Backward pass: The error is propagated backwards through the
network.
• Gradient calculation: The algorithm calculates how much each
weight and bias contributes to the error.
• Parameter update: The weights and biases are adjusted to reduce
the error.
Backpropagation and Regularization
• The Chain Rule and Partial Derivatives
• At the core of backpropagation is the chain rule from calculus. This
mathematical principle allows the algorithm to compute how changes in
one layer affect the previous layers, enabling the network to distribute the
error and make appropriate adjustments.
• Optimization Techniques
• While backpropagation provides the mechanism for computing gradients,
optimization algorithms determine how to use this information to update
the network's parameters effectively.
Backpropagation and Regularization
• Gradient Descent
• The most fundamental optimization algorithm is gradient
descent. It updates the parameters in the opposite
direction of the gradient, taking steps proportional to the
gradient's magnitude. There are several variants of gradient
descent:
• Batch Gradient Descent: Uses the entire dataset to
compute gradients.
• Stochastic Gradient Descent (SGD): Updates parameters
using a single training example at a time.
• Mini-batch Gradient Descent: A compromise between
batch and stochastic methods, using small batches of data.
Backpropagation and Regularization
• Advanced Optimization Algorithms
• Modern neural networks often employ more sophisticated
optimization techniques to overcome limitations of basic
gradient descent:
• Momentum: Accelerates convergence and helps overcome
local minima.
• AdaGrad: Adapts the learning rate for each parameter based
on historical gradients.
• RMSprop: Similar to AdaGrad but addresses some of its
limitations.
• Adam: Combines ideas from momentum and RMSprop for
efficient optimization.
Backpropagation and Regularization
• Learning Rate Schedules
• The learning rate, which determines the size of parameter updates,
plays a crucial role in optimization. Learning rate schedules, such as
step decay, exponential decay, and cyclical learning rates, can
improve convergence and final performance.
• Challenges and Advanced Concepts
• Deep neural networks can suffer from vanishing or exploding
gradients, where the gradients become extremely small or large as
they propagate through many layers. Techniques like careful
initialization, normalized activation functions, and gradient clipping
help mitigate these issues.
Backpropagation and Regularization
➢ Regularization: To prevent overfitting, various regularization
techniques are employed alongside backpropagation and
optimization:
• L1 and L2 regularization
• Dropout
• Early stopping
➢ Backpropagation and optimization form the backbone of
modern neural network training. As the field of deep learning
continues to advance, researchers are constantly refining
these techniques and developing new approaches to make
neural networks more efficient, accurate, and capable of
tackling increasingly complex problems.
Batch Normalization
• Batch Normalization is used to reduce the problem of internal covariate
shift in neural networks. It works by normalizing the data within each
mini-batch. This means it calculates the mean and variance of data in a
batch and then adjusts the values so that they have similar range. After
that it scales and shifts the values so that model learn effectively.
• In traditional neural networks as the input data propagates through the
network, the distribution of each layer's inputs changes. This phenomenon
is known as internal covariate shift and it can slow down training process.
Batch Normalization aims to reduce this issue by normalizing the inputs of
each layer.
Batch Normalization
➢ This process keeps the inputs to each layer of the network in a stable
range even if the outputs of earlier layers change during training. As a
result, training becomes faster and more stable.
➢ Need of Batch Normalization
• Batch Normalization makes sure outputs of each layer stay steady as
model learns. This helps model train faster and learn more effectively.
• Solves the problem of internal covariate shift.
• Makes training faster and more stable.
• Allows use of higher learning rates.
• Helps avoid vanishing or exploding gradients.
• Can act like a regularizer sometimes reduce the need for dropout.
Batch Normalization
➢Fundamentals of Batch Normalization
• Step 1: Compute the Mean and Variance of Mini-Batches
• For mini-batch of activations x1,x2,...,xmx1,x2,...,xm, the mean μB and
variance σB^2of the mini-batch are computed.
➢ Step 2: Normalization
• Each activation xixiis normalized using the computed mean and
variance of the mini-batch. The normalization process subtracts the
mean μBμB from each activation and divides by the square root of
the variance σB2σB2, ensuring that the normalized activations have a
zero mean and unit variance.
• Additionally a small constant ϵ is added to the denominator for
numerical stability, particularly to prevent division by zero.
Batch Normalization
➢ Step 3: Scale and Shift the Normalized Activations
The normalized activations xixi are then scaled by a learnable
parameter γγ and shifted by another learnable parameter ββ. These
parameters allow the model to learn the optimal scaling and shifting of the
normalized activations giving the network additional flexibility.
Batch Normalization
➢ Benefits of Batch Normalization
• Faster Convergence: Batch Normalization reduces internal covariate
shift, allowing for faster convergence during training.
• Higher Learning Rates: With Batch Normalization, higher learning
rates can be used without the risk of divergence.
• Regularization Effect: Batch Normalization introduces a slight
regularization effect that reduces the need for adding regularization
techniques like dropout.
VC Dimension and Neural Nets
➢ The VC Dimension is a fundamental concept in statistical
learning theory that measures the capacity of a hypothesis
space by determining the largest dataset that can be perfectly
classified by a model.
• Higher VC Dimension → More complex model → Risk of
overfitting
Lower VC Dimension → Simpler model → Risk of
underfitting
• A model with optimal VC Dimension strikes a balance
between bias and variance, ensuring better generalization on
unseen data.
VC Dimension and Neural Nets
➢ Why is VC Dimension Important: Helps in evaluating model complexity
Avoids overfitting and underfitting. Provides insights into the learning
capacity of algorithms. Essential for understanding generalization bounds
➢ VC Dimension & Overfitting
• If a model has a very high VC Dimension, it can memorize the training data
but fails to generalize well.
Example: A deep neural network with too many parameters might fit
noise instead of learning the underlying pattern.
➢ VC Dimension & Underfitting
• If a model has a low VC Dimension, it lacks the flexibility to learn complex
patterns.
• Example: A linear model trying to fit a highly nonlinear dataset.
VC Dimension and Neural Nets
➢ Finding the Right VC Dimension
• Machine learning models must balance their VC Dimension for optimal
accuracy and generalization.
• Techniques like regularization, pruning, and hyperparameter tuning help in
finding the right trade-off.
• Real-World Applications of VC Dimension
➢ Deep Learning & Neural Networks → Helps in designing optimal
architectures.
Support Vector Machines (SVMs) → Used in choosing the right kernel and
margin.
Decision Trees & Ensemble Methods → Guides in setting the depth and
complexity.
:
Deep vs Shallow Networks
Shallow Neural Networks Deep Neural Networks
Shallow Neural network with few Deep Neural network with many
layers (usually 1 hidden layer). layers (multiple hidden layers).
Complexity is low. Complexity is high.
Limited learning capacity. Higher learning capacity.
Lower risk of overfitting. Higher risk of overfitting.
Requires more data for effective
Requires less data.
training.
Fewer parameters counts in the Many more parameters counts in
shallow neural networks. the deep neural networks.
Requires less computational Requires more computational
resources. resources (e.g., GPUs).
Easier to interpret. More difficult to interpret.
Example: Convolutional Neural
Convolutional Neural Networks
• Convolutional Neural Network (CNN) is an advanced version
of ANN, primarily designed to extract features from grid-like
matrix datasets. This is particularly useful for visual datasets
such as images or videos, where data patterns play a crucial
role. CNNs are widely used in Computer vision applications
due to their effectiveness in processing visual data.
• CNNs consist of multiple layers like the input layer,
Convolutional layer, pooling layer, and fully connected layers.
Convolutional Neural Networks
Convolutional Neural Networks
➢ How Convolutional Layers Works
• Convolution Neural Networks are neural networks that share their
parameters.
• Imagine you have an image. It can be represented as a cuboid having its
length, width (dimension of the image), and height (i.e the channel as
images generally have red, green, and blue channels)
Convolutional Neural Networks
➢ Now imagine taking a small patch of this image and running a
small neural network, called a filter or kernel on it, with say, K
outputs and representing them vertically.
• Now slide that neural network across the whole image, as a
result, we will get another image with different widths,
heights, and depths. Instead of just R, G, and B channels now
we have more channels but lesser width and height. This
operation is called Convolution. If the patch size is the same
as that of the image it will be a regular neural network.
Because of this small patch, we have fewer weights.
Convolutional Neural Networks
Convolutional Neural Networks
➢ Mathematical Overview of Convolution
• Now let’s talk about a bit of mathematics that is involved in the whole
convolution process.
• Convolution layers consist of a set of learnable filters (or kernels) having
small widths and heights and the same depth as that of input volume (3 if
the input layer is image input).
• For example, if we have to run convolution on an image with dimensions
34x34x3. The possible size of filters can be axax3, where ‘a’ can be
anything like 3, 5, or 7 but smaller as compared to the image dimension.
• During the forward pass, we slide each filter across the whole input
volume step by step where each step is called stride (which can have a
value of 2, 3, or even 4 for high-dimensional images) and compute the dot
product between the kernel weights and patch from input volume.
• As we slide our filters we’ll get a 2-D output for each filter and we’ll stack
them together as a result, we’ll get output volume having a depth equal to
the number of filters. The network will learn all the filters.
Convolutional Neural Networks
➢ Layers Used to Build ConvNets
• A complete Convolution Neural Networks architecture is also known as
covnets. A covnets is a sequence of layers, and every layer transforms one
volume to another through a differentiable function.
• Let’s take an example by running a covnets on of image of dimension 32 x
32 x 3.
• Input Layers: It’s the layer in which we give input to our model. In CNN,
Generally, the input will be an image or a sequence of images. This layer
holds the raw input of the image with width 32, height 32, and depth 3.
Convolutional Neural Networks
• Convolutional Layer: This is the layer, which is used to extract the
feature from the input dataset. It applies a set of learnable filters
known as the kernels to the input images. The filters/kernels are
smaller matrices usually 2x2, 3x3, or 5x5 shape. it slides over the
input image data and computes the dot product between kernel
weight and the corresponding input image patch. The output of this
layer is referred as feature maps. Suppose we use a total of 12
filters for this layer we’ll get an output volume of dimension 32 x 32
x 12.
• Activation Layer: By adding an activation function to the output of
the preceding layer, activation layers add nonlinearity to the
network. it will apply an element-wise activation function to the
output of the convolution layer. Some common activation functions
are RELU: max(0, x), Tanh, Leaky RELU, etc. The volume remains
unchanged hence output volume will have dimensions 32 x 32 x 12.
Convolutional Neural Networks
• Pooling layer: This layer is periodically inserted in the covnets and its main
function is to reduce the size of volume which makes the computation fast
reduces memory and also prevents overfitting. Two common types of
pooling layers are max pooling and average pooling. If we use a max pool
with 2 x 2 filters and stride 2, the resultant volume will be of dimension
16x16x12.
Convolutional Neural Networks
➢ Flattening: The resulting feature maps are flattened into a one-
dimensional vector after the convolution and pooling layers so they can be
passed into a completely linked layer for categorization or regression.
• Fully Connected Layers: It takes the input from the previous layer and
computes the final classification or regression task.
➢ Output Layer: The output from the fully connected layers is then fed into a
logistic function for classification tasks like sigmoid or softmax which
converts the output of each class into the probability score of each class.
Convolutional Neural Networks
Generative Adversarial Networks
(GANs)
➢ Generative Adversarial Networks (GAN) help machines to create new,
realistic data by learning from existing examples.
➢ It is introduced by Ian Goodfellow and his team in 2014 and they have
transformed how computers generate images, videos, music and more.
Unlike traditional models that only recognize or classify data, they take a
creative way by generating entirely new content that closely resembles
real-world data.
➢ This ability helped various fields such as art, gaming, healthcare and data
science.
Generative Adversarial Networks
(GANs)
➢ Architecture of GAN
GAN consist of two main models that work together to create
realistic synthetic data which are as follows:
➢ Generator Model
The generator is a deep neural network that takes random noise
as input to generate realistic data samples like images or text. It
learns the underlying data patterns by adjusting its internal
parameters during training through backpropagation. Its
objective is to produce samples that the discriminator classifies
as real.
Generative Adversarial Networks
(GANs)
• Generator Loss Function: The generator tries to minimize this loss
where
JG measure how well the generator is fooling the discriminator.
G(zi)G(zi) is the generated sample from random noise zi
D(G(zi))) is the discriminator’s estimated probability that the generated sample
is real.
The generator aims to maximize D(G(zi)) meaning it wants the discriminator to
classify its fake data as real (probability close to 1).
The generator aims to maximize D(G(zi)) meaning it wants the discriminator to classify
its fake data as real (probability close to 1).
Generative Adversarial Networks
(GANs)
➢ Discriminator Model
• The discriminator acts as a binary classifier helps in distinguishing
between real and generated data. It learns to improve its
classification ability through training, refining its parameters to
detect fake samples more accurately. When dealing with image
data, the discriminator uses convolutional layers or other relevant
architectures which help to extract features and enhance the
model’s ability.
Generative Adversarial Networks
(GANs)
Generative Adversarial Networks
(GANs)
Semi-Supervised Learning
• Semi-supervised learning is a branch of machine learning that
combines supervised and unsupervised learning by using both labeled and
unlabeled data to train AI models for classification and regression tasks.
• Though semi-supervised learning is generally employed for the same use cases
in which one might otherwise use supervised learning methods, it’s
distinguished by various techniques that incorporate unlabeled data into
model training, in addition to the labeled data required for conventional
supervised learning.
• Semi-supervised learning methods are especially relevant in situations where
obtaining a sufficient amount of labeled data is prohibitively difficult or
expensive, but large amounts of unlabeled data are relatively easy to acquire.
In such scenarios, neither fully supervised nor unsupervised learning methods
will provide adequate solutions.
Semi-Supervised Learning
➢ Labeled data and machine learning
• Training AI models for prediction tasks like classification or
regression typically requires labeled data: annotated data
points that provide necessary context and demonstrate the
correct predictions (output) for each sample input. During
training, a loss function measures the difference (loss)
between the model’s predictions for a given input and the
“ground truth” provided by that input’s label.
Models learn from these labeled examples by using
techniques like gradient descent that update model weights
to minimize loss. Because this machine learning process
actively involves humans, it is called “supervised” learning.
Semi-Supervised Learning
• Properly labeling data becomes increasingly labor-
intensive for complex AI tasks. For example, to train an
image classification model to differentiate between
cars and motorcycles, hundreds (if not thousands) of
training images must be labeled “car” or “motorcycle”;
for a more detailed CV task, like object detection,
humans must not only annotate the object(s) each
image contains, but where each object is located; for
even more detailed tasks, like image segmentation,
data labels must annotate specific pixel-by-pixel
boundaries of different image segments for each
image.
Semi-Supervised Learning
• Labeling data can thus be particularly tedious for certain use cases.
In more specialized machine learning use cases, like drug discovery,
genetic sequencing or protein classification, data annotation is not
only extremely time-consuming, but also requires very specific
domain expertise.
• Semi-supervised learning offers a way to extract maximum benefit
from a scarce amount of labeled data while also making use of
relatively abundant unlabeled data.
Semi-supervised learning vs
supervised learning
➢ The primary distinction between semi- and fully supervised machine
learning is that the latter can only be trained using fully labeled datasets,
whereas the former uses both labeled and unlabeled data samples in the
training process. Semi-supervised learning techniques modify or
supplement a supervised algorithm—called the “base learner,” in this
context—to incorporate information from unlabeled examples. Labeled
data points are used to ground the base learner’s predictions and add
structure (like how many classes exist and the basic characteristics of
each) to the learning problem.
➢ The goal in training any classification model is for it to learn an
accurate decision boundary: a line—or, for data with more than two
dimensions, a “surface” or hyperplane—separates data points of one
classification category from data points belonging to a different
classification category. Though a fully supervised classification model can
technically learn a decision boundary using only a few labeled data points,
it might not generalize well to real-world examples, making the model's
predictions unreliable.
Semi-supervised learning vs
supervised learning
• The classic “half-moons” dataset visualizes the shortcomings of supervised
models relying on too few labeled data points. Though the “correct”
decision boundary would separate each of the two half-moons, a
supervised learning model is likely to overfit the few labeled data points
available. The unlabeled data points clearly convey helpful context, but a
traditional supervised algorithm cannot process unlabeled data.
Semi-supervised learning vs
supervised learning