Module II: Deep Networks
1. Deep Feedforward Networks
Definition:
Also known as multilayer perceptrons (MLPs), deep feedforward networks are the
foundational architecture in deep learning. They consist of multiple layers where information
flows in one direction—from input to output—without cycles.
Structure:
Input Layer: Receives the raw data.
Hidden Layers: Perform computations and feature transformations.
Output Layer: Produces the final prediction.
Mathematical Representation:
2. Example: Learning XOR
The XOR (exclusive OR) problem is a classic example demonstrating the necessity of non-
linear models.
Problem Statement:
Inputs: Two binary variables
Output: 1 if inputs are different, 0 if they are the same
Challenge:
A single-layer perceptron cannot solve the XOR problem because it's not linearly separable.
Solution:
Introduce a hidden layer to capture the non-linear relationship.
Network Architecture:
Input Layer: 2 neurons
Hidden Layer: 2 neurons with non-linear activation (e.g., sigmoid)
Output Layer: 1 neuron with sigmoid activation
Training:
Using backpropagation and gradient descent, the network adjusts weights to minimize
the error between predicted and actual outputs.
3. Gradient-Based Learning
Concept:
Gradient-based learning involves optimizing the network's parameters by minimizing
a loss function using gradients.
Loss Function:
For classification tasks, the cross-entropy loss is commonly used:
5. Architecture Design
Considerations:
Depth (number of layers): Deeper networks can model more complex functions but
are harder to train.
Width (number of neurons per layer): Wider layers can capture more features but
may lead to overfitting.
Activation Functions: Choice affects learning dynamics and performance.
Universal Approximation Theorem:
A feedforward network with a single hidden layer containing a finite number of neurons can
approximate any continuous function on compact subsets of Rn, under mild assumptions on
the activation function.
6. Backpropagation and Differentiation Algorithms
Backpropagation:
An efficient algorithm to compute gradients of the loss function with respect to each weight
by applying the chain rule of calculus.
Steps:
1. Forward Pass: Compute activations for each layer.
2. Compute Loss: Calculate the difference between predicted and actual outputs.
3. Backward Pass: Propagate the error backward to compute gradients.
4. Update Weights: Adjust weights using the computed gradients.
b. Dataset Augmentation:
Increases training data diversity by applying transformations (e.g., rotation, scaling) to
existing data.
Helps the model generalize better.
c. Noise Robustness:
Introduce noise to inputs or weights during training to make the model more robust to
variations.
d. Semi-Supervised Learning:
Combines a small amount of labeled data with a large amount of unlabeled data
during training.
e. Multitask Learning:
Trains the model on multiple related tasks simultaneously, leveraging shared
representations.
f. Early Stopping:
Monitors validation performance during training.
Stops training when performance on validation data starts to degrade.
g. Parameter Tying and Sharing:
Parameter Tying: Forces certain parameters to be equal.
Parameter Sharing: Uses the same parameters across different parts of the model
(common in CNNs).
h. Sparse Representations:
Encourages activations to be sparse, meaning most neurons are inactive (output zero)
for a given input.
i. Bagging and Ensemble Methods:
Bagging: Trains multiple models on different subsets of data and averages their
predictions.
Ensemble Methods: Combine predictions from multiple models to improve
generalization.
j. Dropout:
Randomly sets a fraction of activations to zero during training.
Prevents units from co-adapting too much.