0% found this document useful (0 votes)
7 views24 pages

ML DL Study Guide

The document is a comprehensive study guide on Machine Learning and Deep Learning, covering fundamental concepts, workflows, and various algorithms across six units. It includes topics such as supervised learning, classification, data preprocessing, feature engineering, and dimensionality reduction techniques like PCA and LDA. The guide also addresses specific algorithms like linear regression, logistic regression, and support vector machines, providing insights into their applications and evaluation metrics.

Uploaded by

anjurakshit616
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views24 pages

ML DL Study Guide

The document is a comprehensive study guide on Machine Learning and Deep Learning, covering fundamental concepts, workflows, and various algorithms across six units. It includes topics such as supervised learning, classification, data preprocessing, feature engineering, and dimensionality reduction techniques like PCA and LDA. The guide also addresses specific algorithms like linear regression, logistic regression, and support vector machines, providing insights into their applications and evaluation metrics.

Uploaded by

anjurakshit616
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Machine Learning & Deep Learning — Complete Study Guide

Machine Learning &


Deep Learning
Complete Study Guide — Units I to VI

Covering: Introduction • Supervised Learning • Classification

Deep Learning • CNNs • Recurrent Neural Networks

From Fundamentals to Advanced Concepts

ML Study Guide | Units I–VI


Machine Learning & Deep Learning — Complete Study Guide

Unit I: Introduction to Machine Learning


Machine Learning (ML) is a subfield of Artificial Intelligence (AI) that enables systems to learn and
improve from experience automatically, without being explicitly programmed for every task. The
fundamental idea is to allow computers to find patterns in data and use those patterns to make
decisions or predictions.

1.1 Motivation and Role of Machine Learning


Traditional programming requires developers to write explicit rules. Machine Learning flips this
paradigm: instead of coding the rules, we provide data and the desired outputs, and the algorithm
learns the rules automatically.

Why Machine Learning?


• Problems too complex to hand-code rules (e.g., face recognition, language translation)
• Environments that change over time (e.g., spam filters, stock market prediction)
• Personalization at scale (e.g., Netflix recommendations, targeted advertising)
• Mining hidden patterns in massive datasets (e.g., genomics, fraud detection)
• Tasks that humans perform intuitively but cannot easily articulate (e.g., handwriting recognition)

Role in Computer Science and Problem Solving


Machine Learning bridges raw data and actionable intelligence. Its role spans:
Domain ML Application Example
Computer Vision Image classification, object detection Self-driving cars, medical imaging
Natural Language Text classification, translation, QA ChatGPT, Google Translate
Processing
Finance Fraud detection, credit scoring Bank transaction monitoring
Healthcare Disease prediction, drug discovery Cancer detection from scans
Robotics Motion planning, control Industrial automation
Search Engines Ranking, recommendation Google Search, YouTube Suggest

1.2 Machine Learning Workflow


A well-structured ML workflow ensures reproducibility and quality. The standard pipeline consists of the
following stages:

Step 1: Problem Definition


• Define the business/research objective
• Determine if ML is the right approach
• Identify the type of ML task (classification, regression, clustering, etc.)

ML Study Guide | Units I–VI


Machine Learning & Deep Learning — Complete Study Guide

Step 2: Data Collection


• Gather data from databases, APIs, web scraping, sensors, surveys
• Ensure data quality, diversity, and sufficient volume
• Understand data provenance and ethical considerations

Step 3: Exploratory Data Analysis (EDA)


• Statistical summaries (mean, median, std, skewness, kurtosis)
• Visualizations: histograms, scatter plots, box plots, correlation heatmaps
• Identify outliers, missing values, and data imbalances

Step 4: Data Preprocessing


• Handle missing values (imputation or removal)
• Encode categorical variables (label encoding, one-hot encoding)
• Normalize/standardize numerical features
• Split data into training, validation, and test sets

Step 5: Feature Engineering & Selection


• Create new features from existing ones
• Remove irrelevant or redundant features
• Apply dimensionality reduction if needed

Step 6: Model Selection & Training


• Choose appropriate algorithm(s) based on problem type
• Train on training set, validate on validation set
• Tune hyperparameters (Grid Search, Random Search, Bayesian Optimization)

Step 7: Evaluation
• Evaluate on the held-out test set using appropriate metrics
• Check for overfitting or underfitting
• Perform cross-validation for robust estimates

Step 8: Deployment & Monitoring


• Serialize the model (pickle, ONNX, TensorFlow SavedModel)
• Build an API or integrate into production pipeline
• Monitor for data drift and model degradation over time
💡 Golden Rule: Never let your test set influence any design decision. It is a final, unbiased
estimator of real-world performance.

1.3 Paradigms of Learning


Machine Learning is broadly categorized into four learning paradigms based on how the algorithm
learns from data:

Supervised Learning
The algorithm learns a mapping from inputs X to outputs Y given labeled training examples (X, Y).

ML Study Guide | Units I–VI


Machine Learning & Deep Learning — Complete Study Guide

• Regression: Predict continuous values (e.g., house prices, temperature)


• Classification: Predict discrete class labels (e.g., spam/not-spam, digit 0–9)
• Examples: Linear Regression, Decision Trees, SVM, Neural Networks

Unsupervised Learning
No labels are provided. The algorithm discovers structure in unlabeled data.
• Clustering: Group similar data points (e.g., K-Means, DBSCAN, Hierarchical)
• Dimensionality Reduction: PCA, t-SNE, UMAP
• Density Estimation: Gaussian Mixture Models
• Anomaly Detection: Identify unusual patterns

Semi-Supervised Learning
A combination of a small amount of labeled data and a large amount of unlabeled data. Useful when
labeling is expensive (e.g., medical image annotation).

Reinforcement Learning
An agent learns to take actions in an environment to maximize cumulative reward. No explicit training
data — learning happens through trial and error.
• Key concepts: Agent, Environment, State, Action, Reward, Policy
• Examples: AlphaGo, OpenAI Five, robotic control, game playing

1.4 Data Preprocessing and Feature Engineering


Handling Missing Values
• Drop rows/columns: Suitable when missing rate is very high (>50%)
• Mean/Median/Mode Imputation: Simple but can distort distributions
• KNN Imputation: Uses k nearest neighbors to fill missing values
• Model-Based Imputation: Train a model to predict missing values
• Forward/Backward Fill: Useful for time series data

Encoding Categorical Variables


Technique Use Case Notes
Label Encoding Ordinal categories Assigns integer codes 0, 1, 2...
One-Hot Encoding Nominal categories Creates binary columns for each
category
Target Encoding High cardinality features Replaces category with mean
target value
Binary Encoding High cardinality features More compact than one-hot

Feature Scaling
• Min-Max Normalization: Scales to [0, 1] — sensitive to outliers
• Z-score Standardization: Subtracts mean, divides by std — zero mean, unit variance

ML Study Guide | Units I–VI


Machine Learning & Deep Learning — Complete Study Guide

• Robust Scaler: Uses median and IQR — resistant to outliers


• Log Transform: Handles right-skewed distributions (e.g., income data)

Feature Engineering
Feature engineering is the process of using domain knowledge to create new informative features from
raw data. Examples:
• From datetime: extract day, month, year, day-of-week, hour
• Polynomial features: x1^2, x2^2, x1*x2 for non-linear relationships
• Interaction features: combine two features (e.g., height * weight = BMI-like)
• Text features: TF-IDF, word embeddings from raw text
• Aggregation: compute statistics (mean, sum) per group

1.5 Feature Selection and Extraction Techniques


Feature Selection Methods
Feature selection reduces the number of input features by selecting a relevant subset.
• Filter Methods: Rank features by statistical correlation with the target (Chi-square, ANOVA,
Mutual Information). Fast but do not consider feature interactions.
• Wrapper Methods: Use a model to evaluate feature subsets (Recursive Feature Elimination
(RFE), Forward/Backward Selection). Accurate but computationally expensive.
• Embedded Methods: Feature selection is built into model training (L1 Lasso regularization,
feature_importances_ in Random Forest). Efficient and effective.

Feature Extraction
Unlike selection, extraction transforms the original features into a new lower-dimensional space.
• PCA (Principal Component Analysis): Projects data onto orthogonal axes of maximum variance
• LDA (Linear Discriminant Analysis): Maximizes class separability
• t-SNE / UMAP: Non-linear methods for visualization
• Autoencoders: Neural network-based non-linear feature extraction

1.6 Dimensionality Reduction: PCA


Principal Component Analysis (PCA) is the most widely used linear dimensionality reduction technique.
It transforms data into a new coordinate system where the axes (principal components) are ordered by
the amount of variance they explain.

Mathematical Intuition
• Step 1: Standardize the data (zero mean, unit variance)
• Step 2: Compute the covariance matrix of the features
• Step 3: Compute eigenvalues and eigenvectors of the covariance matrix
• Step 4: Sort eigenvectors by descending eigenvalue (variance explained)
• Step 5: Select top k eigenvectors (principal components)
• Step 6: Project original data onto the k-dimensional subspace

ML Study Guide | Units I–VI


Machine Learning & Deep Learning — Complete Study Guide

Key Concepts
• Explained Variance Ratio: Percentage of total variance captured by each component
• Cumulative Explained Variance: Used to choose k (typically 95% threshold)
• Scree Plot: Plot of eigenvalues — look for 'elbow' in the curve

PCA in Python
from [Link] import StandardScaler
from [Link] import PCA

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=0.95) # retain 95% variance


X_pca = pca.fit_transform(X_scaled)
print(pca.explained_variance_ratio_)
💡 PCA is unsupervised — it does not use class labels. Use LDA when you want to maximize class
separability.

1.7 Dimensionality Reduction: LDA


Linear Discriminant Analysis (LDA) is a supervised dimensionality reduction and classification
technique. It finds the linear combinations of features that best separate the classes.

Objective
LDA maximizes the Fisher criterion: the ratio of between-class scatter to within-class scatter. This
ensures projected classes are well separated.

Key Concepts
• Between-class scatter matrix (SB): Measures how far class means are from the global mean
• Within-class scatter matrix (SW): Measures spread within each class
• LDA components: Eigenvectors of SW^(-1) * SB
• Maximum components: min(n_classes - 1, n_features)

PCA vs LDA Comparison


Property PCA LDA
Type Unsupervised Supervised
Objective Maximize variance Maximize class separability
Uses Class Labels No Yes
Max Components min(n_samples, n_features) n_classes - 1
Best For Visualization, compression Classification preprocessing

ML Study Guide | Units I–VI


Machine Learning & Deep Learning — Complete Study Guide

Unit II: Supervised Learning Algorithms


Supervised learning algorithms learn a function mapping input features to output labels using labeled
training data. This unit covers linear models, kernel-based methods, and probability-based classifiers.

2.1 Linear Regression


Linear Regression models the relationship between a dependent variable (target) and one or more
independent variables (features) as a linear function.

Simple Linear Regression


Models the relationship between one input feature x and the target y: y = β₀ + β₁x + ε, where β₀ is the
intercept, β₁ is the slope, and ε is the error term.
• Ordinary Least Squares (OLS) minimizes the sum of squared residuals
• β₁ = Σ(xᵢ - x̄ )(yᵢ - ȳ) / Σ(xᵢ - x̄ )²
• β₀ = ȳ - β₁x̄

Multiple Linear Regression


Extends simple linear regression to multiple input features: y = β₀ + β₁x₁ + β₂x₂ + ... + β ₙx ₙ + ε. In
matrix form: y = Xβ, solved as β = (XᵀX)⁻¹Xᵀy.

Polynomial Linear Regression


Captures non-linear relationships by adding polynomial features while keeping the model linear in
coefficients: y = β₀ + β₁x + β₂x² + β₃x³ + ...
💡 The model is still linear in its parameters (β), even though the relationship with x is non-linear.
This is why it is called Polynomial Linear Regression.

Evaluation Metrics for Regression


Metric Formula Interpretation
MAE Mean(|yᵢ - ŷᵢ|) Average absolute error; less sensitive
to outliers
MSE Mean((yᵢ - ŷᵢ)²) Penalizes large errors heavily; in
squared units
RMSE √MSE Same units as target; useful for
interpretation
R² Score 1 - SS_res / SS_tot Proportion of variance explained; 1 =
perfect, 0 = baseline

2.2 Logistic Regression


Despite its name, Logistic Regression is a classification algorithm. It models the probability that an input
belongs to a particular class using the sigmoid function.
• Sigmoid function: σ(z) = 1 / (1 + e^(-z)), where z = Xβ

ML Study Guide | Units I–VI


Machine Learning & Deep Learning — Complete Study Guide

• Output is a probability P(y=1|X), thresholded at 0.5 for binary classification


• Trained by minimizing binary cross-entropy (log loss)
• Decision boundary is linear in feature space
• Extended to multiclass via One-vs-Rest (OvR) or Softmax regression

Cost Function
Log-Loss: L = -[y·log(ŷ) + (1-y)·log(1-ŷ)]. Optimized using Gradient Descent or variants (SGD, Adam).

2.3 Ridge and Lasso Regression


Regularization techniques that add a penalty term to the loss function to prevent overfitting by
discouraging large coefficient values.

Ridge Regression (L2 Regularization)


• Loss = MSE + λΣβᵢ²
• Shrinks coefficients toward zero but never exactly zero
• Works well when many features have small effects
• Has closed-form solution: β = (XᵀX + λI)⁻¹Xᵀy

Lasso Regression (L1 Regularization)


• Loss = MSE + λΣ|βᵢ|
• Can shrink coefficients to exactly zero — performs automatic feature selection
• Useful for sparse models with few important features
• No closed-form solution; solved iteratively

Elastic Net
Combines L1 and L2 penalties: Loss = MSE + λ₁Σ|βᵢ| + λ₂Σβᵢ². Balances feature selection (Lasso) and
coefficient shrinkage (Ridge).
💡 Choose Ridge when all features matter. Choose Lasso when you expect sparse solutions. Use
Elastic Net for high-dimensional correlated features.

2.4 Support Vector Machines (SVM)


Support Vector Machines find the optimal hyperplane that maximizes the margin between two classes.
Points closest to the boundary are called support vectors.

Linear SVM and Linear Classification


• Hyperplane: w·x + b = 0, where w is the weight vector and b is the bias
• Margin = 2 / ||w||; maximizing margin minimizes ||w||
• Hard-margin SVM: No misclassifications allowed (linearly separable data)
• Soft-margin SVM: Allows misclassifications via slack variables; controlled by C parameter
• Large C: Low bias, high variance (fits training data closely)
• Small C: High bias, low variance (larger margin, more misclassifications tolerated)

ML Study Guide | Units I–VI


Machine Learning & Deep Learning — Complete Study Guide

Kernel-Based Classification
When data is not linearly separable in the original feature space, the kernel trick maps data to a higher-
dimensional space where linear separation is possible, without explicitly computing the transformation.
Kernel Formula Best For
Linear K(x,z) = xᵀz Linearly separable data, high-dim
text
Polynomial K(x,z) = (xᵀz + c)^d Non-linear boundaries; image
classification
RBF (Gaussian) K(x,z) = exp(-γ||x-z||²) Most general purpose; non-linear
data
Sigmoid K(x,z) = tanh(αxᵀz + c) Neural network-like decision
boundaries

Hyperparameters
• C: Regularization strength (controls bias-variance tradeoff)
• γ (gamma): For RBF kernel — controls locality of decision boundary
• degree d: For polynomial kernel

2.5 Naive Bayes Classifiers


Naive Bayes is a probabilistic classifier based on Bayes' theorem with the 'naive' assumption that
features are conditionally independent given the class label.

Bayes' Theorem
P(y|X) = P(X|y) · P(y) / P(X), where P(y|X) is the posterior, P(X|y) is the likelihood, P(y) is the prior, and
P(X) is the evidence.
The naive independence assumption gives: P(X|y) = Π P(xᵢ|y), simplifying computation enormously.

Gaussian Naive Bayes


• Assumes each feature follows a Gaussian (normal) distribution within each class
• P(xᵢ|y) = (1/√(2πσ²)) · exp(-(xᵢ-μ)² / 2σ²)
• Best for continuous numerical features
• Estimates μ (mean) and σ² (variance) per feature per class from training data

Multinomial Naive Bayes


• Assumes features represent counts or frequencies (non-negative integers)
• Widely used in text classification (word counts as features)
• P(xᵢ|y) = (count(xᵢ, y) + α) / (Σ count(x, y) + α|V|) [Laplace smoothing]
• α=1 is Laplace (add-one) smoothing to handle zero probabilities
💡 Naive Bayes is surprisingly effective despite its simplifying assumptions. It is fast, scalable to
large datasets, and works well with limited training data.

ML Study Guide | Units I–VI


Machine Learning & Deep Learning — Complete Study Guide

Unit III: Supervised Learning — Classification


This unit covers advanced classification algorithms including instance-based learning, tree-based
methods, ensemble techniques, and comprehensive evaluation metrics.

3.1 K-Nearest Neighbor (KNN) Classifier


KNN is a non-parametric, lazy learning algorithm. It makes predictions by finding the k most similar
training examples to a new input and using their labels.

Algorithm
• Step 1: Choose the number of neighbors k
• Step 2: Compute the distance from the query point to all training points
• Step 3: Select the k nearest neighbors
• Step 4: For classification: take majority vote of neighbor labels
• Step 5: For regression: take average of neighbor values

Distance Metrics
• Euclidean Distance: √Σ(xᵢ - yᵢ)² — most common; works for continuous features
• Manhattan Distance: Σ|xᵢ - yᵢ| — robust to outliers
• Minkowski Distance: (Σ|xᵢ - yᵢ|^p)^(1/p) — generalizes Euclidean and Manhattan
• Hamming Distance: Number of positions that differ — for categorical data

Choosing k
• Small k: Low bias, high variance (decision boundary too complex, noise-sensitive)
• Large k: High bias, low variance (smooth boundary, may miss patterns)
• Use cross-validation to select optimal k
💡 Always standardize features before KNN — otherwise features with larger scales dominate the
distance computation.

3.2 Decision Trees


Decision Trees are tree-structured classifiers that recursively split the feature space based on the
feature and threshold that best separates the target classes.

Key Concepts
• Root Node: Starting point; the best feature to split on globally
• Internal Nodes: Subsequent feature tests
• Leaf Nodes: Terminal nodes containing class labels or regression values
• Pruning: Remove branches that provide little classification power to reduce overfitting

ML Study Guide | Units I–VI


Machine Learning & Deep Learning — Complete Study Guide

Splitting Criteria
Criterion Formula Used In
Gini Impurity 1 - Σpᵢ² CART algorithm (scikit-learn
default)
Information Gain H(parent) - weighted avg ID3, C4.5 algorithms
H(children)
Entropy - Σpᵢ·log₂(pᵢ) Measure of impurity
Variance Reduction Var(parent) - weighted avg Regression trees
Var(children)

Overfitting and Hyperparameters


• max_depth: Limit tree depth to prevent overfitting
• min_samples_split: Minimum samples required to split an internal node
• min_samples_leaf: Minimum samples required at a leaf node
• max_features: Number of features to consider for the best split

3.3 Ensemble Learning


Ensemble methods combine multiple weak learners to build a stronger model. The key insight: the
collective wisdom of many models beats any single model.

Bagging (Bootstrap Aggregating)


• Train multiple models on different bootstrap samples (random sampling with replacement)
• Final prediction: majority vote (classification) or average (regression)
• Reduces variance without increasing bias
• Models are trained in parallel — fast and scalable
• Example: Random Forest

Boosting
• Train models sequentially, each correcting the errors of the previous one
• Later models focus more on previously misclassified examples
• Reduces bias and can also reduce variance
• More prone to overfitting than Bagging if not tuned properly
• Examples: AdaBoost, Gradient Boosting, XGBoost, LightGBM

Random Forest
Random Forest is an ensemble of Decision Trees trained via Bagging, with an additional
randomization: at each split, only a random subset of features is considered. This decorrelates the
trees, making the ensemble more powerful.
• Key hyperparameters: n_estimators (number of trees), max_features, max_depth
• Provides feature importance rankings via average impurity decrease
• Highly robust, handles missing values and high-dimensional data well
• State-of-the-art performance on many tabular datasets

ML Study Guide | Units I–VI


Machine Learning & Deep Learning — Complete Study Guide

AdaBoost (Adaptive Boosting)


• Trains a sequence of weak classifiers (typically shallow decision trees / stumps)
• Assigns a weight to each training sample, initially equal
• After each round, increase weights of misclassified samples
• Final model: weighted vote of all classifiers
• Classifier weight: αₜ = 0.5 · ln((1 - εₜ) / εₜ), where εₜ is the error rate
💡 Gradient Boosting generalizes AdaBoost by framing boosting as gradient descent in function
space. XGBoost and LightGBM are highly optimized implementations.

3.4 Evaluation Metrics and Scores


Confusion Matrix
A confusion matrix is a table comparing predicted versus actual class labels. For binary classification:
Predicted Positive Predicted Negative

Actual Positive True Positive (TP) False Negative (FN)


Actual Negative False Positive (FP) True Negative (TN)

Core Metrics
Metric Formula Meaning
Accuracy (TP+TN)/(TP+TN+FP+FN) Overall fraction correct; misleading
on imbalanced data
Precision TP / (TP + FP) Of predicted positives, what fraction
is correct?
Recall (Sensitivity) TP / (TP + FN) Of actual positives, what fraction
did we catch?
F1-Score 2 · (Precision · Recall) / (Precision Harmonic mean of Precision and
+ Recall) Recall
Specificity TN / (TN + FP) True Negative Rate

Cross-Validation
Cross-validation provides a more reliable estimate of model performance by using multiple train/test
splits.
• k-Fold CV: Split data into k folds; train on k-1, test on remaining; repeat k times
• Stratified k-Fold: Maintains class distribution in each fold (use for imbalanced data)
• Leave-One-Out CV (LOOCV): k = n; highest variance, computationally expensive
• Repeated k-Fold: Repeat entire k-Fold process multiple times with different shuffles

ML Study Guide | Units I–VI


Machine Learning & Deep Learning — Complete Study Guide

Multi-Class Averaging
Averaging Type Description Use When
Micro-Average Pool all TPs, FPs, FNs across classes then Imbalanced classes;
compute metric overall performance
matters
Macro-Average Compute metric per class, then take unweighted Equal weight to each class;
mean small classes matter
Weighted Average Mean weighted by class support (sample count) Report balanced view for
imbalanced datasets

ML Study Guide | Units I–VI


Machine Learning & Deep Learning — Complete Study Guide

Unit IV: Introduction to Deep Learning


Deep Learning is a subset of Machine Learning that uses multi-layered neural networks to learn
hierarchical representations directly from raw data. It has achieved unprecedented performance in
vision, speech, language, and many other domains.

4.1 Evolution of AI
Era Approach Key Events
1950s-1980s Symbolic AI / Expert Systems Turing Test (1950), LISP, rule-based
systems
1980s-2000s Machine Learning (shallow) SVM, Decision Trees, Random Forest
rise
2006 Deep Belief Networks Hinton's breakthrough in pre-training
deep networks
2012 Deep Learning revolution AlexNet wins ImageNet by large margin
(CNNs)
2014-2016 GANs, Seq2Seq, Attention Generative models, Neural machine
translation
2017-2020 Transformers BERT, GPT series transform NLP
permanently
2020-Now Foundation Models GPT-4, Gemini, DALL-E, multimodal AI

4.2 Machine Learning vs Deep Learning


Aspect Machine Learning Deep Learning
Feature Manual, domain-expertise required Automatic — learned from raw data
Engineering
Data Requirements Works with small-medium datasets Requires large datasets (millions+)
Compute CPU sufficient GPU/TPU often required
Interpretability Higher (linear models, trees) Lower (black box)
Training Time Minutes to hours Hours to days/weeks
Performance on Needs preprocessed features Excels on images, text, audio
Raw Data
Feature Learning No Yes — hierarchical representations

ML Study Guide | Units I–VI


Machine Learning & Deep Learning — Complete Study Guide

4.3 Deep Learning Types


• Feedforward Neural Networks (FNN/MLP): Dense layers, basic building block
• Convolutional Neural Networks (CNN): For structured grid data — images, video
• Recurrent Neural Networks (RNN/LSTM/GRU): For sequential data — time series, text
• Autoencoders: Unsupervised representation learning and generation
• Generative Adversarial Networks (GANs): Generate realistic data — images, audio
• Transformers: Attention-based; dominates NLP and increasingly vision
• Graph Neural Networks (GNN): Learning on graph-structured data

4.4 Applications of Deep Learning


• Computer Vision: Object detection (YOLO), image segmentation, facial recognition
• NLP: Machine translation, sentiment analysis, chatbots, text summarization
• Speech: Speech recognition (Whisper), text-to-speech synthesis
• Generative AI: DALL-E, Stable Diffusion, ChatGPT, Midjourney
• Healthcare: Medical image diagnosis, drug discovery, genomics
• Autonomous Systems: Self-driving cars, robotics, drone navigation

4.5 Deep Learning Frameworks


Framework Language Key Features
Keras Python High-level API; beginner-friendly; runs on TF/JAX backend
PyTorch Python Dynamic computation graph; Pythonic; research-preferred
TensorFlow Python/C++ Production-grade; TF Serving; TF Lite for mobile
Caffe C++/Python Fast, used in computer vision; older framework
Shogun C++/Python Traditional ML toolkit with some DL support; SVM-focused

4.6 Basic Tensor Operations


A tensor is the fundamental data structure in deep learning — a generalization of scalars, vectors, and
matrices to arbitrary dimensions (ranks).

Tensor Ranks
• Rank-0 (Scalar): Single number — e.g., loss value 0.42
• Rank-1 (Vector): 1D array — e.g., shape (128,) feature vector
• Rank-2 (Matrix): 2D array — e.g., shape (32, 128) batch of features
• Rank-3: 3D array — e.g., shape (32, 100, 300) batch of text sequences
• Rank-4: 4D array — e.g., shape (32, 224, 224, 3) batch of RGB images

PyTorch Tensor Operations


import torch

# Creating tensors

ML Study Guide | Units I–VI


Machine Learning & Deep Learning — Complete Study Guide

x = [Link]([[1.0, 2.0], [3.0, 4.0]])


zeros = [Link](3, 4)
rand = [Link](2, 3) # Normal distribution

# Operations
y = x + 2; y = x * x; y = [Link](x, x.T)
y = [Link](1, 4); y = [Link](); y = [Link](0)

# GPU transfer
device = 'cuda' if [Link].is_available() else 'cpu'
x = [Link](device)

4.7 Building a Neural Network


Using Keras
import tensorflow as tf
from tensorflow import keras

model = [Link]([
[Link](128, activation='relu', input_shape=(784,)),
[Link](0.3),
[Link](64, activation='relu'),
[Link](10, activation='softmax')
])

[Link](optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])

[Link](X_train, y_train, epochs=10, validation_split=0.2, batch_size=32)

Using PyTorch
import [Link] as nn

class Net([Link]):
def __init__(self):
super().__init__()
self.fc1 = [Link](784, 128)
self.fc2 = [Link](128, 64)
self.fc3 = [Link](64, 10)
[Link] = [Link]()
[Link] = [Link](0.3)

ML Study Guide | Units I–VI


Machine Learning & Deep Learning — Complete Study Guide

def forward(self, x):


x = [Link](self.fc1(x))
x = [Link](x)
x = [Link](self.fc2(x))
return self.fc3(x)

ML Study Guide | Units I–VI


Machine Learning & Deep Learning — Complete Study Guide

Unit V: Convolutional Neural Networks (CNN)


Convolutional Neural Networks are the gold standard for processing grid-structured data, especially
images. They leverage the spatial structure of data through local connections and shared weights to
achieve exceptional efficiency and performance.

5.1 CNN Architecture Overview


A typical CNN follows a pattern of stacked feature extraction layers followed by classification layers:
• Input Layer → Convolutional Layers → Activation (ReLU) → Pooling Layers → ...repeat... →
Flatten → Fully Connected Layers → Output
Each convolutional layer learns to detect increasingly abstract features: edges and colors in early
layers, shapes and textures in middle layers, and high-level concepts (faces, objects) in deep layers.

5.2 Building Blocks


Convolution Layer
The core operation in a CNN. A learnable filter (kernel) of small spatial extent slides over the input,
computing the dot product between filter weights and local input patches.
• Filter/Kernel: Small 2D weight matrix (e.g., 3×3, 5×5) learned during training
• Feature Map: Output of applying a filter to the input
• Each filter detects a specific pattern (edge, texture, etc.)
• Multiple filters per layer → multiple feature maps → depth grows
• Output size formula: O = (I - K + 2P) / S + 1, where I=input, K=kernel, P=padding, S=stride

Activation Function — ReLU


Rectified Linear Unit: f(x) = max(0, x). Applied element-wise after each convolution.
• Introduces non-linearity, enabling CNNs to learn complex functions
• Computationally simple; does not suffer from vanishing gradient as much as Sigmoid/Tanh
• Can suffer from Dying ReLU (neurons stuck at zero); addressed by Leaky ReLU or ELU

Pooling Layer
Reduces the spatial dimensions of feature maps, providing translation invariance and reducing
computation.
• Max Pooling: Takes the maximum value in each region — preserves strongest activation
• Average Pooling: Takes the average value in each region — smoother, less common
• Global Average Pooling (GAP): Reduces each feature map to a single value — used before
final FC layer in modern architectures

Padding and Strides


• Valid Padding (no padding): Output is smaller than input — border information lost
• Same Padding (zero-padding): Output has same spatial size as input — preserves borders
• Stride: Step size of the filter; stride=1 is standard; stride=2 reduces spatial size by 2

ML Study Guide | Units I–VI


Machine Learning & Deep Learning — Complete Study Guide

Fully Connected Layers


After convolutional and pooling layers flatten the spatial structure, fully connected (dense) layers
perform the final classification using the extracted features as input.

5.3 Advanced Architectures


LeNet-5 (1998, LeCun et al.)
The pioneering CNN, designed for handwritten digit recognition (MNIST).
• Architecture: Conv(6@5×5) → AvgPool → Conv(16@5×5) → AvgPool → FC(120) → FC(84) →
Output(10)
• Used tanh/sigmoid activations; average pooling
• Introduced the pattern: Convolution → Pooling → Fully Connected

AlexNet (2012, Krizhevsky et al.)


Won the 2012 ImageNet competition with a massive performance gap, sparking the deep learning
revolution.
• 8 layers (5 conv + 3 FC); 60 million parameters
• Introduced ReLU activation, Dropout for regularization, and data augmentation
• Trained on dual GPUs for parallelism
• First use of overlapping pooling

VGG-16 (2014, Simonyan et al.)


Very deep network with a simple, uniform architecture using only 3×3 convolutions.
• 16 weight layers (13 conv + 3 FC); 138 million parameters
• Key insight: Stack multiple small filters instead of one large filter for more depth
• 2 stacked 3×3 filters have the same receptive field as one 5×5 filter but fewer parameters
• Excellent for transfer learning — widely used as feature extractor

ResNet (2015, He et al.)


Introduced residual (skip) connections to enable training of very deep networks (100+ layers) without
vanishing gradients.
• Residual Block: H(x) = F(x) + x — adds the identity shortcut connection
• The network learns the residual F(x) = H(x) - x rather than the full mapping
• If the optimal function is close to identity, F(x) can be driven to zero
• ResNet-50, ResNet-101, ResNet-152 are standard variants
💡 The key insight of ResNets: it is easier to learn zero residuals than to fit an identity mapping from
scratch. Skip connections allow gradients to flow directly through the network.

5.4 Training and Optimization


Training Strategies
• Mini-batch Gradient Descent: Most common; typical batch sizes 32, 64, 128, 256

ML Study Guide | Units I–VI


Machine Learning & Deep Learning — Complete Study Guide

• Learning Rate Schedule: Start high, decay over time (step decay, cosine annealing, warmup)
• Optimizers: SGD with Momentum, Adam, AdamW, RMSProp
• Batch Normalization: Normalizes layer outputs; stabilizes and accelerates training

Regularization
• Dropout: Randomly zeros units during training; prevents co-adaptation of neurons
• L2 Weight Decay: Penalizes large weights; built into optimizers (weight_decay parameter)
• Data Augmentation: Random crops, flips, rotations, color jitter, Mixup, CutMix
• Early Stopping: Stop training when validation loss stops improving

Transfer Learning
Use a model pre-trained on a large dataset (e.g., ImageNet) and fine-tune it for a new task.
• Feature Extraction: Freeze all pre-trained layers; only train a new classification head
• Fine-Tuning: Unfreeze some/all layers and train on new data with a small learning rate
• Benefits: Works with small datasets, much less training time, better generalization
# Keras transfer learning example
base_model = [Link].ResNet50(weights='imagenet', include_top=False)
base_model.trainable = False # Freeze pre-trained layers
x = [Link].GlobalAveragePooling2D()(base_model.output)
output = [Link](num_classes, activation='softmax')(x)
model = [Link](inputs=base_model.input, outputs=output)

ML Study Guide | Units I–VI


Machine Learning & Deep Learning — Complete Study Guide

Unit VI: Recurrent Neural Networks (RNN)


Recurrent Neural Networks are designed for sequential data, where the order of inputs matters. They
maintain a hidden state that captures information from previous time steps, enabling the network to
model temporal dependencies.

6.1 Sequence Modeling and RNN Architecture


Unlike feedforward networks, RNNs have connections that form directed cycles, allowing information to
persist across time steps.

The RNN Update Equation


At each time step t: hₜ = tanh(Wₕₕ · hₜ₋₁ + Wₓₕ · xₜ + bₕ) and yₜ = W ₕᵧ · h ₜ + bᵧ, where h ₜ is the hidden
state, xₜ is the input, and yₜ is the output.
• Shared weights (Wₕₕ, Wₓₕ, Wₕᵧ) across all time steps — far fewer parameters
• Hidden state hₜ acts as a memory of what has been seen so far

Types of RNN Architectures by Input/Output


• One-to-Many: Single input → sequence output (e.g., image captioning)
• Many-to-One: Sequence input → single output (e.g., sentiment classification)
• Many-to-Many (Same length): Sequence → sequence (e.g., POS tagging, NER)
• Many-to-Many (Different length): Encoder-Decoder for machine translation

Bidirectional RNNs
Standard RNNs only process sequences in the forward direction. Bidirectional RNNs process the
sequence in both forward and backward directions, allowing each time step to use context from both
the past and the future.
• Forward RNN: Processes x₁ → x₂ → ... → xT
• Backward RNN: Processes xT → xT₋₁ → ... → x₁
• Hidden state at step t: hₜ = [h→ₜ ; h←ₜ] (concatenated)
• Not suitable for real-time/autoregressive generation (requires future context)
• Used in: BERT, many NLP tasks (NER, machine reading comprehension)

6.2 Vanishing and Exploding Gradient Problem


Training RNNs with Backpropagation Through Time (BPTT) requires computing gradients through
many time steps. This leads to two critical problems:

Vanishing Gradients
• Gradients shrink exponentially as they flow back through many time steps
• Early layers (early time steps) receive near-zero gradient updates
• Network fails to learn long-range dependencies
• Caused by repeated multiplication by small values (< 1) through tanh/sigmoid

ML Study Guide | Units I–VI


Machine Learning & Deep Learning — Complete Study Guide

Exploding Gradients
• Gradients grow exponentially — weight updates become huge, training diverges
• Easier to detect (NaN values) and fix with gradient clipping
• Gradient Clipping: Scale gradients down if their norm exceeds a threshold
💡 The vanishing gradient problem is the primary motivation for LSTM and GRU architectures,
which use gating mechanisms to selectively remember and forget information.

6.3 Long Short-Term Memory (LSTM)


LSTMs were introduced by Hochreiter & Schmidhuber (1997) to solve the vanishing gradient problem.
They maintain both a short-term hidden state hₜ and a long-term cell state cₜ, regulated by three
learnable gates.

LSTM Gates
Gate Formula Function
Forget Gate fₜ = σ(Wf·[hₜ₋₁, xₜ] + bf) How much of previous cell state to keep
(0=forget all, 1=keep all)
Input Gate iₜ = σ(Wi·[hₜ₋₁, xₜ] + bi) How much new information to add to
cell state
Candidate Cell c̃ ₜ = tanh(Wc·[hₜ₋₁, xₜ] + bc) New candidate values to add to cell
state
Cell State cₜ = fₜ * cₜ₋₁ + iₜ * c̃ ₜ Updated long-term memory (additive
update = no gradient decay)
Output Gate oₜ = σ(Wo·[hₜ₋₁, xₜ] + bo) What part of cell state to output
Hidden State hₜ = oₜ * tanh(cₜ) Short-term output / next step input

Why LSTMs Work


• The cell state pathway uses additive updates — gradients can flow without vanishing
• Gates are differentiable — the entire model is trainable end-to-end via backprop
• Forget gate can learn to preserve information over many time steps
• The architecture enables capturing dependencies hundreds of steps apart

GRU — Gated Recurrent Unit


A simplified version of LSTM introduced by Cho et al. (2014), using only two gates (reset and update)
and no separate cell state. Often matches LSTM performance with fewer parameters.
• Reset Gate: Controls how much of the previous hidden state to use in the new candidate
• Update Gate: Blends old hidden state with new candidate (like forget+input gates combined)

ML Study Guide | Units I–VI


Machine Learning & Deep Learning — Complete Study Guide

6.4 Autoencoders
Autoencoders are neural networks trained to reconstruct their input through a bottleneck, learning a
compressed representation (encoding) of the data in the process.

Architecture
• Encoder: Input → ... → Bottleneck (latent code z)
• Decoder: Bottleneck z → ... → Reconstructed Output
• Trained to minimize reconstruction loss: ||x - x̂ ||²
• The bottleneck forces the encoder to learn the most important features

Variants
• Denoising Autoencoder: Input is corrupted, model learns to reconstruct clean version — better
representations
• Variational Autoencoder (VAE): Bottleneck is a learned distribution (μ, σ²) — enables generation
of new samples
• Sparse Autoencoder: Adds sparsity constraint on activations
• Recurrent Autoencoder: Uses LSTM/GRU layers for sequence encoding/decoding

6.5 Applications of RNNs


Language Modeling
Given a sequence of words, predict the next word. The foundation of text generation, autocomplete,
and early chatbots.
• Character-level LM: Predict next character (great for code, DNA sequences)
• Word-level LM: Predict next word token
• Perplexity is the standard metric: PP = exp(average cross-entropy)

Speech Recognition
Converts audio waveforms to text. Modern systems use bidirectional LSTMs or Transformers.
• Connectionist Temporal Classification (CTC): Loss function for alignment-free sequence-to-
sequence
• End-to-end models: Listen-Attend-Spell (LAS), Whisper

Machine Translation
• Sequence-to-Sequence (Seq2Seq): Encoder LSTM encodes source sentence, Decoder LSTM
generates target
• Attention Mechanism: Allows decoder to focus on relevant encoder states at each step
• Teacher Forcing: During training, feed ground truth tokens to decoder to stabilize learning

Other Applications
• Time Series Forecasting: Stock prices, weather prediction, energy consumption
• Sentiment Analysis: Classify text as positive/negative/neutral
• Named Entity Recognition (NER): Identify names, places, organizations in text
• Music Generation: Generate melodies and harmonies step by step
• Video Captioning: Combine CNN (visual) + LSTM (language) for description generation

ML Study Guide | Units I–VI


Machine Learning & Deep Learning — Complete Study Guide

💡 Modern NLP has largely moved from RNNs to Transformers (self-attention), which process
sequences in parallel and capture long-range dependencies more effectively. However, RNNs
remain essential for understanding the foundations of sequence modeling and are still used in
edge/embedded systems where memory efficiency matters.

Quick Reference: Comparing Sequence Models


Model Long-range Parallelizable Parameters Best Use Case
Vanilla RNN Poor No Low Very short sequences
LSTM Good No Medium Medium-length
sequences
GRU Good No Low Efficient sequence
modeling
BiRNN/BiLSTM Good (both No 2x LSTM Classification, NER
dirs)
Transformer Excellent Yes High NLP, Vision (ViT),
modern tasks

End of Study Guide — Machine Learning & Deep Learning, Units I–VI

ML Study Guide | Units I–VI

You might also like