UNIT II – SUPERVISED LEARNING
1. Introduction to Supervised Learning
Supervised learning is one of the most widely used paradigms in machine
learning. In supervised learning, the learning algorithm is provided with a
labeled dataset, where each training example consists of an input vector and
a corresponding target output. The objective of supervised learning is to
learn a mapping from inputs to outputs that can accurately predict the
output for unseen inputs.
Supervised learning problems are broadly classified into: - Regression
problems, where the output is continuous - Classification problems,
where the output is discrete or categorical
The success of supervised learning relies on the availability of high-quality
labeled data and appropriate model selection.
2. Linear Regression Models (Extremely Detailed)
Linear Regression is one of the most fundamental and widely used
supervised learning algorithms. It serves as the foundation for understanding
more complex models and optimization techniques used in machine
learning. Despite its simplicity, linear regression is powerful, interpretable,
and forms the basis for many real-world applications.
2.1 Problem Formulation
In linear regression, the objective is to model the relationship between an
input variable (or variables) and a continuous output variable. Given a
dataset consisting of input-output pairs:
(x₁, y₁), (x₂, y₂), …, (xₙ, yₙ)
where xᵢ represents the input features and yᵢ represents the corresponding
target value, the task is to learn a function that can predict y for unseen
values of x.
2.2 Simple Linear Regression
Simple linear regression considers only a single input variable. The model
assumes that the relationship between input and output can be
approximated by a straight line.
The hypothesis (model) is defined as:
h(x) = w₀ + w₁x
where: - w₀ is the intercept or bias term - w₁ is the slope or weight parameter
This equation represents a straight line in a two-dimensional plane.
2.3 Interpretation of Parameters
Intercept (w₀): Represents the predicted value of y when x = 0
Slope (w₁): Represents the rate of change of y with respect to x
These parameters are learned from data such that the line best fits the given
data points.
2.4 Error and Loss Function
To measure how well the model fits the data, an error function is defined. For
a single data point, the error is:
eᵢ = h(xᵢ) − yᵢ
A commonly used loss function is the squared error loss, defined as:
L = (h(xᵢ) − yᵢ)²
The squared error penalizes large deviations more heavily and results in a
smooth, differentiable cost function.
2.5 Cost Function for Linear Regression
For n training examples, the cost function (Mean Squared Error) is defined
as:
J(w₀, w₁) = (1/2n) Σ (h(xᵢ) − yᵢ)²
The factor 1/2 is included for mathematical convenience during
differentiation.
The objective of learning is to find values of w₀ and w₁ that minimize J(w₀,
w₁).
2.6 Geometric Interpretation
The cost function represents a surface over the parameter space (w₀, w₁).
For linear regression, this surface is a convex paraboloid. The global
minimum corresponds to the optimal parameter values.
Because the cost surface is convex, linear regression has a unique global
minimum, ensuring stable convergence.
2.7 Derivation of Normal Equation (Single Variable)
To find the optimal parameters analytically, we compute the partial
derivatives of the cost function with respect to w₀ and w₁ and set them to
zero.
∂J/∂w₀ = (1/n) Σ (h(xᵢ) − yᵢ)
∂J/∂w₁ = (1/n) Σ (h(xᵢ) − yᵢ)xᵢ
Setting both derivatives to zero yields two linear equations. Solving these
equations gives the optimal values of w₀ and w₁. These equations are known
as the normal equations.
2.8 Limitations of Simple Linear Regression
Can model only linear relationships
Sensitive to outliers
Poor performance for high-dimensional data
These limitations motivate the extension to multiple linear regression.
2.9 Multiple Linear Regression
Multiple linear regression considers more than one input feature. The
hypothesis function is:
h(x) = w₀ + w₁x₁ + w₂x₂ + … + wₙxₙ
This model represents a hyperplane in n-dimensional feature space.
2.10 Vector and Matrix Representation
Using vector notation, the hypothesis can be written as:
h(x) = wᵀx
where: - w = [w₀, w₁, w₂, …, wₙ]ᵀ - x = [1, x₁, x₂, …, xₙ]ᵀ
This compact representation simplifies mathematical analysis and
implementation.
2.11 Cost Function in Matrix Form
The cost function for multiple linear regression can be expressed as:
J(w) = (1/2n)(Xw − y)ᵀ(Xw − y)
where X is the design matrix and y is the target vector.
2.12 Derivation of Normal Equation (Matrix Form)
To minimize the cost function, we take the gradient with respect to w and set
it to zero:
∇J(w) = (1/n)Xᵀ(Xw − y) = 0
Solving for w yields:
w = (XᵀX)⁻¹Xᵀy
This closed-form solution provides the optimal parameters when XᵀX is
invertible.
2.13 Computational Considerations
Matrix inversion is computationally expensive for large datasets
Numerical instability may arise if XᵀX is ill-conditioned
These issues motivate the use of iterative optimization methods such as
gradient descent.
2.14 Practical Example (Conceptual)
Consider predicting house prices using features such as area, number of
rooms, and age of the building. Multiple linear regression models the
contribution of each feature to the final price.
2.15 Advantages and Disadvantages of Linear Regression
Advantages: - Simple and interpretable - Fast to train - Works well for
linearly related data
Disadvantages: - Cannot model complex patterns - Sensitive to outliers -
Assumes linearity
2.16 Summary of Linear Regression
Linear regression provides a strong foundation for supervised learning. Its
mathematical simplicity, interpretability, and analytical solutions make it an
essential starting point for understanding more advanced machine learning
models.
3. Gradient Descent Optimization
Gradient descent is an iterative optimization algorithm used to minimize the
cost function when a closed-form solution is computationally expensive.
3.1 Batch Gradient Descent
In batch gradient descent, model parameters are updated using the gradient
computed over the entire training dataset.
Update rule:
wⱼ := wⱼ − α ∂J/∂wⱼ
where α is the learning rate.
3.2 Stochastic Gradient Descent
Stochastic gradient descent updates parameters using one training example
at a time. This introduces noise in updates but often converges faster for
large datasets.
3.3 Convergence Properties
The convergence of gradient descent depends on the choice of learning rate.
A very small learning rate leads to slow convergence, while a very large
learning rate may cause divergence.
4. Bayesian Linear Regression
Bayesian linear regression introduces probability distributions over model
parameters instead of point estimates.
4.1 Motivation for Bayesian Approach
Bayesian methods provide uncertainty estimates and incorporate prior
knowledge into learning.
4.2 Prior and Posterior Distributions
Assuming a Gaussian prior over weights and Gaussian noise in observations,
the posterior distribution over weights is also Gaussian.
4.3 Predictive Distribution
The Bayesian framework produces a predictive distribution rather than a
single prediction, offering uncertainty quantification.
5. Linear Classification Models
Linear classification models aim to separate data into classes using a linear
decision boundary.
5.1 Discriminant Functions
A discriminant function assigns a score to each class, and the class with the
highest score is chosen.
6. Perceptron Algorithm
The perceptron is one of the earliest linear classification algorithms.
6.1 Perceptron Model
The perceptron computes a weighted sum of inputs and applies a step
function.
6.2 Learning Algorithm
Weights are updated iteratively when misclassification occurs.
6.3 Convergence Theorem
The perceptron converges if the data is linearly separable.
7. Logistic Regression
Logistic regression is a probabilistic discriminative model used for binary
classification.
7.1 Logistic Function
The logistic (sigmoid) function maps real-valued inputs to the range (0,1).
7.2 Cost Function and Derivation
The cost function is derived from maximum likelihood estimation.
7.3 Gradient Descent for Logistic Regression
Parameters are optimized using gradient descent.
8. Naive Bayes Classifier
Naive Bayes is a probabilistic generative model based on Bayes’ theorem.
8.1 Bayes’ Theorem
Bayes’ theorem relates conditional and marginal probabilities.
8.2 Naive Independence Assumption
Features are assumed to be conditionally independent given the class.
8.3 Types of Naive Bayes
Gaussian Naive Bayes
Multinomial Naive Bayes
Bernoulli Naive Bayes
9. Support Vector Machines
Support Vector Machines are maximum margin classifiers.
9.1 Maximum Margin Concept
SVM seeks a hyperplane that maximizes the margin between classes.
9.2 Hard Margin SVM
Applicable when data is perfectly separable.
9.3 Soft Margin SVM
Introduces slack variables to handle non-separable data.
9.4 Kernel Trick
Kernels allow SVMs to learn non-linear decision boundaries.
10. Decision Trees
Decision trees classify data by recursively partitioning the feature space.
10.1 Tree Structure
Nodes represent tests, branches represent outcomes, and leaves represent
class labels.
10.2 Splitting Criteria
Information Gain
Gini Index
10.3 Overfitting and Pruning
Pruning reduces model complexity and improves generalization.
11. Random Forests
Random forests are ensemble models composed of multiple decision trees.
11.1 Bagging Principle
Each tree is trained on a bootstrap sample of the data.
11.2 Feature Randomness
Random feature selection decorrelates trees.
11.3 Advantages
Random forests improve accuracy and robustness.
12. Comparative Analysis of Supervised Learning Models
Different supervised learning algorithms have different inductive biases and
suitability.
13. Case Studies and Applications
Supervised learning is used in spam detection, credit scoring, medical
diagnosis, and image classification.
14. Summary of Unit II
This unit presented an in-depth study of supervised learning techniques,
covering regression, classification, probabilistic models, margin-based
classifiers, and tree-based methods.
End of UNIT II