0% found this document useful (0 votes)
11 views9 pages

Supervised Learning: Linear Regression Guide

Unit II covers supervised learning, detailing its definition, types (regression and classification), and foundational algorithms such as linear regression and gradient descent. It explores various models including logistic regression, naive Bayes, support vector machines, decision trees, and random forests, along with their advantages and limitations. The unit concludes with a comparative analysis of supervised learning models and their applications in real-world scenarios.

Uploaded by

nivunivi26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views9 pages

Supervised Learning: Linear Regression Guide

Unit II covers supervised learning, detailing its definition, types (regression and classification), and foundational algorithms such as linear regression and gradient descent. It explores various models including logistic regression, naive Bayes, support vector machines, decision trees, and random forests, along with their advantages and limitations. The unit concludes with a comparative analysis of supervised learning models and their applications in real-world scenarios.

Uploaded by

nivunivi26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

UNIT II – SUPERVISED LEARNING

1. Introduction to Supervised Learning


Supervised learning is one of the most widely used paradigms in machine
learning. In supervised learning, the learning algorithm is provided with a
labeled dataset, where each training example consists of an input vector and
a corresponding target output. The objective of supervised learning is to
learn a mapping from inputs to outputs that can accurately predict the
output for unseen inputs.
Supervised learning problems are broadly classified into: - Regression
problems, where the output is continuous - Classification problems,
where the output is discrete or categorical
The success of supervised learning relies on the availability of high-quality
labeled data and appropriate model selection.

2. Linear Regression Models (Extremely Detailed)


Linear Regression is one of the most fundamental and widely used
supervised learning algorithms. It serves as the foundation for understanding
more complex models and optimization techniques used in machine
learning. Despite its simplicity, linear regression is powerful, interpretable,
and forms the basis for many real-world applications.

2.1 Problem Formulation


In linear regression, the objective is to model the relationship between an
input variable (or variables) and a continuous output variable. Given a
dataset consisting of input-output pairs:
(x₁, y₁), (x₂, y₂), …, (xₙ, yₙ)
where xᵢ represents the input features and yᵢ represents the corresponding
target value, the task is to learn a function that can predict y for unseen
values of x.
2.2 Simple Linear Regression
Simple linear regression considers only a single input variable. The model
assumes that the relationship between input and output can be
approximated by a straight line.
The hypothesis (model) is defined as:
h(x) = w₀ + w₁x
where: - w₀ is the intercept or bias term - w₁ is the slope or weight parameter
This equation represents a straight line in a two-dimensional plane.

2.3 Interpretation of Parameters


 Intercept (w₀): Represents the predicted value of y when x = 0
 Slope (w₁): Represents the rate of change of y with respect to x
These parameters are learned from data such that the line best fits the given
data points.

2.4 Error and Loss Function


To measure how well the model fits the data, an error function is defined. For
a single data point, the error is:
eᵢ = h(xᵢ) − yᵢ
A commonly used loss function is the squared error loss, defined as:
L = (h(xᵢ) − yᵢ)²
The squared error penalizes large deviations more heavily and results in a
smooth, differentiable cost function.

2.5 Cost Function for Linear Regression


For n training examples, the cost function (Mean Squared Error) is defined
as:
J(w₀, w₁) = (1/2n) Σ (h(xᵢ) − yᵢ)²
The factor 1/2 is included for mathematical convenience during
differentiation.
The objective of learning is to find values of w₀ and w₁ that minimize J(w₀,
w₁).

2.6 Geometric Interpretation


The cost function represents a surface over the parameter space (w₀, w₁).
For linear regression, this surface is a convex paraboloid. The global
minimum corresponds to the optimal parameter values.
Because the cost surface is convex, linear regression has a unique global
minimum, ensuring stable convergence.

2.7 Derivation of Normal Equation (Single Variable)


To find the optimal parameters analytically, we compute the partial
derivatives of the cost function with respect to w₀ and w₁ and set them to
zero.
∂J/∂w₀ = (1/n) Σ (h(xᵢ) − yᵢ)
∂J/∂w₁ = (1/n) Σ (h(xᵢ) − yᵢ)xᵢ
Setting both derivatives to zero yields two linear equations. Solving these
equations gives the optimal values of w₀ and w₁. These equations are known
as the normal equations.

2.8 Limitations of Simple Linear Regression


 Can model only linear relationships
 Sensitive to outliers
 Poor performance for high-dimensional data
These limitations motivate the extension to multiple linear regression.

2.9 Multiple Linear Regression


Multiple linear regression considers more than one input feature. The
hypothesis function is:
h(x) = w₀ + w₁x₁ + w₂x₂ + … + wₙxₙ
This model represents a hyperplane in n-dimensional feature space.
2.10 Vector and Matrix Representation
Using vector notation, the hypothesis can be written as:
h(x) = wᵀx
where: - w = [w₀, w₁, w₂, …, wₙ]ᵀ - x = [1, x₁, x₂, …, xₙ]ᵀ
This compact representation simplifies mathematical analysis and
implementation.

2.11 Cost Function in Matrix Form


The cost function for multiple linear regression can be expressed as:
J(w) = (1/2n)(Xw − y)ᵀ(Xw − y)
where X is the design matrix and y is the target vector.

2.12 Derivation of Normal Equation (Matrix Form)


To minimize the cost function, we take the gradient with respect to w and set
it to zero:
∇J(w) = (1/n)Xᵀ(Xw − y) = 0
Solving for w yields:
w = (XᵀX)⁻¹Xᵀy
This closed-form solution provides the optimal parameters when XᵀX is
invertible.

2.13 Computational Considerations


 Matrix inversion is computationally expensive for large datasets
 Numerical instability may arise if XᵀX is ill-conditioned
These issues motivate the use of iterative optimization methods such as
gradient descent.
2.14 Practical Example (Conceptual)
Consider predicting house prices using features such as area, number of
rooms, and age of the building. Multiple linear regression models the
contribution of each feature to the final price.

2.15 Advantages and Disadvantages of Linear Regression


Advantages: - Simple and interpretable - Fast to train - Works well for
linearly related data
Disadvantages: - Cannot model complex patterns - Sensitive to outliers -
Assumes linearity

2.16 Summary of Linear Regression


Linear regression provides a strong foundation for supervised learning. Its
mathematical simplicity, interpretability, and analytical solutions make it an
essential starting point for understanding more advanced machine learning
models.

3. Gradient Descent Optimization


Gradient descent is an iterative optimization algorithm used to minimize the
cost function when a closed-form solution is computationally expensive.

3.1 Batch Gradient Descent


In batch gradient descent, model parameters are updated using the gradient
computed over the entire training dataset.
Update rule:
wⱼ := wⱼ − α ∂J/∂wⱼ
where α is the learning rate.

3.2 Stochastic Gradient Descent


Stochastic gradient descent updates parameters using one training example
at a time. This introduces noise in updates but often converges faster for
large datasets.

3.3 Convergence Properties


The convergence of gradient descent depends on the choice of learning rate.
A very small learning rate leads to slow convergence, while a very large
learning rate may cause divergence.

4. Bayesian Linear Regression


Bayesian linear regression introduces probability distributions over model
parameters instead of point estimates.

4.1 Motivation for Bayesian Approach


Bayesian methods provide uncertainty estimates and incorporate prior
knowledge into learning.

4.2 Prior and Posterior Distributions


Assuming a Gaussian prior over weights and Gaussian noise in observations,
the posterior distribution over weights is also Gaussian.

4.3 Predictive Distribution


The Bayesian framework produces a predictive distribution rather than a
single prediction, offering uncertainty quantification.

5. Linear Classification Models


Linear classification models aim to separate data into classes using a linear
decision boundary.

5.1 Discriminant Functions


A discriminant function assigns a score to each class, and the class with the
highest score is chosen.

6. Perceptron Algorithm
The perceptron is one of the earliest linear classification algorithms.

6.1 Perceptron Model


The perceptron computes a weighted sum of inputs and applies a step
function.

6.2 Learning Algorithm


Weights are updated iteratively when misclassification occurs.

6.3 Convergence Theorem


The perceptron converges if the data is linearly separable.

7. Logistic Regression
Logistic regression is a probabilistic discriminative model used for binary
classification.

7.1 Logistic Function


The logistic (sigmoid) function maps real-valued inputs to the range (0,1).

7.2 Cost Function and Derivation


The cost function is derived from maximum likelihood estimation.

7.3 Gradient Descent for Logistic Regression


Parameters are optimized using gradient descent.

8. Naive Bayes Classifier


Naive Bayes is a probabilistic generative model based on Bayes’ theorem.

8.1 Bayes’ Theorem


Bayes’ theorem relates conditional and marginal probabilities.

8.2 Naive Independence Assumption


Features are assumed to be conditionally independent given the class.

8.3 Types of Naive Bayes


 Gaussian Naive Bayes
 Multinomial Naive Bayes
 Bernoulli Naive Bayes

9. Support Vector Machines


Support Vector Machines are maximum margin classifiers.
9.1 Maximum Margin Concept
SVM seeks a hyperplane that maximizes the margin between classes.

9.2 Hard Margin SVM


Applicable when data is perfectly separable.

9.3 Soft Margin SVM


Introduces slack variables to handle non-separable data.

9.4 Kernel Trick


Kernels allow SVMs to learn non-linear decision boundaries.

10. Decision Trees


Decision trees classify data by recursively partitioning the feature space.

10.1 Tree Structure


Nodes represent tests, branches represent outcomes, and leaves represent
class labels.

10.2 Splitting Criteria


 Information Gain
 Gini Index
10.3 Overfitting and Pruning
Pruning reduces model complexity and improves generalization.

11. Random Forests


Random forests are ensemble models composed of multiple decision trees.

11.1 Bagging Principle


Each tree is trained on a bootstrap sample of the data.

11.2 Feature Randomness


Random feature selection decorrelates trees.

11.3 Advantages
Random forests improve accuracy and robustness.
12. Comparative Analysis of Supervised Learning Models
Different supervised learning algorithms have different inductive biases and
suitability.

13. Case Studies and Applications


Supervised learning is used in spam detection, credit scoring, medical
diagnosis, and image classification.

14. Summary of Unit II


This unit presented an in-depth study of supervised learning techniques,
covering regression, classification, probabilistic models, margin-based
classifiers, and tree-based methods.

End of UNIT II

You might also like