0% found this document useful (0 votes)

20 views33 pages

Unit2 - Lecturenotes

The document covers supervised learning in machine learning, detailing its concepts, algorithms, and applications. It explains the distinctions between classification and regression algorithms, the steps involved in supervised learning, and specific models like Bayesian linear regression, gradient descent, and logistic regression. Additionally, it discusses linear classification models, including the perceptron algorithm and probabilistic models, emphasizing their practical applications in various fields.

Uploaded by

ddnandu3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views33 pages

Unit2 - Lecturenotes

Uploaded by

ddnandu3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

AM3403 Machine Learning: Concepts and

Application

LECTURE NOTES

UNIT-2

SUPERVISED LEARNING
Syllabus
SUPERVISED LEARNING

Bayesian linear regression, gradient descent, Linear Classification Models: Discriminant

function –Perceptron algorithm,–Support vector machine, Decision Tree, Random Forests,
Instance Based Learning-KNN. Probabilistic discriminative model -Logistic regression,
Probabilistic generative model –Naive Bayes, Maximum margin classifier

2.1 What is Supervised Machine Learning?

Supervised machine learning learns patterns and relationships between input and output data. It
is defined by its use of labeled data. A labeled data is a dataset that contains a lot of examples of
Features and Target. Specifically, a supervised learning algorithm takes a known set of input data
and known responses to the data (output), and trains a model to generate reasonable predictions
for the response to new data. This process is referred to as Training or Fitting.

There are two types of supervised learning algorithms:

 Classification
 Regression
Classification Algorithms
Classification algorithms are used for predicting discrete outcomes, if the outcome can take two possible
values such as True or False, Default or No Default, Yes or No, it is known as Binary Classification.
When the outcome contains more than two possible values, it is known as Multiclass Classification. There
are many machine learning algorithms that can be used for classification tasks.

 Logistic regression
 Support vector machines (SVM)
 Neural networks
 Naïve Bayes classifier
 Decision trees
 Discriminant analysis
 Nearest neighbors (kNN)
 Ensemble Classification
 Generalized Additive Model (GAM)

Regression Algorithms
Regression is a type of supervised machine learning where algorithms learn from the data to
predict continuous values such as sales, salary, weight, or temperature. For example: A dataset
containing features of the house such as lot size, number of bedrooms, number of baths,
neighborhood, etc. and the price of the house, a Regression algorithm can be trained to learn the
relationship between the features and the price of the house.

Common regression algorithms include:

 Linear regression
 Nonlinear regression
 Generalized linear models
 Decision trees
 Neural networks
 Gaussian Process Regression
 Support Vector Machine Regression
 Ensemble Regression

Steps in Supervised Learning

Supervised learning involves training a model to learn patterns from labeled data and making
predictions on new inputs. While different algorithms have unique implementations, the overall
process follows a structured workflow:

1. Data Preparation

The first step in supervised learning is organizing the input data:

 The dataset consists of an input feature matrix X(where each row represents an observation and
each column represents a feature) and an output response vector Y.
 Missing values in X or Y should be appropriately handled, either by ignoring incomplete rows or
imputing missing data.
 The response variable Y varies based on the task:
o Regression: Y is a numeric vector.
o Classification: Y can be categorical, binary, or multi-class labels.

2. Choosing an Algorithm

The selection of a suitable learning algorithm depends on multiple factors, including:

 Training speed: Some models train faster than others, depending on complexity and dataset size.
 Memory usage: Resource-efficient algorithms are preferable for large datasets.
 Predictive accuracy: The model should generalize well to unseen data.
 Interpretability: Some models (e.g., decision trees) provide clear insights, while others (e.g.,
deep learning) act as black boxes.

3. Model Training (Fitting)

The training process involves applying the chosen algorithm to fit the model using the given
dataset. Common types of models include:

 Decision Trees
 Linear and Logistic Regression
 Support Vector Machines (SVM)
 Neural Networks
 k-Nearest Neighbors (k-NN)
 Naïve Bayes Classifier
 Ensemble Methods (e.g., Random Forest, Boosting)

Each algorithm has its own method for fitting a model to the training data.

4. Model Validation

To assess model performance, different validation techniques can be used:

 Resubstitution Error: Evaluating the model on the same training data.

 Cross-Validation: Splitting data into training and validation sets multiple times to
estimate performance on new data.
 Out-of-Bag Error: Specific to ensemble methods like bagging, evaluating performance
using data points not included in each subset during training.

5. Model Evaluation and Optimization

Once validated, the model can be fine-tuned for better accuracy, efficiency, or robustness. This
can involve:

 Adjusting hyperparameters (e.g., learning rate, tree depth).

 Pruning or regularizing the model to reduce complexity.
 Trying alternative algorithms for comparison.

For models that support optimization, compacting the model by removing unnecessary training
data or parameters can improve efficiency.

6. Making Predictions

After training and validating the model, it is used to make predictions on new data:

 For classification tasks, the model assigns labels to new observations.

 For regression tasks, the model predicts numerical values based on input features.

By following the above steps, a supervised learning model can be effectively developed,
validated, and applied to real-world problems.

2.2 Bayesian linear regression

Bayesian Linear Regression is an extension of standard linear regression that incorporates

probability distributions over parameters, allowing for uncertainty estimation in predictions.
Instead of finding a single best-fit line, it provides a posterior distribution over possible
regression models.

In standard linear regression, we model the relationship between input X and output y as:

In Bayesian Linear Regression, we treat the weights w as random variables with a prior distribution,
rather than fixed values. The goal is to compute the posterior distribution over w given the data.
Bayesian Linear Regression Formulas

To predict a new target value y^ for an unseen input X in Bayesian Linear Regression (BLR), we compute
the predictive distribution, which gives both the expected value (mean prediction) and uncertainty
(variance).
2.3 Gradient Descent

Gradient Descent is known as one of the most commonly used optimization algorithms to train
machine learning models by means of minimizing errors between actual and expected [Link]
mathematical terminology, Optimization algorithm refers to the task of minimizing/maximizing
an objective function f(x) parameterized by x. Similarly, in machine learning, optimization is the
task of minimizing the cost function parameterized by the model's parameters.

To define the local minimum or local maximum of a function using gradient descent is as
follows:

o If we move towards a negative gradient or away from the gradient of the function
at the current point, it will give the local minimum of that function.
o Whenever we move towards a positive gradient or towards the gradient of the
function at the current point, we will get the local maximum of that function.

This entire procedure is known as Gradient Ascent, which is also known as steepest descent.

The main objective of using a gradient descent algorithm is to minimize the cost function
using iteration. To achieve this goal, it performs two steps iteratively:

o Calculates the first-order derivative of the function to compute the gradient or slope of
that function.
o Move away from the direction of the gradient, which means slope increased from the
current point by alpha times, where Alpha is defined as Learning Rate. It is a tuning
parameter in the optimization process which helps to decide the length of the steps.
Working principle of Gradient Descent for linear regression

Consider a simple dataset with input (X) and output (Y) variables. The goal is to fit a line using
gradient descent.

Equation of a Line:

The objective is to determine optimal values for the intercept and slope by minimizing the error.

Loss Function:

We use the sum of squared residuals to measure how well the line fits the data:

Gradient Descent Update Rule:

To minimize the loss, we compute derivatives (gradients) of the loss function with respect to the
intercept and slope:

Using these gradients, we update the values iteratively:

This process repeats until the model converges to the optimal line.

Example

Consider a Simple data set which has height and weight measures for these different peoples
The objective of us is to fit a line to the above data points using gradient descent. However, at
first we plotted with respect to the generic equation for a line

Predicted Height = Intersect + Slope * Weight

The goal is to create an optimal value for intercept and slope. For example, if we started with the
intercept = 0 and slope = [Link] we use sum of square residuals as a lost function to determine
how well that initial line to fit the data.

To find the optimal values for the intercept and slope, we plugged the equation for the predicted
height into sum of square residuals.

Sum of squared residuals = (Observed height – [Intercept + Slope * Height]).

Then find the derivate of sum of the squared residuals with respect to the intercept and slope.
Then we should repeat the same many times till the step size become small.

The different types of gradient descent are T

 Batch Gradient Descent

o Updates weights after computing the gradient on the entire dataset
o Stable but computationally expensive

 Stochastic Gradient Descent

o Updates weights after each individual data point
o Much faster but introduces randomness

 Mini batch Gradient Descent

o Compromise between Batch and Stochastic Gradient Descent
o Uses a small batch of data to update weights
2.4 Linear Classification Models
Linear classification models are a class of machine learning models used for classifying data
points by drawing a linear decision boundary between different classes. These models assume
that classes can be separated by a straight line (in 2D), a plane (in 3D), or a hyper plane (in
higher dimensions).

Working Principle of Linear Classification

A linear classifier models the relationship between input features and class labels by computing a
weighted sum of input features and applying a threshold function to determine the class.
Mathematically, a linear classifier can be represented as:
Types of Linear Classification Models

1. Perceptron
o A simple binary classifier based on a linear threshold function.
o Updates weights using the perceptron learning rule.
o Converges if data is linearly separable.
2. Logistic Regression
o Used for binary classification.
o Uses the sigmoid function to output probabilities.
o Optimized using Maximum Likelihood Estimation (MLE).
o Extended to Multinomial Logistic Regression for multiclass problems.
3. Linear Discriminant Analysis (LDA)
o Assumes Gaussian distributions for classes and equal covariance matrices.
o Finds a projection that maximizes class separability.
4. Support Vector Machine (SVM) with Linear Kernel
o Finds the optimal hyperplane that maximizes the margin between classes.
o Can handle non-linearly separable data using soft margin SVM.

Applications

Spam Detection (e.g., Logistic Regression for classifying emails as spam or not)
Sentiment Analysis (e.g., classifying movie reviews as positive or negative)

Medical Diagnosis (e.g., identifying disease presence based on patient features)

2.5 Discriminant Function

A discriminant function is a function used in pattern classifiers to partition the feature space
based on probabilities or equivalent functions, helping to determine the class to which a given
input belongs.

Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) is a supervised learning algorithm used for dimensionality
reduction and classification. It finds the optimal linear boundaries between different classes
while maximizing the separation between them. The objectives of LDA are

 Maximize class separability : Find a projection that best separates different

classes.
 Reduce dimensionality: Transform data into a lower-dimensional space while
retaining class information.
 Improve classification performance :Useful for models like Naïve Bayes or
Logistic Regression.
LDA projects high-dimensional data onto a lower-dimensional space by maximizing inter-class
variance while minimizing intra-class variance. This criterion is known as the Fisher nib
criterion.

LDA algorithm works based on the following steps:

a) The first step is to calculate the mean vector of each class.

b) Within class scatter matrix and between classes scatter matrix is calculated
c) These matrices are then used to calculate the eigenvectors and eigenvalues and to project
the data to new space

Benefits of using LDA:

a) LDA is used for classification problems.
b) LDA is a powerful tool for dimensionality reduction.
c) LDA is not susceptible to the "curse of dimensionality" like many other machine
learning algorithms.
2.6 Perceptron algorithm

The Perceptron Algorithm is one of the simplest supervised learning algorithms for binary
classification.

It is a linear classifier that updates its weights based on misclassified data points using a
simple learning rule.

Working Principle of Perceptron

Given an input feature vector X and a weight vector W, the perceptron computes a weighted
sum and applies activation function to classify the data.
The perceptron makes predictions using the formula:

The perceptron updates its weights whenever it misclassifies a point.

Advantage

 Guaranteed to converge for linearly separable data

 .Simple and computationally efficient.
Disadvantage

 Fails on non-linearly separable data (e.g., XOR problem).

 Sensitive to learning rate and data order.

2.7 Probabilistic discriminative model

 The discriminative model aims to model the conditional distribution of the output
variable given the input variable.
 It directly estimates the posterior probability P(y∣X), which is the probability of a class
label y given an input X
 These models focus on learning the decision boundary between classes rather than
modeling the underlying distribution of the data.
 These models are useful when the focus is on making accurate predictions rather than
generating new data.
 Logistic Regression, Softmax Regression (Multinomial Logistic Regression), Conditional
Random Fields (CRFs) are examples of probability discriminative model

Logistic regression

Logistic regression is a statistical method used to analyze and predict a binary outcome (i.e., a
result with two possible values, such as Yes/No, 0/1, or Healthy/Sick). It helps determine
whether certain factors influence the likelihood of an event [Link] example, logistic
regression can be used to predict whether a patient has a disease based on blood test results. The
key components of logistic regression are

Logistic Component:

 Instead of directly predicting the outcome (0 or 1), logistic regression models the log
odds of the outcome using a special S-shaped curve called the sigmoid function.
 This transformation ensures that the predicted probability always falls between 0 and 1.

Regression Component:

 It measures the relationship between the predictor variables (independent variables) and
the outcome (dependent variable).
 It helps quantify how much each predictor contributes to the likelihood of the event
happening.
Logistic regression is a form of regression analysis in which the outcome variable is binary or
dichotomous. It is a statistical method used to model dichotomous or binary outcomes using
predictor variables.
• Logistic component: Instead of modeling the outcome, Y, directly, the method models the log
odds (Y) using the logistic function.
• Regression component: Methods used to quantify association between an outcome and
predictor variables. It could be used to build predictive models as a function of predictors.
•

Types of Logistic Regression:

 Simple Logistic Regression: Uses only one predictor variable.
 Multiple Logistic Regressions: Uses multiple predictor variables.

Working principle of Logistic regression

 The logistic regression model does not predict the 0 or 1 outcome directly. Instead, it
estimates the probability of the outcome happening.
 It transforms the probability using the log odds (logit) function, making the relationship
between predictors and the outcome linear.
Comparison with Linear Regression

 In linear regression, the outcome is a continuous number (e.g., predicting a person's weight).
 In logistic regression, the outcome is categorical (e.g., predicting whether a person has a
disease: Yes/No).

The key difference is how probabilities are modeled:

 Linear regression assumes probability increases in a straight line.

 Logistic regression assumes probability follows an S-shaped (sigmoid) curve.

While linear models are easy to interpret, logistic regression results are often explained in terms
of odds ratios, which can be less intuitive.

2.9 Probabilistic Generative model

Generative models aim to model the joint distribution of the input and output variables. These
models generate new data based on the probability distribution of the original dataset.
Generative models are powerful because they can generate new data that resembles the training
data.

Naive Bayes

• Naïve Bayes is a probabilistic, generative model based on Bayes' theorem. It is widely

used for classification tasks and assumes that features are conditionally independent
given the class label (hence the term naïve).
• A Naive Bayes Classifier is a program which predicts a class value given a set of
attributes. For each known class value,
• For each known class value,
1. Calculate probabilities for each attribute, conditional on the class value.
2. Use the product rule to obtain a joint conditional probability for the attributes.
3. Use Bayes rule to derive conditional probabilities for the class variable.
[Link] this has been done for all class values, output the class with the highest
probability.

The above steps are explained mathematically as follows

2.10 Maximum margin classifier

• The Maximum Margin Classifier is a fundamental concept in machine learning used for binary
classification.

• It is the basis of Support Vector Machines (SVMs) and aims to find a decision boundary that
maximizes the distance (margin) between the closest data points (support vectors) of each class.
Support Vector Machine

A support vector machine (SVM) is a supervised machine learning algorithm that classifies data
by finding an optimal line or hyperplane that maximizes the distance between each class in an
N-dimensional space.

Support Vector Machine (SVM) used for classification and regression tasks. While it can handle
regression problems, SVM is particularly well-suited for classification tasks

Support Vector Machine (SVM) Terminologies

 Hyper plane: A decision boundary separating different classes in feature space,
represented by the equation wx + b = 0 in linear classification.
 Support Vectors: The closest data points to the hyper plane, crucial for determining the
hyper plane and margin in SVM.
 Margin: The distance between the hyper plane and the support vectors. SVM aims to
maximize this margin for better classification performance.
 Kernel: A function that maps data to a higher-dimensional space, enabling SVM to handle
non-linearly separable data.
 Hard Margin: A maximum-margin hyper plane that perfectly separates the data without
misclassifications.
 Soft Margin: Allows some misclassifications by introducing slack variables, balancing
margin maximization and misclassification penalties when data is not perfectly separable.
 C: A regularization term balancing margin maximization and misclassification penalties. A
higher C value enforces a stricter penalty for misclassifications.
 Hinge Loss: A loss function penalizing misclassified points or margin violations,
combined with regularization in SVM.
 Dual Problem: Involves solving for Lagrange multipliers associated with support vectors,
facilitating the kernel trick and efficient computation.

Support Vector Machine (SVM) working principle

• They distinguish between two classes by finding the optimal hyper plane that maximizes
the margin between the closest data points of opposite classes.

• The number of features in the input data determine if the hyper plane is a line in a 2-D
space or a plane in a n-dimensional space.

• Since multiple hyper planes can be found to differentiate classes, maximizing the margin
between points enables the algorithm to find the best decision boundary between classes

• The data points that are adjacent to the optimal hyper plane are known as support vectors
as these vectors run through the data points that determine the maximal margin.

• Hard Margin: A maximum-margin hyper plane that perfectly separates the data without
misclassifications.
• Soft Margin: Allows some misclassifications by introducing slack variables, balancing
margin maximization and misclassification penalties when data is not perfectly separable.
SVM can handle both linear and nonlinear classification tasks. When the data is not
linearly separable, kernel functions are used to transform the data higher-dimensional
space to enable linear separation. This application of kernel functions can be known as
the “kernel trick”. The choice of kernel function, such as linear kernels, polynomial
kernels, radial basis function (RBF) kernels, or sigmoid kernels, depends on data
characteristics and the specific use case.

SVM Applications
• SVM has been used successfully in many real-world problems,

1. Text (and hypertext) categorization

2. Image classification
3. Bioinformatics (Protein classification, Cancer classification)
[Link]-written character recognition
5. Determination of SPAM email.

Problem1 : For the following figure find a linear hyperplane (decision boundary)
that will separate the data.
Solution:
1. Define what an optimal hyper plane is : maximize margin
2. Extend the above definition for non-linearly separable problems have a penalty term for
misclassifications
3. Map data to high dimensional space where it is easier to classify with linear of by decision
surfaces: reformulate problem so that data is mapped implicitly to this space

Problem 2: From the following diagram, identify which data points (1, 2, 3, 4, 5)
are support vectors (if any), slack variables on correct side of classifier (if any) and
slack variables on wrong side of classifier (if any). Mention which point will have
maximum penalty and why?
Solution:
• Data points 1 and 5 will have maximum penalty.
• Margin (m) is the gap between data points & the classifier boundary. The margin is the
minimum distance of any sample to the decision boundary. If this hyperplane is in the canonical
form, the margin can be measured by the length of the weight vector.
• Maximal margin classifier: A classifier in the family F that maximizes the margin. Maximizing
the margin is good according to intuition and PAC theory. Implies that only support vectors
matter; other training examples are ignorable.
• What if the training set is not linearly separable? Slack variables can be added to allow
misclassification of difficult or noisy examples, resulting margin called soft.
• A soft-margin allows a few variables to cross into the margin or over the hyperplane, allowing
misclassification.
• We penalize the crossover by looking at the number and distance of the misclassifications. This
is a trade off between the hyperplane violations and the margin size. The slack variables are
bounded by some set cost. The farther they are from the soft margin, the less influence they have
on the prediction.
• All observations have an associated slack variable
1. Slack variable = 0 then all points on the margin.
[Link] variable > 0 then a point in the margin or on the wrong side of the hyperplane.
3. C is the tradeoff between the slack variable penalty and the margin
2.11 Decision Tree

A Decision Tree is a supervised learning algorithm used for both classification and regression.
It models decisions using a tree-like structure, where each internal node represents a feature test,
each branch represents an outcome, each leaf node represents a final decision (class label or
numerical value).

. Key Concepts in Decision Trees

• Root Node: The starting point that represents the entire dataset.

• Internal Nodes: Represent feature-based decisions (splitting points).

• Branches: Represent possible outcomes of a decision.

• Leaf Nodes: Represent final predictions (class labels in classification, numerical values
in regression).

• Splitting: The process of dividing nodes based on feature values.

• Pruning: Removing unnecessary branches to prevent over fitting

Working principle of Decision Tree

1. Select the best feature to split the dataset using a criterion (e.g., Information Gain, Gini
Index).

2. Split the dataset into subsets based on feature values.

3. Repeat recursively until:

a. A stopping condition is met (e.g., maximum depth, minimum samples per leaf).

b. All data points in a node belong to the same class (pure node).

4. Prune the tree (if necessary) to avoid overfitting.

.Example of a Decision Tree

Let us look at a sample decision tree given in Figure 13.2. The rectangle nodes, in our example
Leave At, Road Block and Accident nodes are the decision nodes. Leave At is the root of the
decision tree and has three arcs corresponding to the three values that Leave At can take namely
10AM, 8AM, and 9AM. Road Block and Accident, the two internal nodes are Boolean. The leaf
nodes corresponding to the values of the target attribute which is the time taken for travel which
has possible values namely Short, Long and Medium.
Building a Decision Tree

The decision tree can be constructed in two ways – top-down tree construction and bottom-up
tree construction. In the top-down tree construction, at the start, all training examples are at the
root. We then recursively partition the examples by choosing one attribute each time. On the
other hand, in the bottom-up tree pruning method we remove sub trees or branches, in a bottom-
up manner, to improve the estimated accuracy on new cases.

Decision Tree Classification Task

Now let us discuss the use of the decision tree for the classification task using an example of
finding about whether a person will be a cheat or not.
The training data here consists of 10 samples (Figure 13.8). This example has three decision
attributes, Atttribute1 is categorical type Boolean attribute, Atttribute2 is categorical type three
valued attribute, while Atttribute3 is continuous type attribute. In this example we are dealing
with a continuous attribute. Continuous variables as attributes increase computational
complexity, may result in prediction inaccuracy and can lead to overfitting of data. Generally
continuous variables are converted into discrete intervals using “greater than or equal to” and
“less than”.

The decision tree corresponding to the training set is shown in Figure 13.8. Figure 13.9 show
how the induction learning uses the training samples to build the model and we apply the model
learnt to deduce the class of the samples in the test set.

When we carry out deduction (Figure 13.10), we start at the root node of the decision tree in
our case Refund, and for the example Refund value is n, so we follow that path and reach the
decision node Marital Status, where for the example considered the value is married. We follow
the path and end up with a target value of no. Therefore for the test example, the person is not a
cheat.
Problem1: Using following feature tree, write decision rules for majority class

Solution:

Left Side: A feature tree combining two Boolean features. Each internal node
or split is labelled with a feature, and each edge emanating from a split is
labelled with a feature value. Each leaf therefore corresponds to a unique
combination of feature values. Also indicated in each leaf is the class
distribution derived from the training set
Right Side: A feature tree partitions the instance space into rectangular regions, one
for each leaf.
The leaves of the tree in the above figure could be labelled, from left to right, as ham
- spam - spam employing a simple decision rule called majority class.

• Left side: A feature tree with training set class distribution in the leaves.
• Right side: A decision tree obtained using the majority class decision rule.
2.12 Random Forest Tree
Random Forest is an ensemble learning method that combines multiple Decision Trees to
improve accuracy and reduce over fitting. It is used for classification and regression
[Link] of relying on a single Decision Tree, Random Forest builds multiple trees
– averages their predictions (for regression)
– uses majority voting (for classification).
Working principle of Random Forest
1. Bootstrap Sampling: Randomly select subsets of training data (with replacement).
2. Feature Selection: At each tree node, randomly choose a subset of features instead of
using all features.
3. Tree Construction: Grow a Decision Tree using the subset.
4. Ensemble Prediction:
– For Classification → Majority voting among trees.
– For Regression → Average prediction of all trees.

• Mathematical Representation
• Each tree in the Random Forest predicts an outcome hi(x) and the final output is
determined as:

Advantages of Random Forest

* Handles missing values
* Reduces over fitting compared to Decision Trees
* Works well with high-dimensional data
* Can handle both classification & regression
Disadvantages:
– Can be computationally expensive with many trees.
– Harder to interpret than a single Decision Tree.
2.13 Instance-Based Learning
Instance-Based Learning (IBL) is a type of lazy learning algorithm where the model memorizes
training instances instead of learning a general function. Predictions for new data points are
made by comparing them to stored [Link] model does not build an explicit function
during training.
Working principle of Instance-Based Learning
1 Store all training examples in memory.
2 When a new query instance arrives, compute its similarity to stored examples.
3 Use a function (like majority voting or weighted average) to make a prediction.
Key Idea: Instead of generalizing from training data, IBL relies on similarity measures (e.g.,
Euclidean distance).
Types of Instance-Based Learning
• k-Nearest Neighbors (k-NN)
• Radial Basis Function (RBF) Networks
• Locally Weighted Regression (LWR)
k-Nearest Neighbors (k-NN)
k-Nearest Neighbors (k-NN) is a supervised learning algorithm used for classification and
regression. The algorithm stores all training [Link] predicts a new data point based on the
majority vote of its k nearest neighbors (for classification) or the average value of its neighbors
(for regression).No explicit training phase – it makes predictions only during inference.

Workings of KNN algorithm

Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity, where it
predicts the label or value of a new data point by considering the labels or values of its K nearest
neighbors in the training [Link]-by-Step explanation of how KNN works is discussed
below:
Step 1: Selecting the optimal value of K
 K represents the number of nearest neighbors that needs to be considered while making
prediction.
Step 2: Calculating distance
 To measure the similarity between target and training data points, Euclidean distance
is used. Distance is calculated between each of the data points in the dataset and
target point.
Step 3: Finding Nearest Neighbors
 The k data points with the smallest distances to the target point are the nearest
neighbors.
Step 4: Voting for Classification or Taking Average for Regression
 In the classification problem, the class labels of K-nearest neighbors are determined
by performing majority voting. The class with the most occurrences among the
neighbors becomes the predicted class for the target data point.

To measure the similarity between data points, k-NN typically uses distance metrics
Choosing the Right k Value
• Small k (e.g., 1 or 3): More sensitive to noise, higher variance.
• Large k (e.g., 10 or 20): More stable, less variance, but may smooth over local patterns.
• Best practice: Use cross-validation to find the optimal k.

Machine Learning: Linear Regression Guide
No ratings yet
Machine Learning: Linear Regression Guide
36 pages
Machine Learning 7
No ratings yet
Machine Learning 7
28 pages
Supervised Learning: Linear Regression Guide
No ratings yet
Supervised Learning: Linear Regression Guide
9 pages
Supervised Learning in Portland Housing
No ratings yet
Supervised Learning in Portland Housing
41 pages
Unit 1
No ratings yet
Unit 1
82 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
18 pages
ML 6 Gemini
No ratings yet
ML 6 Gemini
39 pages
ML Assignment I Answers
No ratings yet
ML Assignment I Answers
6 pages
Linear Regression and Gradient Descent Guide
No ratings yet
Linear Regression and Gradient Descent Guide
71 pages
Machine Learning Algorithms Explained
No ratings yet
Machine Learning Algorithms Explained
77 pages
Advanced Machine Learning Techniques
No ratings yet
Advanced Machine Learning Techniques
164 pages
Understanding Regression Algorithms
No ratings yet
Understanding Regression Algorithms
26 pages
Linear Regression with Gradient Descent
No ratings yet
Linear Regression with Gradient Descent
11 pages
Linear Models in Machine Learning
No ratings yet
Linear Models in Machine Learning
68 pages
Supervised Learning in Machine Learning
No ratings yet
Supervised Learning in Machine Learning
20 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
17 pages
SLIntro 14
No ratings yet
SLIntro 14
32 pages
Machine Learning Primer Overview
No ratings yet
Machine Learning Primer Overview
122 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
19 pages
Machine Learning Algorithms Overview
No ratings yet
Machine Learning Algorithms Overview
59 pages
CS229 Machine Learning Class Notes
No ratings yet
CS229 Machine Learning Class Notes
217 pages
Linear Regression in Machine Learning
No ratings yet
Linear Regression in Machine Learning
54 pages
ML 4
No ratings yet
ML 4
9 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
71 pages
Linear Regression in Machine Learning
No ratings yet
Linear Regression in Machine Learning
91 pages
Decision Tree
No ratings yet
Decision Tree
74 pages
Supervised Machine Learning: Linear Models and Fundamentals
No ratings yet
Supervised Machine Learning: Linear Models and Fundamentals
49 pages
Machine Learning - Unit 2
No ratings yet
Machine Learning - Unit 2
41 pages
DA106 Slides
No ratings yet
DA106 Slides
41 pages
Classification vs. Regression Explained
No ratings yet
Classification vs. Regression Explained
26 pages
Machine Learning Module 2 Notes
No ratings yet
Machine Learning Module 2 Notes
12 pages
Robot Learning: Supervised Learning Basics
No ratings yet
Robot Learning: Supervised Learning Basics
2 pages
Agents For Supervised Learning
No ratings yet
Agents For Supervised Learning
42 pages
Types of Machine Learning Explained
No ratings yet
Types of Machine Learning Explained
22 pages
Gradient Descent in Logistic Regression
No ratings yet
Gradient Descent in Logistic Regression
39 pages
Regression and Gradient Descent - Machine Learning
No ratings yet
Regression and Gradient Descent - Machine Learning
5 pages
Supervised Learning in Machine Learning
No ratings yet
Supervised Learning in Machine Learning
19 pages
Chapter 2 - Supervised Learning
No ratings yet
Chapter 2 - Supervised Learning
76 pages
AIML IAT 1 Notes
No ratings yet
AIML IAT 1 Notes
16 pages
Supervised Learning: Classification & Regression
No ratings yet
Supervised Learning: Classification & Regression
307 pages
Lecture No 2
No ratings yet
Lecture No 2
6 pages
Machine Learning
No ratings yet
Machine Learning
43 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
15 pages
Supervised Learning in Machine Learning
No ratings yet
Supervised Learning in Machine Learning
34 pages
Supervised Learning in Machine Learning
No ratings yet
Supervised Learning in Machine Learning
34 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
130 pages
Neural Network Optimization Techniques
No ratings yet
Neural Network Optimization Techniques
65 pages
Machine Learning Techniques Overview
No ratings yet
Machine Learning Techniques Overview
89 pages
Supervised Learning: Regression vs. Classification
No ratings yet
Supervised Learning: Regression vs. Classification
10 pages
Machine Learning: Loss Functions & Optimization
No ratings yet
Machine Learning: Loss Functions & Optimization
39 pages
Lecture1 ML Vocab+LinReg
No ratings yet
Lecture1 ML Vocab+LinReg
16 pages
Content-Based Recommendation Techniques
No ratings yet
Content-Based Recommendation Techniques
10 pages
Machine Learning in Polytechnic Education
No ratings yet
Machine Learning in Polytechnic Education
22 pages
Machine Learning Basics Explained
No ratings yet
Machine Learning Basics Explained
290 pages
SVM Classifier: Concepts and Optimization
No ratings yet
SVM Classifier: Concepts and Optimization
40 pages
Classification Models in Machine Learning
No ratings yet
Classification Models in Machine Learning
47 pages
Machine Learning Course Lesson Plan
No ratings yet
Machine Learning Course Lesson Plan
5 pages
Linear Classifiers in Python Course
No ratings yet
Linear Classifiers in Python Course
16 pages
Taeho Jo - Deep Learning Foundations-Springer (2023) (Z-Lib - Io)
100% (1)
Taeho Jo - Deep Learning Foundations-Springer (2023) (Z-Lib - Io)
433 pages
Understanding VC Dimension in ML
No ratings yet
Understanding VC Dimension in ML
3 pages
State-Of-Art Approaches For Review Spammer Detection: A Survey
No ratings yet
State-Of-Art Approaches For Review Spammer Detection: A Survey
34 pages
Linear Models for Classification in ML
No ratings yet
Linear Models for Classification in ML
72 pages
Deep Learning Overview for CS 404/504
No ratings yet
Deep Learning Overview for CS 404/504
102 pages
Linear Models and Supervised Learning Guide
No ratings yet
Linear Models and Supervised Learning Guide
10 pages
Naive Bayes Text Classification Guide
No ratings yet
Naive Bayes Text Classification Guide
207 pages
Efficient Text Classification with fastText
No ratings yet
Efficient Text Classification with fastText
5 pages
Linear Classifiers in Python Guide
No ratings yet
Linear Classifiers in Python Guide
17 pages
Introduction to Digital Images and CV
No ratings yet
Introduction to Digital Images and CV
120 pages
Support Vector Machine Overview
No ratings yet
Support Vector Machine Overview
31 pages
Text Classification with Neural Networks
No ratings yet
Text Classification with Neural Networks
64 pages
Machine Learning Algorithms Overview
No ratings yet
Machine Learning Algorithms Overview
31 pages
Linear Classifiers in Python Guide
No ratings yet
Linear Classifiers in Python Guide
21 pages
Essential Python Libraries for NLP and ML
No ratings yet
Essential Python Libraries for NLP and ML
21 pages
Machine Learning MCQs Overview
No ratings yet
Machine Learning MCQs Overview
53 pages
Introduction to Classification in Machine Learning
No ratings yet
Introduction to Classification in Machine Learning
14 pages
CS189: Intro to Machine Learning Overview
No ratings yet
CS189: Intro to Machine Learning Overview
113 pages
Geometric Models in Machine Learning
No ratings yet
Geometric Models in Machine Learning
44 pages
Linear Classification Models Overview
No ratings yet
Linear Classification Models Overview
4 pages
Linear Classifiers and Regression Methods
No ratings yet
Linear Classifiers and Regression Methods
30 pages
Machine Learning Basics for Data Science
No ratings yet
Machine Learning Basics for Data Science
16 pages
Understanding Linear Classifiers in ML
No ratings yet
Understanding Linear Classifiers in ML
197 pages

Unit2 - Lecturenotes

Uploaded by

Unit2 - Lecturenotes

Uploaded by

AM3403 Machine Learning: Concepts and

Bayesian linear regression, gradient descent, Linear Classification Models: Discriminant

2.1 What is Supervised Machine Learning?

There are two types of supervised learning algorithms:

Common regression algorithms include:

Steps in Supervised Learning

The first step in supervised learning is organizing the input data:

The selection of a suitable learning algorithm depends on multiple factors, including:

3. Model Training (Fitting)

To assess model performance, different validation techniques can be used:

 Resubstitution Error: Evaluating the model on the same training data.

5. Model Evaluation and Optimization

 Adjusting hyperparameters (e.g., learning rate, tree depth).

 For classification tasks, the model assigns labels to new observations.

2.2 Bayesian linear regression

Bayesian Linear Regression is an extension of standard linear regression that incorporates

Gradient Descent Update Rule:

Using these gradients, we update the values iteratively:

Predicted Height = Intersect + Slope * Weight

Sum of squared residuals = (Observed height – [Intercept + Slope * Height]).

The different types of gradient descent are T

 Batch Gradient Descent

 Stochastic Gradient Descent

 Mini batch Gradient Descent

Working Principle of Linear Classification

Medical Diagnosis (e.g., identifying disease presence based on patient features)

2.5 Discriminant Function

Linear Discriminant Analysis (LDA)

 Maximize class separability : Find a projection that best separates different

LDA algorithm works based on the following steps:

a) The first step is to calculate the mean vector of each class.

Benefits of using LDA:

Working Principle of Perceptron

The perceptron updates its weights whenever it misclassifies a point.

 Guaranteed to converge for linearly separable data

 Fails on non-linearly separable data (e.g., XOR problem).

2.7 Probabilistic discriminative model

Types of Logistic Regression:

Working principle of Logistic regression

The key difference is how probabilities are modeled:

 Linear regression assumes probability increases in a straight line.

2.9 Probabilistic Generative model

• Naïve Bayes is a probabilistic, generative model based on Bayes' theorem. It is widely

The above steps are explained mathematically as follows

Support Vector Machine (SVM) Terminologies

Support Vector Machine (SVM) working principle

1. Text (and hypertext) categorization

. Key Concepts in Decision Trees

• Internal Nodes: Represent feature-based decisions (splitting points).

• Branches: Represent possible outcomes of a decision.

• Splitting: The process of dividing nodes based on feature values.

• Pruning: Removing unnecessary branches to prevent over fitting

Working principle of Decision Tree

2. Split the dataset into subsets based on feature values.

3. Repeat recursively until:

4. Prune the tree (if necessary) to avoid overfitting.

.Example of a Decision Tree

Decision Tree Classification Task

Advantages of Random Forest

Workings of KNN algorithm

You might also like