0% found this document useful (0 votes)
20 views33 pages

Unit2 - Lecturenotes

The document covers supervised learning in machine learning, detailing its concepts, algorithms, and applications. It explains the distinctions between classification and regression algorithms, the steps involved in supervised learning, and specific models like Bayesian linear regression, gradient descent, and logistic regression. Additionally, it discusses linear classification models, including the perceptron algorithm and probabilistic models, emphasizing their practical applications in various fields.

Uploaded by

ddnandu3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views33 pages

Unit2 - Lecturenotes

The document covers supervised learning in machine learning, detailing its concepts, algorithms, and applications. It explains the distinctions between classification and regression algorithms, the steps involved in supervised learning, and specific models like Bayesian linear regression, gradient descent, and logistic regression. Additionally, it discusses linear classification models, including the perceptron algorithm and probabilistic models, emphasizing their practical applications in various fields.

Uploaded by

ddnandu3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

AM3403 Machine Learning: Concepts and

Application

LECTURE NOTES

UNIT-2

SUPERVISED LEARNING
Syllabus
SUPERVISED LEARNING

Bayesian linear regression, gradient descent, Linear Classification Models: Discriminant


function –Perceptron algorithm,–Support vector machine, Decision Tree, Random Forests,
Instance Based Learning-KNN. Probabilistic discriminative model -Logistic regression,
Probabilistic generative model –Naive Bayes, Maximum margin classifier

2.1 What is Supervised Machine Learning?

Supervised machine learning learns patterns and relationships between input and output data. It
is defined by its use of labeled data. A labeled data is a dataset that contains a lot of examples of
Features and Target. Specifically, a supervised learning algorithm takes a known set of input data
and known responses to the data (output), and trains a model to generate reasonable predictions
for the response to new data. This process is referred to as Training or Fitting.

There are two types of supervised learning algorithms:

 Classification
 Regression
Classification Algorithms
Classification algorithms are used for predicting discrete outcomes, if the outcome can take two possible
values such as True or False, Default or No Default, Yes or No, it is known as Binary Classification.
When the outcome contains more than two possible values, it is known as Multiclass Classification. There
are many machine learning algorithms that can be used for classification tasks.

 Logistic regression
 Support vector machines (SVM)
 Neural networks
 Naïve Bayes classifier
 Decision trees
 Discriminant analysis
 Nearest neighbors (kNN)
 Ensemble Classification
 Generalized Additive Model (GAM)

Regression Algorithms
Regression is a type of supervised machine learning where algorithms learn from the data to
predict continuous values such as sales, salary, weight, or temperature. For example: A dataset
containing features of the house such as lot size, number of bedrooms, number of baths,
neighborhood, etc. and the price of the house, a Regression algorithm can be trained to learn the
relationship between the features and the price of the house.

Common regression algorithms include:

 Linear regression
 Nonlinear regression
 Generalized linear models
 Decision trees
 Neural networks
 Gaussian Process Regression
 Support Vector Machine Regression
 Ensemble Regression

Steps in Supervised Learning

Supervised learning involves training a model to learn patterns from labeled data and making
predictions on new inputs. While different algorithms have unique implementations, the overall
process follows a structured workflow:

1. Data Preparation

The first step in supervised learning is organizing the input data:

 The dataset consists of an input feature matrix X(where each row represents an observation and
each column represents a feature) and an output response vector Y.
 Missing values in X or Y should be appropriately handled, either by ignoring incomplete rows or
imputing missing data.
 The response variable Y varies based on the task:
o Regression: Y is a numeric vector.
o Classification: Y can be categorical, binary, or multi-class labels.

2. Choosing an Algorithm

The selection of a suitable learning algorithm depends on multiple factors, including:

 Training speed: Some models train faster than others, depending on complexity and dataset size.
 Memory usage: Resource-efficient algorithms are preferable for large datasets.
 Predictive accuracy: The model should generalize well to unseen data.
 Interpretability: Some models (e.g., decision trees) provide clear insights, while others (e.g.,
deep learning) act as black boxes.

3. Model Training (Fitting)

The training process involves applying the chosen algorithm to fit the model using the given
dataset. Common types of models include:

 Decision Trees
 Linear and Logistic Regression
 Support Vector Machines (SVM)
 Neural Networks
 k-Nearest Neighbors (k-NN)
 Naïve Bayes Classifier
 Ensemble Methods (e.g., Random Forest, Boosting)

Each algorithm has its own method for fitting a model to the training data.

4. Model Validation

To assess model performance, different validation techniques can be used:

 Resubstitution Error: Evaluating the model on the same training data.


 Cross-Validation: Splitting data into training and validation sets multiple times to
estimate performance on new data.
 Out-of-Bag Error: Specific to ensemble methods like bagging, evaluating performance
using data points not included in each subset during training.

5. Model Evaluation and Optimization

Once validated, the model can be fine-tuned for better accuracy, efficiency, or robustness. This
can involve:

 Adjusting hyperparameters (e.g., learning rate, tree depth).


 Pruning or regularizing the model to reduce complexity.
 Trying alternative algorithms for comparison.

For models that support optimization, compacting the model by removing unnecessary training
data or parameters can improve efficiency.

6. Making Predictions

After training and validating the model, it is used to make predictions on new data:

 For classification tasks, the model assigns labels to new observations.


 For regression tasks, the model predicts numerical values based on input features.

By following the above steps, a supervised learning model can be effectively developed,
validated, and applied to real-world problems.

2.2 Bayesian linear regression

Bayesian Linear Regression is an extension of standard linear regression that incorporates


probability distributions over parameters, allowing for uncertainty estimation in predictions.
Instead of finding a single best-fit line, it provides a posterior distribution over possible
regression models.

In standard linear regression, we model the relationship between input X and output y as:

In Bayesian Linear Regression, we treat the weights w as random variables with a prior distribution,
rather than fixed values. The goal is to compute the posterior distribution over w given the data.
Bayesian Linear Regression Formulas

To predict a new target value y^ for an unseen input X in Bayesian Linear Regression (BLR), we compute
the predictive distribution, which gives both the expected value (mean prediction) and uncertainty
(variance).
2.3 Gradient Descent

Gradient Descent is known as one of the most commonly used optimization algorithms to train
machine learning models by means of minimizing errors between actual and expected [Link]
mathematical terminology, Optimization algorithm refers to the task of minimizing/maximizing
an objective function f(x) parameterized by x. Similarly, in machine learning, optimization is the
task of minimizing the cost function parameterized by the model's parameters.

To define the local minimum or local maximum of a function using gradient descent is as
follows:

o If we move towards a negative gradient or away from the gradient of the function
at the current point, it will give the local minimum of that function.
o Whenever we move towards a positive gradient or towards the gradient of the
function at the current point, we will get the local maximum of that function.

This entire procedure is known as Gradient Ascent, which is also known as steepest descent.

The main objective of using a gradient descent algorithm is to minimize the cost function
using iteration. To achieve this goal, it performs two steps iteratively:

o Calculates the first-order derivative of the function to compute the gradient or slope of
that function.
o Move away from the direction of the gradient, which means slope increased from the
current point by alpha times, where Alpha is defined as Learning Rate. It is a tuning
parameter in the optimization process which helps to decide the length of the steps.
Working principle of Gradient Descent for linear regression

Consider a simple dataset with input (X) and output (Y) variables. The goal is to fit a line using
gradient descent.

Equation of a Line:

The objective is to determine optimal values for the intercept and slope by minimizing the error.

Loss Function:

We use the sum of squared residuals to measure how well the line fits the data:

Gradient Descent Update Rule:

To minimize the loss, we compute derivatives (gradients) of the loss function with respect to the
intercept and slope:

Using these gradients, we update the values iteratively:

This process repeats until the model converges to the optimal line.

Example

Consider a Simple data set which has height and weight measures for these different peoples
The objective of us is to fit a line to the above data points using gradient descent. However, at
first we plotted with respect to the generic equation for a line

Predicted Height = Intersect + Slope * Weight

The goal is to create an optimal value for intercept and slope. For example, if we started with the
intercept = 0 and slope = [Link] we use sum of square residuals as a lost function to determine
how well that initial line to fit the data.

To find the optimal values for the intercept and slope, we plugged the equation for the predicted
height into sum of square residuals.

Sum of squared residuals = (Observed height – [Intercept + Slope * Height]).

Then find the derivate of sum of the squared residuals with respect to the intercept and slope.
Then we should repeat the same many times till the step size become small.

The different types of gradient descent are T

 Batch Gradient Descent


o Updates weights after computing the gradient on the entire dataset
o Stable but computationally expensive

 Stochastic Gradient Descent


o Updates weights after each individual data point
o Much faster but introduces randomness

 Mini batch Gradient Descent


o Compromise between Batch and Stochastic Gradient Descent
o Uses a small batch of data to update weights
2.4 Linear Classification Models
Linear classification models are a class of machine learning models used for classifying data
points by drawing a linear decision boundary between different classes. These models assume
that classes can be separated by a straight line (in 2D), a plane (in 3D), or a hyper plane (in
higher dimensions).

Working Principle of Linear Classification

A linear classifier models the relationship between input features and class labels by computing a
weighted sum of input features and applying a threshold function to determine the class.
Mathematically, a linear classifier can be represented as:
Types of Linear Classification Models

1. Perceptron
o A simple binary classifier based on a linear threshold function.
o Updates weights using the perceptron learning rule.
o Converges if data is linearly separable.
2. Logistic Regression
o Used for binary classification.
o Uses the sigmoid function to output probabilities.
o Optimized using Maximum Likelihood Estimation (MLE).
o Extended to Multinomial Logistic Regression for multiclass problems.
3. Linear Discriminant Analysis (LDA)
o Assumes Gaussian distributions for classes and equal covariance matrices.
o Finds a projection that maximizes class separability.
4. Support Vector Machine (SVM) with Linear Kernel
o Finds the optimal hyperplane that maximizes the margin between classes.
o Can handle non-linearly separable data using soft margin SVM.

Applications

Spam Detection (e.g., Logistic Regression for classifying emails as spam or not)
Sentiment Analysis (e.g., classifying movie reviews as positive or negative)

Medical Diagnosis (e.g., identifying disease presence based on patient features)

2.5 Discriminant Function

A discriminant function is a function used in pattern classifiers to partition the feature space
based on probabilities or equivalent functions, helping to determine the class to which a given
input belongs.

Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) is a supervised learning algorithm used for dimensionality
reduction and classification. It finds the optimal linear boundaries between different classes
while maximizing the separation between them. The objectives of LDA are

 Maximize class separability : Find a projection that best separates different


classes.
 Reduce dimensionality: Transform data into a lower-dimensional space while
retaining class information.
 Improve classification performance :Useful for models like Naïve Bayes or
Logistic Regression.
LDA projects high-dimensional data onto a lower-dimensional space by maximizing inter-class
variance while minimizing intra-class variance. This criterion is known as the Fisher nib
criterion.

LDA algorithm works based on the following steps:

a) The first step is to calculate the mean vector of each class.

b) Within class scatter matrix and between classes scatter matrix is calculated
c) These matrices are then used to calculate the eigenvectors and eigenvalues and to project
the data to new space

Benefits of using LDA:


a) LDA is used for classification problems.
b) LDA is a powerful tool for dimensionality reduction.
c) LDA is not susceptible to the "curse of dimensionality" like many other machine
learning algorithms.
2.6 Perceptron algorithm

The Perceptron Algorithm is one of the simplest supervised learning algorithms for binary
classification.

It is a linear classifier that updates its weights based on misclassified data points using a
simple learning rule.

Working Principle of Perceptron

Given an input feature vector X and a weight vector W, the perceptron computes a weighted
sum and applies activation function to classify the data.
The perceptron makes predictions using the formula:

The perceptron updates its weights whenever it misclassifies a point.

Advantage

 Guaranteed to converge for linearly separable data


 .Simple and computationally efficient.
Disadvantage

 Fails on non-linearly separable data (e.g., XOR problem).


 Sensitive to learning rate and data order.

2.7 Probabilistic discriminative model

 The discriminative model aims to model the conditional distribution of the output
variable given the input variable.
 It directly estimates the posterior probability P(y∣X), which is the probability of a class
label y given an input X
 These models focus on learning the decision boundary between classes rather than
modeling the underlying distribution of the data.
 These models are useful when the focus is on making accurate predictions rather than
generating new data.
 Logistic Regression, Softmax Regression (Multinomial Logistic Regression), Conditional
Random Fields (CRFs) are examples of probability discriminative model

Logistic regression

Logistic regression is a statistical method used to analyze and predict a binary outcome (i.e., a
result with two possible values, such as Yes/No, 0/1, or Healthy/Sick). It helps determine
whether certain factors influence the likelihood of an event [Link] example, logistic
regression can be used to predict whether a patient has a disease based on blood test results. The
key components of logistic regression are

Logistic Component:

 Instead of directly predicting the outcome (0 or 1), logistic regression models the log
odds of the outcome using a special S-shaped curve called the sigmoid function.
 This transformation ensures that the predicted probability always falls between 0 and 1.

Regression Component:

 It measures the relationship between the predictor variables (independent variables) and
the outcome (dependent variable).
 It helps quantify how much each predictor contributes to the likelihood of the event
happening.
Logistic regression is a form of regression analysis in which the outcome variable is binary or
dichotomous. It is a statistical method used to model dichotomous or binary outcomes using
predictor variables.
• Logistic component: Instead of modeling the outcome, Y, directly, the method models the log
odds (Y) using the logistic function.
• Regression component: Methods used to quantify association between an outcome and
predictor variables. It could be used to build predictive models as a function of predictors.

Types of Logistic Regression:


 Simple Logistic Regression: Uses only one predictor variable.
 Multiple Logistic Regressions: Uses multiple predictor variables.

Working principle of Logistic regression


 The logistic regression model does not predict the 0 or 1 outcome directly. Instead, it
estimates the probability of the outcome happening.
 It transforms the probability using the log odds (logit) function, making the relationship
between predictors and the outcome linear.
Comparison with Linear Regression

 In linear regression, the outcome is a continuous number (e.g., predicting a person's weight).
 In logistic regression, the outcome is categorical (e.g., predicting whether a person has a
disease: Yes/No).

The key difference is how probabilities are modeled:

 Linear regression assumes probability increases in a straight line.


 Logistic regression assumes probability follows an S-shaped (sigmoid) curve.

While linear models are easy to interpret, logistic regression results are often explained in terms
of odds ratios, which can be less intuitive.

2.9 Probabilistic Generative model


Generative models aim to model the joint distribution of the input and output variables. These
models generate new data based on the probability distribution of the original dataset.
Generative models are powerful because they can generate new data that resembles the training
data.

Naive Bayes

• Naïve Bayes is a probabilistic, generative model based on Bayes' theorem. It is widely


used for classification tasks and assumes that features are conditionally independent
given the class label (hence the term naïve).
• A Naive Bayes Classifier is a program which predicts a class value given a set of
attributes. For each known class value,
• For each known class value,
1. Calculate probabilities for each attribute, conditional on the class value.
2. Use the product rule to obtain a joint conditional probability for the attributes.
3. Use Bayes rule to derive conditional probabilities for the class variable.
[Link] this has been done for all class values, output the class with the highest
probability.

The above steps are explained mathematically as follows


2.10 Maximum margin classifier

• The Maximum Margin Classifier is a fundamental concept in machine learning used for binary
classification.

• It is the basis of Support Vector Machines (SVMs) and aims to find a decision boundary that
maximizes the distance (margin) between the closest data points (support vectors) of each class.
Support Vector Machine

A support vector machine (SVM) is a supervised machine learning algorithm that classifies data
by finding an optimal line or hyperplane that maximizes the distance between each class in an
N-dimensional space.

Support Vector Machine (SVM) used for classification and regression tasks. While it can handle
regression problems, SVM is particularly well-suited for classification tasks

Support Vector Machine (SVM) Terminologies


 Hyper plane: A decision boundary separating different classes in feature space,
represented by the equation wx + b = 0 in linear classification.
 Support Vectors: The closest data points to the hyper plane, crucial for determining the
hyper plane and margin in SVM.
 Margin: The distance between the hyper plane and the support vectors. SVM aims to
maximize this margin for better classification performance.
 Kernel: A function that maps data to a higher-dimensional space, enabling SVM to handle
non-linearly separable data.
 Hard Margin: A maximum-margin hyper plane that perfectly separates the data without
misclassifications.
 Soft Margin: Allows some misclassifications by introducing slack variables, balancing
margin maximization and misclassification penalties when data is not perfectly separable.
 C: A regularization term balancing margin maximization and misclassification penalties. A
higher C value enforces a stricter penalty for misclassifications.
 Hinge Loss: A loss function penalizing misclassified points or margin violations,
combined with regularization in SVM.
 Dual Problem: Involves solving for Lagrange multipliers associated with support vectors,
facilitating the kernel trick and efficient computation.

Support Vector Machine (SVM) working principle

• They distinguish between two classes by finding the optimal hyper plane that maximizes
the margin between the closest data points of opposite classes.

• The number of features in the input data determine if the hyper plane is a line in a 2-D
space or a plane in a n-dimensional space.

• Since multiple hyper planes can be found to differentiate classes, maximizing the margin
between points enables the algorithm to find the best decision boundary between classes

• The data points that are adjacent to the optimal hyper plane are known as support vectors
as these vectors run through the data points that determine the maximal margin.

• Hard Margin: A maximum-margin hyper plane that perfectly separates the data without
misclassifications.
• Soft Margin: Allows some misclassifications by introducing slack variables, balancing
margin maximization and misclassification penalties when data is not perfectly separable.
SVM can handle both linear and nonlinear classification tasks. When the data is not
linearly separable, kernel functions are used to transform the data higher-dimensional
space to enable linear separation. This application of kernel functions can be known as
the “kernel trick”. The choice of kernel function, such as linear kernels, polynomial
kernels, radial basis function (RBF) kernels, or sigmoid kernels, depends on data
characteristics and the specific use case.

SVM Applications
• SVM has been used successfully in many real-world problems,

1. Text (and hypertext) categorization


2. Image classification
3. Bioinformatics (Protein classification, Cancer classification)
[Link]-written character recognition
5. Determination of SPAM email.

Problem1 : For the following figure find a linear hyperplane (decision boundary)
that will separate the data.
Solution:
1. Define what an optimal hyper plane is : maximize margin
2. Extend the above definition for non-linearly separable problems have a penalty term for
misclassifications
3. Map data to high dimensional space where it is easier to classify with linear of by decision
surfaces: reformulate problem so that data is mapped implicitly to this space

Problem 2: From the following diagram, identify which data points (1, 2, 3, 4, 5)
are support vectors (if any), slack variables on correct side of classifier (if any) and
slack variables on wrong side of classifier (if any). Mention which point will have
maximum penalty and why?
Solution:
• Data points 1 and 5 will have maximum penalty.
• Margin (m) is the gap between data points & the classifier boundary. The margin is the
minimum distance of any sample to the decision boundary. If this hyperplane is in the canonical
form, the margin can be measured by the length of the weight vector.
• Maximal margin classifier: A classifier in the family F that maximizes the margin. Maximizing
the margin is good according to intuition and PAC theory. Implies that only support vectors
matter; other training examples are ignorable.
• What if the training set is not linearly separable? Slack variables can be added to allow
misclassification of difficult or noisy examples, resulting margin called soft.
• A soft-margin allows a few variables to cross into the margin or over the hyperplane, allowing
misclassification.
• We penalize the crossover by looking at the number and distance of the misclassifications. This
is a trade off between the hyperplane violations and the margin size. The slack variables are
bounded by some set cost. The farther they are from the soft margin, the less influence they have
on the prediction.
• All observations have an associated slack variable
1. Slack variable = 0 then all points on the margin.
[Link] variable > 0 then a point in the margin or on the wrong side of the hyperplane.
3. C is the tradeoff between the slack variable penalty and the margin
2.11 Decision Tree

A Decision Tree is a supervised learning algorithm used for both classification and regression.
It models decisions using a tree-like structure, where each internal node represents a feature test,
each branch represents an outcome, each leaf node represents a final decision (class label or
numerical value).

. Key Concepts in Decision Trees

• Root Node: The starting point that represents the entire dataset.

• Internal Nodes: Represent feature-based decisions (splitting points).

• Branches: Represent possible outcomes of a decision.

• Leaf Nodes: Represent final predictions (class labels in classification, numerical values
in regression).

• Splitting: The process of dividing nodes based on feature values.

• Pruning: Removing unnecessary branches to prevent over fitting

Working principle of Decision Tree

1. Select the best feature to split the dataset using a criterion (e.g., Information Gain, Gini
Index).

2. Split the dataset into subsets based on feature values.

3. Repeat recursively until:

a. A stopping condition is met (e.g., maximum depth, minimum samples per leaf).

b. All data points in a node belong to the same class (pure node).

4. Prune the tree (if necessary) to avoid overfitting.

.Example of a Decision Tree

Let us look at a sample decision tree given in Figure 13.2. The rectangle nodes, in our example
Leave At, Road Block and Accident nodes are the decision nodes. Leave At is the root of the
decision tree and has three arcs corresponding to the three values that Leave At can take namely
10AM, 8AM, and 9AM. Road Block and Accident, the two internal nodes are Boolean. The leaf
nodes corresponding to the values of the target attribute which is the time taken for travel which
has possible values namely Short, Long and Medium.
Building a Decision Tree

The decision tree can be constructed in two ways – top-down tree construction and bottom-up
tree construction. In the top-down tree construction, at the start, all training examples are at the
root. We then recursively partition the examples by choosing one attribute each time. On the
other hand, in the bottom-up tree pruning method we remove sub trees or branches, in a bottom-
up manner, to improve the estimated accuracy on new cases.

Decision Tree Classification Task

Now let us discuss the use of the decision tree for the classification task using an example of
finding about whether a person will be a cheat or not.
The training data here consists of 10 samples (Figure 13.8). This example has three decision
attributes, Atttribute1 is categorical type Boolean attribute, Atttribute2 is categorical type three
valued attribute, while Atttribute3 is continuous type attribute. In this example we are dealing
with a continuous attribute. Continuous variables as attributes increase computational
complexity, may result in prediction inaccuracy and can lead to overfitting of data. Generally
continuous variables are converted into discrete intervals using “greater than or equal to” and
“less than”.

The decision tree corresponding to the training set is shown in Figure 13.8. Figure 13.9 show
how the induction learning uses the training samples to build the model and we apply the model
learnt to deduce the class of the samples in the test set.

When we carry out deduction (Figure 13.10), we start at the root node of the decision tree in
our case Refund, and for the example Refund value is n, so we follow that path and reach the
decision node Marital Status, where for the example considered the value is married. We follow
the path and end up with a target value of no. Therefore for the test example, the person is not a
cheat.
Problem1: Using following feature tree, write decision rules for majority class

Solution:

Left Side: A feature tree combining two Boolean features. Each internal node
or split is labelled with a feature, and each edge emanating from a split is
labelled with a feature value. Each leaf therefore corresponds to a unique
combination of feature values. Also indicated in each leaf is the class
distribution derived from the training set
Right Side: A feature tree partitions the instance space into rectangular regions, one
for each leaf.
The leaves of the tree in the above figure could be labelled, from left to right, as ham
- spam - spam employing a simple decision rule called majority class.

• Left side: A feature tree with training set class distribution in the leaves.
• Right side: A decision tree obtained using the majority class decision rule.
2.12 Random Forest Tree
Random Forest is an ensemble learning method that combines multiple Decision Trees to
improve accuracy and reduce over fitting. It is used for classification and regression
[Link] of relying on a single Decision Tree, Random Forest builds multiple trees
– averages their predictions (for regression)
– uses majority voting (for classification).
Working principle of Random Forest
1. Bootstrap Sampling: Randomly select subsets of training data (with replacement).
2. Feature Selection: At each tree node, randomly choose a subset of features instead of
using all features.
3. Tree Construction: Grow a Decision Tree using the subset.
4. Ensemble Prediction:
– For Classification → Majority voting among trees.
– For Regression → Average prediction of all trees.

• Mathematical Representation
• Each tree in the Random Forest predicts an outcome hi(x) and the final output is
determined as:

Advantages of Random Forest


* Handles missing values
* Reduces over fitting compared to Decision Trees
* Works well with high-dimensional data
* Can handle both classification & regression
Disadvantages:
– Can be computationally expensive with many trees.
– Harder to interpret than a single Decision Tree.
2.13 Instance-Based Learning
Instance-Based Learning (IBL) is a type of lazy learning algorithm where the model memorizes
training instances instead of learning a general function. Predictions for new data points are
made by comparing them to stored [Link] model does not build an explicit function
during training.
Working principle of Instance-Based Learning
1 Store all training examples in memory.
2 When a new query instance arrives, compute its similarity to stored examples.
3 Use a function (like majority voting or weighted average) to make a prediction.
Key Idea: Instead of generalizing from training data, IBL relies on similarity measures (e.g.,
Euclidean distance).
Types of Instance-Based Learning
• k-Nearest Neighbors (k-NN)
• Radial Basis Function (RBF) Networks
• Locally Weighted Regression (LWR)
k-Nearest Neighbors (k-NN)
k-Nearest Neighbors (k-NN) is a supervised learning algorithm used for classification and
regression. The algorithm stores all training [Link] predicts a new data point based on the
majority vote of its k nearest neighbors (for classification) or the average value of its neighbors
(for regression).No explicit training phase – it makes predictions only during inference.

Workings of KNN algorithm

Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity, where it
predicts the label or value of a new data point by considering the labels or values of its K nearest
neighbors in the training [Link]-by-Step explanation of how KNN works is discussed
below:
Step 1: Selecting the optimal value of K
 K represents the number of nearest neighbors that needs to be considered while making
prediction.
Step 2: Calculating distance
 To measure the similarity between target and training data points, Euclidean distance
is used. Distance is calculated between each of the data points in the dataset and
target point.
Step 3: Finding Nearest Neighbors
 The k data points with the smallest distances to the target point are the nearest
neighbors.
Step 4: Voting for Classification or Taking Average for Regression
 In the classification problem, the class labels of K-nearest neighbors are determined
by performing majority voting. The class with the most occurrences among the
neighbors becomes the predicted class for the target data point.

To measure the similarity between data points, k-NN typically uses distance metrics
Choosing the Right k Value
• Small k (e.g., 1 or 3): More sensitive to noise, higher variance.
• Large k (e.g., 10 or 20): More stable, less variance, but may smooth over local patterns.
• Best practice: Use cross-validation to find the optimal k.

You might also like