0% found this document useful (0 votes)
11 views35 pages

Module 4

The document provides an overview of classification and prediction techniques in machine learning, detailing their definitions, examples, and common algorithms. It outlines the process of solving classification problems, including data preparation, model selection, training, evaluation, and tuning, as well as various evaluation metrics and techniques. Additionally, it discusses decision tree construction, Naive Bayes classifiers, and their applications in real-world scenarios.

Uploaded by

rajeshshikari.09
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views35 pages

Module 4

The document provides an overview of classification and prediction techniques in machine learning, detailing their definitions, examples, and common algorithms. It outlines the process of solving classification problems, including data preparation, model selection, training, evaluation, and tuning, as well as various evaluation metrics and techniques. Additionally, it discusses decision tree construction, Naive Bayes classifiers, and their applications in real-world scenarios.

Uploaded by

rajeshshikari.09
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Module-iv

Classification And Prediction


Problem Definition: -
Classification maps data into predefined categorical labels
(e.g., safe/risky, spam/not-spam).
Prediction models are continuous-valued functions to
forecast unknown or missing numerical values (e.g.,
predicting income, sales, or temperature).
Both techniques use labelled training data to predict new,
unseen data.
Classification vs. Prediction: Definitions
 Classification (Categorical Prediction): A supervised
learning process that finds a model (classifier) to
categorise data objects into discrete classes.
o Goal: Assign a class label to new data.
o Examples: Email spam detection, loan approval
classification, medical diagnosis (disease/no-
disease).
 Prediction (Numerical Prediction/Regression): A
process used to predict missing or unavailable numerical
data values for new observations.
o Goal: Estimate a continuous value.
o Examples: Predicting customer spending,
forecasting stock prices, estimating product
demand.
Usage Examples
 Classification:
o Finance: Determining if a loan application is "safe"
or "risky".
o Healthcare: Classifying tumour cells as
"malignant" or "benign".
o Marketing: Segmenting customers into "high-
value" or "low-value" groups.
 Prediction:
o Finance: Predicting the dollar amount a customer
will spend.
o Sales: Forecasting future sales trends based on
historical data.
o Technology: Estimating the remaining lifespan of a
machine component.
Synonyms and Key Terms
 Classification: Categorisation, Supervised Learning
(Classification), Class Label Prediction, Discrete
Classification.
 Prediction: Numerical Prediction, Regression Analysis,
Forecasting, Continuous Value Estimation.
Process Flow
1. Learning Step (Training): Algorithms analyse
historical data to build a model (e.g., decision trees,
neural networks).
2. Classification/Prediction Step (Testing): The model is
used to predict or classify new data, with accuracy
determined by comparing predicted results to known,
actual labels.
Common Algorithms:
 Classification: Decision Trees, Naïve Bayes, Support
Vector Machines (SVM), k-Nearest Neighbours (k-NN).
 Prediction: Linear Regression, Nonlinear Regression,
Artificial Neural Networks (ANN).
General Approaches for solving a classification problem
Solving a classification problem involves a structured, two-
step process: building a model using labelled training
data and subsequently using that model to predict class
labels forunseen data. The general approach
requires data preparation, algorithm selection (e.g.,
decision trees, SVM), model training, evaluation (using
accuracy, precision, F1-score), and tuning.
Key Steps for Solving a Classification Problem
 Data Preparation & Cleaning: Gather, clean, and pre-
process data by handling missing values and scaling
numerical features.
 Feature Selection & Engineering: Identify the most
relevant variables influencing the outcome.
 Data Splitting: Divide the dataset into training (e.g., 70-
80%) and testing (e.g., 20-30%) sets to ensure the model
generalises well.
 Model Selection: Choose an algorithm based on data
type and complexity. Common techniques include:
o Decision Trees: Creates a tree-based model for
decision-making.
o Logistic Regression: Suitable for binary
classification.
o K-Nearest Neighbours (KNN): A lazy learner that
classifies based on nearest neighbours.
o Support Vector Machines (SVM): Find the
optimal hyperplane for separation.
o Naive Bayes: Probabilistic classifier based on
Bayes' theorem.
 Model Training: The algorithm learns patterns from the
training data.
 Model Evaluation: Test the model on unseen data using
a confusion matrix to assess accuracy, precision, recall,
and F1 Score.
 Hyperparameter Tuning: Optimise model parameters
(e.g., via cross-validation) to improve performance.
Common Approaches
 Supervised Learning: The most common approach,
using labelled datasets.
 Lazy vs. Eager Learners: Lazy learners (e.g., KNN)
store data and wait for a test instance, while eager
learners (e.g., Decision Trees) build a model before
receiving test data.
 Binary vs. Multiclass/Multi-label: Selecting the
appropriate approach (e.g., binary for yes/no, multiclass
for multiple categories).

Evaluation of Classifiers
It’s about measuring how well a classification model predicts
class labels on unseen data. Different problems care
about different kinds of errors, so there’s no single
“best” metric.
Confusion Matrix (the foundation)
For binary classification:

Predicted Positive Predicted Negative

Actual Positive True Positive (TP) False Negative (FN)

Actual Negative False Positive (FP) True Negative (TN)


Almost every evaluation metric comes from these four
numbers.
Common Evaluation Metrics
1. Accuracy
𝑇𝑃 + 𝑇𝑁
Accuracy =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁

Good when: classes are balanced


Bad when: classes are imbalanced (e.g., fraud
detection)
2. Precision
𝑇𝑃
Precision =
𝑇𝑃 + 𝐹𝑃

Answers: “When the model predicts positive, how often is it


correct?”
Important when: false positives are costly (e.g., spam
filtering)
3. Recall (Sensitivity / True Positive Rate)
𝑇𝑃
Recall =
𝑇𝑃 + 𝐹𝑁

Answers: “How many actual positives did we catch?”


Important when: false negatives are costly (e.g.,
disease detection)
4. F1-Score
Precision ⋅ Recall
F1 = 2 ⋅
Precision + Recall

Use when: you need a balance between precision and recall


5. Specificity (True Negative Rate)
𝑇𝑁
Specificity =
𝑇𝑁 + 𝐹𝑃

Often used alongside recall in medical or risk-sensitive


domains.
Threshold-Based Evaluation
Many classifiers output probabilities, not labels.
ROC Curve & AUC
 ROC curve: plots Recall (TPR) vs False Positive Rate
 AUC: probability that a random positive is ranked higher
than a random negative
Good for: overall ranking quality
Limitation: can be misleading with heavy class
imbalance
Precision–Recall Curve
 Focuses on positive class performance
 More informative than ROC for imbalanced datasets
Multi-Class Classification
Metrics can be averaged:
 Macro average: treats all classes equally
 Micro average: weights by class frequency
 Weighted average: balance between macro and micro
Evaluation Strategies (How you test)
Train/Test Split
Simple, fast, but noisy.
Cross-Validation
 k-fold CV is the standard
 Reduces variance in evaluation
Stratified Sampling
Preserves class proportions — very important for imbalanced
data.
Classification techniques
1. Linear Classification Techniques
Logistic Regression
 Models probability using a sigmoid function
 Fast, interpretable, strong baseline
 Works best when classes are linearly separable
Linear Discriminate Analysis (LDA)
 Models class means and variances
 Assumes normal distribution
 Good when data is well-behaved and low-dimensional
2. Distance-Based Techniques
K-Nearest Neighbors (k-NN)
 Classifies based on the majority vote of nearest points
 No training phase (lazy learner)
 Sensitive to noise and feature scaling
3. Tree-Based Techniques
Decision Trees
 Uses feature splits to classify data
 Easy to interpret
 Prone to over fitting
Random Forest
 Ensemble of decision trees
 Reduces over fitting via bagging
 Strong general-purpose classifier
Gradient Boosting (XG Boost, Light GBM, CatBoost)
 Trees built sequentially to fix previous errors
 Very high accuracy
 Needs careful tuning
4. Probabilistic Techniques
Naive Bayes
 Based on Bayes’ theorem
 Assumes feature independence
 Very fast, works well for text classification
5. Margin-Based Techniques
Support Vector Machines (SVM)
 Finds maximum-margin hyper plane
 Works well in high-dimensional spaces
 Kernel trick handles non-linear data
6. Neural Network-Based Techniques
Artificial Neural Networks (ANN)
 Multi-layer perceptrons for classification
 Handles complex patterns
 Requires large datasets
Deep Learning Models
 CNNs → images
 RNNs / Transformers → text & sequences
 State-of-the-art but computationally expensive
7. Rule-Based & Other Techniques
Rule-Based Classifiers
 IF–THEN rules
 Easy to understand
 Limited flexibility
Ensemble Methods
 Combine multiple classifiers
 Examples: Voting, Stacking, Bagging, Boosting
8. Binary vs Multi-Class Techniques
 Binary classifiers: Logistic Regression, SVM, k-NN
 Multi-class strategies: One-vs.-Rest (OvR), One-vs.-
One (OvO), Softmax
Quick Comparison Table

Technique Interpretability Speed Accuracy

Logistic
High Fast Medium
Regression
Technique Interpretability Speed Accuracy

Slow (test
k-NN Medium Medium
time)

Decision Tree High Fast Medium

Random Forest Medium Medium High

SVM Low Medium High

Very
Neural Networks Low Slow Hi
gh

What is Decision Tree Construction?


Decision tree construction is the process of building a tree-
like model by repeatedly splitting the dataset based on
feature values, yielding subsets as pure as possible.
Main Steps in Decision Tree Construction
1. Start with the Root Node
 The entire dataset is placed at the root.
 Choose the best attribute to split the data.
2. Select the Best Splitting Attribute
The best attribute is chosen using an attribute selection
measure:
Common Measures:
 Information Gain (ID3)
 Gain Ratio (C4.5)
 Gini Index (CART)
These measures evaluate how well a feature separates the
classes.
3. Split the Dataset
 The dataset is divided into subsets based on the selected
attribute.
 Each branch represents an outcome of the test.
Example:
Is Outlook = Sunny?
├── Yes
└── No
4. Create Decision Nodes or Leaf Nodes
 If all instances in a subset belong to the same class,
create a leaf node.
 If not, repeat the process recursively on that subset.
5. Stopping Conditions
Tree growth stops when:
 All instances belong to the same class
 No attributes are left for splitting
 A predefined depth is reached
 Information gain becomes negligible
6. Pruning (Avoid Over fitting)
After the tree is built, pruning removes unnecessary branches.
Types of Pruning:
 Pre-pruning (Early stopping)
Stops tree growth early
 Post-pruning
Removes branches after full tree construction
Attribute Selection Measures (Key Formulas)
Entropy
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = −∑𝑝 log 𝑝

Information Gain
∣𝑆 ∣
𝐼𝐺(𝑆, 𝐴) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − ∑ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆 )
∣𝑆∣

Gini Index
𝐺𝑖𝑛𝑖(𝑆) = 1 − ∑𝑝

Algorithms for Decision Tree Construction

Algorithm Measure Used


Algorithm Measure Used

ID3 Information Gain

C4.5 Gain Ratio

CART Gini Index

Advantages of Decision Tree Construction


 Easy to understand and interpret
 Handles both categorical and numerical data
 Requires little data pre processing
Limitations
 Sensitive to noisy data
 Can over fit without pruning
 Small data changes may create different trees
Simple Flow of Construction
Start

Select the best attribute

Split data

Pure node?
├─ Yes → Leaf node
└─ No → Repeat

Naive Bayes Classifier

The Naive Bayes Classifier is a probabilistic machine learning


algorithm used mainly for classification problems like spam detection,
sentiment analysis, and document classification.

It is based on Bayes’ Theorem and assumes that features are


independent of each other (that’s why it’s called “naive”).
1. Bayes’ Theorem

Bayes’ theorem formula:

P(C∣X)=P(X∣C)⋅P(C)P(X)P(C|X) = \frac{P(X|C) \cdot


P(C)}{P(X)}P(C∣X)=P(X)P(X∣C)⋅P(C)

Where:

 P(C|X) → Posterior probability (probability of class C given


data X)
 P(X|C) → Likelihood (probability of data X given class C)
 P(C) → Prior probability of class C
 P(X) → Evidence (probability of data X)

In simple words:

It calculates the probability that a data point belongs to a particular


class.

Why “Naive”?
Because it assumes:

All features are independent of each other.

Example:
If you're classifying whether an email is spam:

 Word “free”
 Word “offer”
 Word “win”

Naive Bayes assumes these words occur independently — which is


not fully true in real life, but surprisingly the model still works very
well!

Types of Naive Bayes

1. Gaussian Naive Bayes


oUsed when features are continuous
o Assumes data follows a normal distribution
2. Multinomial Naive Bayes
o Used for text classification
o Based on word counts
3. Bernoulli Naive Bayes
o Used for binary features (0/1, yes/no)

Working Steps
1. Calculate prior probabilities for each class
2. Calculate likelihood probabilities for each feature
3. Apply Bayes theorem
4. Choose class with highest posterior probability

Applications
 Email spam detection
 Sentiment analysis
 Document classification
 Medical diagnosis
 Student performance prediction (like in your CNN project —
but simpler model)

Advantages
 Simple and fast
 Works well with large datasets
 Performs well in text classification
 Requires less training data
Disadvantages

 Assumes feature independence


 Not good when features are highly correlated
Example
Suppose we want to classify a message as Spam or Not Spam.

If:

 P(Spam) = 0.4
 P(Free | Spam) = 0.8
 P(Free | NotSpam) = 0.1

Then the model compares probabilities and chooses the higher one.

Problem

Suppose we want to classify whether a student will pass or fail based


on two features:

 Hours Studied (Low / High)


 Attendance (Poor / Good)

We are given training data of 10 students:

Student Hours Attendance Result

1 High Good Pass

2 High Good Pass

3 High Poor Pass

4 Low Good Pass

5 Low Poor Fail

6 Low Poor Fail

7 High Poor Pass

8 Low Good Pass


Student Hours Attendance Result

9 Low Poor Fail

10 High Good Pass

Now classify a new student:

Hours = High, Attendance = Poor

Step 1: Calculate Prior Probabilities


Total students = 10

Number of Pass = 7
Number of Fail = 3

P(Pass)=7/10=0.7P(Pass) = 7/10 = 0.7P(Pass)=7/10=0.7


P(Fail)=3/10=0.3P(Fail) = 3/10 = 0.3P(Fail)=3/10=0.3

Step 2: Calculate Likelihood Probabilities


For Class = Pass

Among 7 Pass students:

 High Hours = 5
 Low Hours = 2
 Poor Attendance = 2
 Good Attendance = 5

So,

P(High∣Pass)=5/7P(High | Pass) = 5/7P(High∣Pass)=5/7


P(Poor∣Pass)=2/7P(Poor | Pass) = 2/7P(Poor∣Pass)=2/7
For Class = Fail

Among 3 fail students:

 High Hours = 0
 Low Hours = 3
 Poor Attendance = 3
 Good Attendance = 0

So,

P(High∣Fail)=0/3=0P(High | Fail) = 0/3 = 0P(High∣Fail)=0/3=0


P(Poor∣Fail)=3/3=1P(Poor | Fail) = 3/3 = 1P(Poor∣Fail)=3/3=1

Step 3: Apply Naive Bayes Formula


We calculate:

P(Pass∣High,Poor)P(Pass | High, Poor)P(Pass∣High,Poor)

Using:

P(Pass)×P(High∣Pass)×P(Poor∣Pass)P(Pass) × P(High | Pass) ×


P(Poor | Pass)P(Pass)×P(High∣Pass)×P(Poor∣Pass)
=0.7×(5/7)×(2/7)= 0.7 × (5/7) × (2/7)=0.7×(5/7)×(2/7)
=0.7×0.714×0.285= 0.7 × 0.714 × 0.285=0.7×0.714×0.285
=0.142= 0.142=0.142

Now for Fail:

P(Fail∣High,Poor)P(Fail | High, Poor)P(Fail∣High, Poor) =0.3×0×1=


0.3 × 0 × 1=0.3×0×1 =0= 0=0

Step 4: Compare Probabilities


P(Pass∣High,Poor)=0.142P(Pass | High, Poor) =
0.142P(Pass∣High,Poor)=0.142 P(Fail∣High,Poor)=0P(Fail |
High, Poor) = 0P(Fail∣High,Poor)=0
Since 0.142 > 0

Final Prediction = PASS

Important Observation (Zero Probability Problems)


Notice:

P(High∣Fail)=0P(High | Fail) = 0P(High∣Fail)=0

Because in training data, no failed student had High hours.

This is called the Zero Frequency Problem.

Bayesian Belief Network


A Bayesian Belief Network (BBN)—often just called a Bayesian
Network, Bayes Net, or Belief Network—is a type of probabilistic
graphical model. It is used to represent a set of variables and their
conditional dependencies, making it an incredibly powerful tool for
reasoning under uncertainty.

By mapping out cause-and-effect relationships and assigning


probabilities to them, BBNs allow you to calculate the likelihood of
certain events occurring based on the evidence you currently have.

Here is a breakdown of how they work, their core components, and


where they are used.

Core Components

A BBN consists of two main parts: a graphical structure and a set of


probabilities.

 Directed Acyclic Graph (DAG): This is the visual structure of


the network.
o Nodes: Each node represents a random variable. These
variables can be observable quantities (like a sensor
reading), latent variables (like a hidden disease), or
hypotheses.
o Directed Edges (Arrows): These represent conditional
dependencies. If an arrow points from Node $A$ to Node
$B$, $A$ is considered the "parent" of $B$, meaning the
state of $B$ depends directly on the state of $A$.
o Acyclic: The graph cannot have loops. You cannot start at
a node, follow the arrows, and end up back at the same
node.
 Conditional Probability Tables (CPTs): Every node in the
network has an associated CPT.
o If a node has no parents (an independent variable), the
table simply lists its prior probabilities.
o If a node has parents, the table quantifies the probability of
that node being in a specific state given the states of its
parents.

The Mathematical Foundation


BBNs are built on Bayes' Theorem, which provides a mathematical
rule for updating the probability of a hypothesis as more evidence or
information becomes available.

The true power of a BBN lies in how it simplifies complex probability


distributions. Instead of calculating the joint probability of every
single variable interacting with every other variable, a BBN assumes
that a variable is conditionally independent of all its non-descendants,
given its parents.

This allows the network to calculate the full joint probability


distribution using the chain rule for Bayesian networks:

P(X 1 ,X 2 ,…,X n )= i=1 ∏ n P(X i ∣Parents(X i ))

Where:

 X i represents a specific node (variable).


 Parents(X i ) represents the direct causes of that node.
A Classic Example: The Wet Grass

Imagine you want to know why the grass in your yard is wet. A
simple BBN for this might involve four variables: Cloudy, Rain,
Sprinkler, and Wet Grass.

1. Cloudy is the root node. It has no parents. Its probability is


independent.
2. Cloudy is a parent to both Rain and Sprinkler. (If it's cloudy,
it's more likely to rain, and you are less likely to turn your
sprinkler on).
3. Both Rain and Sprinkler are parents to Wet Grass.

If you step outside and see that the grass is wet (your evidence), you
can use the network to work backward (infer) and calculate the
probability that it rained versus the probability that the sprinkler was
left on. If you then notice it is cloudy, the network will update the
probabilities, making rain the much more likely cause.

Common Applications
Because BBNs are excellent at handling incomplete data and
modeling complex systems, they are heavily utilized across various
fields:

 Medical Diagnosis: Mapping symptoms to potential diseases to


calculate the most likely diagnosis.
 Robotics and AI: Helping autonomous systems make decisions
when sensor data is noisy or incomplete.
 Spam Filtering: Analyzing the occurrence of certain words to
determine the probability that an email is spam.
 Risk Management: Predicting the likelihood of financial
defaults, equipment failures, or project delays.

K K-Nearest neighbour classification-Algorithm and


Characteristics
K-Nearest Neighbors (KNN) is one of the simplest, yet most
intuitive, supervised machine learning algorithms. It is primarily used
for classification (though it can also be used for regression) and
operates on a very straightforward principle: similar things exist in
close proximity.

It works by finding the "k" closest data points (neighbours) to a


given input and makes a predictions based on the majority class (for
classification) or the average value (for regression). Since KNN
makes no assumptions about the underlying data distribution it
makes it a non-parametric and instance-based learning method.
K-Nearest Neighbors is also called as a lazy learner
algorithm because it does not learn from the training set immediately
instead it stores the entire dataset and performs computations only at
the time of classification.
For example, consider two features i.e Category 1 and Category 2:
 KNN assigns the category based on the majority of nearby points.
The image shows how KNN predicts the category of a new data
point based on its closest neighbours.
 The green points represent Category 1 and the red points represent
Category 2.
 The new data point checks its closest neighbors (circled points).
 Since the majority of its closest neighbors are red points (Category
2) KNN predicts the new data point belongs to Category 2.
KNN Algorithm working visualization

KNN works by using proximity and majority voting to make


predictions.
What is 'K' in K nearest Neighbour?
In the k-Nearest Neighbours algorithm k is just a number that tells the
algorithm how many nearby points or neighbors to look at when it
makes a decision.
Example: Imagine you're deciding which fruit it is based on its shape
and size. You compare it to fruits you already know.
 If k = 3, the algorithm looks at the 3 closest fruits to the new one.
 If 2 of those 3 fruits are apples and 1 is a banana, the algorithm
says the new fruit is an apple because most of its neighbors are
apples.

How to choose the value of k for KNN Algorithm?

 The value of k in KNN decides how many neighbours the


algorithm looks at when making a prediction.
 Choosing the right k is important for good results.
 If the data has lots of noise or outliers, using a larger k can make
the predictions more stable.
 But if k is too large the model may become too simple and miss
important patterns and this is called under fitting.
 So k should be picked carefully based on the data.

Statistical Methods for Selecting k


 Cross-Validation: Cross-Validation is a good way to find the best
value of k is by using k-fold cross-validation. This means dividing
the dataset into k parts. The model is trained on some of these parts
and tested on the remaining ones. This process is repeated for each
part. The k value that gives the highest average accuracy during
these tests is usually the best one to use.
 Elbow Method: In Elbow Method we draw a graph showing the
error rate or accuracy for different k values. As k increases the
error usually drops at first. But after a certain point error stops
decreasing quickly. The point where the curve changes direction
and looks like an "elbow" is usually the best choice for k.
 Odd Values for k: It’s a good idea to use an odd number for k
especially in classification problems. This helps avoid ties when
deciding which class is the most common among the neighbours.
Distance Metrics Used in KNN Algorithm

KNN uses distance metrics to identify nearest neighbor; these


neighbors are used for classification and regression task. To identify
nearest neighbor we use below distance metrics:

1. Euclidean Distance

Euclidean distance is defined as the straight-line distance between two


points in a plane or space. You can think of it like the shortest path
you would walk if you were to go directly from one point to another.

𝐝𝐢𝐬𝐭𝐚𝐧𝐜𝐞(𝒙, 𝑿𝒊) = ∑𝒅𝒋 𝟏(𝒙𝒋 − 𝑿𝒊𝒋)𝟐 ]

2. Manhattan Distance
This is the total distance you would travel if you could only move
along horizontal and vertical lines like a grid or city streets. It’s also
called "taxicab distance" because a taxi can only drive along the grid-
like streets of a city.
𝒅(𝒙, 𝒚) = ∑𝒏𝒊 𝟏 ∣ 𝒙𝒊 − 𝒚𝒊 ∣
3. Minkowski Distance
Minkowski distance is like a family of distances, which includes both
Euclidean and Manhattan distances as special cases.

𝐝(𝐱, 𝐲) = (∑𝐧𝐢 𝟎(𝐱 𝐢 − 𝐲 𝐢 ) 𝐩 )𝟏/𝐩

From the formula above, when p=2, it becomes the same as the
Euclidean distance formula and when p=1, it turns into the Manhattan
distance formula. Minkowski distance is essentially a flexible formula
that can represent either Euclidean or Manhattan distance depending
on the value of p.
Working of KNN algorithm
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity
where it predicts the label or value of a new data point by considering the labels or
values of its K nearest neighbors in the training dataset.
Step 1: Selecting the optimal value of K

 K represents the number of nearest neighbour that needs to be


considered while making prediction.

Step 2: Calculating distance

 To measure the similarity between target and training data point’s


Euclidean distance is widely used. Distance is calculated between
data points in the dataset and target point.

Step 3: Finding Nearest Neighbors

 The k data points with the smallest distances to the target point are
nearest neighbours.

Step 4: Voting for Classification or Taking Average for


Regression

 When you want to classify a data point into a category like spam or
not spam, the KNN algorithm looks at the K closest points in the
dataset. These closest points are called neighbours. The algorithm
then looks at which category the neighbour belong to and picks the
one that appears the most. This is called majority voting.
 In regression, the algorithm still looks for the K closest points. But
instead of voting for a class in classification, it takes the average of
the values of those K neighbours. This average is the predicted
value for the new point for the algorithm.
It shows how a test point is classified based on its nearest neighbors.
As the test point moves the algorithm identifies the closest 'k' data
points i.e. 5 in this case and assigns test point the majority class label
that is grey label class here.
Applications of KNN
 Recommendation Systems: Suggests items like movies or
products by finding users with similar preferences.
 Spam Detection: Identifies spam emails by comparing new emails
to known spam and non-spam examples.
 Customer Segmentation: Groups customers by comparing their
shopping behavior to others.
 Speech Recognition: Matches spoken words to known patterns to
convert them into text.
Advantages of KNN
 Simple to use: Easy to understand and implement.
 No training step: No need to train as it just stores the data and
uses it during prediction.
 Few parameters: Only needs to set the number of neighbour (k)
and a distance method.
 Versatile: Works for both classification and regression problems.
Disadvantages of KNN
 Slow with large data: Needs to compare every point during
prediction.
 Struggles with many features: Accuracy drops when data has too
many features.
 Can Over fit: It can over fit especially when the data is high-
dimensional or not clean.

Prediction: Accuracy and Error measures


Prediction accuracy and error measures in data mining, such as
RMSE, MAE, and classification accuracy, evaluate model
performance by quantifying the difference between predicted and
actual values. Common techniques like
-fold cross-validation and confusion matrices (precision, recall, F1-
score) are used to assess robustness, with RMSE often used for
numerical prediction and accuracy for classification.

Key Prediction Accuracy & Error Measures


 Root Mean Square Error (RMSE): The most frequently used metric
to assess numerical prediction errors; it effectively reflects large
errors by squaring them.
 Mean Absolute Error (MAE): Measures the average magnitude of
errors, treating all errors equally.
Coefficient of Determination (): Indicates the fraction of variance
explained by the model, capturing the model's skill.
 Accuracy (Classification): Defined as the percentage of correct
predictions out of all, calculated as


 Precision & Recall: Precision measures the quality of positive
predictions, while recall measures the proportion of actual positive
cases detected.
 F1-Score: The harmonic mean of precision and recall, balancing
both.
 Log Loss: Measures the uncertainty of probabilities, penalizing low-
probability, correct predictions.
Model Evaluation Techniques
 Holdout Method: Randomly splits data into training and test sets to
calculate error.
 K-Fold Cross-Validation: Splits data into subsets, training times to
ensure unbiased estimation.
 Bootstrapping: Repeatedly samples the dataset with replacement for
estimation.
 Confusion Matrix: A matrix () used to visualize the performance of
classification algorithms by comparing predicted and actual classes.
Evaluation Criteria for Models
 Accuracy: The ability of a model to correctly predict class labels or
numeric values.
 Speed: Computational cost to build and use the model.
 Robustness: The capability to make correct predictions on new,
unseen data.
 Scalability: Performance change relative to the amount of data.
 Interpretability: The ability to understand the reasoning behind the
model's predictions.
Evaluating the accuracy of a classifier or a predictor
Data Mining can be referred to as knowledge mining from data,
knowledge extraction, data/pattern analysis, data archaeology, and
data dredging. In this article, we will see techniques to evaluate the
accuracy of classifiers.

Hold Out

In the holdout method, the largest dataset is randomly divided into


three subsets:
 A training set is a subset of the dataset which are been used to
build predictive models.
 The validation set is a subset of the dataset which is been used to
assess the performance of the model built in the training phase. It
provides a test platform for fine-tuning of the model's parameters
and selecting the best-performing model. It is not necessary for all
modelling algorithms to need a validation set.
 Test sets or unseen examples are the subset of the dataset to
assess the likely future performance of the model. If a model is
fitting into the training set much better than it fits into the test set,
then overfitting is probably the cause that occurred here.
Basically, two-thirds of the data are been allocated to the training set
and the remaining one-third is been allocated to the test set.
Random Sub sampling random sub sampling is a variation of the
holdout method. The holdout method is been repeated K times. The
holdout sub sampling involves randomly splitting the data into a
training set and a test set. On the training set the data is been trained
and the mean square error (MSE) is been obtained from the
predictions on the test set. As MSE is dependent on the split, this
method is not recommended. So a new split can give you a new MSE.
The overall accuracy is been calculated as E = 1/K \sum_{k}^{i=1}
E_{i}
Cross-Validation

 K-fold cross-validation is been used when there is only a limited


amount of data available, to achieve an unbiased estimation of the
performance of the model.
 Here, we divide the data into K subsets of equal sizes.
 We build models K times, each time leaving out one of the subsets
from the training, and use it as the test set.
 If K equals the sample size, then this is called a "Leave-One-Out"

Bootstrapping

 Bootstrapping is one of the techniques which is used to make the


estimations from the data by taking an average of the estimates
from smaller data samples.
 The bootstrapping method involves the iterative re sampling of a
dataset with replacement.
 On re sampling instead of only estimating the statistics once on
complete data, we can do it many times.
 Repeating this multiple times helps to obtain a vector of estimates.
 Bootstrapping can compute variance, expected value, and other
relevant statistics of these estimates.
Ensemble methods
Ensemble methods are used in data mining due to their ability to enhance
the predictive performance of machine learning models. A single model
may either over fit the training data or underperform on unseen instances.
Ensembles solve these problems by aggregating models and balancing
their errors.

You might also like