0% found this document useful (0 votes)

9 views24 pages

Machine Learning Performance Metrics Guide

Uploaded by

Ayan Khajuria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views24 pages

Machine Learning Performance Metrics Guide

Uploaded by

Ayan Khajuria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Metrics

Rajeev Kumar

Department of Computer Science & Engineering

Delhi Technological University
E-mail: rajeevkumar@[Link]
Objective

To understand the different metrics used to

evaluate the performance of ML models.

2
Acknowledgement
• Most of the material related is taken from the Web and link for the source of the same are
placed on the slides itself.
• I am sure I would have been influenced and borrowed ideas from other sources and I
apologize if I have failed to acknowledge them.

3
Performance Metrics in Machine Learning
• Evaluating the performance of a Machine learning model is one of the important steps
while building an effective ML model.
• To evaluate the performance or quality of the model, different metrics are used, and these
metrics are known as performance metrics or evaluation metrics.
• These performance metrics help us understand how well our model has performed for the
given data.
• In this way, we can improve the model's performance by tuning the hyper-parameters.
• Each ML model aims to generalize well on unseen/new data, and performance metrics help
determine how well the model generalizes on the new dataset.

(Photo/Content source: [Link] 4

Metrics for Classification
• In a classification problem, the category or classes of data is
identified based on training data.
• The model learns from the given dataset and then classifies the
new data into classes or groups based on the training.
• It predicts class labels as the output, such as Yes or No, 0 or 1,
Spam or Not Spam, etc.
– Confusion Matrix
– Accuracy
– Precision
– Recall
– F-Score
– AUC(Area Under the Curve)-ROC

(Photo/Content source: [Link] 5

Confusion Matrix
• A confusion matrix is a tabular representation of prediction outcomes of any binary
classifier, which is used to describe the performance of the classification model on a set
of test data when true values are known.

– True Positive: The number of times our actual positive values are equal to the predicted positive. You predicted a
positive value, and it is correct.
– False Positive: The number of times our model wrongly predicts negative values as positives. You predicted a
negative value, and it is actually positive.
– True Negative: The number of times our actual negative values are equal to predicted negative values. You
predicted a negative value, and it is actually negative.
– False Negative: The number of times our model wrongly predicts negative values as positives. You predicted a
negative value, and it is actually positive.

(Content source[Link] 6
Confusion Matrix Example
• Consider a confusion matrix made for a classifier that classifies
people based on whether they speak English or Spanish.

• Just from looking at the matrix, the performance of our model is

not very clear.

(Content source[Link] 7
Accuracy
The accuracy metric is one of the simplest Classification
metrics to implement, and it can be determined as the
number of correct predictions to the total number of
predictions.

In the previous example:

Accuracy = (86 +79) / (86 + 79 + 12 + 10) = 0.8823 = 88.23%

It measures how many predictions are correctly made out of all.

(Photo/Content source: [Link] 8

Use or Not to Use Accuracy
When Accuracy Works Well:

• Balanced datasets where class distribution is relatively even. Accuracy provides a good overall
measure of performance.
• Simple classification tasks with few labels/classes that are well represented. Accuracy can effectively
compare basic models.
• Binary classification problems where positive and negative classes are defined appropriately.
• Multi-class problems where all classes are equally important to model.
• Model selection during cross-validation when classes are balanced. Accuracy can pick top performers.

When Other Metrics are Better:

• Imbalanced datasets where accuracy is misleading due to underrepresented classes. Use F1, precision,
recall etc.
• Cost of errors vary significantly for different classes. Metrics like F1 help incorporate cost.
• Multi-label classification where each sample has multiple labels. Accuracy has limitations.
• Graded relevance such as information retrieval. Ranking metrics like Average Precision are better.
• Probabilistic predictions where confidence scores matter. Log-loss and Brier score are more
appropriate.
• Fraud detection where false positives and false negatives have different implications.

9
Precision
Precision is used to calculate the model's ability to classify
positive values correctly. It is the true positives divided by
the total number of predicted positive values.

In the previous example:

Precision = 86 / (86 + 12) = 0.8775 = 87.75%

Precision is an important evaluation metric that is useful in situations where

minimizing false positives is the priority.
For example in Medical testing, False positive diagnoses can cause undue
stress and lead to unnecessary procedures, so precision is valued.
(Photo/Content source: [Link] 10
Recall (Sensitivity)
• It is used to calculate the model's ability to predict positive values. "How often does the
model predict the correct positive values?".
• It is the true positives divided by the total number of actual positive values.

In the previous example:

Recall = 86 / (86 + 10) = 0.8983 = 89.83%

• Recall is also an important evaluation metric that is useful in situations where capturing all
relevant instances (true positives) is the priority even if it results in some false positives.
For example in Medical testing, It is important to identify all patients with a disease, even
if some healthy ones test positive.

• In other words, High recall minimizes false negatives, while precision minimizes false
positives.

• Precision and recall often tradeoff - improving one may lower the other.
(Photo/Content source: [Link] 11
F1-Score
• It is the harmonic mean of Recall and Precision. It is
useful when you need to take both Precision and Recall
into account.

In the previous example:

F1-Score = (2* 0.8775 * 0.8983) / (0.8775 + 0.8983) = 0.8877 = 88.77%

• F1 score provides a balance between precision and recall. When the costs of false
positives and false negatives are comparable, optimizing F1 helps build models
that minimize both errors simultaneously.

(Photo/Content source: [Link] 12

Use or Not to Use F1 Score
• When there is an uneven class distribution, accuracy is misleading. F1 accounts
for the class imbalance.
• When both false positives and false negatives have similar costs. Optimizing F1
balances both errors.
• Classification problems where retrieving positive cases and avoiding false alarms
have equal priority.
• Screening tests where false negatives and false positives should be minimized.
• Information retrieval where comprehensive search and precise results are both
needed.
• Fraud detection where all frauds must be caught and false accusations avoided.
• Classifier selection during imbalanced datasets, to pick optimal precision-recall
tradeoff.
• Model selection during cross validation, to account for precision and recall.
• As an alternative to accuracy for uneven or skewed class distributions.
• Multi-class classification where each class is considered equally important.

13
Receiver Operating Characteristic (ROC)
curve
• An ROC curve, or receiver operating characteristic curve, is like a graph
that shows how well a classification model performs. It helps us see how
the model makes decisions at different levels of certainty.
• The curve has two lines: one for how often the model correctly identifies
positive cases (true positives) and another for how often it mistakenly
identifies negative cases as positive (false positives).

• By looking at this graph, we can understand how good the model is and
choose the threshold that gives us the right balance between correct and
incorrect predictions.

(Photo/Content source: [Link] 14

What Makes A Good ROC Curve?
• The ROC Curve is valuable not only because it
gives us an overview of our model's performance
but because it also gives us an easy visual to
compare the performance of different classifiers
to one another.

• A perfect classifier is one that hugs along the

outer-left and top of the chart. This is expected, as
‘perfect’ here implies the classifier will always
have a TPR=1, regardless of the FPR.

• On the other hand, a diagonal line implies that

TPR=FPR for every classification threshold - in
other words, the classifier is just making random
guesses. The model is garbage!

(Photo/Content source:[Link] 15
Model Comparison using ROC
• When it comes to comparing models, the rule of
thumb is that curves that fall above the ROC
Curve of a random classifier (the diagonal line)
are good or decent. The higher up they are, the
better.

• Anything below the diagonal line has worse

performance than random guessing, so likely isn't
worth any consideration.

• Obtaining a perfect or exactly random result

likely indicates a problem. An exactly random
result may indicate that your problem is not well-
framed, and may not have a solution without
additional data..

(Photo/Content source:[Link] 16
AUC: Area Under the Curve
• AUC (sometimes written AUROC) is just the area underneath
the entire ROC curve.
• AUC provides us with a nice, single measure of performance
for our classifiers, independent of the exact classification
threshold chosen.

• This allows us to compare models to each other without even

looking at their ROC curves.

• AUC ranges in value from 0 to 1, with higher numbers

indicating better performance. A perfect classifier will have an
AUC of 1, while a perfectly random classifier an AUC of 0.5.

• A model that always predicts that a negative sample is more

likely to have a positive label than a positive sample will have
AUC of 0, indicating severe failure on the modeling side.
Scores in the range [0.5, 1] imply good performance, while
anything under 0.5 indicates very poor performance.

• At 0.73, our model's AUC isn't too shabby.

(Photo/Content source:[Link] 17
Metrics for Regression
• Mean Absolute Error
• Mean Squared Error
• R2 Score
• Adjusted R2

(Photo/Content source: [Link] 18

Mean Absolute Error (MAE)
• Mean Absolute Error or MAE is one of the simplest metrics, which measures the
absolute difference between actual and predicted values, where absolute means
taking a number as Positive.

• To understand MAE, let's take an example of Linear Regression, where the model
draws a best fit line between dependent and independent variables. To measure
the MAE or error in prediction, we need to calculate the difference between actual
values and predicted values. But in order to find the absolute error for the
complete dataset, we need to find the mean absolute of the complete dataset.

• MAE is much more robust for the outliers. One of the limitations of MAE is that
it is not differentiable, so for this, we need to apply different optimizers such as
Gradient Descent. However, to overcome this limitation, another metric can be
used, which is Mean Squared Error or MSE.

(Photo/Content source: [Link] 19

Mean Squared Error
• Mean Squared error or MSE is one of the most suitable metrics for Regression
evaluation. It measures the average of the Squared difference between predicted
values and the actual value given by the model.

• Since in MSE, errors are squared, therefore it only assumes non-negative values,
and it is usually positive and non-zero.

• Moreover, due to squared differences, it penalizes small errors also, and hence it
leads to over-estimation of how bad the model is.

• MSE is a much-preferred metric compared to other regression metrics as it is

differentiable and hence optimized better.

(Photo/Content source: [Link] 20

R Squared Score
• R squared error is also known as Coefficient of Determination, which is another
popular metric used for Regression model evaluation.
• The R-squared metric enables us to compare our model with a constant baseline
to determine the performance of the model.
• To select the constant baseline, we need to take the mean of the data and draw the
line at the mean.
• The R squared score will always be less than or equal to 1 without concerning if
the values are too large or small.
• R-squared is easy to understand, as it represents the proportion of the total
variation in the data that the model can explain. For example, an R-squared value
of 0.8 indicates that 80% of the variation in the dependent variable can be
explained by the independent variables in the model.

(Photo/Content source: [Link] 21

Interpretation of the R Squared Score

22
(Photo/Content source: [Link]
[Link]#:~:text=6%20See%20Also-,Definition,line%20approximates%20the%20actual%20data.)
Adjusted R Squared
• Adjusted R squared, as the name suggests, is the improved version of R squared error. R square
has a limitation of improvement of a score on increasing the terms, even though the model is
not improving, and it may mislead the data scientists.

• To overcome the issue of R square, adjusted R squared is used, which will always show a lower
value than R². It is because it adjusts the values of increasing predictors and only shows
improvement if there is a real improvement.

• Here, n is the number of observations, k denotes the number of independent variables, and Ra2
denotes the adjusted R2
Recommended Read: [Link]

(Photo/Content source: [Link] 23

DL 1
No ratings yet
DL 1
14 pages
Machine Learning Evaluation Metrics Guide
No ratings yet
Machine Learning Evaluation Metrics Guide
57 pages
Unit 2 Machine Learning
No ratings yet
Unit 2 Machine Learning
46 pages
23AD1401 Machine Learning Unit 5
No ratings yet
23AD1401 Machine Learning Unit 5
44 pages
Machine Learning Performance Metrics Guide
No ratings yet
Machine Learning Performance Metrics Guide
8 pages
Machine Learning Performance Metrics Guide
No ratings yet
Machine Learning Performance Metrics Guide
30 pages
Lec 7,8,9 Performance Evaluation Metrics
No ratings yet
Lec 7,8,9 Performance Evaluation Metrics
62 pages
Unit 4
No ratings yet
Unit 4
4 pages
Model Evaluation and Performance Metrics
No ratings yet
Model Evaluation and Performance Metrics
16 pages
Unit 4
No ratings yet
Unit 4
15 pages
Understanding Evaluation Metrics in ML
No ratings yet
Understanding Evaluation Metrics in ML
4 pages
Classification Evaluation Metrics Guide
No ratings yet
Classification Evaluation Metrics Guide
17 pages
Intro to Model Evaluation Metrics
No ratings yet
Intro to Model Evaluation Metrics
24 pages
Key Performance Metrics in ML
No ratings yet
Key Performance Metrics in ML
12 pages
Module 6 - Evaluation Metrics
No ratings yet
Module 6 - Evaluation Metrics
23 pages
ML Chapter 3 - Evaluation Metrics
No ratings yet
ML Chapter 3 - Evaluation Metrics
23 pages
Model Evaluation and Performance Metrics
No ratings yet
Model Evaluation and Performance Metrics
15 pages
Machine Learning Evaluation Metrics Guide
No ratings yet
Machine Learning Evaluation Metrics Guide
45 pages
Machine Learning Model Training & Testing
No ratings yet
Machine Learning Model Training & Testing
23 pages
Classification Metrics Explained
No ratings yet
Classification Metrics Explained
58 pages
Classification Model Evaluation Metrics
No ratings yet
Classification Model Evaluation Metrics
12 pages
Confusion Matrix & Accuracy Explained
No ratings yet
Confusion Matrix & Accuracy Explained
4 pages
Machine Learning Performance Metrics Guide
No ratings yet
Machine Learning Performance Metrics Guide
19 pages
Machine Learning Performance Metrics
No ratings yet
Machine Learning Performance Metrics
14 pages
Evaluation Metrics for ML Models
No ratings yet
Evaluation Metrics for ML Models
36 pages
Classification Model Evaluation Metrics
No ratings yet
Classification Model Evaluation Metrics
8 pages
Exp 5
No ratings yet
Exp 5
7 pages
Neural Network Performance Metrics
No ratings yet
Neural Network Performance Metrics
15 pages
Binary Classification Performance Metrics
No ratings yet
Binary Classification Performance Metrics
3 pages
Performance Metrics
No ratings yet
Performance Metrics
6 pages
Machine Learning Basics: Regression & Metrics
No ratings yet
Machine Learning Basics: Regression & Metrics
72 pages
Evaluating Models with Confusion Matrix & ROC
No ratings yet
Evaluating Models with Confusion Matrix & ROC
10 pages
Performance Metrics
No ratings yet
Performance Metrics
17 pages
Machine Learning Classification Overview
No ratings yet
Machine Learning Classification Overview
20 pages
ML 6
No ratings yet
ML 6
24 pages
Key Metrics for Classification Performance
No ratings yet
Key Metrics for Classification Performance
35 pages
Performance Metrics for ML Algorithms
No ratings yet
Performance Metrics for ML Algorithms
13 pages
Unit Iv ML
No ratings yet
Unit Iv ML
15 pages
Precision and Recall in Model Evaluation
No ratings yet
Precision and Recall in Model Evaluation
6 pages
Evaluating Metrics for Model Performance
No ratings yet
Evaluating Metrics for Model Performance
40 pages
Understanding Machine Learning Metrics
No ratings yet
Understanding Machine Learning Metrics
32 pages
Evaluation Metrics in Machine Learning
No ratings yet
Evaluation Metrics in Machine Learning
6 pages
Choosing Machine Learning Metrics
No ratings yet
Choosing Machine Learning Metrics
10 pages
Key Performance Metrics for ML Models
No ratings yet
Key Performance Metrics for ML Models
43 pages
Evaluating ML Performance Metrics
No ratings yet
Evaluating ML Performance Metrics
32 pages
Evaluating Machine Learning Performance
No ratings yet
Evaluating Machine Learning Performance
42 pages
Classifier Accuracy Metrics Overview
No ratings yet
Classifier Accuracy Metrics Overview
35 pages
Understanding Classification in ML
No ratings yet
Understanding Classification in ML
38 pages
Lesson 7 Model Evaluation and Performance Metrics
No ratings yet
Lesson 7 Model Evaluation and Performance Metrics
10 pages
Performance Metrics for ML Models
No ratings yet
Performance Metrics for ML Models
5 pages
Machine Learning Model Evaluation Guide
No ratings yet
Machine Learning Model Evaluation Guide
31 pages
Machine Learning Performance Metrics Guide
No ratings yet
Machine Learning Performance Metrics Guide
30 pages
Evaluation Metrics For Machine Learning
No ratings yet
Evaluation Metrics For Machine Learning
26 pages
Machine Learning Performance Metrics Guide
No ratings yet
Machine Learning Performance Metrics Guide
24 pages
Key Evaluation Metrics for ML Models
No ratings yet
Key Evaluation Metrics for ML Models
6 pages
DAP Unit 5 Notes
No ratings yet
DAP Unit 5 Notes
75 pages
Classifier Performance Metrics Explained
No ratings yet
Classifier Performance Metrics Explained
23 pages
OPSEU Benefit Plan Changes Effective April 2022
No ratings yet
OPSEU Benefit Plan Changes Effective April 2022
1 page
Royal Brunei Airlines Flight Itinerary
No ratings yet
Royal Brunei Airlines Flight Itinerary
3 pages
Inguinal Hernia Epidemiology in Tanzania
No ratings yet
Inguinal Hernia Epidemiology in Tanzania
8 pages
GPRS Internet Setup for Afghan SIMs
No ratings yet
GPRS Internet Setup for Afghan SIMs
10 pages
Yu Takeyama - Story
No ratings yet
Yu Takeyama - Story
2 pages
Lab Report on Acid-Base Indicators
No ratings yet
Lab Report on Acid-Base Indicators
7 pages
L-Carnitine Boosts Endurance in Mice
No ratings yet
L-Carnitine Boosts Endurance in Mice
6 pages
Aging, Inflammation, and Detoxification Insights
No ratings yet
Aging, Inflammation, and Detoxification Insights
36 pages
Mental Health Vocabulary Guide
No ratings yet
Mental Health Vocabulary Guide
4 pages
Typhoon-Resilient House Design Strategies
No ratings yet
Typhoon-Resilient House Design Strategies
16 pages
Primary Standard SOP: Prep & Storage
No ratings yet
Primary Standard SOP: Prep & Storage
4 pages
Feeding Techniques for Infants
No ratings yet
Feeding Techniques for Infants
92 pages
Understanding Diabetes Insipidus
No ratings yet
Understanding Diabetes Insipidus
29 pages
SIL Assessment Procedure for Oil & Gas
No ratings yet
SIL Assessment Procedure for Oil & Gas
10 pages
Urban Water Demand and Consumption Factors
No ratings yet
Urban Water Demand and Consumption Factors
20 pages
Understanding Verb Modifiers in English
100% (1)
Understanding Verb Modifiers in English
3 pages
Impact of Mobile Legends on Student Socialization
No ratings yet
Impact of Mobile Legends on Student Socialization
12 pages
Junior's Journey: A Native American Tale
No ratings yet
Junior's Journey: A Native American Tale
4 pages
Plumbing Cost Estimate for Residential Build
100% (2)
Plumbing Cost Estimate for Residential Build
9 pages
Ketogenic Diet's Role in Seizure Control
No ratings yet
Ketogenic Diet's Role in Seizure Control
20 pages
ISO 10333-4: Personal Fall-Arrest Systems
No ratings yet
ISO 10333-4: Personal Fall-Arrest Systems
12 pages
Bubble Dynamics and Drag Forces Tutorial
No ratings yet
Bubble Dynamics and Drag Forces Tutorial
2 pages
Mutual Mercy Day 1
No ratings yet
Mutual Mercy Day 1
10 pages
Grade 5 Geography Baseline Assessment
0% (1)
Grade 5 Geography Baseline Assessment
14 pages
ICOPHAI Abstract Book 2017
No ratings yet
ICOPHAI Abstract Book 2017
131 pages
Minecraft Carnivorous Recipes Guide
No ratings yet
Minecraft Carnivorous Recipes Guide
4 pages
Understanding Resistance to Change
No ratings yet
Understanding Resistance to Change
1 page
Understanding Soft Skills and Their Importance
No ratings yet
Understanding Soft Skills and Their Importance
8 pages
Word Vietnam February 2016
No ratings yet
Word Vietnam February 2016
200 pages

Machine Learning Performance Metrics Guide

Uploaded by

Machine Learning Performance Metrics Guide

Uploaded by

Metrics

Department of Computer Science & Engineering

To understand the different metrics used to

(Photo/Content source: [Link] 4

(Photo/Content source: [Link] 5

• Just from looking at the matrix, the performance of our model is

In the previous example:

Accuracy = (86 +79) / (86 + 79 + 12 + 10) = 0.8823 = 88.23%

It measures how many predictions are correctly made out of all.

(Photo/Content source: [Link] 8

When Other Metrics are Better:

In the previous example:

Precision = 86 / (86 + 12) = 0.8775 = 87.75%

Precision is an important evaluation metric that is useful in situations where

In the previous example:

Recall = 86 / (86 + 10) = 0.8983 = 89.83%

In the previous example:

F1-Score = (2* 0.8775 * 0.8983) / (0.8775 + 0.8983) = 0.8877 = 88.77%

(Photo/Content source: [Link] 12

(Photo/Content source: [Link] 14

• A perfect classifier is one that hugs along the

• On the other hand, a diagonal line implies that

• Anything below the diagonal line has worse

• Obtaining a perfect or exactly random result

• This allows us to compare models to each other without even

• AUC ranges in value from 0 to 1, with higher numbers

• A model that always predicts that a negative sample is more

• At 0.73, our model's AUC isn't too shabby.

(Photo/Content source: [Link] 18

(Photo/Content source: [Link] 19

• MSE is a much-preferred metric compared to other regression metrics as it is

(Photo/Content source: [Link] 20

Recommended Read: [Link]

(Photo/Content source: [Link] 21

(Photo/Content source: [Link] 23

You might also like