0% found this document useful (0 votes)
9 views24 pages

Machine Learning Performance Metrics Guide

Uploaded by

Ayan Khajuria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views24 pages

Machine Learning Performance Metrics Guide

Uploaded by

Ayan Khajuria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Metrics

Rajeev Kumar

Department of Computer Science & Engineering


Delhi Technological University
E-mail: rajeevkumar@[Link]
Objective

To understand the different metrics used to


evaluate the performance of ML models.

2
Acknowledgement
• Most of the material related is taken from the Web and link for the source of the same are
placed on the slides itself.
• I am sure I would have been influenced and borrowed ideas from other sources and I
apologize if I have failed to acknowledge them.

3
Performance Metrics in Machine Learning
• Evaluating the performance of a Machine learning model is one of the important steps
while building an effective ML model.
• To evaluate the performance or quality of the model, different metrics are used, and these
metrics are known as performance metrics or evaluation metrics.
• These performance metrics help us understand how well our model has performed for the
given data.
• In this way, we can improve the model's performance by tuning the hyper-parameters.
• Each ML model aims to generalize well on unseen/new data, and performance metrics help
determine how well the model generalizes on the new dataset.

(Photo/Content source: [Link] 4


Metrics for Classification
• In a classification problem, the category or classes of data is
identified based on training data.
• The model learns from the given dataset and then classifies the
new data into classes or groups based on the training.
• It predicts class labels as the output, such as Yes or No, 0 or 1,
Spam or Not Spam, etc.
– Confusion Matrix
– Accuracy
– Precision
– Recall
– F-Score
– AUC(Area Under the Curve)-ROC

(Photo/Content source: [Link] 5


Confusion Matrix
• A confusion matrix is a tabular representation of prediction outcomes of any binary
classifier, which is used to describe the performance of the classification model on a set
of test data when true values are known.

– True Positive: The number of times our actual positive values are equal to the predicted positive. You predicted a
positive value, and it is correct.
– False Positive: The number of times our model wrongly predicts negative values as positives. You predicted a
negative value, and it is actually positive.
– True Negative: The number of times our actual negative values are equal to predicted negative values. You
predicted a negative value, and it is actually negative.
– False Negative: The number of times our model wrongly predicts negative values as positives. You predicted a
negative value, and it is actually positive.

(Content source[Link] 6
Confusion Matrix Example
• Consider a confusion matrix made for a classifier that classifies
people based on whether they speak English or Spanish.

• Just from looking at the matrix, the performance of our model is


not very clear.

(Content source[Link] 7
Accuracy
The accuracy metric is one of the simplest Classification
metrics to implement, and it can be determined as the
number of correct predictions to the total number of
predictions.

In the previous example:

Accuracy = (86 +79) / (86 + 79 + 12 + 10) = 0.8823 = 88.23%

It measures how many predictions are correctly made out of all.

(Photo/Content source: [Link] 8


Use or Not to Use Accuracy
When Accuracy Works Well:

• Balanced datasets where class distribution is relatively even. Accuracy provides a good overall
measure of performance.
• Simple classification tasks with few labels/classes that are well represented. Accuracy can effectively
compare basic models.
• Binary classification problems where positive and negative classes are defined appropriately.
• Multi-class problems where all classes are equally important to model.
• Model selection during cross-validation when classes are balanced. Accuracy can pick top performers.

When Other Metrics are Better:

• Imbalanced datasets where accuracy is misleading due to underrepresented classes. Use F1, precision,
recall etc.
• Cost of errors vary significantly for different classes. Metrics like F1 help incorporate cost.
• Multi-label classification where each sample has multiple labels. Accuracy has limitations.
• Graded relevance such as information retrieval. Ranking metrics like Average Precision are better.
• Probabilistic predictions where confidence scores matter. Log-loss and Brier score are more
appropriate.
• Fraud detection where false positives and false negatives have different implications.

9
Precision
Precision is used to calculate the model's ability to classify
positive values correctly. It is the true positives divided by
the total number of predicted positive values.

In the previous example:

Precision = 86 / (86 + 12) = 0.8775 = 87.75%

Precision is an important evaluation metric that is useful in situations where


minimizing false positives is the priority.
For example in Medical testing, False positive diagnoses can cause undue
stress and lead to unnecessary procedures, so precision is valued.
(Photo/Content source: [Link] 10
Recall (Sensitivity)
• It is used to calculate the model's ability to predict positive values. "How often does the
model predict the correct positive values?".
• It is the true positives divided by the total number of actual positive values.

In the previous example:

Recall = 86 / (86 + 10) = 0.8983 = 89.83%

• Recall is also an important evaluation metric that is useful in situations where capturing all
relevant instances (true positives) is the priority even if it results in some false positives.
For example in Medical testing, It is important to identify all patients with a disease, even
if some healthy ones test positive.

• In other words, High recall minimizes false negatives, while precision minimizes false
positives.

• Precision and recall often tradeoff - improving one may lower the other.
(Photo/Content source: [Link] 11
F1-Score
• It is the harmonic mean of Recall and Precision. It is
useful when you need to take both Precision and Recall
into account.

In the previous example:

F1-Score = (2* 0.8775 * 0.8983) / (0.8775 + 0.8983) = 0.8877 = 88.77%

• F1 score provides a balance between precision and recall. When the costs of false
positives and false negatives are comparable, optimizing F1 helps build models
that minimize both errors simultaneously.

(Photo/Content source: [Link] 12


Use or Not to Use F1 Score
• When there is an uneven class distribution, accuracy is misleading. F1 accounts
for the class imbalance.
• When both false positives and false negatives have similar costs. Optimizing F1
balances both errors.
• Classification problems where retrieving positive cases and avoiding false alarms
have equal priority.
• Screening tests where false negatives and false positives should be minimized.
• Information retrieval where comprehensive search and precise results are both
needed.
• Fraud detection where all frauds must be caught and false accusations avoided.
• Classifier selection during imbalanced datasets, to pick optimal precision-recall
tradeoff.
• Model selection during cross validation, to account for precision and recall.
• As an alternative to accuracy for uneven or skewed class distributions.
• Multi-class classification where each class is considered equally important.

13
Receiver Operating Characteristic (ROC)
curve
• An ROC curve, or receiver operating characteristic curve, is like a graph
that shows how well a classification model performs. It helps us see how
the model makes decisions at different levels of certainty.
• The curve has two lines: one for how often the model correctly identifies
positive cases (true positives) and another for how often it mistakenly
identifies negative cases as positive (false positives).

• By looking at this graph, we can understand how good the model is and
choose the threshold that gives us the right balance between correct and
incorrect predictions.

(Photo/Content source: [Link] 14


What Makes A Good ROC Curve?
• The ROC Curve is valuable not only because it
gives us an overview of our model's performance
but because it also gives us an easy visual to
compare the performance of different classifiers
to one another.

• A perfect classifier is one that hugs along the


outer-left and top of the chart. This is expected, as
‘perfect’ here implies the classifier will always
have a TPR=1, regardless of the FPR.

• On the other hand, a diagonal line implies that


TPR=FPR for every classification threshold - in
other words, the classifier is just making random
guesses. The model is garbage!

(Photo/Content source:[Link] 15
Model Comparison using ROC
• When it comes to comparing models, the rule of
thumb is that curves that fall above the ROC
Curve of a random classifier (the diagonal line)
are good or decent. The higher up they are, the
better.

• Anything below the diagonal line has worse


performance than random guessing, so likely isn't
worth any consideration.

• Obtaining a perfect or exactly random result


likely indicates a problem. An exactly random
result may indicate that your problem is not well-
framed, and may not have a solution without
additional data..

(Photo/Content source:[Link] 16
AUC: Area Under the Curve
• AUC (sometimes written AUROC) is just the area underneath
the entire ROC curve.
• AUC provides us with a nice, single measure of performance
for our classifiers, independent of the exact classification
threshold chosen.

• This allows us to compare models to each other without even


looking at their ROC curves.

• AUC ranges in value from 0 to 1, with higher numbers


indicating better performance. A perfect classifier will have an
AUC of 1, while a perfectly random classifier an AUC of 0.5.

• A model that always predicts that a negative sample is more


likely to have a positive label than a positive sample will have
AUC of 0, indicating severe failure on the modeling side.
Scores in the range [0.5, 1] imply good performance, while
anything under 0.5 indicates very poor performance.

• At 0.73, our model's AUC isn't too shabby.

(Photo/Content source:[Link] 17
Metrics for Regression
• Mean Absolute Error
• Mean Squared Error
• R2 Score
• Adjusted R2

(Photo/Content source: [Link] 18


Mean Absolute Error (MAE)
• Mean Absolute Error or MAE is one of the simplest metrics, which measures the
absolute difference between actual and predicted values, where absolute means
taking a number as Positive.

• To understand MAE, let's take an example of Linear Regression, where the model
draws a best fit line between dependent and independent variables. To measure
the MAE or error in prediction, we need to calculate the difference between actual
values and predicted values. But in order to find the absolute error for the
complete dataset, we need to find the mean absolute of the complete dataset.

• MAE is much more robust for the outliers. One of the limitations of MAE is that
it is not differentiable, so for this, we need to apply different optimizers such as
Gradient Descent. However, to overcome this limitation, another metric can be
used, which is Mean Squared Error or MSE.

(Photo/Content source: [Link] 19


Mean Squared Error
• Mean Squared error or MSE is one of the most suitable metrics for Regression
evaluation. It measures the average of the Squared difference between predicted
values and the actual value given by the model.

• Since in MSE, errors are squared, therefore it only assumes non-negative values,
and it is usually positive and non-zero.

• Moreover, due to squared differences, it penalizes small errors also, and hence it
leads to over-estimation of how bad the model is.

• MSE is a much-preferred metric compared to other regression metrics as it is


differentiable and hence optimized better.

(Photo/Content source: [Link] 20


R Squared Score
• R squared error is also known as Coefficient of Determination, which is another
popular metric used for Regression model evaluation.
• The R-squared metric enables us to compare our model with a constant baseline
to determine the performance of the model.
• To select the constant baseline, we need to take the mean of the data and draw the
line at the mean.
• The R squared score will always be less than or equal to 1 without concerning if
the values are too large or small.
• R-squared is easy to understand, as it represents the proportion of the total
variation in the data that the model can explain. For example, an R-squared value
of 0.8 indicates that 80% of the variation in the dependent variable can be
explained by the independent variables in the model.

Recommended Read: [Link]


[Link]#:~:text=6%20See%20Also-,Definition,line%20approximates%20the%20actual%20data.

(Photo/Content source: [Link] 21


Interpretation of the R Squared Score

22
(Photo/Content source: [Link]
[Link]#:~:text=6%20See%20Also-,Definition,line%20approximates%20the%20actual%20data.)
Adjusted R Squared
• Adjusted R squared, as the name suggests, is the improved version of R squared error. R square
has a limitation of improvement of a score on increasing the terms, even though the model is
not improving, and it may mislead the data scientists.

• To overcome the issue of R square, adjusted R squared is used, which will always show a lower
value than R². It is because it adjusts the values of increasing predictors and only shows
improvement if there is a real improvement.

• Here, n is the number of observations, k denotes the number of independent variables, and Ra2
denotes the adjusted R2
Recommended Read: [Link]

(Photo/Content source: [Link] 23

You might also like