0% found this document useful (0 votes)

6 views27 pages

Model Evaluation Metrics Explained

The document provides an overview of various model evaluation metrics used in regression and classification, including Total Sum of Squares (TSS), Regression Sum of Squares (RSS), Error Sum of Squares (ESS), R-squared (R²), Adjusted R-squared, Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Percentage Error (MAPE). It also discusses accuracy, confusion matrix, precision, recall, F1 score, and their interpretations, use cases, and limitations. Each metric is explained with examples, highlighting when to use or avoid them based on the context of model evaluation.

Uploaded by

Hari Krishu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views27 pages

Model Evaluation Metrics Explained

Uploaded by

Hari Krishu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Model Evaluation & Tuning

🔢 Total Sum of Squares (TSS)

❓What is it?
TSS measures the total variance in the actual (observed) data values. It's
how far your data points are from the mean.

Example:

Imagine a group of penguins trying to hit a dartboard while blindfolded. TSS

tells you how wildly all those darts are spread from the center
regardless of whether they’re hitting the bullseye.

📊 Interpretation:
High TSS = data is spread out; Low TSS = data is clustered around the
mean.

Sum of the squared differences between each actual value and the
average of all actual values.

🌍 Use Case:
Used in regression to compute how much variation the model can
potentially explain.

✅ Use When:
You want to evaluate the total variance in the response variable.

❌ Don’t Use When:

You need model performance alone; it doesn't reflect model accuracy
directly.
🔢 Regression Sum of Squares (RSS or SSR)
❓What is it?
RSS is the part of TSS explained by the regression model.

Example:

Back to the penguins RSS tells you how many darts landed closer to
where the coach told them to aim. Coach = regression line.

📊 Interpretation:
Higher RSS = model explains more variance.

Sum of the squared differences between each predicted value and the
average of all actual values.

🌍 Use Case:
Used to calculate R² and understand how well the model fits the data.

✅ Use When:
You want to know how much of the total variance is captured by the model.

❌ Don’t Use When:

You're dealing with classification problems or non-linear behavior.
🔢 Error Sum of Squares (ESS or SSE)
❓What is it?
The part of TSS that the model couldn’t explain the leftovers, the mess.

Example:

Penguins missed the mark? ESS is the sum of all those missed throws
compared to the predicted location.

📊 Interpretation:
Lower ESS = better model accuracy.

Sum of the squared differences between each actual value and its
corresponding predicted value.

🌍 Use Case:
Helps measure prediction error, also used in computing RMSE and R².

✅ Use When:
You’re evaluating how far off your predictions are from actual values.

❌ Don’t Use When:

You care about percent-based error or need normalized metrics.

🔢 R-squared (R²)
❓What is it?
R² tells you what percentage of the variance in the target variable is
explained by the model.
📊 Interpretation:
● R² = 0.8 means your robot got 80% better than just guessing the
center.
● R² = 1 → “You’re a genius!”
● R² = 0 → “You’re just guessing...”
● R² < 0 → “Please stop. You’re worse than guessing.”

🌍 Use Case:
Used to assess linear regression performance.

✅ Use When:
Comparing linear regression models or explaining model strength.

❌ Don’t Use When:

You're comparing models with different feature counts using Adjusted R²
instead.
🔢 Adjusted R-squared
❓What is it?
R², but smarter it penalizes the model for using too many features that
don’t help.

Example:

You add 10 penguins to help, but they just confuse things. Adjusted R²
says: “Nope, no participation trophies!”

📊 Interpretation:
If Adjusted R² decreases when adding a variable, it means the new variable
isn’t helpful.

🌍 Use Case:
Best for feature selection in regression.

✅ Use When:
You’re comparing regression models with different numbers of predictors.
❌ Don’t Use When:
You’re using non-linear models or want raw fit measure (R² is simpler then).

🔢 Mean Absolute Error (MAE)

❓What is it?
The average absolute difference between actual and predicted values.

Example:

It’s like penguins measuring their missed distance 2 feet, 3 feet, and 5 feet.
MAE just says: “Your average miss was 3.33 feet.

📊 Interpretation:
● MAE = 0 → Perfect prediction

● Higher MAE → More average error

🌍 Use Case:
Useful when you care equally about over- and under-predictions.

✅ Use When:
You want a straightforward, interpretable error measure.

❌ Don’t Use When:

You want to penalize larger errors more heavily (use MSE or RMSE
instead).
🔢 Mean Squared Error (MSE)
❓What is it?
The average of squared differences between actual and predicted
values.

Example:

Penguins not only care about the miss they square it to exaggerate the
pain! A 10-meter miss hurts 100x more than a 1-meter miss.

📊 Interpretation:
● MSE = 0 → Perfect prediction

● Larger errors are punished more due to squaring

🌍 Use Case:
Great for detecting large prediction errors.

✅ Use When:
You want to penalize big mistakes more heavily.

❌ Don’t Use When:

You need easily interpretable error units (use RMSE or MAE).

🔢 Root Mean Squared Error (RMSE)

❓What is it?
The square root of MSE so the error is in the original units (like ₹, km,
etc.)
Example:

If MSE says penguins missed by 400 (squared units), RMSE says, “That's
like a 20-meter miss in real life.”

📊 Interpretation:
● RMSE = 0 → Perfect

● Higher RMSE = More severe errors

🌍 Use Case:
Most common in regression for real-world, unit-friendly error analysis.

✅ Use When:
You want to balance interpretability and penalty for large errors.

❌ Don’t Use When:

You're comparing across models with different units or scales.

🔢 Mean Absolute Percentage Error (MAPE)

❓What is it?
The average error expressed as a percentage of actual values.

📊 Interpretation:
Lower % = better model.
But MAPE is undefined when actual = 0.

🌍 Use Case:
Useful for business forecasting (e.g., sales, revenue).
✅ Use When:
You care about relative error and want to compare models across different
scales.

❌ Don’t Use When:

● Data has zeros (can cause division errors)

● Very small actual values (can inflate percentages)

Accuracy
✅ What is Accuracy?
Accuracy is one of the most commonly used metrics to evaluate how well
a machine learning model performs. It simply measures the proportion of
correct predictions made by the model out of all predictions.

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠

𝑇𝑃+𝑇𝑁
Accuracy = 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁

📌 Why is Accuracy Important?

Accuracy gives a quick and intuitive measure of how well a model is doing.
If a model has high accuracy, it means it is making correct predictions most
of the time, which is what we usually want.
It’s like a test score:

● If you answered 90 questions correctly out of 100, your accuracy is

90%.
● Similarly, if a model predicted 900 labels correctly out of 1,000, its
accuracy is 90%.

📈 Example of Accuracy
Let’s say a machine learning model predicts whether customers will buy a
product.
Out of 100 customers:

● It correctly predicted the behavior of 85 customers

● It made mistakes on 15 customers

85
Accuracy = 100
= 0.85% or 85%

💡 When to Use Accuracy

Accuracy works well when the data is balanced, meaning:

● The number of examples in each class (e.g., “Yes” and “No”) is

roughly the same.

● The cost of being wrong is not too high.

For example:

● Spam email classification where both spam and non-spam emails are
equally represented

● Image classification with a balanced number of images per category

🚫 When Accuracy Can Be Misleading
While accuracy is easy to understand, it can give a false sense of
performance in some cases especially when the data is imbalanced.

Example:

Imagine a dataset where:

● 950 people are healthy

● 50 people have a disease

If a model always predicts "healthy," then:

● It will be correct for 950 out of 1000 cases

● Accuracy = 95%

But the model never identifies any diseased person so it's not useful,
even though accuracy looks high.

🔍 Things Accuracy Doesn’t Tell You

● What kind of errors the model is making

● Which class the model struggles with

● How confident the model is in its predictions

So, accuracy is just a starting point, not the full story.

Confusion Matrix
📘 What is a Confusion Matrix?
Confusion matrix is a simple table used to measure how well a
classification model is performing. It compares the predictions made by the
model with the actual results and shows where the model was right or
wrong. This helps you understand where the model is making mistakes so
you can improve it.
🧠 Interpretation:
● True Positive (TP): The model predicted “Yes”, and it was actually
“Yes”

● True Negative (TN): The model predicted “No”, and it was actually
“No”

● False Positive (FP): The model predicted “Yes”, but it was actually
“No” (Type I Error)

● False Negative (FN): The model predicted “No”, but it was actually
“Yes” (Type II Error)

It also helps calculate key measures like accuracy, precision and recall
which give a better idea of performance especially when the data is
imbalanced.

Precision

✅ 1. What is Precision?
Precision is a metric used in classification problems to measure the
accuracy of positive predictions.

It answers the question:

“Out of all the instances the model predicted as positive, how

many were actually positive?”
🧪 Formula
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃+𝐹𝑃

Where:

● TP (True Positives): The model correctly predicted positive cases.

● FP (False Positives): The model incorrectly predicted as positive

(actually negative).

🎯 Interpretation
Precision focuses on the quality of positive predictions.

● High precision = few false positives

● Low precision = many false positives

🧠 Real-World Use Cases

4
📬 Spam Email Detection
● Positive class = Spam

● If a model marks a legitimate email as spam (False Positive), it's

annoying to users.

● → Precision should be high.

🧪 COVID-19 Test
● Positive = Infected

● False positives can cause panic, unnecessary isolation.

● → High precision is desired (test only marks truly infected as

positive).

💳 Fraud Detection
● Positive = Fraud

● Marking too many legit transactions as fraud (FP) frustrates

customers.

● → High precision reduces customer inconvenience.

🔍 When to Focus on Precision?

● When False Positives are costly or harmful

● When the consequences of acting on incorrect positives are serious

● When trust in positive predictions is crucial

⚠️ Limitations of Precision
● Precision doesn’t consider false negatives
→ A model could have high precision but miss many actual positives
(low recall).

● It alone is insufficient for evaluating model performance.

Recall
📌 What is Recall

Recall is a performance metric used to measure how well a classification

model identifies actual positive cases. It answers the question:

"Out of all the actual positive cases, how many did the model
correctly predict?"

🧪 Formula
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (𝑇𝑃)
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (𝑇𝑃)+𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 (𝐹𝑁)

Where:

● True Positive (TP): Model predicted positive and it is actually

positive.

● False Negative (FN): Model predicted negative but it is actually

positive.

🎯 When to Focus on Recall?

Recall is crucial when:

● False Negatives are costly.

● You care more about not missing positive cases than about being
overly precise.
✅ Use Cases:
● Medical diagnosis (e.g., cancer detection)

● Fraud detection

● Spam filtering

● Loan default prediction

🔄 Trade-off with Precision

Recall has a trade-off with Precision, which measures how many of the
predicted positives are actually positive.

● High Recall, Low Precision → You catch more positives, but also
more false alarms.

● High Precision, Low Recall → Fewer false positives, but you may
miss many true positives.

False Positive Rate (FPR)

▶️ Definition:
FPR tells us how many actual negatives were incorrectly predicted as
positives.

𝐹𝑃
𝐹𝑃𝑅 = 𝐹𝑃+𝑇𝑁
🧠 Interpretation:
"Out of all real negative cases, how many did the model wrongly
mark as positive?"

⚠️ High FPR:
● The model is generating too many false alarms.

● Bad in email spam filtering or medical testing where false positives

can be costly.

F1 Score

📌 What is the F1 Score?

The F1 Score is the harmonic mean of Precision and Recall, and is used
as a single metric to evaluate a model's performance when both false
positives and false negatives are important.

Unlike arithmetic mean, the harmonic mean punishes extreme values

more. So if either precision or recall is low, the F1 score drops significantly.

𝑃.𝑅
𝐹1 = 2. 𝑃+𝑅

Where:

𝐹𝑃
● 𝑃 = 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝐹𝑃+𝑇𝑁
𝑇𝑃
● 𝑅 = 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃+𝑇𝑁
🧠 Why Harmonic Mean?
The harmonic mean is appropriate when we want to combine rates (like
Precision and Recall), especially when extreme values are undesirable.

🧾 Properties of F1 Score
Property Explanation

Range
[0, 1] — where 1 is perfect precision and
recall

Symmetric
F1 treats precision and recall equally — it
doesn't favor one over the other

Insensitive to True Negatives

(TN) TN doesn’t appear in precision or recall
formulas, so F1 ignores correct
classification of negatives

Non linear
Because of the harmonic mean, even
small drops in precision or recall have
large impacts on F1
📌 When to Use F1 Score
F1 Score is the preferred metric when:

● You cannot assume equal class distribution

● False positives and false negatives carry similar consequences

● The dataset is imbalanced, e.g., 90% negatives and 10% positives

● Examples include:

○ Disease diagnosis

○ Fraud detection

○ Spam detection

○ Rare event prediction

🛠️ Limitations of F1 Score
1. Ignores True Negatives
Doesn’t reflect model performance on negative class

2. Not Always Interpretable Alone

Two models may have the same F1 but different precision/recall
balances

3. Equal Weight to Precision and Recall

Not ideal if your problem requires favoring one (e.g., high recall over
precision)
ROC (Receiver Operating Characteristic)

📘 What is ROC?
ROC stands for Receiver Operating Characteristic.

It is a graphical representation that illustrates the performance of a binary

classification model as the decision threshold is varied.

🔍 Origin of the Name

● The term “Receiver Operating Characteristic” comes from signal
detection theory during World War II.

● It was used to assess the ability of radar systems to distinguish

signal (enemy aircraft) from noise (birds, clouds, etc.).

🎯 What Does ROC Show?

The ROC curve plots:

● True Positive Rate (TPR) on the Y-axis

● False Positive Rate (FPR) on the X-axis

Each point on the ROC curve corresponds to a specific threshold

used to convert predicted probabilities into class labels
(positive/negative).
📈 Axes of ROC Curve
Axis Metric Meaning
FPR (False Positive Proportion of actual negatives wrongly
X-axis Rate) classified as positive
TPR (True Positive Proportion of actual positives correctly
Y-axis Rate) classified

📊 How to Interpret ROC Curve

🔹 Ideal ROC Curve:
● Starts at (0, 0)

● Goes to (0, 1) — perfect TPR, zero FPR

● Ends at (1, 1)

The closer the curve follows the left-hand border and then the top border,
the better the classifier.

🔹 Random Classifier:
● Lies along the diagonal line from (0,0) to (1,1)

● Means TPR ≈ FPR — no real predictive power

🧠 Conceptual Intuition
Let’s say your model outputs probabilities for "positive" class. You can
convert them to labels using a threshold (e.g., 0.5):

● At high threshold (e.g., 0.9): Only very confident predictions are

labeled positive ⇒ low TPR, low FPR

● At low threshold (e.g., 0.1): Most samples are labeled positive ⇒ high
TPR, but also high FPR

The ROC curve is created by sweeping this threshold from 0 to 1 and

plotting (FPR, TPR) at each step.

✅ Why ROC is Useful

● Threshold-independent: It gives a full picture of model performance
across all thresholds

● Good for comparing multiple classifiers

● Especially helpful when positive and negative classes are fairly

balanced
❗ Important Notes
● ROC ignores class imbalance — for imbalanced datasets,
Precision-Recall Curve may be better

● ROC is most informative when your model outputs probabilities, not

hard labels

AUC (Area Under the Curve)

📘 What is AUC?
AUC stands for Area Under the Curve.
When we say “AUC” in machine learning, we almost always mean:

Area under the ROC curve (AUC-ROC)

It is a single scalar value that summarizes the overall performance of a

binary classifier across all classification thresholds.

🔍 What Does AUC Represent?

AUC answers this question:

“If I randomly pick one positive example and one negative

example, what is the probability that the model assigns a higher
probability to the positive example than to the negative one?”

So, a higher AUC means better model performance.

📈 AUC Scale and Interpretation
AUC Score Interpretation
1 Perfect classifier
0.9–1.0 Excellent
0.8–0.9 Very good
0.7–0.8 Acceptable
0.6–0.7 Poor
0.5 No discrimination (random guessing)
< 0.5 Worse than random (inverted predictions)

🧮 How is AUC Calculated?

Since it is area under the ROC curve, it can be computed via:

1. Trapezoidal Rule:

● Numerically integrates the area under the curve formed by (FPR,

TPR) points.

2. Ranking Approach:

● AUC is equivalent to the Wilcoxon-Mann-Whitney statistic

● It measures how well the classifier ranks a random positive

example higher than a random negative one
📊 Visual Understanding
Ideal ROC Curve:

● Reaches (0,1) quickly → AUC ≈ 1

Random Model:

● Diagonal line → AUC ≈ 0.5

Poor Model:

● Curve below diagonal → AUC < 0.5

🧠 Intuitive Example
Suppose:

● You have 100 spam emails (positives)

● And 100 non-spam emails (negatives)

● Your model ranks emails based on probability of being spam

If for every spam email, the model assigns a higher score than every
non-spam email → AUC = 1

If the model ranks half of them incorrectly → AUC ~ 0.5

If it always ranks spam below non-spam → AUC = 0

✅ Why is AUC Useful?
● Threshold-independent: Unlike accuracy, it does not depend on a
specific classification threshold

● Compares models effectively, even with different probability

calibrations

● Robust against class imbalance, unlike accuracy

❗ Things to Keep in Mind

Point Explanation
It doesn’t evaluate probability calibration (e.g.,
AUC is for ranking ability how close to 0.8 is 0.8)
Not always meaningful for
highly imbalanced data Use Precision-Recall AUC in such cases
AUC can hide poor decisions Always combine AUC with confusion matrix or
at critical thresholds domain-specific analysis

Key Performance Metrics for ML Models
No ratings yet
Key Performance Metrics for ML Models
43 pages
Machine Learning Performance Metrics Guide
No ratings yet
Machine Learning Performance Metrics Guide
50 pages
Model Evaluation and Performance Metrics
No ratings yet
Model Evaluation and Performance Metrics
16 pages
Understanding Hit@K Metric in Classification
No ratings yet
Understanding Hit@K Metric in Classification
6 pages
Machine Learning Performance Metrics Guide
No ratings yet
Machine Learning Performance Metrics Guide
24 pages
Understanding Machine Learning Metrics
No ratings yet
Understanding Machine Learning Metrics
32 pages
Key Metrics for Model Evaluation
No ratings yet
Key Metrics for Model Evaluation
7 pages
DL 1
No ratings yet
DL 1
14 pages
Confusion Matrix & Error Metrics Guide
No ratings yet
Confusion Matrix & Error Metrics Guide
27 pages
Regression and Classification Metrics Guide
No ratings yet
Regression and Classification Metrics Guide
13 pages
Model Evaluation and Performance Metrics
No ratings yet
Model Evaluation and Performance Metrics
15 pages
Ultimate Data Science Stats Cheat Sheet
100% (1)
Ultimate Data Science Stats Cheat Sheet
13 pages
Performance Metrics
No ratings yet
Performance Metrics
6 pages
Machine Learning Performance Metrics Guide
No ratings yet
Machine Learning Performance Metrics Guide
8 pages
L 3 Model Representation Merged
No ratings yet
L 3 Model Representation Merged
43 pages
Experimental Evaluation
No ratings yet
Experimental Evaluation
59 pages
Machine Learning Evaluation Metrics Guide
No ratings yet
Machine Learning Evaluation Metrics Guide
43 pages
U 3 Answers Mid 2
No ratings yet
U 3 Answers Mid 2
10 pages
Machine Learning Model Training & Testing
No ratings yet
Machine Learning Model Training & Testing
23 pages
Evaluating Metrics for Model Performance
No ratings yet
Evaluating Metrics for Model Performance
40 pages
Confusion Matrix & Accuracy Explained
No ratings yet
Confusion Matrix & Accuracy Explained
4 pages
ML Chapter 3 - Evaluation Metrics
No ratings yet
ML Chapter 3 - Evaluation Metrics
23 pages
M3 Evaluation Metrics
No ratings yet
M3 Evaluation Metrics
20 pages
Machine Learning Performance Metrics Guide
No ratings yet
Machine Learning Performance Metrics Guide
15 pages
Machine Learning Model Evaluation Metrics
No ratings yet
Machine Learning Model Evaluation Metrics
40 pages
Machine Learning Performance Metrics Guide
No ratings yet
Machine Learning Performance Metrics Guide
24 pages
Confusion Matrix
No ratings yet
Confusion Matrix
4 pages
Performance Metrics for ML Algorithms
No ratings yet
Performance Metrics for ML Algorithms
13 pages
Machine Learning Performance Metrics Guide
No ratings yet
Machine Learning Performance Metrics Guide
24 pages
Evaluating Machine Learning Metrics
No ratings yet
Evaluating Machine Learning Metrics
2 pages
DL 2 Unit 3
No ratings yet
DL 2 Unit 3
22 pages
Machine Learning Performance Metrics Guide
No ratings yet
Machine Learning Performance Metrics Guide
30 pages
Machine Learning Performance Metrics Guide
No ratings yet
Machine Learning Performance Metrics Guide
30 pages
Machine Learning Basics: Regression & Metrics
No ratings yet
Machine Learning Basics: Regression & Metrics
72 pages
Lec 7,8,9 Performance Evaluation Metrics
No ratings yet
Lec 7,8,9 Performance Evaluation Metrics
62 pages
Performance Metrics
No ratings yet
Performance Metrics
6 pages
Machine Learning Performance Metrics Guide
No ratings yet
Machine Learning Performance Metrics Guide
24 pages
Model Evaluation Metrics Explained
No ratings yet
Model Evaluation Metrics Explained
9 pages
Unit 4
No ratings yet
Unit 4
15 pages
ML Evaluation For Beginners
No ratings yet
ML Evaluation For Beginners
15 pages
Machine Learning Evaluation Metrics Guide
No ratings yet
Machine Learning Evaluation Metrics Guide
20 pages
Importance of Model Evaluation in ML
No ratings yet
Importance of Model Evaluation in ML
22 pages
Performance Metrics Regression 1
No ratings yet
Performance Metrics Regression 1
6 pages
4.2 Regression and Classification Evaluation Metrics - MNS
No ratings yet
4.2 Regression and Classification Evaluation Metrics - MNS
20 pages
Model Evaluation Metrics Explained
No ratings yet
Model Evaluation Metrics Explained
23 pages
Key Evaluation Metrics for ML Models
No ratings yet
Key Evaluation Metrics for ML Models
6 pages
Model Evaluation and Accuracy Metrics
No ratings yet
Model Evaluation and Accuracy Metrics
10 pages
Key Metrics for Model Evaluation
No ratings yet
Key Metrics for Model Evaluation
8 pages
Machine Learning Performance Metrics Guide
No ratings yet
Machine Learning Performance Metrics Guide
28 pages
Machine Learning Model Evaluation Techniques
No ratings yet
Machine Learning Model Evaluation Techniques
32 pages
Machine Learning Performance Metrics Guide
No ratings yet
Machine Learning Performance Metrics Guide
19 pages
Performance Metrics for ML Models
No ratings yet
Performance Metrics for ML Models
6 pages
Classifier Accuracy Metrics Overview
No ratings yet
Classifier Accuracy Metrics Overview
35 pages
Machine Learning Evaluation Metrics Guide
No ratings yet
Machine Learning Evaluation Metrics Guide
61 pages
Understanding Performance Metrics in AI
No ratings yet
Understanding Performance Metrics in AI
46 pages
Machine Learning Evaluation Metrics Guide
No ratings yet
Machine Learning Evaluation Metrics Guide
45 pages
Machine Learning Model Evaluation Metrics
No ratings yet
Machine Learning Model Evaluation Metrics
29 pages
Nevron Axon Content Preparation - Instructions
No ratings yet
Nevron Axon Content Preparation - Instructions
13 pages
ASI7214Y Datasheet 20190822
No ratings yet
ASI7214Y Datasheet 20190822
2 pages
Flexwind User Guide for McClean Anderson
No ratings yet
Flexwind User Guide for McClean Anderson
38 pages
Persistent's AWS Cloud Transformation Solutions
No ratings yet
Persistent's AWS Cloud Transformation Solutions
22 pages
Cross Arms
No ratings yet
Cross Arms
46 pages
Sales Force Automation Overview and Tools
No ratings yet
Sales Force Automation Overview and Tools
14 pages
Fluent CFD Tutorial: First Exercise Guide
No ratings yet
Fluent CFD Tutorial: First Exercise Guide
2 pages
Modeling Acoustic Guitar Sound Synthesis
100% (1)
Modeling Acoustic Guitar Sound Synthesis
4 pages
AI and Bioinformatics in TCM Quality Control
No ratings yet
AI and Bioinformatics in TCM Quality Control
27 pages
Abdullah Mukhtar: IT Professional Profile
No ratings yet
Abdullah Mukhtar: IT Professional Profile
4 pages
Urdu Love Shayari Collection
60% (5)
Urdu Love Shayari Collection
5 pages
LTspice Essentials for Circuit Simulation
No ratings yet
LTspice Essentials for Circuit Simulation
48 pages
Java Syntax and Data Types Guide
No ratings yet
Java Syntax and Data Types Guide
56 pages
Rational Zeros and Factoring of p(x)
No ratings yet
Rational Zeros and Factoring of p(x)
255 pages
Account Statement: Nov-Dec 2023
No ratings yet
Account Statement: Nov-Dec 2023
14 pages
Embedded Systems Course Overview
No ratings yet
Embedded Systems Course Overview
49 pages
Tekmar Pump Sequencer Relay Overview
No ratings yet
Tekmar Pump Sequencer Relay Overview
3 pages
8662D06112038B 976329
No ratings yet
8662D06112038B 976329
34 pages
VFX Compositing Basics in After Effects
No ratings yet
VFX Compositing Basics in After Effects
9 pages
Array Multiplier vs. Sequential Multiplier
No ratings yet
Array Multiplier vs. Sequential Multiplier
16 pages
Practicing XP: Collaboration & Development
No ratings yet
Practicing XP: Collaboration & Development
19 pages
JavaScript Shop Management System Report
No ratings yet
JavaScript Shop Management System Report
4 pages
Computational Thinking Activity Guide
No ratings yet
Computational Thinking Activity Guide
31 pages
MHK CareProminence Overview
No ratings yet
MHK CareProminence Overview
2 pages
Grade Control Methods in Open Pit Mining
No ratings yet
Grade Control Methods in Open Pit Mining
12 pages
Microcontroller Potentiostat for Metallurgy
0% (1)
Microcontroller Potentiostat for Metallurgy
6 pages
C++ Electricity Bill Calculation Program
No ratings yet
C++ Electricity Bill Calculation Program
7 pages
Antenna Tracker Development for UAS
No ratings yet
Antenna Tracker Development for UAS
82 pages
Kool Energy 1kW Inverter User Manual
No ratings yet
Kool Energy 1kW Inverter User Manual
11 pages
Teleworking Lesson Plan for ICT Class
No ratings yet
Teleworking Lesson Plan for ICT Class
5 pages

Model Evaluation Metrics Explained

Uploaded by

Model Evaluation Metrics Explained

Uploaded by

Model Evaluation & Tuning

🔢 Total Sum of Squares (TSS)

Imagine a group of penguins trying to hit a dartboard while blindfolded. TSS

❌ Don’t Use When:

❌ Don’t Use When:

❌ Don’t Use When:

❌ Don’t Use When:

🔢 Mean Absolute Error (MAE)

●​ Higher MAE → More average error​

❌ Don’t Use When:

●​ Larger errors are punished more due to squaring​

❌ Don’t Use When:

🔢 Root Mean Squared Error (RMSE)

●​ Higher RMSE = More severe errors​

❌ Don’t Use When:

🔢 Mean Absolute Percentage Error (MAPE)

❌ Don’t Use When:

●​ Very small actual values (can inflate percentages)

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠

📌 Why is Accuracy Important?

●​ If you answered 90 questions correctly out of 100, your accuracy is

●​ It correctly predicted the behavior of 85 customers​

●​ It made mistakes on 15 customers

💡 When to Use Accuracy

●​ The number of examples in each class (e.g., “Yes” and “No”) is

●​ The cost of being wrong is not too high.

●​ Image classification with a balanced number of images per category

Imagine a dataset where:

●​ 950 people are healthy​

●​ 50 people have a disease​

If a model always predicts "healthy," then:

●​ It will be correct for 950 out of 1000 cases​

🔍 Things Accuracy Doesn’t Tell You

●​ Which class the model struggles with​

●​ How confident the model is in its predictions​

So, accuracy is just a starting point, not the full story.

It answers the question:

“Out of all the instances the model predicted as positive, how

●​ TP (True Positives): The model correctly predicted positive cases.​

●​ FP (False Positives): The model incorrectly predicted as positive

●​ High precision = few false positives​

●​ Low precision = many false positives

🧠 Real-World Use Cases

●​ If a model marks a legitimate email as spam (False Positive), it's

●​ → Precision should be high.

●​ False positives can cause panic, unnecessary isolation.​

●​ → High precision is desired (test only marks truly infected as

●​ Marking too many legit transactions as fraud (FP) frustrates

●​ → High precision reduces customer inconvenience.

🔍 When to Focus on Precision?

●​ When the consequences of acting on incorrect positives are serious​

●​ When trust in positive predictions is crucial​

●​ It alone is insufficient for evaluating model performance.

Recall is a performance metric used to measure how well a classification

●​ True Positive (TP): Model predicted positive and it is actually

●​ False Negative (FN): Model predicted negative but it is actually

🎯 When to Focus on Recall?

●​ False Negatives are costly.

●​ Loan default prediction

🔄 Trade-off with Precision

False Positive Rate (FPR)

●​ Bad in email spam filtering or medical testing where false positives

📌 What is the F1 Score?

Unlike arithmetic mean, the harmonic mean punishes extreme values

Insensitive to True Negatives

●​ You cannot assume equal class distribution​

●​ False positives and false negatives carry similar consequences​

●​ The dataset is imbalanced, e.g., 90% negatives and 10% positives​

○​ Rare event prediction

2.​ Not Always Interpretable Alone​

3.​ Equal Weight to Precision and Recall​

It is a graphical representation that illustrates the performance of a binary

🔍 Origin of the Name

●​ It was used to assess the ability of radar systems to distinguish

🎯 What Does ROC Show?

●​ True Positive Rate (TPR) on the Y-axis​

● Higher MAE → More average error

● Larger errors are punished more due to squaring

● Higher RMSE = More severe errors

● Very small actual values (can inflate percentages)

● If you answered 90 questions correctly out of 100, your accuracy is

● It correctly predicted the behavior of 85 customers

● It made mistakes on 15 customers

● The number of examples in each class (e.g., “Yes” and “No”) is

● The cost of being wrong is not too high.

● Image classification with a balanced number of images per category

● 950 people are healthy

● 50 people have a disease

● It will be correct for 950 out of 1000 cases

● Which class the model struggles with

● How confident the model is in its predictions

● TP (True Positives): The model correctly predicted positive cases.

● FP (False Positives): The model incorrectly predicted as positive

● High precision = few false positives

● Low precision = many false positives

● If a model marks a legitimate email as spam (False Positive), it's

● → Precision should be high.

● False positives can cause panic, unnecessary isolation.

● → High precision is desired (test only marks truly infected as

● Marking too many legit transactions as fraud (FP) frustrates

● → High precision reduces customer inconvenience.

● When the consequences of acting on incorrect positives are serious

● When trust in positive predictions is crucial

● It alone is insufficient for evaluating model performance.

● True Positive (TP): Model predicted positive and it is actually

● False Negative (FN): Model predicted negative but it is actually

● False Negatives are costly.

● Loan default prediction

● Bad in email spam filtering or medical testing where false positives

● You cannot assume equal class distribution

● False positives and false negatives carry similar consequences

● The dataset is imbalanced, e.g., 90% negatives and 10% positives

○ Rare event prediction

2. Not Always Interpretable Alone

3. Equal Weight to Precision and Recall

● It was used to assess the ability of radar systems to distinguish

● True Positive Rate (TPR) on the Y-axis

● False Positive Rate (FPR) on the X-axis

● Goes to (0, 1) — perfect TPR, zero FPR

● Ends at (1, 1)

● Means TPR ≈ FPR — no real predictive power

● At high threshold (e.g., 0.9): Only very confident predictions are

● Good for comparing multiple classifiers

● Especially helpful when positive and negative classes are fairly

● ROC is most informative when your model outputs probabilities, not

● Numerically integrates the area under the curve formed by (FPR,

● AUC is equivalent to the Wilcoxon-Mann-Whitney statistic

● It measures how well the classifier ranks a random positive

● Reaches (0,1) quickly → AUC ≈ 1

● Diagonal line → AUC ≈ 0.5

● Curve below diagonal → AUC < 0.5

● You have 100 spam emails (positives)

● And 100 non-spam emails (negatives)

● Your model ranks emails based on probability of being spam

● Compares models effectively, even with different probability

● Robust against class imbalance, unlike accuracy