Model Evaluation & Tuning
🔢 Total Sum of Squares (TSS)
❓What is it?
TSS measures the total variance in the actual (observed) data values. It's
how far your data points are from the mean.
Example:
Imagine a group of penguins trying to hit a dartboard while blindfolded. TSS
tells you how wildly all those darts are spread from the center
regardless of whether they’re hitting the bullseye.
📊 Interpretation:
High TSS = data is spread out; Low TSS = data is clustered around the
mean.
Sum of the squared differences between each actual value and the
average of all actual values.
🌍 Use Case:
Used in regression to compute how much variation the model can
potentially explain.
✅ Use When:
You want to evaluate the total variance in the response variable.
❌ Don’t Use When:
You need model performance alone; it doesn't reflect model accuracy
directly.
🔢 Regression Sum of Squares (RSS or SSR)
❓What is it?
RSS is the part of TSS explained by the regression model.
Example:
Back to the penguins RSS tells you how many darts landed closer to
where the coach told them to aim. Coach = regression line.
📊 Interpretation:
Higher RSS = model explains more variance.
Sum of the squared differences between each predicted value and the
average of all actual values.
🌍 Use Case:
Used to calculate R² and understand how well the model fits the data.
✅ Use When:
You want to know how much of the total variance is captured by the model.
❌ Don’t Use When:
You're dealing with classification problems or non-linear behavior.
🔢 Error Sum of Squares (ESS or SSE)
❓What is it?
The part of TSS that the model couldn’t explain the leftovers, the mess.
Example:
Penguins missed the mark? ESS is the sum of all those missed throws
compared to the predicted location.
📊 Interpretation:
Lower ESS = better model accuracy.
Sum of the squared differences between each actual value and its
corresponding predicted value.
🌍 Use Case:
Helps measure prediction error, also used in computing RMSE and R².
✅ Use When:
You’re evaluating how far off your predictions are from actual values.
❌ Don’t Use When:
You care about percent-based error or need normalized metrics.
🔢 R-squared (R²)
❓What is it?
R² tells you what percentage of the variance in the target variable is
explained by the model.
📊 Interpretation:
● R² = 0.8 means your robot got 80% better than just guessing the
center.
● R² = 1 → “You’re a genius!”
● R² = 0 → “You’re just guessing...”
● R² < 0 → “Please stop. You’re worse than guessing.”
🌍 Use Case:
Used to assess linear regression performance.
✅ Use When:
Comparing linear regression models or explaining model strength.
❌ Don’t Use When:
You're comparing models with different feature counts using Adjusted R²
instead.
🔢 Adjusted R-squared
❓What is it?
R², but smarter it penalizes the model for using too many features that
don’t help.
Example:
You add 10 penguins to help, but they just confuse things. Adjusted R²
says: “Nope, no participation trophies!”
📊 Interpretation:
If Adjusted R² decreases when adding a variable, it means the new variable
isn’t helpful.
🌍 Use Case:
Best for feature selection in regression.
✅ Use When:
You’re comparing regression models with different numbers of predictors.
❌ Don’t Use When:
You’re using non-linear models or want raw fit measure (R² is simpler then).
🔢 Mean Absolute Error (MAE)
❓What is it?
The average absolute difference between actual and predicted values.
Example:
It’s like penguins measuring their missed distance 2 feet, 3 feet, and 5 feet.
MAE just says: “Your average miss was 3.33 feet.
📊 Interpretation:
● MAE = 0 → Perfect prediction
● Higher MAE → More average error
🌍 Use Case:
Useful when you care equally about over- and under-predictions.
✅ Use When:
You want a straightforward, interpretable error measure.
❌ Don’t Use When:
You want to penalize larger errors more heavily (use MSE or RMSE
instead).
🔢 Mean Squared Error (MSE)
❓What is it?
The average of squared differences between actual and predicted
values.
Example:
Penguins not only care about the miss they square it to exaggerate the
pain! A 10-meter miss hurts 100x more than a 1-meter miss.
📊 Interpretation:
● MSE = 0 → Perfect prediction
● Larger errors are punished more due to squaring
🌍 Use Case:
Great for detecting large prediction errors.
✅ Use When:
You want to penalize big mistakes more heavily.
❌ Don’t Use When:
You need easily interpretable error units (use RMSE or MAE).
🔢 Root Mean Squared Error (RMSE)
❓What is it?
The square root of MSE so the error is in the original units (like ₹, km,
etc.)
Example:
If MSE says penguins missed by 400 (squared units), RMSE says, “That's
like a 20-meter miss in real life.”
📊 Interpretation:
● RMSE = 0 → Perfect
● Higher RMSE = More severe errors
🌍 Use Case:
Most common in regression for real-world, unit-friendly error analysis.
✅ Use When:
You want to balance interpretability and penalty for large errors.
❌ Don’t Use When:
You're comparing across models with different units or scales.
🔢 Mean Absolute Percentage Error (MAPE)
❓What is it?
The average error expressed as a percentage of actual values.
📊 Interpretation:
Lower % = better model.
But MAPE is undefined when actual = 0.
🌍 Use Case:
Useful for business forecasting (e.g., sales, revenue).
✅ Use When:
You care about relative error and want to compare models across different
scales.
❌ Don’t Use When:
● Data has zeros (can cause division errors)
● Very small actual values (can inflate percentages)
Accuracy
✅ What is Accuracy?
Accuracy is one of the most commonly used metrics to evaluate how well
a machine learning model performs. It simply measures the proportion of
correct predictions made by the model out of all predictions.
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝑇𝑃+𝑇𝑁
Accuracy = 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
📌 Why is Accuracy Important?
Accuracy gives a quick and intuitive measure of how well a model is doing.
If a model has high accuracy, it means it is making correct predictions most
of the time, which is what we usually want.
It’s like a test score:
● If you answered 90 questions correctly out of 100, your accuracy is
90%.
● Similarly, if a model predicted 900 labels correctly out of 1,000, its
accuracy is 90%.
📈 Example of Accuracy
Let’s say a machine learning model predicts whether customers will buy a
product.
Out of 100 customers:
● It correctly predicted the behavior of 85 customers
● It made mistakes on 15 customers
85
Accuracy = 100
= 0.85% or 85%
💡 When to Use Accuracy
Accuracy works well when the data is balanced, meaning:
● The number of examples in each class (e.g., “Yes” and “No”) is
roughly the same.
● The cost of being wrong is not too high.
For example:
● Spam email classification where both spam and non-spam emails are
equally represented
● Image classification with a balanced number of images per category
🚫 When Accuracy Can Be Misleading
While accuracy is easy to understand, it can give a false sense of
performance in some cases especially when the data is imbalanced.
Example:
Imagine a dataset where:
● 950 people are healthy
● 50 people have a disease
If a model always predicts "healthy," then:
● It will be correct for 950 out of 1000 cases
● Accuracy = 95%
But the model never identifies any diseased person so it's not useful,
even though accuracy looks high.
🔍 Things Accuracy Doesn’t Tell You
● What kind of errors the model is making
● Which class the model struggles with
● How confident the model is in its predictions
So, accuracy is just a starting point, not the full story.
Confusion Matrix
📘 What is a Confusion Matrix?
Confusion matrix is a simple table used to measure how well a
classification model is performing. It compares the predictions made by the
model with the actual results and shows where the model was right or
wrong. This helps you understand where the model is making mistakes so
you can improve it.
🧠 Interpretation:
● True Positive (TP): The model predicted “Yes”, and it was actually
“Yes”
● True Negative (TN): The model predicted “No”, and it was actually
“No”
● False Positive (FP): The model predicted “Yes”, but it was actually
“No” (Type I Error)
● False Negative (FN): The model predicted “No”, but it was actually
“Yes” (Type II Error)
It also helps calculate key measures like accuracy, precision and recall
which give a better idea of performance especially when the data is
imbalanced.
Precision
✅ 1. What is Precision?
Precision is a metric used in classification problems to measure the
accuracy of positive predictions.
It answers the question:
“Out of all the instances the model predicted as positive, how
many were actually positive?”
🧪 Formula
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃+𝐹𝑃
Where:
● TP (True Positives): The model correctly predicted positive cases.
● FP (False Positives): The model incorrectly predicted as positive
(actually negative).
🎯 Interpretation
Precision focuses on the quality of positive predictions.
● High precision = few false positives
● Low precision = many false positives
🧠 Real-World Use Cases
4
📬 Spam Email Detection
● Positive class = Spam
● If a model marks a legitimate email as spam (False Positive), it's
annoying to users.
● → Precision should be high.
🧪 COVID-19 Test
● Positive = Infected
● False positives can cause panic, unnecessary isolation.
● → High precision is desired (test only marks truly infected as
positive).
💳 Fraud Detection
● Positive = Fraud
● Marking too many legit transactions as fraud (FP) frustrates
customers.
● → High precision reduces customer inconvenience.
🔍 When to Focus on Precision?
● When False Positives are costly or harmful
● When the consequences of acting on incorrect positives are serious
● When trust in positive predictions is crucial
⚠️ Limitations of Precision
● Precision doesn’t consider false negatives
→ A model could have high precision but miss many actual positives
(low recall).
● It alone is insufficient for evaluating model performance.
Recall
📌 What is Recall
Recall is a performance metric used to measure how well a classification
model identifies actual positive cases. It answers the question:
"Out of all the actual positive cases, how many did the model
correctly predict?"
🧪 Formula
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (𝑇𝑃)
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (𝑇𝑃)+𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 (𝐹𝑁)
Where:
● True Positive (TP): Model predicted positive and it is actually
positive.
● False Negative (FN): Model predicted negative but it is actually
positive.
🎯 When to Focus on Recall?
Recall is crucial when:
● False Negatives are costly.
● You care more about not missing positive cases than about being
overly precise.
✅ Use Cases:
● Medical diagnosis (e.g., cancer detection)
● Fraud detection
● Spam filtering
● Loan default prediction
🔄 Trade-off with Precision
Recall has a trade-off with Precision, which measures how many of the
predicted positives are actually positive.
● High Recall, Low Precision → You catch more positives, but also
more false alarms.
● High Precision, Low Recall → Fewer false positives, but you may
miss many true positives.
False Positive Rate (FPR)
▶️ Definition:
FPR tells us how many actual negatives were incorrectly predicted as
positives.
𝐹𝑃
𝐹𝑃𝑅 = 𝐹𝑃+𝑇𝑁
🧠 Interpretation:
"Out of all real negative cases, how many did the model wrongly
mark as positive?"
⚠️ High FPR:
● The model is generating too many false alarms.
● Bad in email spam filtering or medical testing where false positives
can be costly.
F1 Score
📌 What is the F1 Score?
The F1 Score is the harmonic mean of Precision and Recall, and is used
as a single metric to evaluate a model's performance when both false
positives and false negatives are important.
Unlike arithmetic mean, the harmonic mean punishes extreme values
more. So if either precision or recall is low, the F1 score drops significantly.
𝑃.𝑅
𝐹1 = 2. 𝑃+𝑅
Where:
𝐹𝑃
● 𝑃 = 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝐹𝑃+𝑇𝑁
𝑇𝑃
● 𝑅 = 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃+𝑇𝑁
🧠 Why Harmonic Mean?
The harmonic mean is appropriate when we want to combine rates (like
Precision and Recall), especially when extreme values are undesirable.
🧾 Properties of F1 Score
Property Explanation
Range
[0, 1] — where 1 is perfect precision and
recall
Symmetric
F1 treats precision and recall equally — it
doesn't favor one over the other
Insensitive to True Negatives
(TN) TN doesn’t appear in precision or recall
formulas, so F1 ignores correct
classification of negatives
Non linear
Because of the harmonic mean, even
small drops in precision or recall have
large impacts on F1
📌 When to Use F1 Score
F1 Score is the preferred metric when:
● You cannot assume equal class distribution
● False positives and false negatives carry similar consequences
● The dataset is imbalanced, e.g., 90% negatives and 10% positives
● Examples include:
○ Disease diagnosis
○ Fraud detection
○ Spam detection
○ Rare event prediction
🛠️ Limitations of F1 Score
1. Ignores True Negatives
Doesn’t reflect model performance on negative class
2. Not Always Interpretable Alone
Two models may have the same F1 but different precision/recall
balances
3. Equal Weight to Precision and Recall
Not ideal if your problem requires favoring one (e.g., high recall over
precision)
ROC (Receiver Operating Characteristic)
📘 What is ROC?
ROC stands for Receiver Operating Characteristic.
It is a graphical representation that illustrates the performance of a binary
classification model as the decision threshold is varied.
🔍 Origin of the Name
● The term “Receiver Operating Characteristic” comes from signal
detection theory during World War II.
● It was used to assess the ability of radar systems to distinguish
signal (enemy aircraft) from noise (birds, clouds, etc.).
🎯 What Does ROC Show?
The ROC curve plots:
● True Positive Rate (TPR) on the Y-axis
● False Positive Rate (FPR) on the X-axis
Each point on the ROC curve corresponds to a specific threshold
used to convert predicted probabilities into class labels
(positive/negative).
📈 Axes of ROC Curve
Axis Metric Meaning
FPR (False Positive Proportion of actual negatives wrongly
X-axis Rate) classified as positive
TPR (True Positive Proportion of actual positives correctly
Y-axis Rate) classified
📊 How to Interpret ROC Curve
🔹 Ideal ROC Curve:
● Starts at (0, 0)
● Goes to (0, 1) — perfect TPR, zero FPR
● Ends at (1, 1)
The closer the curve follows the left-hand border and then the top border,
the better the classifier.
🔹 Random Classifier:
● Lies along the diagonal line from (0,0) to (1,1)
● Means TPR ≈ FPR — no real predictive power
🧠 Conceptual Intuition
Let’s say your model outputs probabilities for "positive" class. You can
convert them to labels using a threshold (e.g., 0.5):
● At high threshold (e.g., 0.9): Only very confident predictions are
labeled positive ⇒ low TPR, low FPR
● At low threshold (e.g., 0.1): Most samples are labeled positive ⇒ high
TPR, but also high FPR
The ROC curve is created by sweeping this threshold from 0 to 1 and
plotting (FPR, TPR) at each step.
✅ Why ROC is Useful
● Threshold-independent: It gives a full picture of model performance
across all thresholds
● Good for comparing multiple classifiers
● Especially helpful when positive and negative classes are fairly
balanced
❗ Important Notes
● ROC ignores class imbalance — for imbalanced datasets,
Precision-Recall Curve may be better
● ROC is most informative when your model outputs probabilities, not
hard labels
AUC (Area Under the Curve)
📘 What is AUC?
AUC stands for Area Under the Curve.
When we say “AUC” in machine learning, we almost always mean:
Area under the ROC curve (AUC-ROC)
It is a single scalar value that summarizes the overall performance of a
binary classifier across all classification thresholds.
🔍 What Does AUC Represent?
AUC answers this question:
“If I randomly pick one positive example and one negative
example, what is the probability that the model assigns a higher
probability to the positive example than to the negative one?”
So, a higher AUC means better model performance.
📈 AUC Scale and Interpretation
AUC Score Interpretation
1 Perfect classifier
0.9–1.0 Excellent
0.8–0.9 Very good
0.7–0.8 Acceptable
0.6–0.7 Poor
0.5 No discrimination (random guessing)
< 0.5 Worse than random (inverted predictions)
🧮 How is AUC Calculated?
Since it is area under the ROC curve, it can be computed via:
1. Trapezoidal Rule:
● Numerically integrates the area under the curve formed by (FPR,
TPR) points.
2. Ranking Approach:
● AUC is equivalent to the Wilcoxon-Mann-Whitney statistic
● It measures how well the classifier ranks a random positive
example higher than a random negative one
📊 Visual Understanding
Ideal ROC Curve:
● Reaches (0,1) quickly → AUC ≈ 1
Random Model:
● Diagonal line → AUC ≈ 0.5
Poor Model:
● Curve below diagonal → AUC < 0.5
🧠 Intuitive Example
Suppose:
● You have 100 spam emails (positives)
● And 100 non-spam emails (negatives)
● Your model ranks emails based on probability of being spam
If for every spam email, the model assigns a higher score than every
non-spam email → AUC = 1
If the model ranks half of them incorrectly → AUC ~ 0.5
If it always ranks spam below non-spam → AUC = 0
✅ Why is AUC Useful?
● Threshold-independent: Unlike accuracy, it does not depend on a
specific classification threshold
● Compares models effectively, even with different probability
calibrations
● Robust against class imbalance, unlike accuracy
❗ Things to Keep in Mind
Point Explanation
It doesn’t evaluate probability calibration (e.g.,
AUC is for ranking ability how close to 0.8 is 0.8)
Not always meaningful for
highly imbalanced data Use Precision-Recall AUC in such cases
AUC can hide poor decisions Always combine AUC with confusion matrix or
at critical thresholds domain-specific analysis