UNIT-4:
Model Validation Techniques and
Traditional Interpretable Algorithms
TOPICS in UNIT-4
• Model Validation, Evaluation, and Hyperparameters; Model
selection and evaluation: Validation curve and learning curve;
Classification model validation: Confusion Matrix, Accuracy,
Sensitivity, Specificity; Regression model validation: Root Mean
Square Error, RSE, MAE, RAE, R2;
• Ante-Hoc Explainability Methods: Interpretable Machine
Learning Properties, Interpretable Models: Decision Trees, Rule-
Based Systems, Case Studies: Using Interpretable Models in Real-
World Scenarios
I. Model Validation, Evaluation, and Hyperparameters
• The key to creating great models is to
make sure that the model generalizes
well on unseen data.
• Figure 3.1 gives the most well-
established process that ensures models
do not overfit (or underfit) and
generalize well for classification and
regression.
• The labeled dataset can be divided
into training, validation, and test
sets from the original data. Primarily,
the test set should be representative of
the unseen real-world data in terms of
quality, distribution, class balance, etc.
• The validation set is used during the
training phase of the model to provide
an unbiased evaluation of the model's
performance and to fine-tune the
model's parameters.
• The test set, on the other hand, is used
after the model has been fully
trained to assess the model's
performance on completely unseen
data.
I. Model Validation, Evaluation, and Hyperparameters
• In the absence of a separate validation set, splitting training data into train and
validation sets is a choice and depends on the amount of labeled data and the
model capacity .
• Validation techniques like k-fold cross validation are employed when separate
validation sets are not a possibility.
• The validation process plays a vital role in tuning or selecting the model
parameters. The choice of these parameters affects the model performance, and
hence explicitly understanding the options is critical from an explainability
standpoint.
• To compare and contrast machine learning models it is necessary to use the same
split of train, validation, and test sets to evaluate all the models (with
parameters) using the same performance metric(s).
• Interpretability is also one of the aspects that one should focus on along with
other metrics.
II. Model selection and evaluation
• Most machine learning algorithms have parameters that need to be
tuned for optimal performance on a given dataset.
• For example, a decision tree can have different values of “max depth”
and the models corresponding to each such value can exhibit a range of
performance values, measured as accuracy or precision.
• A validation set or cross-validation technique is used to tune these
parameters.
1. Validation Curve:
• V alidation curve is a plot of performance metrics such as a score with
respect to different values of the model parameters.
• It is a graphical technique that can be used to measure the influence of
parameters on the model’s performance.
• By looking at this curve, you can determine if the model is underfitting,
overfitting or just-right for some range of hyperparameter values.
II. Model selection and evaluation: Validation Curve
• The validation curve as
in Fig. 3.2a for Decision
Tree shows that at “max
depth” of 4, the
classifier stabilizes to
give optimum AUC of
around 0.88.
• As the number of nodes
increases, the
validation score
remains almost
constant while the
training score increases
indicating overfitting.
II. Model selection and evaluation: Validation Curve
• The validation curve
as in Fig. 3.2b for
Logistic Regression
shows best
performance for the
parameter C at 0.1
with AUC value
around 0.842.
• As the C value
increases the
validation score
drops indicating the
region of
overfitting.
• The variance in
validation and
training scores is
very high in Logistic
Regression as
compared to Decision
Tree.
II. Model selection and evaluation: Learning Curve
• A learning curve explains the relationship between a performance metric,
such as accuracy for a classifier, and the number of training samples.
• The learning curve provides various diagnostic insights into the classifier such as
How many training samples does the classifier/regressor need for an
optimum performance score in training and validation?
Are the samples representative of the domain?
Does the bias or the variance introduce error in the classifier/regressor?
Does the model have any overfitting / underfitting issues?
• The training and validation learning curves are plotted together so we can look at
the relative metrics to get the overall diagnosis for decision trees and logistic
regression
II. Model selection and evaluation: Learning Curve
II. Model selection and evaluation: Learning Curve
• The learning curves in
Fig. 3.3a for Decision Tree
show that training and
validation curves are
separated.
• At about 600 samples, the
validation curve trends
downwards.
• There is a large variance
in the cross-validation as
compared to training
indicating variance errors
in predictions rather than
bias errors.
II. Model selection and evaluation: Learning Curve
• The learning curves in Fig. 3.3b for
Logistic Regression show both
training and validation curves
following similar trends and at about
600 samples, showing divergence.
• Similar to the Decision tree, logistic
regression also indicates variance
error.
• The training learning curve for
Logistic Regression also shows
variability and this indicates the bias
error.
• When compared with decision tree, it
can be concluded that the non-linear
decision tree algorithm performs
better indicating the presence of non-
linear boundaries.
• The variance in logistic regression is
more than that of decision tree.
III. CLASSIFICATION MODEL VISUALIZATION
• Model selection happens based on the agreed metrics that vary based on the domain and the
nature of the application.
• For example, in financial services, false negatives have to be minimized (recall-centric), while
in other applications such as fraud detection where there are fewer resources to investigate the
positive hits, false positive minimization becomes imperative (precision-centric).
• Many model governance teams consider model metrics and evaluation results along with the
actual model as an artifact that needs to be documented and reported.
• From a diagnostic and white-boxing perspective, understanding how the model performs in
various scenarios is critical.
• This section discusses some well-known model metrics and how they impact selection, especially
of the classification models.
1. Confusion matrix and Classification report
2. Receiver Operating Characteristic (ROC) and Area Under Curve (AUC)
3. Precision-Recall curve (PRC)
4. Discrimination Thresholds.
III. CLASSIFICATION MODEL VISUALIZATION: 1. Confusion matrix
• Precision, recall, F1-score, and accuracy are metrics commonly used to evaluate the
performance of classification models. They are derived from the confusion matrix, which
is a table that summarizes the performance of a classification model on a set of test data.
• True Positive (TP): Number of correctly predicted positive instances.
• True Negative (TN): Number of correctly predicted negative instances.
• False Positive (FP): Number of incorrectly predicted positive instances (Type I error).
• False Negative (FN): Number of incorrectly predicted negative instances (Type II error).
Predicted Negative Predicted Positive
Actual Negative True Negative (TN) False Positive (FP)
Actual Positive False Negative (FN) True Positive (TP)
III. CLASSIFICATION MODEL VISUALIZATION: 1. Confusion matrix
Predicted Negative Predicted Positive
Actual Negative True Negative (TN) False Positive (FP)
Actual Positive False Negative (FN) True Positive (TP)
Precision: Precision is the proportion of correctly predicted positive instances (True Positives) out of all
instances predicted as positive (True Positives + False Positives).
Recall (Sensitivity): Recall is the proportion of correctly predicted positive instances (True Positives) out of
all actual positive instances (True Positives + False Negatives).
F1-score: F1-score is the harmonic mean of precision and recall, providing a balance between the two
metrics. It's particularly useful when the classes are imbalanced.
Accuracy: Accuracy is the proportion of correctly predicted instances (True Positives + True Negatives) out
of all instances.
CLASSIFICATION METRICS
• Recall is a metric that measures how often a machine learning model
correctly identifies positive instances (true positives) from all the
actual positive samples in the dataset. In other words, recall
answers the question: can an ML model find all instances of the
positive class? 𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁
• Precision is a metric that measures how often a machine learning model
correctly predicts the positive class. In other words, precision
answers the question: how often the positive predictions are
correct?
• F1 score is a useful metric for measuring the performance for classification
models when you have imbalanced data because it considers the type of
errors — false positive and false negative – and not just the number of
predictions that were incorrect, useful in areas like fraud prevention
III. CLASSIFICATION MODEL VISUALIZATION: 1. Confusion matrix
1) Suppose we have a binary classification model for detecting spam emails. After evaluating the model,
we obtain the following confusion matrix: Compute precision, recall, F1-Score, and accuracy for the
following confusion matrix. Analyze the results.
Predicted Not Spam Predicted Spam
Actual Not Spam 1150 30
Actual Spam 20 800
Ans: TN=1150, TP=800, FN=20, FP=30
• = (out of all instances predicted as spam, 96.3% were actually spam)
• = =0.975 (correctly identified 97.6% of all actual spam emails. )
• = =0.969 (indicates high precision and recall)
• = = 0.975 (correctly classified 97.5% of all instances))
III. CONFUSION MATRIX
2) A healthcare AI model predicts whether a patient has a disease (positive) or
not (negative). Compute accuracy, precision, recall, and F1-score for the
following confusion matrix. Analyze the results.
Disease No Disease
Disease 80 90
No Disease 20 90
3) A financial fraud detection model predicts whether a transaction is fraudulent
(positive) or not (negative). Results on a test dataset are shown below. Compute
accuracy, precision, recall, and F1-score for the following confusion matrix. Analyze the
results.
Not Fraud Fraud
Not Fraud 950 10
Fraud 50 40
III. CONFUSION MATRIX: SOLUTION 2
III. CONFUSION MATRIX: SOLUTION 2
4. For the following confusion matrix, compute accuracy,
recall, specificity, precision, and F1-score .
Predicted Class TP=4
Class=1 Class=0 5
Actual
Class= 45 32 TN=2
1 3
Class
Class= 15 23 FP=1
0
5
FN=3
= 2
Recall/Sensitivity/TPR
0
Model Regression Metrics
• Model regression metrics are used to evaluate
the performance of regression models, which
predict continuous numerical values rather
than discrete categories.
• These metrics help assess how well a model's
predictions align with the actual target
values, providing insights into the accuracy,
error, and reliability of the model.
• A good model should have low error (i.e. the
predicted value should be as close as possible to
the actual value).
Sum of Squared Error:
• Sum of Squared Error: [ is Actual target (output) value
and is the predicted target (output) value]
• The Sum of Squared Error (SSE) measures the total
squared difference between actual and predicted
values in a regression model.
• It reflects the overall error magnitude and helps evaluate
how well a model fits the data.
• Real-World Example: In predicting housing prices, SSE
indicates how much the predicted prices deviate from the
actual prices across all properties.
• SSE: Useful for understanding the total error magnitude,
especially in comparing different models trained on the
same dataset.
Mean Squared Error:
• MSE tells us how far a model’s predictions are from the actual
values, on average. It squares the errors (differences
between actual and predicted values), so larger
mistakes are penalized more.
• Mean Squared Error:
• Penalizes Big Errors: Squaring the errors gives more weight
to larger mistakes, so MSE highlights where the model is
performing poorly.
• Measures Model Accuracy: A smaller MSE means the model
is making better predictions.
• MSE is Preferred when evaluating how well the model performs
on average, or when comparing models on datasets of
different sizes.
Mean Squared Error:
• The Root Mean Squared Error (RMSE) measures the average
error in predictions, but unlike MSE, it expresses the error in the
same units as the target variable, making it easier to
interpret in real-world terms.
• Root Mean Squared Error:
• Real-World Example: Predicting Daily Temperature
• Goal: Predict daily high temperatures for the next week.
• Actual Temperatures (°C): [30, 32, 28, 33, 29, 31, 34]
• Predicted Temperatures (°C): [29, 33, 27, 32, 30, 31, 35]
• RMSE=0.926°C
• MSE gives the average squared error, which is less interpretable
because it’s in squared units (e.g., °C²).
• RMSE removes the squaring by taking the square root, making
the error easier to understand in real-world terms.
Relative Standard Error: RSE
• The Relative Standard Error (RSE) measures how well a model
explains the variability of the target variable by comparing the model's
errors to the total variability in the actual data. It is expressed as a
percentage, making it easy to interpret.
• Scenario: You’re building a model to predict household electricity bills (in
₹) based on appliance usage.
• Actual Bills (₹): [3000, 3200, 2800, 3100, 2900]
• Predicted Bills (₹): [3100, 3150, 2700, 3050, 2950]
• If RSE = 52.4%, it means that 52.4% of the total variability in the
electricity bills is due to the model's prediction errors (SSE). The
remaining 47.6% (100% - 52.4%) of the variability is explained by the
model.
• Variability refers to how much the actual data points (electricity bills in
this case) deviate from their average (mean). Higher variability means
the bills are more spread out, while lower variability means they are
closer to the mean.
Mean Absolute Error:
• The Mean Absolute Error (MAE) measures the average
magnitude of the errors in a model's predictions, without
considering their direction (positive or negative). It is easy to
interpret because it is in the same unit as the target
variable.
• Mean Absolute Error:
• Predicting Monthly Taxi Fares: You are building a model
to predict taxi fares (in ₹) based on trip distance, time of
day, and other factors.
• Actual Fares (₹): [250, 300, 400, 350, 200]
• Predicted Fares (₹): [240, 310, 390, 360, 210]
• If MAE = ₹10 means that, on average, the model's
predictions are ₹10 off from the actual taxi fares.
Relative Absolute Error:
• Relative Absolute Error (RAE) is a metric that
measures the total absolute error of a model’s
predictions relative to the total absolute
deviation of the actual data from its mean. It
expresses how well the model performs in comparison
to a simple baseline model that always predicts the
mean of the target variable.
• Relative Absolute Error:
• RAE is unitless and expressed as a ratio or percentage,
making it easy to compare across different datasets or
models.
coefficient of determination (R2)
• The coefficient of
determination (R-squared
or R2), is a statistical metric
that explains how much of
the variability in the
dependent variable (the
variable you're trying to
predict) can be explained by
the independent variables
(the predictors) in a regression
model.
• R2 tells you how well the
model fits the data.
Model Regression Metrics
i) Sum of Squared Error: Actual target (output) value
Predicted target (output) value
ii) Mean Squared Error: Mean of a
All of them range from 0 to ,
iii) Root Mean Squared Error: except R2 which ranges from - to
1.
iv) Relative Standard Error:
v) Mean Absolute Error:
vi) Relative Absolute Error:
vii) R-Squared (R2) =
(Where, SST is the sum of squares total
01/09/2025 29
REGRESSION PROBLEMS
Actual output value (a) Predicted output value (p)
10 12
20 14
1. Compute SSE, MSE, RMSE, RSE, MAE, RAE 14 18
and R2 on the following data: 16 17
Ans: 19 34
21 25
i) =302 25 27
ii) = 302/7=43.14
iii) =6.56 Actual Predict
output ed
iv) = 302 / 146.84 = 2.05 output
value
v) = 34/7 = 4.85 (a) value
(p)
10 12 2 4 7.85 61.62
20 14 6 36 2.15 4.62
vi) R2 = =-1.05
14 18 4 16 3.85 14.82
16 17 1 1 1.85 3.42
19 34 15 225 1.15 1.32
21 25 4 16 3.15 9.92
25 27 2 4 7.15 51.12
=34 = 302 ∑ = 146.84
(SSE) =27.15 (SST)
01/09/2025 30
ANALYSIS OF RESULTS OF PROBLEM-1
• SSE = 302: The model has a substantial total error, indicating a
significant difference between predicted and actual values.
• MSE = 43.14: The average squared error is relatively high,
suggesting a poor fit of the model to the data.
• RMSE = 6.56: The average prediction error is moderate, meaning
the model's predictions deviate by about 6.56 units on average.
• RSE = 2.05: The model's performance is worse than predicting the
mean, as the error is more than twice the variance in the data.
• MAE = 4.85: The average absolute error is moderate, implying that
on average, the model's predictions are off by about 4.85 units.
• RAE = 1.25: The model's errors are 1.25 times larger than the total
variation in the data, indicating poor predictive performance.
• R² = -1.05: The model performs worse than simply predicting the
mean of the data, indicating a very poor fit.
REGRESSION PROBLEMS • Original (a) = a( -2, 1, -3, 2, 3, 5, 4, 6, 5, 6, 7)
• Predicted (p) = p(-1, -1, -2, 2, 3, 4, 4, 5, 5, 7, 7)
2) Compute SSE, MSE, RMSE, RSE, MAE, RAE and
R2 on the following data: Actual Predict
Ans: output ed
value output
i) =9 (a) value
(p)
ii) = 9/11=0.818 -2 -1 1 1 5.09 25.9
iii) =0.904 1 -1 2 4 2.09 4.36
-3 -2 1 1 6.09 37.08
iv) = 9 / 108.82 = 0.082 2 2 0 0 1.09 1.18
3 3 0 0 0.09 0.0081
v) = 7/11 = 0.63 5 4 1 1 1.91 3.64
4 4 0 0 0.91 0.82
6 5 1 1 2.91 8.46
5 5 0 0 1.91 3.64
vi) R2 = = 0.918 6 7 1 1 2.91 8.46
7 7 0 0 3.91 15.28
=7 =9 ∑ =1108.82
=28.91
01/09/2025 32
ANALYSIS OF RESULTS OF PROBLEM-2
• SSE = 9: The total squared error is low, suggesting the model has a
relatively small total discrepancy between predicted and actual values.
• MSE = 0.818: The average squared error is low, indicating that the
model’s predictions are close to the actual values on average.
• RMSE = 0.904: The root of MSE indicates that, on average, the model's
predictions are off by about 0.904 units, showing reasonable accuracy.
• RSE = 0.082: The relative squared error is very low, indicating that the
model's error is minimal compared to the total variance in the data,
reflecting a good fit.
• MAE = 0.63: The average absolute error is low, meaning the model's
predictions are typically within 0.63 units of the actual values.
• RAE = 0.242: The relative absolute error is small, showing that the
model's prediction errors are much smaller compared to the total
variance in the data, suggesting good performance.
• R² = 0.918: The model explains 91.8% of the variance in the data, which
indicates a very strong fit and high predictive accuracy.