Introduction to Machine Learning
Course Teacher:
Dr. M. Shahidur Rahman
Professor, DoCSE, SUST
2 Model Performance and Evaluation Metrics
Topics covered:
Evaluation Metrics
Model Performance Evaluation
Model Selection
Model Performance and Evaluation Metrics
In classification domain, the simplest visualization of the success of a
model is normally described using the confusion matrix.
Evaluation Metrics
Accuracy:
True positive rate (TPR) or recall or hit rate or sensitivity:
Precision or positive predictive value:
F1 Score:
Evaluation Metrics…
Specificity:
Miss rate or false negative rate:
False Positive Rate (FPR):
Evaluation Metrics…
Accuracy and classification error are informative measures of success
when the data is balanced in terms of the classes
When the data is imbalanced, i.e., one class is represented in larger
proportion over the other class in the dataset, these measures become
biased towards the majority class and give a wrong estimate of success.
In such cases, base measures, such as true positive rate (TPR), false
positive rate (FPR), true negative rate (TNR), and false negative rate (FNR),
become useful.
Metrics such as F1 score combines the base measures to give an overall
measure of success.
Evaluation Metrics…
The curve that plots TPR and FPR for a classifier at various thresholds is
known as the receiver-operating characteristic (ROC) curve.
Precision and recall can be plotted at different thresholds, giving the
precision-recall curve (PRC)
The areas under each curve are respectively known as auROC and auPRC
and are popular metrics of performance.
In particular, auPRC is generally considered to be an informative metric
in the presence of imbalanced classes.
8 ROC AUC
A perfect classifier would fall into
the top-left corner of the graph
with a TPR of 1 and an FPR of 0.
Based on the ROC curve, we
compute the ROC area under the
curve (ROC AUC) to characterize
the performance of a classification
model.
Higher ROC AUC means better
classification performance.
Regression Evaluation Metrics
Average prediction error:
Mean absolute error (MAE):
Root mean squared error (RMSE):
Relative squared error (RSE) is used when two errors are measured in
different units:
10 Ratio for partitioning a dataset into training and
test datasets
In general, we don't want to allocate too much information to the test set.
However, the smaller the test set, the more inaccurate the estimation of the
generalization error.
Dividing a dataset into training and test datasets is all about balancing this
tradeoff.
In practice, the most commonly used splits are 60:40, 70:30, or 80:20,
depending on the size of the initial dataset.
For large datasets, 90:10 or 99:1 splits are also common and appropriate.
For example, if the dataset contains more than 100,000 training examples, it
might be fine to withhold only 10,000 examples for testing in order to get a
good estimate of the generalization performance.
11 Underfitting and overfitting
model can also suffer from underfitting (high bias), which means that
our model is not complex enough to capture the pattern in the training
data well and suffers from low performance on unseen data.
If a model is too complex for a given training dataset—there are too
many parameters in this model—the model tends to overfit the training
data and does not generalize well to unseen data
12 Debugging algorithms with learning and
validation curves
Hyperparameter tuning
Validation techniques are meant to
answer the question of how to select
a model(s) with the right
hyperparameter values.
Hyperparameters are parameters set
before training a machine learning
model. They are not learned from the
data but are manually configured to
optimize model performance. Ex.
Learning Rate (𝛼), Number of Trees, Hyperparameter C is the inverse
Kernal type in SVM. regularization parameter of the
LogisticRegression classifier,
where C=1 provides best performance.
14 Holdout cross-validation
For estimating the
generalization
performance of ML
models is holdout cross-
validation
K-fold cross validation
The validation process needs a
large number of labeled data
points for creating the training
set and the validation set.
Collecting a large labeled set is
usually difficult
In such cases, instead of
physically separating the training
set and validation set, k-fold
cross-validation is used.
16 K-fold cross validation…
Once we have found satisfactory hyperparameter values, we can retrain
the model on the complete training dataset and obtain a final
performance estimate using the independent test dataset.
Value of k in k-fold cross-validation is typically k = 10.
A special case of k-fold cross-validation is the leave-one-out cross-
validation (LOOCV) method, where k = n, number of training examples.
It is recommended for working with very small datasets.
17 Model Selection
Use cross-validation or k-fold cross-validation for fine-tuning the
performance of an ML model by varying its hyperparameter values
Choose the model that performs best on relevant criteria such as
accuracy.