MODEL EVALUATION
Performance metrics
When building machine learning models, it’s important to
understand how well they perform. Evaluation metrics help
us to measure the effectiveness of our models. Whether we
are solving a classification problem, predicting continuous
values or clustering data, selecting the right evaluation
metric allows us to assess how well the model meets our
goals. In this article, we will see commonly used evaluation
metrics and discuss how to choose the right metric for our
model.
Classification Metrics
Classification problems aim to predict discrete categories. To
evaluate the performance of classification models, we use the
following metrics:
1. Accuracy
Accuracy is a fundamental metric used for evaluating the
performance of a classification model. It tells us the proportion
of correct predictions made by the model out of all predictions.
While accuracy provides a quick snapshot, it can be misleading
in cases of imbalanced datasets. For example, in a dataset with
90% class A and 10% class B, a model predicting only class A
will still achieve 90% accuracy but it will fail to identify any class
B instances.
2. Precision
It measures how many of the positive predictions made by the
model are actually correct. It's useful when the cost of false
positives is high such as in medical diagnoses where
predicting a disease when it’s not present can have serious
consequences.
Where:
•TP = True Positives
•FP = False Positives
Precision helps ensure that when the model predicts a positive
outcome, it’s likely to be correct.
3. Recall
Recall or Sensitivity measures how many of the actual positive
cases were correctly identified by the model. It is important when
missing a positive case (false negative) is more costly than false
positives.
Where:
•FN = False Negatives
In scenarios where catching all positive cases is important (like
disease detection), recall is a key metric.
4. F1 Score
The F1 Score is the harmonic mean of precision and recall. It is useful when
we need a balance between precision and recall as it combines both into a
single number. A high F1 score means the model performs well on both
metrics. Its range is [0,1].
Lower recall and higher precision gives us great accuracy but then it misses a
large number of instances. More the F1 score better will be performance. It
can be expressed mathematically in this way:
5. Logarithmic Loss (Log Loss)
Log loss measures the uncertainty of the model’s predictions. It
is calculated by penalizing the model for assigning low
probabilities to the correct classes. This metric is used in multi-
class classification and is helpful when we want to assess a
model’s confidence in its predictions. If there are N samples
belonging to the M class, then we calculate the Log loss in this
way:
Where:
•=Actual class (0 or 1) for sample and class
• =Predicted probability for sample and class
The goal is to minimize Log Loss, as a lower Log Loss shows
6. Area Under Curve (AUC) and ROC Curve
It is useful for binary classification tasks. The AUC value
represents the probability that the model will rank a
randomly chosen positive example higher than a
randomly chosen negative example. AUC ranges from 0
to 1 with higher values showing better model
performance.
1. True Positive Rate(TPR)
Also known as sensitivity or recall, the True Positive Rate
measures how many actual positive instances were correctly
identified by the model. It answers the question: "Out of all
the actual positive cases, how many did the model correctly
identify?"
Formula:
Where:
•TP = True Positives (correctly predicted positive cases)
•FN = False Negatives (actual positive cases incorrectly
predicted as negative)
2. True Negative Rate(TNR)
Also called specificity, the True Negative Rate measures how
many actual negative instances were correctly identified by the
model. It answers the question: "Out of all the actual negative
cases, how many did the model correctly identify as negative?"
Formula:
Where:
•TN = True Negatives (correctly predicted negative cases)
•FP = False Positives (actual negative cases incorrectly
predicted as positive)
3. False Positive Rate(FPR)
It measures how many actual negative instances were
incorrectly classified as positive. It’s a key metric when the cost
of false positives is high such as in fraud detection.
Formula:
Where:
•FP = False Positives (incorrectly predicted positive cases)
•TN = True Negatives (correctly predicted negative cases)
4. False Negative Rate(FNR)
It measures how many actual positive instances were
incorrectly classified as negative. It answers: "Out of all the
actual positive cases, how many were misclassified as
negative?"
Formula:
Where:
•FN = False Negatives (incorrectly predicted negative cases)
•TP = True Positives (correctly predicted positive cases)
Regression Metrics
In the regression task, we are supposed to predict the target
variable which is in the form of continuous values. To evaluate
the performance of such a model below metrics are used:
1. Mean Absolute Error (MAE)
MAE calculates the average of the absolute differences
between the predicted and actual values. It gives a clear view
of the model’s prediction accuracy but it doesn't shows
whether the errors are due to over- or under-prediction. It is
simple to calculate and interpret helps in making it a good
starting point for model evaluation.
Where:
• = Actual value
• = Predicted value
2. Mean Squared Error (MSE)
MSE calculates the average of the squared differences
between the predicted and actual values. Squaring the
differences ensures that larger errors are penalized more
heavily helps in making it sensitive to outliers. This is useful
when large errors are undesirable but it can be problematic
when outliers are not relevant to the model’s purpose.
Formula:
Where:
• = Actual value
• = Predicted value
3. Root Mean Squared Error (RMSE)
RMSE is the square root of MSE, bringing the metric back to
the original scale of the data. Like MSE, it heavily penalizes
larger errors but is easier to interpret as it’s in the same units
as the target variable. It’s useful when we want to know how
much our predictions deviate from the actual values in terms
of the same scale.
Formula:
Where:
• = Actual value
• = Predicted value
4. Root Mean Squared Logarithmic Error (RMSLE)
RMSLE is useful when the target variable spans a wide range
of values. Unlike RMSE, it penalizes underestimations more
than overestimations helps in making it ideal for situations
where the model is predicting quantities that vary greatly in
scale like predicting prices or population.
Formula:
Where:
• = Actual value
• = Predicted value
5. R² (R-squared)
R2 score represents the proportion of the variance in the
dependent variable that is predictable from the independent
variables. An R² value close to 1 shows a model that explains
most of the variance while a value close to 0 shows that the
model does not explain much of the variability in the data. R² is
used to assess the goodness-of-fit of regression models.
Formula:
Where:
• = Actual value
• = Predicted value
•= Mean of the actual values
Baseline models in deep learning, as in machine learning
generally, are simple models used to establish a reference
point for evaluating the performance of more complex
models. They are not necessarily deep learning models
themselves, but rather serve as a foundational benchmark
against which deep learning models can be compared.
Here's why baseline models are crucial:
•Performance Benchmark:
•They provide a minimum threshold of performance that any
advanced model, including deep learning models, should
ideally exceed. This helps in assessing whether the
complexity of a deep learning model is justified by a
significant improvement in results.
Identifying Issues: If a sophisticated deep learning model performs worse than
a simple baseline, it can signal problems with the data, feature engineering,
model architecture, or hyperparameter tuning.
Contextualizing Results: Baseline models offer a context for interpreting the
evaluation metrics of deep learning models. For instance, a high accuracy score
might seem impressive, but if a simple baseline achieves a similar score, the
deep learning model's value is diminished.
Examples of Baseline Models:
Classification:
Majority Class Classifier: Always predicts the most frequent class in the
training data. This is useful for understanding the impact of class imbalance.
Random Classifier: Assigns class labels randomly, providing a lower bound for
performance.
Regression:
Mean/Median Predictor: Predicts the mean or median of the target variable
from the training data for all inputs.
Last Value Predictor: In time series, predicts the current value as the next
value.
Deep Learning Specific Baselines:
A very simple neural network (e.g., a single-layer perceptron) can be used as
a baseline before building more intricate deep architectures like CNNs or
LSTMs.
Pre-trained models without fine-tuning, or models with minimal layers, can
also serve as baselines.
Hyperparameters in deep learning are configuration variables set
manually before the training process begins to control how the model learns.
They are different from model parameters, which are learned
internally during training. Key examples include the learning
rate, batch size, and the number of layers and nodes in a
neural network, which collectively influence the model's
architecture, complexity, and performance. Finding the optimal
combination of hyperparameters is a critical step called
hyperparameter tuning.
What they are
External configurations: Hyperparameters are external to the model and
are set by the data scientist to define the learning process.
Pre-training settings: They are defined before training starts and cannot
be updated during the learning process.
Contrast with parameters: In contrast, parameters like weights and biases
are internal and are automatically adjusted by the model as it learns from
the data.
Common types of hyperparameters
Learning rate: This determines the step size for updating model parameters
during training. A high learning rate can lead to faster training but may
overshoot the optimal solution, while a low one may converge slowly.
Batch size: This is the number of training examples used in one iteration to
update the model's parameters. It affects training speed and the noise in the
updates.
Epochs: This is the number of times the entire training dataset is passed
through the model.
Network architecture: For neural networks, this includes the number of
hidden layers and the number of neurons in each layer.
Regularization hyperparameters: These are used to prevent overfitting, such
as the value in Lasso (LASSO) regression.
Why they are important
Influence on performance: Hyperparameters have a significant impact on a
model's final accuracy and efficiency.
Guide the learning process: By setting hyperparameters, you guide the
model's development towards your specific goals.
Optimization is key: Choosing the right hyperparameters is crucial for
optimizing the model's performance.
Manual hyperparameter tuning is the process of manually
selecting and adjusting hyperparameters like the learning rate or
number of layers to find the best model performance. This involves
a trial-and-error approach, where a user sets initial
values, trains the model, observes the results, and then
modifies the hyperparameters based on experience and
intuition to improve performance. This is different from
automated methods, which use algorithms to search for
optimal hyperparameters.
How manual hyperparameter tuning works
Select initial values: Based on past experience, the user chooses initial
values for hyperparameters such as the number of layers, nodes per layer,
and learning rate.
Train the model: The model is trained with the chosen set of
hyperparameters.
Evaluate the performance: The model's performance is evaluated on a
validation set using relevant metrics.
Adjust hyperparameters: The user manually changes the hyperparameter
values. For example, if the model is overfitting, the user might increase the
dropout rate or reduce the number of layers.
Repeat: Steps 2-4 are repeated until satisfactory performance is achieved.
Key hyperparameters to tune
Learning rate: Controls how quickly the model updates its parameters.
Number of layers and nodes: Determines the size and depth of the neural
network.
Dropout rate: A regularization technique to prevent overfitting by randomly
ignoring nodes during training.
Batch size: The number of samples processed before the model is updated.
Pros and cons
Pros: Can be useful for experienced users who have good intuition about
how to set hyperparameters for a specific problem and can be a good way
to start.
Cons:
Time-consuming and inefficient: It can take a lot of time and many
iterations to find good hyperparameters.
Difficult to optimize: It becomes very difficult when dealing with multiple
hyperparameters that interact with each other.
Depends on user expertise: Results heavily depend on the user's
knowledge and can be suboptimal compared to automated methods.
Automatic hyperparameter tuning in deep learning uses
algorithms to find the best hyperparameter settings for a model, replacing
the manual and time-consuming trial-and-error process. The
process involves defining a search space of hyperparameters
(like learning rate, number of layers, or activation functions)
and then using an automated algorithm (e.g., Bayesian
optimization, grid search) to find the optimal combination
that maximizes model performance.
How it works
Define the search space: You must specify which hyperparameters to tune and
define a range or set of possible values for each.
Automated search: An algorithm then automatically runs numerous trials with
different hyperparameter combinations from your specified space.
Evaluate performance: For each combination, the model is trained and its
performance is evaluated, often using a validation set.
Find the optimum: The algorithm identifies and returns the set of
hyperparameters that resulted in the best performance, such as the highest
accuracy.
Common techniques
Grid search: Tries every possible combination of hyperparameters from a
predefined grid.
Random search: Randomly samples combinations from the search space.
Bayesian optimization: Uses previous results to intelligently choose the next
set of hyperparameters to test, often finding a good result with fewer trials.
Population-Based Training (PBT): An evolutionary method where multiple
models train simultaneously. Poorly performing models are replaced by
modified copies of better-performing ones, allowing hyperparameters to
evolve over time.
Benefits of automation
Saves time: Automates a process that can take days, weeks, or even months
when done manually.
Improves performance: Can lead to better model performance by exploring a
wider range of possibilities.
Reduces manual effort: Eliminates the need for expert guesswork and extensive
manual iteration.
Debugging Strategies
Debugging deep learning models requires a systematic
approach due to the complexity of the models and the data
involved. Here are key strategies:
1. Data Verification:
•Inspect Data: Ensure data is loaded correctly, free of errors,
and properly preprocessed (normalization, scaling, handling
missing values).
•Check Labels: Verify the correctness of labels and address
class imbalances.
•Visualize Data: Use visualizations to identify outliers or
2. Code and Model Architecture Checks:
Start Simple: Begin with a basic model and a small dataset to establish a
baseline and ensure the training pipeline functions.
Verify Architecture: Double-check the implementation of the model
architecture, ensuring layers and connections are as intended. Tools like
Netron can help visualize network structures.
Weight Initialization: Experiment with different weight initialization
strategies if the model struggles to converge.
Regularization: Adjust regularization techniques (e.g., dropout, L1/L2) as
excessive regularization can lead to underfitting.
3. Training Process Monitoring:
Track Metrics: Monitor loss, accuracy, and other relevant metrics on both
training and validation sets. Look for signs of overfitting or underfitting.
Visualize Training Progress: Plotting loss curves and accuracy over epochs can
reveal issues like exploding/vanishing gradients or a stuck training process.
Monitor Activations and Gradients: Visualize histograms of activations and
gradients in different layers to identify potential problems like dead neurons or
unstable gradients.
Hyperparameter Tuning: Systematically explore different hyperparameters
(learning rate, batch size, optimizer) to find optimal configurations.
4. Debugging Tools and Techniques:
Print Statements: Insert print statements at critical points to inspect variable
values, tensor shapes, and data flow.
Debuggers: Utilize integrated debuggers in IDEs or frameworks (e.g., PyTorch's
[Link].set_detect_anomaly(True)) to step through code and examine
states.
Breakpoints: Set breakpoints to pause execution and inspect variables at
specific points.
Binary Search: When a bug is suspected in a sequence of operations, use
binary search to narrow down the problematic section.
Automated Testing: Implement unit tests for individual components of the
model and data pipeline.
5. Experiment Tracking and Documentation:
Document Experiments: Keep detailed records of model architectures,
hyperparameters, data versions, and results for each experiment.
Version Control: Use version control systems (e.g., Git) to manage code and
track changes, facilitating rollbacks if needed.