Deep Learning Model Evaluation Metrics
Deep Learning Model Evaluation Metrics
III-YEAR-AI&DS
2021 R
SUB CODE: AD3501
SUB NAME: DEEP LEARNIG
UNIT IV
MODEL EVALUATION
1
2
AD3501/DEEP LEARNING/UNIT IV/AI&DS/SRRCET
To evaluate the performance or quality of the model, different metrics are used, and these metrics are known as
performance metrics or evaluation metrics.
These performance metrics help us understand how well our model has performed for the given data. In this way,
we can improve the model's performance by tuning the hyper-parameters. Each ML model aims to generalize well
on unseen/new data, and performance metrics help determine how well the model generalizes on the new dataset.
In a classification problem, the category or classes of data is identified based on training data. The model
learns from the given dataset and then classifies the new data into classes or groups based on the training. It
predicts class labels as the output, such as Yes or No, 0 or 1, Spam or Not Spam, etc. To evaluate the
performance of a classification model, different metrics are used, and some of them are as follows:
o Accuracy
o Confusion Matrix
o Precision
o Recall
o F-Score
o AUC(Area Under the Curve)-ROC
[Link] Accuracy
The accuracy metric is one of the simplest Classification metrics to implement, and it can be determined as the number
of correct predictions to the total number of predictions
It is good to use the Accuracy metric when the target variable classes in data are approximately balanced. For example,
if 60% of classes in a fruit image dataset are of Apple, 40% are Mango. In this case, if the model is asked to predict
whether the image is of Apple or Mango, it will give a prediction with 97% of accuracy.
2
3
AD3501/DEEP LEARNING/UNIT IV/AI&DS/SRRCET
[Link] Precision
The precision metric is used to overcome the limitation of Accuracy. The precision determines the
proportion of positive prediction that was actually correct. It can becalculated as the True Positive or
predictions that are actually true to the total positive predictions (True Positive and False Positive).
It is also similar to the Precision metric; however, it aims to calculate the proportion of actual positive that was
identified incorrectly. It can be calculated as True Positive or predictions that are actually true to the total
number of positives, either correctly predicted as positive or incorrectly predicted as negative (true Positive
and false negative).
Recall determines the performance of a classifier with respect to a false negative, whereas precision gives
information about the performance of a classifier with respect to a false [Link], if we want to minimize
the false negative, then, Recall should be as near to 100%, and if we want to minimize the false positive, then
precision should be close to 100% as possible
score or F1 Score is a metric to evaluate a binary classification model on the basis of predictions that are made for the
positive class. It is calculated with the help of Precision and Recall. It is a type of single score that represents both
Precision and Recall. So, the F1 Score can be calculated as the harmonic mean of both precision and Recall,
assigning equal weight to each of them.
[Link] AUC-ROC
Sometimes we need to visualize the performance of the classification model on charts; then, we can use the
AUC-ROC curve. It is one of the popular and important metrics for evaluating the performance of the
classification model.
Firstly, let's understand ROC (Receiver Operating Characteristic curve) curve. ROC
represents a graph to show the performance of a classification model at different threshold levels.
The curve is plotted between two parameters, which are:
3
4
AD3501/DEEP LEARNING/UNIT IV/AI&DS/SRRCET
Regression is a supervised learning technique that aims to find the relationships between the dependent and
independent variables. A predictive regression model predicts a numeric or discrete value. The metrics used
for regression are different from the classification metrics.
Mean Absolute Error or MAE is one of the simplest metrics, which measures the absolute difference
between actual and predicted values, where absolute means taking a number as Positive.
Here,
Y is the Actual outcome, Y' is the predicted outcome, and N is the total number of data points.
Mean Squared error or MSE is one of the most suitable metrics for Regression evaluation. It measures the
average of the Squared difference between predicted values and the actual value given by the model.
Moreover, due to squared differences, it penalizes small errors also, and hence it leads to over-estimation of
how bad the model is.
4
5
AD3501/DEEP LEARNING/UNIT IV/AI&DS/SRRCET
[Link] R Squared Score
R squared error is also known as Coefficient of Determination, The R-squared metric enables us to compare
our model with a constant baseline to determine the performance of the model. To select the constant baseline,
we need to take the mean of the data and draw the line at the mean.
The R squared score will always be less than or equal to 1 without concerning if the values are too large or
small.
Baseline models serve as a benchmark in an ML application. Their main goal is to put the results of trained
models into context.
• Encoder
• Hidden Vector
• Decoder
• Random Baseline Models: Data in the actual world isn't always reliable. A dummy classifier or
regressor is the optimal baseline model for these issues. This baseline model will inform you if your
machine learning model is learning or not.
• ML Baseline Modes: Now, if the data is predictable, you can create a baseline model which helps us
analyze which features are critical for prediction and which are not. The baseline models are
commonly used with feature engineering.
• Automated ML Baseline Models: It is the ultimate baseline model. It's an excellent model for
comparing your ML model. If your ML model outperforms the automated baseline model, it's a strong
indication that the model has the potential to become a product.
18. Define Hyperparameters
In neural networks, parameters are used to train the model and make predictions. There are two types of
parameters:
Model parameters are internal to the neural network – for example, neuron weights. They are estimated or
learned automatically from training samples. These parameters are also used to make predictions in a
production model.
Hyperparameters are external parameters set by the operator of the neural network – for example, selecting
which activation function to use or the batch size used in training. Hyperparameters have a huge impact
on the accuracy of a neural network, there may be different optimal values for different values, and it is non-
trivial to discover those values. The simplest way to select hyperparameters for a neural network model is
“manual search” – in other words, trial and error. New methods are evolving which use algorithms and
optimization methods to discover the best hyperparameters.
5
6
AD3501/DEEP LEARNING/UNIT IV/AI&DS/SRRCET
A hyperparameter is a parameter of the model whose value influences the learning process and whose value cannot
be estimated from the training data. Hyperparameters are configured externally before starting the model
learning/training process. Hyperparameter tuning is the process of finding the optimal hyperparameters for any given
machine learning algorithm.
6
7
AD3501/DEEP LEARNING/UNIT IV/AI&DS/SRRCET
PART B
In a classification problem, the category or classes of data is identified based on training data. The
model learns from the given dataset and then classifies the new data into classes or groups based on the
training.
It predicts class labels as the output, such as Yes or No, 0 or 1, Spam or Not Spam, etc. To evaluate the
performance of a classification model, different metrics are used, and some of them are as follows:
o Accuracy
o Confusion Matrix
o Precision
o Recall
o F-Score
o AUC(Area Under the Curve)-ROC
I. Accuracy
The accuracy metric is one of the simplest Classification metrics to implement, and it can be
determined as the number of correct predictions to the total number of predictions
It is good to use the Accuracy metric when the target variable classes in data are approximately
balanced. For example, if 60% of classes in a fruit image dataset are of Apple, 40% are Mango.
In this case, if the model is asked to predict whether the image is of Apple or Mango, it will give
a prediction with 97% of accuracy.
7
8
AD3501/DEEP LEARNING/UNIT IV/AI&DS/SRRCET
It is recommended not to use the Accuracy measure when the target variable majorly belongs to
one class. For example, Suppose there is a model for a disease prediction in which, out of 100
people, only five people have a disease, and 95 people don't have one. In this case, if our model
predicts every person with no disease (which means a bad prediction), the Accuracy measure
will be 95%, which is not correct.
o In the matrix, columns are for the prediction values, and rows specify the Actual values.
Here Actual and prediction give two possible classes, Yes or No. So, if we are predicting
the presence of a disease in a patient, the Prediction column with Yes means, Patient has
the disease, and for NO, the Patient doesn't have the disease.
o In this example, the total number of predictions are 165, out of which 110 time predicted
yes, whereas 55 times predicted [Link], in reality, 60 cases in which patients don't
have the disease, whereas 105 cases in which patients have the disease.
o True Positive(TP): In this case, the prediction outcome is true, and it is true in reality, also.
o True Negative(TN): in this case, the prediction outcome is false, and it is false in reality, also.
o False Positive(FP): In this case, prediction outcomes are true, but they are false in actuality.
o False Negative(FN): In this case, predictions are false, and they are true in actuality.
III. Precision
The precision metric is used to overcome the limitation of Accuracy. The precision determines
the proportion of positive prediction that was actually correct. It can be calculated as the True
Positive or predictions that are actually true to the total positive predictions (True Positive and
False Positive).
Recall or Sensitivity
It is also similar to the Precision metric; however, it aims to calculate the proportion of actual
positive that was identified incorrectly. It can be calculated as True Positive or predictions that
are actually true to the total number of positives, either correctly predicted as positive or
incorrectly predicted as negative (true Positive and false negative).
8
9
AD3501/DEEP LEARNING/UNIT IV/AI&DS/SRRCET
9
10
AD3501/DEEP LEARNING/UNIT IV/AI&DS/SRRCET
From the above definitions of Precision and Recall, we can say that recall determines the
performance of a classifier with respect to a false negative, whereas precision gives information
about the performance of a classifier with respect to a false positive.
So, if we want to minimize the false negative, then, Recall should be as near to 100%, and if we
want to minimize the false positive, then precision should be close to 100% as possible.
IV. F-Scores
V. AUC-ROC
Sometimes we need to visualize the performance of the classification model on charts; then, we
can use the AUC-ROC curve. It is one of the popular and important metrics for evaluating the
performance of the classification model.
Firstly, let's understand ROC (Receiver Operating Characteristic curve) curve. ROC represents a
graph to show the performance of a classification model at different threshold levels. The
curve is plotted between two parameters, which are:
AUC calculates the performance across all the thresholds and provides an aggregate measure.
The value of AUC ranges from 0 to 1. It means a model with 100% wrong prediction will have
an AUC of 0.0, whereas models with 100% correct predictions will have an AUC of 1.0.
Regression is a supervised learning technique that aims to find the relationships between the
dependent and independent variables. A predictive regression model predicts a numeric or
discrete value. The metrics used for regression are different from the classification metrics.
10
11
AD3501/DEEP LEARNING/UNIT IV/AI&DS/SRRCET
o Adjusted R2
Mean Absolute Error or MAE is one of the simplest metrics, which measures the absolute
difference between actual and predicted values, where absolute means taking a number as
Positive.
Here,
Y is the Actual outcome, Y' is the predicted outcome, and N is the total number of data points.
Mean Squared error or MSE is one of the most suitable metrics for Regression evaluation. It
measures the average of the Squared difference between predicted values and the actual value
given by the model.
Moreover, due to squared differences, it penalizes small errors also, and hence it leads to over-
estimation of how bad the model is.
R squared error is also known as Coefficient of Determination, The R-squared metric enables us to
compare our model with a constant baseline to determine the performance of the model. To
select the constant baseline, we need to take the mean of the data and draw the line at the mean.
The R squared score will always be less than or equal to 1 without concerning if the values are too
large or small.
11
12
AD3501/DEEP LEARNING/UNIT IV/AI&DS/SRRCET
Adjusted R squared, as the name suggests, is the improved version of R squared error. R square has a
limitation of improvement of a score on increasing the terms, even though the model is not improving,
and it may mislead the data scientists.
To overcome the issue of R square, adjusted R squared is used, which will always show a lower value
than R². It is because it adjusts the values of increasing predictors and only shows improvement if there
is a real improvement.
Here,
Baseline models serve as a benchmark in an ML application. Their main goal is to put the results
of trained models into context.
Assume you begin working on a problem statement and complete all of the steps, including EDA, data
cleansing, and feature engineering.
You now begin working on your model. During model training, you discover that your model's
accuracy is 54%. So, without making much effort, you now have a 54% accuracy level, which is now
your base value.
You can now tag this as a baseline model, indicating that you will enhance this number after this. If
your model's accuracy level goes below 54% in the future, it means the model requires improvements.
Random Baseline Models: Data in the actual world isn't always reliable. A dummy
classifier or regressor is the optimal baseline model for these issues. This baseline model
will inform you if your machine learning model is learning or not.
ML Baseline Modes: Now, if the data is predictable, you can create a baseline model
which helps us analyze which features are critical for prediction and which are not. The
baseline models are commonly used with feature engineering.
Automated ML Baseline Models: It is the ultimate baseline model. It's an excellent
model for comparing your ML model. If your ML model outperforms the automated
12
13
AD3501/DEEP LEARNING/UNIT IV/AI&DS/SRRCET
baseline model, it's a strong indication that the model has the potential to become a
product.
The key advantage of employing the baseline model is that it aids in data comprehension:
Analyze observations that are challenging to categorize: With the help of a baseline
model, you'll be able to figure out which observations are difficult to categorize.
Analyze the different classes: Likewise, if you're focusing on a multi-class regression
issue, a baseline model might show you which classes are simple to classify and which
are tough to classify.
Detect data with low signal strength: A weak signal or low fitting might be
indicated by a baseline model with no or little prediction.
Faster iteration
Baseline models also help improve the efficiency with which you can build the models.
Increase speed and performance: With a baseline model in place, you will have
detailed information on what to improve and develop. This makes it easy to see if the
changes you're making to your model are improving metrics or not. This enables you to
quickly discover initiatives that can enhance your KPIs.
Efficiency: If you build a baseline model, the amount of work you have to do on current
projects may reduce, allowing you to focus on other projects. The baseline model
facilitates efficiency and productivity.
Performance benchmark
Baseline models provide a suitable standard against which you can evaluate your real models.
Some performance measures, such as logarithmic loss, are helpful to evaluate amongst
models than to assess individually. This is due to the fact that many performance
measurements lack a specified scale and instead take on varying values based on the
result variable's range. This can assist you in determining when a sophisticated model is
required vs when simple business logic is adequate.
Calculate the impact on key business parameters. Creating a simple baseline model can
also help you see what type of influence you might have on company indicators. This is
particularly true if your baseline model is stochastic as well.
13
14
In neural networks, parameters are used to train the model and make predictions. There are two types of
parameters:
Model parameters are internal to the neural network – for example, neuron weights. They are
estimated or learned automatically from training samples. These parameters are also used to
make predictions in a production model.
Hyperparameters are external parameters set by the operator of the neural network – for example,
selecting which activation function to use or the batch size used in training. Hyperparameters
have a huge impact on the accuracy of a neural network, there may be different optimal values
for different values, and it is non-trivial to discover those values.
The simplest way to select hyperparameters for a neural network model is “manual search” – in
other words, trial and error. New methods are evolving which use algorithms and optimization
methods to discover the best hyperparameters.
A hyperparameter is a parameter of the model whose value influences the learning process and whose
value cannot be estimated from the training data. Hyperparameters are configured externally before
starting the model learning/training process. Hyperparameter tuning is the process of finding the optimal
hyperparameters for any given machine learning algorithm.
2. Dropout – what percentage of neurons should be randomly “killed” during each epoch to prevent
overfitting.
3. Activation function – which function should be used to process the inputs flowing into each
neuron. The activation function can impact the network’s ability to converge and learn for
different ranges of input values, and also its training speed.
4. Weights initialization – it is necessary to set initial weights for the first forward pass. Two
basic options are to set weights to zero or to randomize them. However, this can result in a
vanishing or exploding gradient, which will make it difficult to train the model. To mitigate this
problem, you can use a heuristic (a formula tied to the number of neuron layers) to determine the
weights. A common heuristic used for the Tanh activation is called Xavier initialization.
15
Traditionally, hyperparameters were tuned manually by trial and error. This is still commonly done, and
experienced operators can “guess” parameter values that will achieve very high accuracy for deep
learning models.
However, there is a constant search for better, faster and more automatic methods to optimize
hyperparameters. Pros: Very simple and effective with skilled operators Cons: Not scientific, unknown
if you have fully optimized hyperparameters
In automated hyperparameter tuning, the optimal set of hyperparameters is found by using an algorithm.
An automatic hyperparameter tuning technique involves methods in which the user defines a set of
hyperparameter combinations or a range for each hyperparameter, and the tuning algorithm runs the trials to
find the optimal set of hyperparameters for the model.
Grid Search
Grid search is slightly more sophisticated than manual tuning. It involves systematically testing
multiple values of each hyperparameter, by automatically retraining the model for each value of the
parameter.
For example, you can perform a grid search for the optimal batch size by automatically training the
model for batch sizes between 10-100 samples, in steps of 20.
The model will run 5 times and the batch size selected will be the one which yields highest accuracy.
Pros: Maps out the problem space and provides more opportunity for optimization Cons: Can be slow
to run for large numbers of hyperparameter values
Grid-search is used to find the optimal hyperparameters of a model which results in the most ‘accurate’
predictions.
Grid search is the simplest algorithm for hyperparameter tuning. Basically, we divide the domain of the
16
Random Search
According to a 2012 research study by James Bergstra and Yoshua Bengio, testing randomized values
of hyperparameters is actually more effective than manual search or grid search.
In other words, instead of testing systematically to cover “promising areas” of the problem space, it is
preferable to test random values drawn from the entire problem space.
Pros: According to the study, provides higher accuracy with less training cycles, for problems with
high dimensionality
Cons: Results are unintuitive, difficult to understand “why” hyperparameter values were chosen
Random search is similar to grid search, but instead of using all the points in the grid, it tests only a
randomly selected subset of these points.
The smaller this subset, the faster but less accurate the optimization. The larger this dataset, the more
accurate the optimization but the closer to a grid search.
17
A debugger, also known as a debugging tool or debugging mode, can be used to easily identify and
correct bugs.
To effectively leverage the tool, we’ll need to run our program within the debugger, which allows we
to monitor it in real-time and see the error when it occurs.
We can pause the program while it’s running to pinpoint and investigate any issues that are occurring
and review our code line by line.
18
2. Log Everything
Make sure we’re logging every issue we encounter, as well as steps we take to address them and
ensure our program is running correctly.
Once we’ve documented the error, we can start mapping out potential scenarios and solutions. We
should keep track of all possible steps to take and the information we need to make a decision
regarding our errors.
This will also allow we to navigate different potential solutions.
3. Localize the Problem
The method of problem localization entails removing pieces of code line by line until we find the issue
that is interfering with our program.
While this is a somewhat painstaking and involved way of identifying the error that’s taking place, it
can be highly effective in determining what, exactly, is going wrong with our product.
Of course, we’ll need to keep repeating the process until we’ve tracked down the bugs.
4. Try to Replicate the Problem
By replicating the problem, we’ll find out what the nature of the problem is precisely. In fact, this can
lead to us creating better, cleaner code in general since we’re exercising the critical thinking skills
required to find the cause of an issue.
This, of course, demands a thorough investigation of the ins and outs of the product. But once we’ve
successfully reproduced the error that’s interfering with our product’s performance, usability, or
functionality, fixing the problem should require far less time.
In fact, most of the time, replicating the issue is the hard work, while resolving it takes only minutes.
5. Turn to the Community
It’s highly likely that any error we encounter is one others have encountered before we. It can be
very helpful to turn to a community associated with the language, framework, or another development
tool we’re using to find a solution for addressing the bug we’ve encountered.
Many development tools, such as languages like Python and frameworks like Ruby on Rails, have
huge, thriving communities, offering an abundance of support to developers within them.
6. Test, Test, and Test Again
The best way to spot bugs and successfully resolve them before they derail our app — is by
repeatedly testing the product.
While the QA team will more thoroughly vet the product, developers themselves can script simple
tests during the development phase, such as unit testing, which involves individually testing
different pieces of the code — units.
There are several common methods and techniques used in debugging, including:
1. Code Inspection: This involves manually reviewing the source code of a software
system to identify potential bugs or errors.
2. Debugging Tools: There are various tools available for debugging such as debuggers,
trace tools, and profilers that can be used to identify and resolve bugs.
3. Unit Testing: This involves testing individual units or components of a software system to
identify bugs or errors.
4. Integration Testing: This involves testing the interactions between different
components of a software system to identify bugs or errors.
5. System Testing: This involves testing the entire software system to identify bugs or errors.
6. Monitoring: This involves monitoring a software system for unusual behaviour or
performance issues that can indicate the presence of bugs or errors.
7. Logging: This involves recording events and messages related to the software system, which
can be used to identify bugs or errors.
Debugging Process: Steps involved in debugging are:
Problem identification and report preparation.
Assigning the report to the software engineer defect to verify that it is genuine.
Defect Analysis using modelling, documentation, finding and testing candidate flaws, etc.
Defect Resolution by making required changes to the system.
Validation of corrections.
The debugging process will always have one of two outcomes:
1. The cause will be found and corrected.
19
Debugging is different from testing. Testing focuses on finding bugs, errors, etc whereas debugging
starts after a bug has been identified in the software.
Testing is used to ensure that the program is correct and it was supposed to do with a certain
minimum success rate.
Testing can be manual or automated. There are several different types of testing unit testing,
integration testing, alpha, and beta testing, etc. Debugging requires a lot of knowledge, skills, and
expertise.
It can be supported by some automated tools available but is more of a manual process as every bug
is different and requires a different technique, unlike a pre-defined testing mechanism.
Advantages of Debugging:
20
Disadvantages of Debugging
While debugging is an important aspect of software engineering, there are also some disadvantages
to consider:
Similar to sampling methods in statistics, the algorithm ends up with a list of possible hyperparameter
value sets and model functions, from which it predicts the optimal function across the entire problem
set.
Pros: The original study and practical experience from the industry shows that bayesian optimization
results in significantly higher accuracy compared to random search.
Cons: Like random search, results are not intuitive and difficult to improve on, even by trained
operators.
The Bayesian optimization method takes a different approach. This method treats the search for the
optimal hyperparameters as an optimization problem.
When choosing the next hyperparameter combination, this method considers the previous evaluation
results.
It then applies a probabilistic function to select the combination that will probably yield the best results.
This method discovers a fairly good hyperparameter combination in relatively few iterations.
Data scientists choose a probabilistic model when the objective function is unknown. That is, there is
no analytical expression to maximize or minimize.
The data scientists apply the learning algorithm to a data set, use the algorithm’s results to define the
objective function, and take the various hyperparameter combinations as the input [Link]
probabilistic model is based on past evaluation results.
It estimates the probability of a hyperparameter combination’s objective function result:
P( result | hyperparameters )
This probabilistic model is a “surrogate” of the objective function. The objective function can be, for
instance, the root-mean-square error (RMSE). We calculate the objective function using the training data
with the hyperparameter combination. We try to optimize it (maximize or minimize, depending on the
objective function selected). Applying the probabilistic model to the hyperparameters is computationally
inexpensive compared to the objective function, so this method typically updates and improves the surrogate
probability model every time the objective function runs. Better hyperparameter predictions decrease the
number of objective function evaluations we need to achieve a good [Link] processes, random forest
regression, and tree-structured Parzen estimators (TPE) are surrogate model examples.