UNIT IV
MODEL EVALUATION
Performance metrics - Baseline Models - Hyperparameters: Manual Hyperparameter –
Automatic Hyperparameter - Grid search - Random search - Debugging strategies.
4.1 PERFORMANCE METRICS
Determining our goals, in terms of which error metric to use, is a necessary first step because
error metric will guide all of future actions. We should also have an idea of what level of performance
we desire. Performance metrics are a part of every machine learning pipeline.
For most applications, it is impossible to achieve absolute zero error. The Bayes error defines
the minimum error rate that we can hope to achieve, even if we have infinite training data and can
recover the true probability distribution. This is because input features may not contain complete
information about the output variable or because the system might be intrinsically stochastic. We will
also be limited by having a finite amount of training data.
Training data can be limited for a variety of reasons:
1. The cost of collecting more data.
2. Data collection can require time, money or human suffering
Several different performance metrics may be used to measure the effectiveness of a
complete application that includes machine learning components. These performance metrics are
usually different from the cost function used to train the model.
Metrics are different from loss functions. Loss functions show a measure of model
performance. They're used to train a machine learning model and they're usually differentiable in the
model's parameters. Metrics are used to monitor and measure the performance of a model (during
training and testing), and don't need to be differentiable.
Measures of performance need to satisfy several criteria:
1. They must coherently capture the aspect of performance of interest;
2. They must be intuitive enough to become widely used, so that the same measures are consistently
reported by researches, enabling community-wide conclusions to be drawn.
3. They must be computationally tractable, to match the rapid growth in scale of modem data collection.
Regression metrics:
Regression models have continuous output. So, we need a metric based on calculating some
sort of distance between predicted and ground truth.
Following metrics are used for regression:
a) Mean Absolute Error (MAE),
b) Mean Squared Error (MSE),
c) Root Mean Squared Error (RMSE),
d) R (R-Squared)
Classification metrics:
Classification models have discrete output, so we need a metric that compares discrete
classes in some form. Classification Metrics evaluate a model's performance and tell user how good or
bad the classification is, but each of them evaluates it in a different way.
Classification metrics are Accuracy, Confusion Matrix, Precision and Recall, F1-score and AU-ROC
[Link]:
Classification accuracy is perhaps the simplest metric to use and implement and is defined as
the number of correct predictions divided by the total number of predictions.
[Link] matrix :
Confusion Matrix is a tabular visualization of the ground-truth labels versus model predictions.
Each row of the confusion matrix represents the instances in a predicted class and each column
represents the instances in an actual class. Confusion Matrix is not exactly a performance metric but
sort of a basis on which other metrics evaluate the results.
A confusion matrix contains information about actual and predicted classification done by a
classification system. Performance of such systems is commonly evaluated using the data in the matrix.
Confusion matrix is also called a contingency table.
a) False positives: Examples predicted as positive, which are from the negative class.
b) False negatives: Examples predicted as negative, whose true class is positive.
c) True positives: Examples correctly predicted as pertaining to the positive class.
d) True negatives: Examples correctly predicted as belonging to the negative class.
3. Precision:
Precision is the ratio of true positives and total positives predicted. A precision score towards
I will signify that your model didn't miss any true positives, and is able to classify well between correct
and incorrect labeling of data. What it cannot measure is the existence of Type-II error, which is false
negatives.
A low precision score (< 0.5) means your classifier has a high number of false positives which
can be an outcome of imbalanced class or untuned model hyperparameters.
4. Recall:
A Recall is essentially the ratio of true positives to all the positives in actual truth. The recall
metric focuses on type-II errors(FN). A type-II error occurs when we accept a false null hypothesis.
Recall towards 1 will signify that your model didn't miss any true positives, and is able
to classify well between correctly and incorrectly labeling of data.
A low recall score (< 0.5) means your classifier has a high number of false negatives
which can be an outcome of imbalanced class or untuned model hyperparameters.
5. F1-score: The F1-score metric uses a combination of precision and recall. In fact, the F1 score is the
harmonic mean of the two.
Now, a high F1 score symbolizes a high precision as well as high recall. It presents a
good balance between precision and recall and gives good results on imbalanced
classification problems. Low recall means we didn't try to do well on very much of the
entire test set. Low precision means that, among the cases we identified as positive
cases, we didn't get many of them right.
In some applications, it is possible for the machine learning system to refuse to make
a decision. This is useful when the machine learning algorithm can estimate how
confident it should be about a decision, especially if a wrong decision can be harmful
and if a human operator is able to occasionally take over.
4.2 BASELINE MODELS
What is a baseline?
Before an athlete runs a marathon, they train for months. Training generally consists of phases,
but the only way to know if they're making any progress is to keep times of previous races.
Without training, an average person may be able to outrun a tortoise, but they're very unlikely to
be able to outrun a dog even after training.
This is because there is an underlying limit to how much better we can get.
A baseline is a method that uses heuristics, simple summary statistics, randomness, or machine
learning to create predictions for a dataset. You can use these predictions to measure the baseline's
performance (e.g., accuracy)-- this metric will then become what you compare any other machine
learning algorithm against.
Source
A baseline can help us determine if the Model built provides any improvement over and above
the baseline or not.
A baseline should be the first model you make, in practice, academia and theory.
A baseline can help us determine if the Model built provides any improvement over and above
the baseline or not. A baseline should be the first model you make, in practice, academia and theory.
Here are a few baselines :
1. MNIST Logistic Regression Baseline
2. Baseline Approach - Linear Regression
3. Simple Exploration+Baseline - GA Customer Revenue
4.3 HYPERPARAMETERS
A Machine Learning model is defined as a mathematical model with several parameters that
need to be learned from the data. By training a model with existing data, we can fit the model
parameters.
However, there is another kind of parameter, known as Hyperparameters, that cannot be directly
learned from the regular training process. They are usually fixed before the actual training process
begins. These parameters express important properties of the model such as its complexity or how fast
it should learn. This article aims to explore various strategies to tune hyperparameters for Machine
learning models.
4.3.1 Hyperparameter Tuning
Hyperparameter tuning is the process of selecting the optimal values for a machine
learning model’s hyperparameters. Hyperparameters are settings that control the learning process of
the model, such as the learning rate, the number of neurons in a neural network, or the kernel size in a
support vector machine. The goal of hyperparameter tuning is to find the values that lead to the best
performance on a given task.
What are Hyperparameters?
In the context of machine learning, hyperparameters are configuration variables that are set
before the training process of a model begins. They control the learning process itself, rather than being
learned from the data. Hyperparameters are often used to tune the performance of a model, and they
can have a significant impact on the model’s accuracy, generalization, and other metrics.
Different Ways of Hyperparameters Tuning
Hyperparameters are configuration variables that control the learning process of a machine
learning model. They are distinct from model parameters, which are the weights and biases that are
learned from the data. There are several different types of hyperparameters:
4.3.2 Hyperparameters in Neural Networks
Neural networks have several essential hyperparameters that need to be adjusted, including:
Learning rate: This hyperparameter controls the step size taken by the optimizer during each
iteration of training. Too small a learning rate can result in slow convergence, while too large
a learning rate can lead to instability and divergence.
Epochs: This hyperparameter represents the number of times the entire training dataset is
passed through the model during training. Increasing the number of epochs can improve the
model’s performance but may lead to overfitting if not done carefully.
Number of layers: This hyperparameter determines the depth of the model, which can have
a significant impact on its complexity and learning ability.
Number of nodes per layer: This hyperparameter determines the width of the model,
influencing its capacity to represent complex relationships in the data.
Architecture: This hyperparameter determines the overall structure of the neural
network, including the number of layers, the number of neurons per layer, and the
connections between layers. The optimal architecture depends on the complexity of the task
and the size of the dataset
Activation function: This hyperparameter introduces non-linearity into the model, allowing it
to learn complex decision boundaries. Common activation functions include
sigmoid, tanh, and Rectified Linear Unit (ReLU).
4.3.3 Hyperparameters in Support Vector Machine
We take into account some essential hyperparameters for fine-tuning SVMs:
C: The regularization parameter that controls the trade-off between the margin and the
number of training errors. A larger value of C penalizes training errors more heavily, resulting
in a smaller margin but potentially better generalization performance. A smaller value of C
allows for more training errors but may lead to overfitting.
Kernel: The kernel function that defines the similarity between data points. Different kernels
can capture different relationships between data points, and the choice of kernel can
significantly impact the performance of the SVM. Common kernels include linear, polynomial,
radial basis function (RBF), and sigmoid.
Gamma: The parameter that controls the influence of support vectors on the decision
boundary. A larger value of gamma indicates that nearby support vectors have a stronger
influence, while a smaller value indicates that distant support vectors have a weaker influence.
The choice of gamma is particularly important for RBF kernels.
4.3.4 Hyperparameters in XGBoost
The following essential XGBoost hyperparameters need to be adjusted:
learning_rate: This hyperparameter determines the step size taken by the optimizer during
each iteration of training. A larger learning rate can lead to faster convergence, but it may also
increase the risk of overfitting. A smaller learning rate may result in slower convergence but
can help prevent overfitting.
n_estimators: This hyperparameter determines the number of boosting trees to be trained.
A larger number of trees can improve the model’s accuracy, but it can also increase the risk
of overfitting. A smaller number of trees may result in lower accuracy but can help prevent
overfitting.
max_depth: This hyperparameter determines the maximum depth of each tree in the
ensemble. A larger max_depth can allow the trees to capture more complex relationships in
the data, but it can also increase the risk of overfitting. A smaller max_depth may result in less
complex trees but can help prevent overfitting.
min_child_weight: This hyperparameter determines the minimum sum of instance weight
(hessian) needed in a child node. A larger min_child_weight can help prevent overfitting by
requiring more data to influence the splitting of trees. A smaller min_child_weight may allow
for more aggressive tree splitting but can increase the risk of overfitting.
subsample: This hyperparameter determines the percentage of rows used for each tree
construction. A smaller subsample can improve the efficiency of training but may reduce the
model’s accuracy. A larger subsample can increase the accuracy but may make training more
computationally expensive.
Some other examples of model hyperparameters include:
1. The penalty in Logistic Regression Classifier i.e. L1 or L2 regularization
2. Number of Trees and Depth of Trees for Random Forests.
3. The learning rate for training a neural network.
4. Number of Clusters for Clustering Algorithms.
5. The k in k-nearest neighbors.
Fig:4.1 Hyperparameter tuning process
Grid search trains the machine learning algorithm for all combinations of hyperparameters;
this process should be guided by a performance metric, typically measured using the "cross-
validation" techniques on the training set.
Random search samples the search space and evaluates sets from a specified probability
distribution.
Bayesian optimization is an informed search algorithm, which means that each iteration of this
algorithm learns from the previous one and the results of one iteration help in creating the next one.
Bayesian optimization resembles the random search method. When applying GA to the hyperparameter
optimization problem, each hyperparameter is represented by a chromosome and the value of the
hyperparameter is set to the decimal value of the representative chromosome. Every chromosome has
several genes represented in binary digit format.
4.4 MANUAL HYPERPARAMETER:
Choosing hyperparameters manually, one must understand the relationship between
hyperparameters, training error, generalization error and computational resources.
The goal of manual hyperparameter search is usually to find the lowest generalization error
subject to some runtime and memory budget.
Choosing them manually requires understanding of what they do and Knowledge of how they
achieve good generalization.
Goal of hyperparameter search:
a) Adjust effective capacity of model to match complexity of task
b) Capacity is controlled by
1. Representational capacity of model
2. Ability of learning algorithm to minimize the cost
3. Degree to which cost and training regularize model
A model with more layers and more hidden nodes per layer has higher capacity, but learning
algorithm may not learn the function.
Generalization error is a U-shaped curve. Not every hyperparameter will be able to explore
the entire U-shaped curve. Many hyperparameters are discrete, such as the number of units in a layer or
the number of linear pieces in a max-out unit, so it is only possible to visit a few points along the curve.
Some hyperparameters are binary.
4.4.1 Learning Rate:
Most important hyperparameter is learning rate. It controls model capacity in a more
complicated way than other hyperparameters. Effective capacity is highest when
learning rate is correct, not when it is large or small.
The amount that the weights are updated during training is referred to as the step size
or the learning rate. Specifically, the learning rate is a configurable hyperparameter
used in the training of neural networks that has a small positive value, often in the
between 0.0 and 1.0. range
For example, if learning rate is 0.1, then the weights in the network are updated 0.1
(estimated weight error) or 10% of the estimated weight error each time the weights
are updated. The learning rate hyper-parameter controls the rate or speed at which
the model learns.
Learning rates are tricky because they end up being specific to the dataset and even
to other hyper-parameters. This creates a lot of overhead for finding the right setting
for hyper-parameters.
Large learning rates () make the model learn faster but at the same time it may cause
us to miss the minimum loss function and only reach the surrounding of it. In cases
where the learning rate is too large, the optimizer overshoots the minimum and the
loss updates will lead to divergent behaviors.
On the other hand, choosing lower learning rate values gives a better chance of finding
the local minima with the trade-off of needing larger number of epochs and more time.
Momentum can accelerate learning on those problems where the high-dimensional
weight space that is being navigated by the optimization process has structures that
mislead the gradient descent algorithm, such as flat regions or steep curvature.
4.5 AUTOMATIC HYPERPARAMETER:
In principle it is possible to develop hyperparameter optimization algorithms that wrap
a learning algorithm and choose its hyperparameters.
Thus hiding hyperparameters from the user. But hyperparameter learning algorithms
have their own hyperparameters such as range of values to be explored for
hyperparameters.
Manual hyperparameter tuning can work very well when the user has a good starting
point, such as one determined by others having worked on the same type of
application and architecture, or when the user has months or years of experience in
exploring
hyperparameter values for neural networks applied to similar tasks.
However, for many applications, these starting points are not available. In these cases,
automated algorithms can find useful values of the hyperparameters.
Hyperparameter Tuning techniques
Models can have many hyperparameters and finding the best combination of parameters can be
treated as a search problem. The two best strategies for Hyperparameter tuning are:
1. Grid Search CV
2. Randomized Search CV
3. Bayesian Optimization
4.6 Grid Search:
Grid-search is used to find the optimal hyperparameters of a model which results in the most accurate'
predictions. When there are three or fewer hyperparameters, the common practice is to perform grid
search.
Grid search builds a model for every combination of hyperparameters specified and evaluates each
model.
User selects a small set of values to be explored for each hyperparameter. Then trains model for every
joint specification of parameter values.
Fig:4.2 Grid and random search
The grid search technique is then used to place these hyperparameters in a matrix-like
structure and the model is trained on every combination of hyperparameter values. The model with
the best performance is then selected.
Grid search usually performs best when it is performed repeatedly.
EXAMPLE PROGRAM
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import numpy as np
from [Link] import make_classification
X, y = make_classification(
n_samples=1000, n_features=20, n_informative=10, n_classes=2, random_state=42)
# Creating the hyperparameter grid
c_space = [Link](-5, 8, 15)
param_grid = {'C': c_space}
# Instantiating logistic regression classifier
logreg = LogisticRegression()
# Instantiating the GridSearchCV object
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)
# Assuming X and y are your feature matrix and target variable
# Fit the GridSearchCV object to the data
logreg_cv.fit(X, y)
# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Best score is {}".format(logreg_cv.best_score_))
Output:
Tuned Logistic Regression Parameters: {'C': 0.006105402296585327}
Best score is 0.853
4.7 RANDOM SEARCH:
As its name suggests, Random Search uses random combinations of hyperparameters. This means
that not all of the parameter values are tried, and instead, parameters will be sampled with fixed numbers
of iterations
While grid search looks at every possible combination of hyperparameters to find the best model,
random search only selects and tests a random combination of hyperparameters. This technique
randomly samples from a grid of hyperparameters instead of conducting an exhaustive search.
We can specify the number of total runs the random search should try before returning the best
model. Hyperparameters are usually independent.
When there are many parameters, Random Search could be preferred as too many parameters will
increase time complexity for Grid Search.
The main reason why random search finds good solutions faster than grid search is that there are
no wasted experimental runs, unlike in the case of grid search, when two values of a hyperparameter
would give the same result.
In the case of grid search, the other hyperparameters would have the same values for these two
runs, whereas with random search, they would usually have different values. Hence if the change
between these two values does not marginally make much difference in terms of validation set error, grid
search will unnecessarily repeat two equivalent experiments while random search will still give two
independent explorations of the other hyperparameters.
EXAMPLE PROGRAM
import numpy as np
from [Link] import make_classification
# Generate a synthetic dataset for illustration
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_classes=2,
random_state=42)
# Rest of your code (including the RandomizedSearchCV part)
from [Link] import randint
from [Link] import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
param_dist = {
"max_depth": [3, None],
"max_features": randint(1, 9),
"min_samples_leaf": randint(1, 9),
"criterion": ["gini", "entropy"]
tree = DecisionTreeClassifier()
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)
tree_cv.fit(X, y)
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))
Output:
Tuned Decision Tree Parameters: {'criterion': 'entropy', 'max_depth': None, 'max_features': 8,
'min_samples_leaf': 7}
Best score is 0.842
4.8 DEBUGGING STRATEGIES
When a machine learning system performs poorly, it is difficult to tell whether poor performance is
intrinsic to algorithm itself or whether there is an implementation bug. We cannot tell a priori the behavior
of algorithm.
Entire point of machine learning is that it will discover useful behavior we were not able to specify
ourselves.
If classification test error rate is 5%, we cannot tell whether this is expected behavior or suboptimal
behavior.
• A further difficulty is that most machine learning models have multiple parts that are each adaptive.
If one part is broken, the other parts can adapt and still achieve roughly acceptable performance. For
example, suppose that we are training a neural net with several layers parametrized by weights W and
biases b.
Suppose further that we have manually implemented the gradient descent rule for each parameter
separately, and we made an error in the update for the biases: b-b-α, where a is the learning rate.
This erroneous update does not use the gradient at all. It causes the biases to constantly become
negative throughout learning, which is clearly not a correct implementation of any reasonable learning
algorithm. The bug may not be apparent just from examining the output of the model though. Depending
on the distribution of the input, the weights may be able to adapt to compensate for the negative biases.
4.8.1 Important Debugging Tests:
1. Visualize the model in action: Directly observing the machine learning model performing its
task will help to determine whether the quantitative performance numbers it achieves seem
reasonable. Evaluation bugs can be some of the most devastating bugs because they can mislead
us into believing system is performing well when it is not.
2. Visualize the worst mistakes: Most models are able to output some sort of confidence
measure for the task they perform. The probability assigned to the most likely class thus gives an
estimate of the confidence the model has in its classification decision. Typically, maximum likelihood
training results in these values being overestimates rather than accurate probabilities of correct
prediction.
3. Reasoning about software using train and test error: If training error is low but test error is
high, then it is likely that that the training procedure works correctly, and the model is overfitting for
fundamental algorithmic reasons. An alternative possibility is that the test error is measured
incorrectly due to a problem with saving the model after training then reloading it for test set
evaluation, or if the test data was prepared differently from the training data. If both train and test
error are high, then it is difficult to determine whether there is a software defect or whether the model
is underfitting due to fundamental algorithmic reasons.
4. Fit a tiny dataset : If there is high error on the training set, determine whether it is due to
underfitting or due to a software defect. Usually even small models can be guaranteed to be able fit
a sufficiently small dataset.
5. Compare back-propagated derivatives to numerical derivatives: If we are using a software
framework that requires to implement our own gradient computations, then a common source of error
is implementing this gradient expression incorrectly. One way to verify that these derivatives are
correct is to compare the derivatives computed by implementation of automatic differentiation to the
derivatives computed by a finite differences.
6. Monitor histograms of activations and gradient: It is often useful to visualize statistics of
neural network activations and gradients, collected over a large amount of training iterations. The
pre-activation value of hidden units can tell us if the units saturate, or how often they do. It is useful
to compare the magnitude of parameter gradients to the magnitude of the parameters themselves.
PART A
1. What are the reasons for training data can be limited?
2. Define Learning rate.
3. What is grind search?
4. What is hyperparameters?
5. Define Random search.
6. What is main reason why random search finds good solutions faster than grid search?
7. When manual hyperparameter works well?
8. Define Precision.
9. How capacity is controlled in manual hyperparameters?
10. What are the measures of performance need to satisfy several criteria?
11. What are the metrics used for regression?
12. List the important debugging Tests.
PART B
1. Illustrate in detail about Performance metrics?
2. Examine about Hyperparameter and its techniques
3. Explain in detail about Debugging Strategies.
4. Analyze about Grid search with suitable example.
5. Summarize Baseline and its model.