0% found this document useful (0 votes)
20 views31 pages

Evaluating Machine Learning Algorithms

The document discusses the evaluation of machine learning algorithms, emphasizing the importance of using separate training, validation, and test sets to assess generalization and avoid overfitting. It introduces key concepts such as the confusion matrix, accuracy metrics, and the Receiver Operating Characteristic (ROC) curve for comparing classifiers. Additionally, it addresses challenges with unbalanced datasets and suggests metrics like Matthew's Correlation Coefficient for more accurate evaluation.

Uploaded by

msec.ac.in
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views31 pages

Evaluating Machine Learning Algorithms

The document discusses the evaluation of machine learning algorithms, emphasizing the importance of using separate training, validation, and test sets to assess generalization and avoid overfitting. It introduces key concepts such as the confusion matrix, accuracy metrics, and the Receiver Operating Characteristic (ROC) curve for comparing classifiers. Additionally, it addresses challenges with unbalanced datasets and suggests metrics like Matthew's Correlation Coefficient for more accurate evaluation.

Uploaded by

msec.ac.in
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MODULE 2

Preliminaries  19

FIGURE 2.4 The volume of the unit hypersphere for different numbers of dimensions.

make predictions on data inputs. In the next section we consider how to evaluate how well
an algorithm actually achieves this.

2.2 KNOWING WHAT YOU KNOW: TESTING MACHINE LEARNING ALGO-


RITHMS
The purpose of learning is to get better at predicting the outputs, be they class labels or
continuous regression values. The only real way to know how successfully the algorithm has
learnt is to compare the predictions with known target labels, which is how the training is
done for supervised learning. This suggests that one thing you can do is just to look at the
error that the algorithm makes on the training set.
However, we want the algorithms to generalise to examples that were not seen in the
training set, and we obviously can’t test this by using the training set. So we need some
different data, a test set, to test it on as well. We use this test set of (input, target) pairs by
feeding them into the network and comparing the predicted output with the target, but we
don’t modify the weights or other parameters for them: we use them to decide how well the
algorithm has learnt. The only problem with this is that it reduces the amount of data that
we have available for training, but that is something that we will just have to live with.

2.2.1 Overfitting
Unfortunately, things are a little bit more complicated than that, since we might also want
to know how well the algorithm is generalising as it learns: we need to make sure that we
do enough training that the algorithm generalises well. In fact, there is at least as much
danger in over-training as there is in under-training. The number of degrees of variability in
most machine learning algorithms is huge — for a neural network there are lots of weights,
and each of them can vary. This is undoubtedly more variation than there is in the function
we are learning, so we need to be careful: if we train for too long, then we will overfit the
data, which means that we have learnt about the noise and inaccuracies in the data as well
as the actual function. Therefore, the model that we learn will be much too complicated,
and won’t be able to generalise.
Figure 2.5 shows this by plotting the predictions of some algorithm (as the curve) at
20  Machine Learning: An Algorithmic Perspective

FIGURE 2.5 The effect of overfitting is that rather than finding the generating function
(as shown on the left), the neural network matches the inputs perfectly, including the
noise in them (on the right). This reduces the generalisation capabilities of the network.

two different points in the learning process. On the left of the figure the curve fits the
overall trend of the data well (it has generalised to the underlying general function), but
the training error would still not be that close to zero since it passes near, but not through,
the training data. As the network continues to learn, it will eventually produce a much
more complex model that has a lower training error (close to zero), meaning that it has
memorised the training examples, including any noise component of them, so that is has
overfitted the training data.
We want to stop the learning process before the algorithm overfits, which means that
we need to know how well it is generalising at each timestep. We can’t use the training data
for this, because we wouldn’t detect overfitting, but we can’t use the testing data either,
because we’re saving that for the final tests. So we need a third set of data to use for this
purpose, which is called the validation set because we’re using it to validate the learning so
far. This is known as cross-validation in statistics. It is part of model selection: choosing the
right parameters for the model so that it generalises as well as possible.

2.2.2 Training, Testing, and Validation Sets


We now need three sets of data: the training set to actually train the algorithm, the validation
set to keep track of how well it is doing as it learns, and the test set to produce the final
results. This is becoming expensive in data, especially since for supervised learning it all has
to have target values attached (and even for unsupervised learning, the validation and test
sets need targets so that you have something to compare to), and it is not always easy to
get accurate labels (which may well be why you want to learn about the data). The area of
semi-supervised learning attempts to deal with this need for large amounts of labelled data;
see the Further Reading section for some references.
Clearly, each algorithm is going to need some reasonable amount of data to learn from
(precise needs vary, but the more data the algorithm sees, the more likely it is to have seen
examples of each possible type of input, although more data also increases the computational
time to learn). However, the same argument can be used to argue that the validation and
Preliminaries  21

FIGURE 2.6 The dataset is split into different sets, some for training, some for validation,
and some for testing.

test sets should also be reasonably large. Generally, the exact proportion of training to
testing to validation data is up to you, but it is typical to do something like 50:25:25 if you
have plenty of data, and 60:20:20 if you don’t. How you do the splitting can also matter.
Many datasets are presented with the first set of datapoints being in class 1, the next in
class 2, and so on. If you pick the first few points to be the training set, the next the test
set, etc., then the results are going to be pretty bad, since the training did not see all the
classes. This can be dealt with by randomly reordering the data first, or by assigning each
datapoint randomly to one of the sets, as is shown in Figure 2.6.
If you are really short of training data, so that if you have a separate validation set there
is a worry that the algorithm won’t be sufficiently trained; then it is possible to perform
leave-some-out, multi-fold cross-validation. The idea is shown in Figure 2.7. The dataset is
randomly partitioned into K subsets, and one subset is used as a validation set, while the
algorithm is trained on all of the others. A different subset is then left out and a new model
is trained on that subset, repeating the same process for all of the different subsets. Finally,
the model that produced the lowest validation error is tested and used. We’ve traded off
data for computation time, since we’ve had to train K different models instead of just one.
In the most extreme case of this there is leave-one-out cross-validation, where the algorithm
is validated on just one piece of data, training on all of the rest.

2.2.3 The Confusion Matrix


Regardless of how much data we use to test the trained algorithm, we still need to work
out whether or not the result is good. We will look here at a method that is suitable for
classification problems that is known as the confusion matrix. For regression problems things
are more complicated because the results are continuous, and so the most common thing
to use is the sum-of-squares error that we will use to drive the training in the following
chapters. We will see these methods being used as we look at examples.
The confusion matrix is a nice simple idea: make a square matrix that contains all the
possible classes in both the horizontal and vertical directions and list the classes along the
top of a table as the predicted outputs, and then down the left-hand side as the targets. So
for example, the element of the matrix at (i, j) tells us how many input patterns were put
22  Machine Learning: An Algorithmic Perspective

FIGURE 2.7 Leave-some-out, multi-fold cross-validation gets around the problem of data
shortage by training many models. It works by splitting the data into sets, training a
model on most sets and holding one out for validation (and another for testing). Different
models are trained with different sets being held out.

into class i in the targets, but class j by the algorithm. Anything on the leading diagonal
(the diagonal that starts at the top left of the matrix and runs down to the bottom right)
is a correct answer. Suppose that we have three classes: C1 , C2 , and C3 . Now we count the
number of times that the output was class C1 when the target was C1 , then when the target
was C2 , and so on until we’ve filled in the table:
Outputs
C1 C2 C3
C1 5 1 0
C2 1 4 1
C3 2 0 4
This table tells us that, for the three classes, most examples were classified correctly, but
two examples of class C3 were misclassified as C1 , and so on. For a small number of classes
this is a nice way to look at the outputs. If you just want one number, then it is possible
to divide the sum of the elements on the leading diagonal by the sum of all of the elements
in the matrix, which gives the fraction of correct responses. This is known as the accuracy,
and we are about to see that it is not the last word in evaluating the results of a machine
learning algorithm.

2.2.4 Accuracy Metrics


We can do more to analyse the results than just measuring the accuracy. If you consider
the possible outputs of the classes, then they can be arranged in a simple chart like this
Preliminaries  23

(where a true positive is an observation correctly put into class 1, while a false positive is an
observation incorrectly put into class 1, while negative examples (both true and false) are
those put into class 2):
True False
Positives Positives
False True
Negatives Negatives
The entries on the leading diagonal of this chart are correct and those off the diagonal are
wrong, just as with the confusion matrix. Note, however, that this chart and the concepts
of false positives, etc., are based on binary classification.
Accuracy is then defined as the sum of the number of true positives and true negatives
divided by the total number of examples (where # means ‘number of’, and TP stands for
True Positive, etc.):
#T P + #F P
Accuracy = . (2.2)
#T P + #F P + #T N + #F N
The problem with accuracy is that it doesn’t tell us everything about the results, since it
turns four numbers into just one. There are two complementary pairs of measurements that
can help us to interpret the performance of a classifier, namely sensitivity and specificity,
and precision and recall. Their definitions are shown next, followed by some explanation.

#T P
Sensitivity = (2.3)
#T P + #F N
#T N
Specificity = (2.4)
#T N + #F P
#T P
Precision = (2.5)
#T P + #F P
#T P
Recall = (2.6)
#T P + #F N

Sensitivity (also known as the true positive rate) is the ratio of the number of correct
positive examples to the number classified as positive, while specificity is the same ratio
for negative examples. Precision is the ratio of correct positive examples to the number of
actual positive examples, while recall is the ratio of the number of correct positive examples
out of those that were classified as positive, which is the same as sensitivity. If you look
at the chart again you can see that sensitivity and specificity sum the columns for the
denominator, while precision and recall sum the first column and the first row, and so miss
out some information about how well the learner does on the negative examples.
Together, either of these pairs of measures gives more information than just the accuracy.
If you consider precision and recall, then you can see that they are to some extent inversely
related, in that if the number of false positives increases (meaning that the algorithm is
using a broader definition of that class), then the number of false negatives often decreases,
and vice versa. They can be combined to give a single measure, the F1 measure, which can
be written in terms of precision and recall as:
precision × recall
F1 = 2 (2.7)
precision + recall
24  Machine Learning: An Algorithmic Perspective

FIGURE 2.8 An example of an ROC curve. The diagonal line represents exactly chance, so
anything above the line is better than chance, and the further from the line, the better. Of
the two curves shown, the one that is further away from the diagonal line would represent
a more accurate method.

and in terms of the numbers of false positives, etc. (from which it can be seen that it
computes the mean of the false examples) as:
#T P
F1 = . (2.8)
#T P + (#F N + #F P )/2

2.2.5 The Receiver Operator Characteristic (ROC) Curve


Since we can use these measures to evaluate a particular classifier, we can also compare clas-
sifiers – either the same classifier with different learning parameters, or completely different
classifiers. In this case, the Receiver Operator Characteristic curve (almost always known just
as the ROC curve) is useful. This is a plot of the percentage of true positives on the y axis
against false positives on the x axis; an example is shown in Figure 2.8. A single run of a
classifier produces a single point on the ROC plot, and a perfect classifier would be a point
at (0, 1) (100% true positives, 0% false positives), while the anti-classifier that got everything
wrong would be at (1,0); so the closer to the top-left-hand corner the result of a classifier
is, the better the classifier has performed. Any classifier that sits on the diagonal line from
(0,0) to (1,1) behaves exactly at the chance level (assuming that the positive and negative
classes are equally common) and so presumably a lot of learning effort is wasted since a fair
coin would do just as well.
In order to compare classifiers, or choices of parameters settings for the same classifier,
you could just compute the point that is furthest from the ‘chance’ line along the diagonal.
However, it is normal to compute the area under the curve (AUC) instead. If you only have
one point for each classifier, the curve is the trapezoid that runs from (0,0) up to the point
and then from there to (1,1). If there are more points (based on more runs of the classifier,
such as trained and/or tested on different datasets), then they are just included in order
along the diagonal line.
The key to getting a curve rather than a point on the ROC curve is to use cross-
validation. If you use 10-fold cross-validation, then you have 10 classifiers, with 10 different
Preliminaries  25

test sets, and you also have the ‘ground truth’ labels. The true labels can be used to produce
a ranked list of the different cross-validation-trained results, which can be used to specify
a curve through the 10 datapoints on the ROC curve that correspond to the results of this
classifier. By producing an ROC curve for each classifier it is possible to compare their
results.

2.2.6 Unbalanced Datasets


Note that for the accuracy we have implicitly assumed that there are the same number
of positive and negative examples in the dataset (which is known as a balanced dataset).
However, this is often not true (this can potentially cause problems for the learners as well,
as we shall see later in the book). In the case where it is not, we can compute the balanced
accuracy as the sum of sensitivity and specificity divided by 2. However, a more correct
measure is Matthew’s Correlation Coefficient, which is computed as:

#T P × #T N − #F P × #F N
M CC = p (2.9)
(#T P + #F P )(#T P + #F N )(#T N + #F P )(#T N + #F N )
If any of the brackets in the denominator are 0, then the whole of the denominator is
set to 1. This provides a balanced accuracy computation.
As a final note on these methods of evaluation, if there are more than two classes and
it is useful to distinguish the different types of error, then the calculations get a little more
complicated, since instead of one set of false positives and one set of false negatives, you
have some for each class. In this case, specificity and recall are not the same. However, it is
possible to create a set of results, where you use one class as the positives and everything
else as the negatives, and repeat this for each of the different classes.

2.2.7 Measurement Precision


There is a different way to evaluate the accuracy of a learning system, which unfortunately
also uses the word precision, although with a different meaning. The concept here is to treat
the machine learning algorithm as a measurement system. We feed in inputs and look at
the outputs that we get. Even before comparing them to the target values, we can measure
something about the algorithm: if we feed in a set of similar inputs, then we would expect to
get similar outputs for them. This measure of the variability of the algorithm is also known
as precision, and it tells us how repeatable the predictions that the algorithm makes are. It
might be useful to think of precision as being something like the variance of a probability
distribution: it tells you how much spread around the mean to expect.
The point is that just because an algorithm is precise it does not mean that it is ac-
curate – it can be precisely wrong if it always gives the wrong prediction. One measure
of how well the algorithm’s predictions match reality is known as trueness, and it can be
defined as the average distance between the correct output and the prediction. Trueness
doesn’t usually make much sense for classification problems unless there is some concept of
certain classes being similar to each other. Figure 2.9 illustrates the idea of trueness and
precision in the traditional way: as a darts game, with four examples with varying trueness
and precision for the three darts thrown by a player.
This section has considered the endpoint of machine learning, looking at the outputs,
and thinking about what we need to do with the input data in terms of having multiple
datasets, etc. In the next section we return to the starting point and consider how we can
start analysing a dataset by dealing with probabilities.
26  Machine Learning: An Algorithmic Perspective

FIGURE 2.9 Assuming that the player was aiming for the highest-scoring triple 20 in darts
(the segments each score the number they are labelled with, the narrow band on the
outside of the circle scores double and the narrow band halfway in scores triple; the outer
and inner ‘bullseye’ at the centre score 25 and 50, respectively), these four pictures show
different outcomes. Top left: very accurate: high precision and trueness, top right: low
precision, but good trueness, bottom left: high precision, but low trueness, and bottom
right: reasonable trueness and precision, but the actual outputs are not very good. (Thanks
to Stefan Nowicki, whose dartboard was used for these pictures.)
Preliminaries  27

FIGURE 2.10 A histogram of feature values (x) against their probability for two classes.

2.3 TURNING DATA INTO PROBABILITIES


Take a look at the plot in Figure 2.10. It shows the measurements of some feature x for
two classes, C1 and C2 . Members of class C2 tend to have larger values of feature x than
members of class C1 , but there is some overlap between the two classes. The correct class is
fairly easy to predict at the extremes of the range, but what to do in the middle is unclear.
Suppose that we are trying to classify writing of the letters ‘a’ and ‘b’ based on their height
(as shown in Figure 2.11). Most people write their ‘a’s smaller than their ‘b’s, but not
everybody. However, in this example, we have a secret weapon. We know that in English
text, the letter ‘a’ is much more common than the letter ‘b’ (we called this an unbalanced
dataset earlier). If we see a letter that is either an ‘a’ or a ‘b’ in normal writing, then there
is a 75% chance that it is an ‘a.’ We are using prior knowledge to estimate the probability
that the letter is an ‘a’: in this example, P (C1 ) = 0.75, P (C2 ) = 0.25. If we weren’t allowed
to see the letter at all, and just had to classify it, then if we picked ‘a’ every time, we’d be
right 75% of the time.
However, when we are asked to make a classification we are also given the value of x.
It would be pretty silly to just use the value of P (C1 ) and ignore the value of x if it might
help! In fact, we are given a training set of values of x and the class that each exemplar
belongs to. This lets us calculate the value of P (C1 ) (we just count how many times out of
the total the class was C1 and divide by the total number of examples), and also another
useful measurement: the conditional probability of C1 given that x has value X: P (C1 |X) .
The conditional probability tells us how likely it is that the class is C1 given that the value
of x is X. So in Figure 2.10 the value of P (C1 |X) will be much larger for small values of X
than for large values. Clearly, this is exactly what we want to calculate in order to perform
classification. The question is how to get to this conditional probability, since we can’t read
it directly from the histogram.
The first thing that we need to do to get these values is to quantise the measurement
x, which just means that we put it into one of a discrete set of values {X}, such as the
bins in a histogram. This is exactly what is plotted in Figure 2.10. Now, if we have lots of
examples of the two classes, and the histogram bins that their measurements fall into, we
can compute P (Ci , Xj ), which is the joint probability, and tells us how often a measurement
of Ci fell into histogram bin Xj . We do this by looking in histogram bin Xj , counting the
number of examples of class Ci that are in it, and dividing by the total number of examples
(of any class).
We can also define P (Xj |Ci ), which is a different conditional probability, and tells us

Common questions

Powered by AI

A confusion matrix provides detailed insights into the performance of a classification algorithm by showing the number of true positives, true negatives, false positives, and false negatives, allowing users to compute various metrics like precision, recall, sensitivity, and specificity . Simple accuracy does not offer this granularity, as it condenses information into one number, which can obscure important performance details, especially in imbalanced datasets .

The number of folds in cross-validation affects a model's evaluation by influencing the variance and computational cost. More folds (such as in leave-one-out cross-validation) provide a more robust estimate of model performance by reducing variance but increase computational load as more models need to be trained . Fewer folds cut down the computational time but may lead to less reliable performance estimates .

A validation set is used to monitor the performance of a machine learning algorithm and prevent overfitting by checking how well the algorithm generalizes during the training process . The training set is used to adjust the parameters or weights of the algorithm, while the test set is employed only once to evaluate the final performance and determine how well the algorithm can generalize to unseen data .

Cross-validation in model selection helps determine how well a model generalizes to unseen data by using separate subsets of data for training and validation in multiple iterations . It makes efficient use of limited data by testing the model across different groups of training/validation sets, reducing the likelihood of overfitting and ensuring a more accurate model assessment .

An ROC curve is a graph plotting the true positive rate against the false positive rate for a classifier, showcasing its performance across different threshold levels . It is used to compare classifiers by assessing the area under the curve (AUC); the closer the curve is to the top-left corner, the better the classifier's performance. Comparing ROC curves helps in selecting the best model among different classifiers or parameter settings .

Overfitting occurs when a machine learning model learns about the noise and inaccuracies in the training data instead of the actual underlying function, resulting in a model that is too complex and cannot generalize well to new data . A visualization can show a simple model fitting the data trend well and an overfitted model matching input data, including noise, thus failing to generalize .

Precision and recall are inversely related; precision measures the accuracy of positive predictions, while recall measures the ability to find all positive instances . Their relationship is crucial because improving precision often reduces recall and vice versa, influencing the overall effectiveness of a classifier. The F1 score combines precision and recall to provide a single metric to evaluate the balance between them .

Semi-supervised learning aims to address the challenge of requiring large amounts of labeled data in machine learning datasets . By leveraging both labeled and unlabeled data, semi-supervised learning reduces the burden of manual labeling, which can be costly and time-consuming, while still aiming to maintain model performance .

Matthew's Correlation Coefficient (MCC) is a metric used to evaluate the quality of binary classifications, especially in cases of imbalanced datasets . MCC considers all four confusion matrix categories, providing a comprehensive measure of correlation between observed and predicted classifications. It is preferred when a balanced perspective on true/false positives and negatives is necessary, as it provides a more nuanced assessment than accuracy .

Balanced accuracy is the average of sensitivity and specificity, providing a measure that accounts for imbalances between positive and negative class sizes in a dataset . Unlike standard accuracy, which can be misleading in unbalanced datasets due to its focus on the overall proportion of correctly classified instances, balanced accuracy gives equal importance to the classification performance on each class, offering a fairer evaluation of model performance .

You might also like