Overfitting
Overfitting
“Overfitting” is a problem that plagues all machine
learning methods. It occurs when a classifier fits the
training data too tightly and doesn’t generalize well
to independent test data.
The green line represents an overfitted
model and the black line represents a
regularized model. While the green line
best follows the training data, it is too
dependent on that data and it is likely to
have a higher error rate on new unseen
data, compared to the black line.
Overfitting
Noisy (roughly linear) data is fitted to a linear function and a
polynomial function. Although the polynomial function is a
perfect fit, the linear function can be expected to generalize
better: if the two functions were used to extrapolate beyond the
fitted data, the linear function should make better predictions.
Overfitting
As an example, if the number of parameters is the
same as or greater than the number of observations,
then a model can perfectly predict the training data
simply by memorizing the data in its entirety. Such a
model, though, will typically fail severely when
making predictions.
To lessen the chance or amount of overfitting,
several techniques are available (cross-validation,
early stopping, pruning,…).
Overfitting
This is the numeric version of the weather problem, where
temperature and humidity are numbers and not nominal
values. If you think about how OneR works, when it comes to
make a rule on the attribute “temperature”, it’s going to make
a complex rule that branches 14 different ways for the 14
different instances of the dataset.
Overfitting
Each rule is going to have zero errors: it’s going to get it
exactly right. If we branch on temperature, we’re going to
get a perfect rule, with a total error count of zero. OneR
has a parameter that limits the complexity of rule.
Overfitting
Open the numeric weather data. Run OneR with cross valiation. The
rule is based on the “outlook” attribute. Remove the outlook
attribute, and try it again. Now it branches on humidity. If humidity is
less than 82.5%, it’s a “yes” day; if it’s greater than 82.5%, it’s a “no”
day, and that gets 10 out of 14 instances correct.
Overfitting
Configure the classifier by clicking on it. We
see that there’s a parameter called
minBucketSize (The minimum bucket size
used for discretizing numeric attributes, as
minBucketSize increases, accuracy on the
training set steadily decreases. When
minBucketSize = 1 the rule is largest, and
decreases in size as minBucketSize
increases) and it’s set to 6 by default.
Change that value to 1. It’s branching many
different ways on the “temperature”
attribute. This rule is overfitted to the
dataset. It’s a very accurate rule on the
training data, but it won’t generalize well to
independent test data.
Minbucketsize6,training set 71,4%,CV 50%
Minbucketsize1,training set 92,8%,CV 35,7%
Overfitting
Open “diabetes” dataset. All the attributes are numeric,
and the class is either tested_negative or
tested_positive. Let’s run ZeroR to get a baseline figure
for this dataset which is 65% with cross-validation (cv).
Let’s run OneR with cv, with default parameter settings
– that is a value of 6 for OneR’s parameter that controls
rule complexity. We get 71.5%. We’re evaluating using
cross-validation, and OneR outperforms the baseline
accuracy by quite a bit – 71% versus 65%.
Overfitting
If we look at the rule, it
branches on “plas”. This is
the plasma-glucose
concentration. So depending
on which of these regions
the plasma-glucose
concentration falls into,
we’re going to predict a
negative or a positive
outcome. That seems like
quite a sensible rule.
Overfitting
Change OneR’s parameter to make it overfit.
We’ll configure OneR, find the minBucketSize
parameter, and change it to 1.
When we run OneR again, we get 57%
accuracy, quite a bit lower than the ZeroR
baseline of 65%. If you look at the rule –– it’s
testing a different attribute, “pedi”, which,
happens to be the diabetes pedigree function.
You can see that this attribute has a lot of
different values, and it looks like we’re
branching on pretty well every single one. That
gives us poor performance when evaluated by
cross-validation.
Overfitting
If you were to evaluate it on the training set, you would
expect to see very good performance (87.5%) accuracy
on the training set, which is very good for this dataset.
Of course, that figure is completely misleading. The
rule is strongly overfitted to the training dataset, and
doesn’t generalize well to independent test sets. That’s
a good example of overfitting.
Minbucketsize6,training set 76,4%,CV 71,4%
Minbucketsize1,training set 87,5%,CV 57,1%
Exercises
1. Open the [Link] dataset and inspect the data
using the Edit button of Weka’s Preprocess panel. What is the
maximum accuracy of rules based on temperature and
humidity respectively, in terms of the number of training
instances predicted correctly?
a) temperature : 12 correct instances
b) temperature : 13 correct instances
c) temperature : 14 correct instances
d) humidity: 10 correct instances
e) humidity : 12 correct instances
f) humidity : 14 correct instances
Exercises
2. The following questions investigate the effect of
OneR’s minBucketSize parameter on performance and rule complexity by
drawing graphs where –B (minBucketSize) ranges from 1 to 10.
Open the [Link] dataset, go to the Classify tab, and select OneR. Make a
rough paper-and-pencil plot of accuracy on the training data (on the vertical
axis) against minBucketSize (on the horizontal axis) and compare it with the
graphs below Which of these shapes do you get?
Exercises
3. Make a rough pencil-and-paper plot of cross-
validation accuracy against minBucketSize. Which of
these do you get?
Exercises
4. Using paper and pencil, plot the size of the rule
generated against minBucketSize. Which of these
plots do you get?
MODEL EVALUATION METRICS
IN CLASSIFICATION
Open the [Link] dataset. Use OneR
with default parameters and test with cross
validation.
In the classifier output, you can see detailed
accuracy by class and the metrics.
MODEL EVALUATION METRICS
IN CLASSIFICATION
TP Rate
4/9=0,444 for yes class
2/5=0,400 for no class
(0,444*9)+(0,400*5)/14
=0,429 for WA
FP Rate
3/5=0,600 for yes class
5/9=0,555 for no class
(0,600*9)+(0,556*5)/14
=0,584 for WA
MODEL EVALUATION METRICS
IN CLASSIFICATION
Precision= TP/(TP+FP)
=4/(4+3)=0,571 for yes class
=2/(2+5)=0,286 for no class
(0,571*9)+(0,286*5)/14=0,469 for WA
Recall=TP/(TP+FN)
=4/(4+5)=0,444 for yes class
=2/(2+3)=0,400 for no class
(0,444*9)+(0,400*5)/14= 0,429 for WA
MODEL EVALUATION METRICS
IN CLASSIFICATION
F-measure=
2*Precision*Recall/(PrecisionxRecall)
=2*0,571*0,444/(0,571+0,444)=0,500
for yes class
=2*0,286*0,400/(0,286+0,400)=0,333
for no class
(0,500*9)+(0,333*5)/14=0,440 for WA
MODEL EVALUATION METRICS
IN CLASSIFICATION
MCC=Matthews correlation coefficient
MCC= TP * TN – FP * FN / √ (TP +FP) * (TP +
FN) * (TN + FP) * (TN + FN)
Like most correlation coefficients, MCC ranges
between -1 and 1, where 1 is the best
agreement between actuals and predictions,
zero is no agreement.
MODEL EVALUATION METRICS
IN CLASSIFICATION
An ROC curve (receiver operating characteristic curve)
is a graph showing the performance of a classification
model.
PRC area computes the area under the Precision-
Recall curve (PRC). The PRC can be interpreted
as the relationship between precision and recall
(sensitivity), and is considered to be a more
appropriate measure for unbalanced datasets than the
ROC curve.