Regression and Classification Trees Overview
Regression and Classification Trees Overview
Yeil Kwon
Korea University
International Winter Campus
Predictors
- Years: the number of years that he has played in the major leagues.
6.00 6.74
The tree has two internal nodes and three terminal nodes, or leaves.
The number in each leaf is the mean of the response for the observations that
fall there.
Yeil Kwon KU-IWC 3 / 36
Regression Trees
238
R3
Hits
R1 117.5
R2
1
1 4.5 24
Years
We first select the predictor xj and the cutpoint s such that splitting the
predictor space into two regions leads to the greatest possible reduction in
RSS.
For any j and s, let R1 (j, s) = {x|xj < s} and R2 (j, s){x|xj ≥ s}. we seek the
value of j and s that minimize the function
X X
f (j, s) = (yi − ŷR1 )2 + (yi − ŷR2 )2
i:xi ∈R1 (j,s) i:xi ∈R2 (j,s)
where ŷRm is the mean response for the observations in Rm (j, s) for m = 1, 2.
We repeat the process, looking for the best predictor and best cutpoint in
order to split the data further so as to minimize the RSS within each of the
resulting regions.
R5
R2 t4
X2
X2
R3
t2 R4
R1
t1 t3
X1 X1
X1 ≤ t1
|
X2 ≤ t2 X1 ≤ t3
X2 ≤ t4
R1 R2 R3
X2
X1
R4 R5
Grow a very large tree T0 , and then prune it back in order to obtain a subtree.
is as small as possible.
|T | indicates the number of terminal nodes of the tree T .
1. Use recursive binary splitting to grow a large tree on the training data,
stopping only when each terminal node has fewer than some minimum number
of observations.
2. Apply cost complexity pruning to the large tree in order to obtain a sequence
of best subtrees, as a function of α.
(a) Repeat Steps 1 and 2 on all but the k-th fold of the training data.
(b) Evaluate the mean squared prediction error on the data in the left-out
k-th fold, as a function of α.
(c) Average the results for each value of α, and pick α to minimize the
average error.
1.0
Training
Cross−Validation
Test
0.8
Mean Squared Error
0.6
0.4
0.2
0.0
2 4 6 8 10
Tree Size
Regression Trees: the predicted response is given by the mean response of the
obs that belong to the same terminal node.
In the classification setting, RSS cannot be used as a criterion for making the
binary splits.
The classification error rate is the fraction of the observations in that region
that do not belong to the most common class.
p̂mk represents the proportion of training observations in the m-th region that
are from the k-th class.
Gini index takes on a small value if all of the p̂mk ’s are close to zero or one
After splitting T into two subsets T1 (m=1, (−)) and T2 (m=2, (+)) with sizes
n1 and n2 (n = n1 + n2 ), the gini index of the split data is defined as
n1 n2
Ginisplit (T ) = Gini(T1 ) + Gini(T2 )
n n
At xj =55
0 10 3 2 7 2
Ginisplit = + 1− − = 0.420
10 10 10 10
At xj =105
2 2 2 2
6 3 3 4 0 4
Ginisplit = 1− − + 1− − = 0.300
10 6 6 10 4 4
After splitting T into two subsets T1 (m=1, (−)) and T2 (m=2, (+)) with sizes
n1 and n2 (n = n1 + n2 ), the gini index of the split data is defined as
n1 n2
Ginisplit (T ) = Gini(T1 ) + Gini(T2 )
n n
At xj =85
At xj =115
After splitting T into two subsets T1 (m=1, (−)) and T2 (m=2, (+)) with sizes
n1 and n2 (n = n1 + n2 ), the gini index of the split data is defined as
n1 n2
Ginisplit (T ) = Gini(T1 ) + Gini(T2 )
n n
xj 55 65 75 85 95 105 115 125
− + − + − + − + − + − + − + − +
Yes (k=1) 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0
No (k=2) 0 7 1 6 2 5 3 4 3 4 3 4 4 3 5 2
Gini 0.420 0.400 0.375 0.400 0.300 0.375
At xj =85
2 2 2 2
4 1 3 6 2 4
Ginisplit = 1− − + 1− − = 0.417
10 4 4 10 6 6
At xj =115
2 2 2 2
7 3 4 3 0 3
Ginisplit = 1− − + 1− − = 0.343
10 7 7 10 3 3
Definition
K
X
D=− p̂mk ln p̂mk
k=1
D will take on a value near zero if p̂mk ’s are all near zero or near one.
Like the Gini index, the Cross-Entropy will take on a small value if the m-th
node is pure.
It turns out that the Gini index and the cross-entropy are quite similar
numerically.
A binary outcome HD for 303 patients who presented with chest pain.
Thal:a
|
Thal:a
|
0.6
Training
Cross−Validation
Test
0.5
0.4
Error
0.3
Yes Yes
0.1
No No
No Yes
5 10 15
Tree Size
Regression Trees
M
X
f (x) = cm · I(x ∈ Rm )
m=1
If the relationship between the features and the response is well approximated
by a linear model then the linear regression approach will outperform a
regression tree
2
1
1
X2
X2
0
0
−1
−1
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
X1 X1
2
2
1
1
X2
X2
0
0
−1
−1
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
X1 X1
Trees are very easy to explain to people: easier to explain than linear
regression.
Trees can easily handle qualitative predictors without the need to create
dummy variables.
Trees are very easy to explain to people: easier to explain than linear
regression.
This means that if we split the training data into two parts at random, and fit
a decision tree to both halves, the results that we get could be quite different.
In contrast, a procedure with low variance will yield similar results if applied
repeatedly to distinct data sets; for example, linear regression tends to have
low variance, if the ratio of n to p is moderately large.
These trees are grown deep, and are not pruned ⇒ each individual tree has
high variance, but low bias.
For a given test observation, we can record the class predicted by each of the B
trees,
Take a majority vote: the overall prediction is the most commonly occurring
class among the B predictions.
0.30
0.25
Error
0.20
0.15
Test: Bagging
Test: RandomForest
OOB: Bagging
0.10
OOB: RandomForest
Number of Trees
The test error (black and orange) is shown as a function of B, the number of
bootstrapped training sets used.
√
Random forests were applied with m = p.
The dashed line: the test error resulting from a single classification tree.
The green and blue: the OOB error, which in this case is considerably lower.
Yeil Kwon KU-IWC 26 / 36
Out-of-Bag Error Estimation
It turns out that there is a very straightforward way to estimate the test error
of a bagged model, without the need to perform cross-validation.
Recall that the key to bagging is that trees are repeatedly fit to bootstrapped
subsets of the observations.
One can show that on average, each bagged tree makes use of around
two-thirds of the observations.
The remaining one-third of the observations not used to fit a given bagged tree
are referred to as the out-of-bag (OOB) observations.
We can predict the response for the i-th observation using each of the trees in
which that observation was OOB.
This will yield around B/3 predictions for the ith observation.
In order to obtain a single prediction for the i-th observation, we can average
these predicted responses (if regression is the goal) or can take a majority vote
(if classification is the goal).
An OOB prediction can be obtained in this way for each of the n observations,
from which the overall OOB MSE (for a regression problem) or classification
error (for a classification problem) can be computed.
The resulting OOB error is a valid estimate of the test error for the bagged
model.
In the Figure of the OOB error on the Heart data, it can be shown that with
B sufficiently large, OOB error is virtually equivalent to LOOCV error.
Recall that one of the advantages of decision trees is the attractive and easily
interpreted diagram that results.
One can obtain an overall summary of the importance of each predictor using
In the case of bagging regression trees, we can record the total amount that the
RSS is decreased due to splits over a given predictor, averaged over all B trees.
In the context of bagging classification trees, we can add up the total amount
that the Gini index is decreased by splits over a given predictor, averaged over
all B trees.
Fbs
RestECG
ExAng
Sex
Slope
Chol
Age
RestBP
MaxHR
Oldpeak
ChestPain
Ca
Thal
0 20 40 60 80 100
Variable Importance
Variable importance is computed using the mean decrease in Gini index
The are expressed relative to the maximum.
But when building these decision trees, each time a split in a tree is
considered, a random sample of m predictors is chosen as split candidates from
the full set of p predictors.
In other words, in building a random forest, at each split in the tree, the
algorithm is not even allowed to consider a majority of the available predictors.
Suppose that there is one very strong predictor in the data set, along with a
number of other moderately strong predictors.
Then in the collection of bagged trees, most or all of the trees will use this
strong predictor in the top split.
Consequently, all of the bagged trees will look quite similar to each other.
Hence the predictions from the bagged trees will be highly correlated.
In particular, this means that bagging will not lead to a substantial reduction
in variance over a single tree in this setting.
Random forests overcome this problem by forcing each split to consider only a
subset of the predictors.
Therefore, on average (p − m)/p of the splits will not even consider the strong
predictor, and so other predictors will have more of a chance.
We can think of this process as decorrelating the trees, thereby making the
average of the resulting trees less variable and hence more reliable.
For instance, if a random forest is built using m = p, then this amounts simply
to bagging.
√
On the Heart data, random forests using m = p leads to a reduction in both
test error and OOB error over bagging.
– Goal: to predict cancer type based on the 500 genes that have the largest
variance in the training set.
We randomly divided the observations into a training and a test set, and
applied random forests to the training set for three different values of the
number of splitting variables m.
m=p
m=p/2
0.5
m= p
0.4
0.3
0.2
Number of Trees
The 15-class gene expression data set with p = 500 predictors
Each colored line corresponds to a different value of m, the number of
predictors available for splitting at each interior tree node.
√
Random forests (m = p) lead to a slight improvement over bagging (m = p).
A single classification tree has an error rate of 45.7 %.
Yeil Kwon KU-IWC 36 / 36