0% found this document useful (0 votes)
6 views36 pages

Regression and Classification Trees Overview

Uploaded by

342667910a
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views36 pages

Regression and Classification Trees Overview

Uploaded by

342667910a
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chap Chap 8.

Tree Based Model

Yeil Kwon

Korea University
International Winter Campus

Yeil Kwon KU-IWC 1 / 36


Regression Trees

Example: Predicting Baseball Players’ Salaries

Response variable : Salary


- We use log-transformed Salary (in thousands of dollars.)

Predictors
- Years: the number of years that he has played in the major leagues.

- Hits: the number of hits that he made in the previous year.

Yeil Kwon KU-IWC 2 / 36


Regression Trees

Years < 4.5


|

Hits < 117.5


5.11

6.00 6.74

A regression tree for predicting the log(salary) of a baseball player, based on


the number of years and the number of hits.

The tree has two internal nodes and three terminal nodes, or leaves.

The number in each leaf is the mean of the response for the observations that
fall there.
Yeil Kwon KU-IWC 3 / 36
Regression Trees

238

R3

Hits
R1 117.5

R2

1
1 4.5 24
Years

R1 = {x|Y ears < 4.5}


R2 = {x|Y ears ≥ 4.5, Hits < 117.5}, R3 = {x|Y ears ≥ 4.5, Hits ≥ 117.5}
The goal is to find boxes R1 , . . . , RM that minimize the RSS, given by
M
X X
RSS = (yi − ŷRm )2 ,
m=1 i∈Rm

where ŷRm is the mean of yi ’s within the m-th box.


Yeil Kwon KU-IWC 4 / 36
How to Find R1 , . . . , RJ ?

We take a top-down, greedy approach: Recursive Binary Splitting.

We first select the predictor xj and the cutpoint s such that splitting the
predictor space into two regions leads to the greatest possible reduction in
RSS.

For any j and s, let R1 (j, s) = {x|xj < s} and R2 (j, s){x|xj ≥ s}. we seek the
value of j and s that minimize the function
X X
f (j, s) = (yi − ŷR1 )2 + (yi − ŷR2 )2
i:xi ∈R1 (j,s) i:xi ∈R2 (j,s)

where ŷRm is the mean response for the observations in Rm (j, s) for m = 1, 2.

We repeat the process, looking for the best predictor and best cutpoint in
order to split the data further so as to minimize the RSS within each of the
resulting regions.

Yeil Kwon KU-IWC 5 / 36


Regression Trees: Example M = 5

R5

R2 t4

X2

X2
R3

t2 R4

R1

t1 t3

X1 X1

X1 ≤ t1
|

X2 ≤ t2 X1 ≤ t3

X2 ≤ t4

R1 R2 R3
X2
X1

R4 R5

Yeil Kwon KU-IWC 6 / 36


Algorithm: Building a Regression Tree

Grow a very large tree T0 , and then prune it back in order to obtain a subtree.

For each value of α there corresponds a subtree T ⊂ T0 such that


|T |
X X
f (j, s) = (yi − ŷRm )2 + α|T |
m=1 i:xi ∈Rm

is as small as possible.
|T | indicates the number of terminal nodes of the tree T .

The tuning parameter α controls a trade-off between the subtree’s complexity


and its fit to the training data.

We can select a value of α using a validation set or using cross-validation.

Return to the full data set to obtain the subtree corresponding to α.

Yeil Kwon KU-IWC 7 / 36


Algorithm: Building a Regression Tree

1. Use recursive binary splitting to grow a large tree on the training data,
stopping only when each terminal node has fewer than some minimum number
of observations.

2. Apply cost complexity pruning to the large tree in order to obtain a sequence
of best subtrees, as a function of α.

3. Use K-fold cross-validation to choose α. That is, divide the training


observations into K folds. For each k = 1, . . . , K:

(a) Repeat Steps 1 and 2 on all but the k-th fold of the training data.

(b) Evaluate the mean squared prediction error on the data in the left-out
k-th fold, as a function of α.

(c) Average the results for each value of α, and pick α to minimize the
average error.

Yeil Kwon KU-IWC 8 / 36


The Unpruned Trees

Years < 4.5


|

RBI < 60.5 Hits < 117.5

Putouts < 82 Years < 3.5


Years < 3.5
5.487 5.394 6.189
4.622 5.183
Walks < 43.5 Walks < 52.5
Runs < 47.5 RBI < 80.5
6.407 6.549 Years < 6.5
6.015 5.571
7.289
6.459 7.007

Yeil Kwon KU-IWC 9 / 36


Regression Tree Analysis

1.0
Training
Cross−Validation
Test
0.8
Mean Squared Error

0.6
0.4
0.2
0.0

2 4 6 8 10

Tree Size

Yeil Kwon KU-IWC 10 / 36


Classification Trees

To predict a qualitative response rather than a quantitative one.

Regression Trees: the predicted response is given by the mean response of the
obs that belong to the same terminal node.

Classification tree: we predict that each observation belongs to the most


commonly occurring class of obs in the region to which it belongs.

In the classification setting, RSS cannot be used as a criterion for making the
binary splits.

A natural alternative to RSS is the classification error rate.

We plan to assign an observation in a given region to the most commonly


occurring error rate class of training observations in that region.

The classification error rate is the fraction of the observations in that region
that do not belong to the most common class.

Yeil Kwon KU-IWC 11 / 36


Gini Index

A measure of total variance across the K classes


K
X K
X
Gini(T ) = p̂mk (1 − p̂mk ) = 1 − p̂2mk
k=1 k=1

p̂mk represents the proportion of training observations in the m-th region that
are from the k-th class.

Gini index takes on a small value if all of the p̂mk ’s are close to zero or one

Gini index is referred to as a measure of node purity.

Yeil Kwon KU-IWC 12 / 36


Example

Gini Index if k = 1 (Yes) and k = 2 (No).


K
X
Gini(T ) = 1 − p̂2mk = 1 − p̂2m1 − p̂2m2
k=1

After splitting T into two subsets T1 (m=1, (−)) and T2 (m=2, (+)) with sizes
n1 and n2 (n = n1 + n2 ), the gini index of the split data is defined as
n1 n2
Ginisplit (T ) = Gini(T1 ) + Gini(T2 )
n n

xj 55 65 75 85 95 105 115 125


− + − + − + − + − + − + − + − +
Yes (k=1) 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0
No (k=2) 0 7 1 6 2 5 3 4 3 4 3 4 4 3 5 2
Gini 0.420 0.400 0.375 0.400 0.300 0.375

At xj =55
0 10 3 2 7 2
     
Ginisplit = + 1− − = 0.420
10 10 10 10

At xj =105
2 2   2  2 
6 3 3 4 0 4
   
Ginisplit = 1− − + 1− − = 0.300
10 6 6 10 4 4

Yeil Kwon KU-IWC 13 / 36


Example

Gini Index if k = 1 (Yes) and k = 2 (No).


K
X
Gini(T ) = 1 − p̂2mk = 1 − p̂2m1 − p̂2m2
k=1

After splitting T into two subsets T1 (m=1, (−)) and T2 (m=2, (+)) with sizes
n1 and n2 (n = n1 + n2 ), the gini index of the split data is defined as
n1 n2
Ginisplit (T ) = Gini(T1 ) + Gini(T2 )
n n

xj 55 65 75 85 95 105 115 125


− + − + − + − + − + − + − + − +
Yes (k=1) 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0
No (k=2) 0 7 1 6 2 5 3 4 3 4 3 4 4 3 5 2
Gini 0.420 0.400 0.375 0.400 0.300 0.375

At xj =85

At xj =115

Yeil Kwon KU-IWC 14 / 36


Example

Gini Index if k = 1 (Yes) and k = 2 (No).


K
X
Gini(T ) = 1 − p̂2mk = 1 − p̂2m1 − p̂2m2
k=1

After splitting T into two subsets T1 (m=1, (−)) and T2 (m=2, (+)) with sizes
n1 and n2 (n = n1 + n2 ), the gini index of the split data is defined as
n1 n2
Ginisplit (T ) = Gini(T1 ) + Gini(T2 )
n n
xj 55 65 75 85 95 105 115 125
− + − + − + − + − + − + − + − +
Yes (k=1) 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0
No (k=2) 0 7 1 6 2 5 3 4 3 4 3 4 4 3 5 2
Gini 0.420 0.400 0.375 0.400 0.300 0.375
At xj =85
2 2   2  2 
4 1 3 6 2 4
   
Ginisplit = 1− − + 1− − = 0.417
10 4 4 10 6 6

At xj =115
2 2   2  2 
7 3 4 3 0 3
   
Ginisplit = 1− − + 1− − = 0.343
10 7 7 10 3 3

Yeil Kwon KU-IWC 15 / 36


Cross-Entropy

Definition
K
X
D=− p̂mk ln p̂mk
k=1

Note that D ≥ 0, since 0 ≤ p̂mk ≤ 1.

D will take on a value near zero if p̂mk ’s are all near zero or near one.

Like the Gini index, the Cross-Entropy will take on a small value if the m-th
node is pure.

It turns out that the Gini index and the cross-entropy are quite similar
numerically.

Yeil Kwon KU-IWC 16 / 36


Example: Heart data set

A binary outcome HD for 303 patients who presented with chest pain.

An outcome value of “Yes” indicates the presence of heart disease based on an


angiographic test, while “No” means no heart disease.

There are 13 predictors including Age, Sex, Chol (a cholesterol measurement),


Thal (Thalium stress test), and other heart and lung function measurements.

Cross-validation results in a tree with six terminal nodes.

Decision trees can be constructed even in the presence of qualitative predictor


variables.

Yeil Kwon KU-IWC 17 / 36


Classification Tree

Thal:a
|

Ca < 0.5 Ca < 0.5

Slope < 1.5 Oldpeak < 1.1

MaxHR < 161.5 ChestPain:bc Age < 52 Thal:b RestECG < 1


ChestPain:a Yes
RestBP < 157 No Yes Yes
Yes No
Chol < 244 No Chol < 244 Sex < 0.5
MaxHR < 156 No Yes
MaxHR < 145.5 Yes
No
No No No No Yes
No Yes

Thal:a
|
0.6

Training
Cross−Validation
Test
0.5
0.4
Error

0.3

Ca < 0.5 Ca < 0.5


0.2

Yes Yes
0.1

MaxHR < 161.5 ChestPain:bc


0.0

No No
No Yes
5 10 15

Tree Size

Yeil Kwon KU-IWC 18 / 36


Trees Versus Linear Models

Linear Regression Model


p
X
f (x) = β0 + βj xj
j=1

Regression Trees
M
X
f (x) = cm · I(x ∈ Rm )
m=1

If the relationship between the features and the response is well approximated
by a linear model then the linear regression approach will outperform a
regression tree

If there is a highly non-linear and complex relationship between the features


and the response, then decision trees may outperform classical linear model
approaches.

Yeil Kwon KU-IWC 19 / 36


Trees Versus Linear Models

2
1

1
X2

X2
0

0
−1

−1
−2

−2
−2 −1 0 1 2 −2 −1 0 1 2

X1 X1
2

2
1

1
X2

X2
0

0
−1

−1
−2

−2

−2 −1 0 1 2 −2 −1 0 1 2

X1 X1

Yeil Kwon KU-IWC 20 / 36


Advantages and Disadvantages of Trees

Trees are very easy to explain to people: easier to explain than linear
regression.

Decision trees more closely mirror human decision-making.

Trees can be displayed graphically, and are easily interpreted even by a


non-expert

Trees can easily handle qualitative predictors without the need to create
dummy variables.

It tends to have lower level of predictive accuracy compared to the other


methods.

Yeil Kwon KU-IWC 21 / 36


Bagging (Bootstrap aggregation)

Trees are very easy to explain to people: easier to explain than linear
regression.

The decision trees discussed suffer from high variance.

This means that if we split the training data into two parts at random, and fit
a decision tree to both halves, the results that we get could be quite different.

In contrast, a procedure with low variance will yield similar results if applied
repeatedly to distinct data sets; for example, linear regression tends to have
low variance, if the ratio of n to p is moderately large.

Bootstrap aggregation, or bagging, is a general-purpose procedure for


reducing the variance of a statistical learning method.

Yeil Kwon KU-IWC 22 / 36


Bagging (Bootstrap aggregation)

Recall averaging a set of observations reduces variance.

To reduce the variance (and hence increase the prediction accuracy)


– to take many training sets from the population

– build a separate prediction model using each training set

– average the resulting predictions


in other words,
– generate B separate bootstrap training sets from the (single) training
data set

– calculate fˆ1 (x), fˆ2 (x), . . ., fˆB (x)

– average them in order to obtain a single low-variance statistical learning


model
B
1 X ˆb
fˆavg (x) = f (x).
B
b=1

Yeil Kwon KU-IWC 23 / 36


Bagging (Bootstrap aggregation)

To apply bagging to regression trees,

Construct B regression trees using B bootstrapped training sets, and average


the resulting predictions.

These trees are grown deep, and are not pruned ⇒ each individual tree has
high variance, but low bias.

Averaging these B trees reduces the variance.

Bagging has been demonstrated to give impressive improvements in accuracy


by combining together hundreds or even thousands of trees into a single
procedure.

Yeil Kwon KU-IWC 24 / 36


Bagging (Bootstrap aggregation)

To apply bagging to classification trees,

For a given test observation, we can record the class predicted by each of the B
trees,

Take a majority vote: the overall prediction is the most commonly occurring
class among the B predictions.

Yeil Kwon KU-IWC 25 / 36


Bagging and random forest results for the Heat Data

0.30
0.25
Error

0.20
0.15

Test: Bagging
Test: RandomForest
OOB: Bagging
0.10

OOB: RandomForest

0 50 100 150 200 250 300

Number of Trees

The test error (black and orange) is shown as a function of B, the number of
bootstrapped training sets used.

Random forests were applied with m = p.
The dashed line: the test error resulting from a single classification tree.
The green and blue: the OOB error, which in this case is considerably lower.
Yeil Kwon KU-IWC 26 / 36
Out-of-Bag Error Estimation

It turns out that there is a very straightforward way to estimate the test error
of a bagged model, without the need to perform cross-validation.

Recall that the key to bagging is that trees are repeatedly fit to bootstrapped
subsets of the observations.

One can show that on average, each bagged tree makes use of around
two-thirds of the observations.

The remaining one-third of the observations not used to fit a given bagged tree
are referred to as the out-of-bag (OOB) observations.

We can predict the response for the i-th observation using each of the trees in
which that observation was OOB.

This will yield around B/3 predictions for the ith observation.

Yeil Kwon KU-IWC 27 / 36


Out-of-Bag Error Estimation

In order to obtain a single prediction for the i-th observation, we can average
these predicted responses (if regression is the goal) or can take a majority vote
(if classification is the goal).

This leads to a single OOB prediction for the i-th observation.

An OOB prediction can be obtained in this way for each of the n observations,
from which the overall OOB MSE (for a regression problem) or classification
error (for a classification problem) can be computed.

The resulting OOB error is a valid estimate of the test error for the bagged
model.

In the Figure of the OOB error on the Heart data, it can be shown that with
B sufficiently large, OOB error is virtually equivalent to LOOCV error.

Yeil Kwon KU-IWC 28 / 36


Variable Importance Measures

Bagging typically results in improved accuracy over prediction using a single


tree.

However, it can be difficult to interpret the resulting model.

Recall that one of the advantages of decision trees is the attractive and easily
interpreted diagram that results.

However, when we bag a large number of trees, it is no longer possible to


represent the resulting statistical learning procedure using a single tree

Thus, bagging improves prediction accuracy at the expense of interpretability.

Yeil Kwon KU-IWC 29 / 36


Variable Importance Measures

One can obtain an overall summary of the importance of each predictor using

– the RSS (for bagging regression trees)

– the Gini index (for bagging classification trees).

In the case of bagging regression trees, we can record the total amount that the
RSS is decreased due to splits over a given predictor, averaged over all B trees.

In the context of bagging classification trees, we can add up the total amount
that the Gini index is decreased by splits over a given predictor, averaged over
all B trees.

A large value indicates an important predictor.

Yeil Kwon KU-IWC 30 / 36


A variable importance plot for the Heart data

Fbs

RestECG

ExAng

Sex

Slope

Chol

Age

RestBP

MaxHR

Oldpeak

ChestPain

Ca

Thal

0 20 40 60 80 100
Variable Importance
Variable importance is computed using the mean decrease in Gini index
The are expressed relative to the maximum.

Yeil Kwon KU-IWC 31 / 36


Results from Random Forest

Random forests provide an improvement over bagged trees by way of a small


tweak that decorrelates the trees.

As in bagging, we build a number of decision trees on bootstrapped training


samples.

But when building these decision trees, each time a split in a tree is
considered, a random sample of m predictors is chosen as split candidates from
the full set of p predictors.

The split is allowed to use only one of those m predictors.

A fresh sample of m predictors is taken at each split, and typically we choose



m ≈ p (4 out of the 13 for the Heart data).

Yeil Kwon KU-IWC 32 / 36


Results from Random Forest

In other words, in building a random forest, at each split in the tree, the
algorithm is not even allowed to consider a majority of the available predictors.

Suppose that there is one very strong predictor in the data set, along with a
number of other moderately strong predictors.

Then in the collection of bagged trees, most or all of the trees will use this
strong predictor in the top split.

Consequently, all of the bagged trees will look quite similar to each other.

Hence the predictions from the bagged trees will be highly correlated.

Unfortunately, averaging many highly (postively) correlated quantities does


not lead to as large of a reduction in variance as averaging many uncorrelated
quantities.

In particular, this means that bagging will not lead to a substantial reduction
in variance over a single tree in this setting.

Yeil Kwon KU-IWC 33 / 36


Results from Random Forest

Random forests overcome this problem by forcing each split to consider only a
subset of the predictors.

Therefore, on average (p − m)/p of the splits will not even consider the strong
predictor, and so other predictors will have more of a chance.

We can think of this process as decorrelating the trees, thereby making the
average of the resulting trees less variable and hence more reliable.

The main difference between bagging and random forests is the


choice of predictor subset size m.

For instance, if a random forest is built using m = p, then this amounts simply
to bagging.

On the Heart data, random forests using m = p leads to a reduction in both
test error and OOB error over bagging.

Yeil Kwon KU-IWC 34 / 36


Random Forest - Example

Using a small value of m in building a random forest will typically be helpful


when we have a large number of correlated predictors.

A high-dimensional biological data set consisting of expressions


– 4,718 genes from 349 patients (n = 4, 718, n = 349)

– Each of the patient samples has a qualitative label with 15 different


levels: either normal or 1 of 14 different types of cancer.

– Goal: to predict cancer type based on the 500 genes that have the largest
variance in the training set.
We randomly divided the observations into a training and a test set, and
applied random forests to the training set for three different values of the
number of splitting variables m.

We see that using 400 trees is sufficient to give good performance

As with bagging, random forests will not overfit if we B is sufficiently large


(for the error rate to have settled down).

Yeil Kwon KU-IWC 35 / 36


Results from Random Forest

m=p
m=p/2

0.5
m= p

Test Classification Error

0.4
0.3
0.2

0 100 200 300 400 500

Number of Trees
The 15-class gene expression data set with p = 500 predictors
Each colored line corresponds to a different value of m, the number of
predictors available for splitting at each interior tree node.

Random forests (m = p) lead to a slight improvement over bagging (m = p).
A single classification tree has an error rate of 45.7 %.
Yeil Kwon KU-IWC 36 / 36

You might also like