Decision Trees, Random Forests, Boosting
ISLR Chapter 8
Yuan YAO
Hong Kong University of Science and Technology
Department of Mathematics
Fall 2024
Best Machine Learning Algorithms in history?
I Boosting (ISLR chapter 8): CART, AdaBoost, Random
Forests (Leo Breiman), etc.
I Support Vector Machines (ISLR chapter 9): or kernel methods
(V. Vapnik), etc.
I Neural Networks: perceptrons (1950s), deep learning (2010s),
CNN/RNN/LSTM, ResNet/Transformers (> 2015), etc.
2
The basics of decision trees.
Regression Trees
Classification Trees
Bagging, random forests and boosting
Bagging
Random Forest
Boosting
3
About this chapter
I Decisions trees: splitting each variable sequentially, creating
rectangular regions.
I Making fitting/prediction locally at each hyper-rectangular
region.
I It is intuitive and easy to implement, may have good
interpretation.
I Generally of lower prediction accuracy (weak learners).
I “The Boosting problem” (Kearns & Valiant): Can a set of
weak learners create a single strong learner?
– Bagging, random forests and boosting ... make
fitting/prediction based on a number (ensemble) of trees.
– Bagging and Boosting are general methodologies, not just
limited to trees.
4
About this chapter
I Decisions trees: splitting each variable sequentially, creating
rectangular regions.
I Making fitting/prediction locally at each hyper-rectangular
region.
I It is intuitive and easy to implement, may have good
interpretation.
I Generally of lower prediction accuracy (weak learners).
I “The Boosting problem” (Kearns & Valiant): Can a set of
weak learners create a single strong learner?
– Bagging, random forests and boosting ... make
fitting/prediction based on a number (ensemble) of trees.
– Bagging and Boosting are general methodologies, not just
limited to trees.
4
About this chapter
I Decisions trees: splitting each variable sequentially, creating
rectangular regions.
I Making fitting/prediction locally at each hyper-rectangular
region.
I It is intuitive and easy to implement, may have good
interpretation.
I Generally of lower prediction accuracy (weak learners).
I “The Boosting problem” (Kearns & Valiant): Can a set of
weak learners create a single strong learner?
– Bagging, random forests and boosting ... make
fitting/prediction based on a number (ensemble) of trees.
– Bagging and Boosting are general methodologies, not just
limited to trees.
4
Interpretability vs. Prediction
High
Subset Selection
Lasso
Least Squares
Interpretability
Generalized Additive Models
Trees
Bagging, Boosting
Support Vector Machines
Low
Neural Networks
Low High
Flexibility
Figure: 2.7. As models become flexible, interpretability drops. Occam
Razor principle: Everything has to be kept as simple as possible, but not
simpler (Albert Einstein).
5
Outline
The basics of decision trees.
Regression Trees
Classification Trees
Bagging, random forests and boosting
Bagging
Random Forest
Boosting
The basics of decision trees. 6
Figure: CART and C4.5
The basics of decision trees. 7
Figure: Historical story about CART: dates back to Automatic
Interaction Detection ... in 1960s
The basics of decision trees. 8
Regression trees
I Trees can be applied to both regression and classification.
I CART refers to classification and regression trees.
I We first consider regression trees through an example of
predicting Baseball players’ salaries.
The basics of decision trees. 9
The Hitters data
I Response/Outputs: Salary.
I Predictors/Inputs/Covariates:
Years (the number of years that he has played in the major
leagues)
Hits (the number of hits that he made in the previous year).
I preparing data: remove the observations with missing data
and log-transformed the Salary (preventing heavy
right-skewness)
The basics of decision trees. 10
Years < 4.5
|
Hits < 117.5
5.11
6.00 6.74
Figure: 1. Next page
The basics of decision trees. 11
Figure 1. For the Hitters data, a regression tree for predicting the
log salary of a baseball player, based on the number of years that
he has played in the major leagues and the number of hits that he
made in the previous year. At a given internal node, the label (of
the form Xj < tk ) indicates the left-hand branch emanating from
that split, and the right-hand branch corresponds to Xj ≥ tk . For
instance, the split at the top of the tree results in two large
branches. The left-hand branch corresponds to Years< 4.5, and
the right-hand branch corresponds to Years≥ 4.5. The tree has two
internal nodes and three terminal nodes, or leaves. The number in
each leaf is the mean of the response for the observations that fall
there.
The basics of decision trees. 12
238
R3
Hits
R1 117.5
R2
1
1 4.5 24
Years
Figure: 2. The three-region partition for the Hitters data set from the
regression tree illustrated in Figure 1.
The basics of decision trees. 13
Estimation/prediction
I On Regions R1 , R2 , R3 , the mean-log-salary is 5.107, and
6.74.
I Our prediction for any players in R1 , R2 and R3 are,
respectively
1, 000 × e 5.107 = $165, 174, 1, 000 × e 5.999 = $402, 834, and
1, 000 × e 6.740 = $845, 346.
The basics of decision trees. 14
Estimation/prediction
I Trees involve a series of splittings of the data, each time by
one variable.
I The series of actions taken place sequentially creates a
tree-like results.
I As in Figure 1, the terminal nodes are the three indexed by
the numbers, which represent the regions R1 , R2 and R3 .
These regions constitute the final partition of the data.
I Terminal nodes are also called leaves.
I Each internal node represents a splitting,
I In Figure 1, the two internal nodes are indexed by Years < 4.5
and Hits < 117.5.
I The lines connecting nodes are called branches.
I Trees are typically drawn upside down.
The basics of decision trees. 15
Two step towards prediction
I Run the splitting according to input values sequentially, and
obtain final partition of the data in regions R1 , ..., RJ .
I For any new observation with covariates in region Rk , we
predict its response by the average of the reponses of the data
points in region Rk .
The basics of decision trees. 16
How to split
I Suppose we wish to partition a region R. In other words, we
wish to separate the data in region R into two parts, day R1
and R2 , according to one input values.
I What would be the optimal or efficient split in some sense?
I Only two parameters in the split:
1. Choice of the input variable to split,
2. the cutpoint of the split of that chose input.
I Imagine that this is the final split of R: R1 and R2 would be
leaves.
– And we would use the mean response of data in R1 and R2 to
predict the response of any new/old observations.
– We wish our choice of R1 and R2 would be optimal in the
sense of achieving minimum prediction error on the training
data in region R.
The basics of decision trees. 17
Recursive binary splitting
I A greedy algorithm (geedy means it is optimal at the current
step):
1. For j = 1, ..., p and all real value s, let
R1 (j, s) = {i ∈ R : Xj < s} and R2 (j, s) = {i ∈ R : Xj ≥ s}.
And let ŷ1 and ŷ2 be the mean response of all observations in
R1 (j, s) and R2 (j, s), respectively.
2. Consider the following prediction error:
X X
RSSnew = (yi − ŷ1 )2 + (yi − ŷ2 )2
i∈R1 (j,s) i∈R2 (j,s)
Choose the split which has the smallest prediction error. This
split is the optimal one, denoted as R1 and R2 .
I Continue the split till the final partition.
The basics of decision trees. 18
R5
R2 t4
X2
X2
R3
t2 R4
R1
t1 t3
X1 X1
X1 ≤ t1
|
X2 ≤ t2 X1 ≤ t3
X2 ≤ t4
R1 R2 R3
X2
X1
R4 R5
The basics of decision trees. 19
Figure 3. Top Left: A partition of two-dimensional feature space
that could not result from recursive binary splitting. Top Right:
The output of recursive binary splitting on a two-dimensional
example. Bottom Left: A tree corresponding to the partition in the
top right panel. Bottom Right: A perspective plot of the
prediction surface corresponding to that tree.
The basics of decision trees. 20
When to stop split
I The problem of when to stop.
I If too many steps of splitting: many leaves, too complex
model, small bias but large variance, may overfit.
I If too few steps of splitting: few leaves, too simple model,
large bias but small variance, may underfit.
The basics of decision trees. 21
One natural idea
I When splitting R into R1 and R2 , consider the RSS before the
split X
RSSold = (yi − ŷ )2
i∈R
where ŷ is the average of the response of data in R. With the
optimal split, the reduction of RSS is
RSSold − RSSnew
I We can pre-choose a threshold, h, and decide the worthiness
of the split.
– If the reduction is smaller than h, we do not do it, and stop
right there; then R is one terminal node (a leave).
– If the reduction is greater than h, we make the split, and
continue with next step.
The basics of decision trees. 22
One natural idea
I The idea is seemingly reasonable, but is too near-sighted.
I Only look at the effect of the current split.
I It is possible that even if the current split is not effective, the
future splits could be effective and, maybe, very effective.
The basics of decision trees. 23
Tree pruning
I Grow a very large tree.
I Prune the tree back to obtain a subtree.
I Objective: find the subtree that has the best test error.
I Use cross-validation to examine the test errors for a sequence
(parametrized by α below) of subtrees during the
growth/pruning, instead of all possible subtrees which is too
large a model space.
The basics of decision trees. 24
Cost complexity pruning
I Let T0 be the original (large) tree. Let T be any subtree. Use
|T0 | and |T | to denote their numbers of teminal nodes, which
represent complexity.
I Consider “Loss + Penalty”:
T X
X
(yi − ŷm )2 + α|T |
m=1 i∈Rm
where Rm are the terminal nodes of the subtree T , and the
mean response of Rm is ŷm ; α is tuning parameter.
I Denote the minimized subtree as Tα .
– If α = 0, no penalty the optimal tree is the original T0 .
– If α = ∞, the tree has no split at all. The predictor is just ȳ .
– The larger the α, the more penalty for model complexity.
The basics of decision trees. 25
Cost complexity pruning
I Just like Lasso, there exists efficient computation algorithm to
compute the entire sequence of Tα for all α.
I Use cross-validation to find the best α to minimize the test
error.
The basics of decision trees. 26
The algorithm
I 1. Use recursive binary splitting to grow a large tree on the
training data, stopping only when each terminal node has
fewer than some minimum number of observations.
I 2. Apply cost complexity pruning to the large tree in order to
obtain a sequence of best subtrees, as a function of α.
The basics of decision trees. 27
The algorithm (continued)
I 3. Use K -fold cross-validation to determine best α. That is,
divide the training observations into K folds. For each
k = 1, ..., K
(a) Repeat Steps 1 and 2 on all but the kth fold of the training
data.
(b) Evaluate the mean squared prediction error on the data in the
left-out k-th fold, as a function of α.
(c) Average the results for each value of α, and pick the α∗ to
minimize the average error.
I 4. Return the subtree from Step 2 that corresponds to the
chosen α∗ .
The basics of decision trees. 28
Years < 4.5
|
RBI < 60.5 Hits < 117.5
Putouts < 82 Years < 3.5
Years < 3.5
5.487 5.394 6.189
4.622 5.183
Walks < 43.5 Walks < 52.5
Runs < 47.5 RBI < 80.5
6.407 6.549 Years < 6.5
6.015 5.571
7.289
6.459 7.007
Figure: 4. Regression tree analysis for the Hitters data. The unpruned
tree that results from top-down greedy splitting on the training data is
shown.
The basics of decision trees. 29
1.0
Training
Cross−Validation
Test
0.8
Mean Squared Error
0.6
0.4
0.2
0.0
2 4 6 8 10
Tree Size
Figure: 5. Regression tree analysis for the Hitters data. The training,
cross-validation, and test MSE are shown as a function of the number of
terminal nodes in the pruned tree. Standard error bands are displayed.
The minimum cross-validation error occurs at a tree size of three.
The basics of decision trees. 30
Classification trees
I Regression has numerical responses; and classification has
qualitative responses.
I Recall that for regression trees, we chose to obtain the
greatest reduction of RSS.
RSS is using sum of squares to measure the error.
I For classification trees, one can follow the same line of
procedure as that of regression trees, but using error
measurements that are more appropriate for classification.
The basics of decision trees. 31
Classification error rates
I For a region R, let p̂k be the percentage of observations in
this region that belong to class k.
I We assign any new observation in region R as from the class
with largest p̂k , which is the so-called most commonly
occuring class in training data.
The basics of decision trees. 32
The impurity measure
I The classification error rate (for this region R) is
E = 1 − maxk p̂k .
I The Gini index is
K
X
G= p̂k (1 − p̂k )
k=1
I The cross-entropy is
K
X
D=− p̂k log(p̂k )
k=1
.
The basics of decision trees. 33
I Any of these three approaches might be used when pruning
the tree.
I If R is nearly pure, most of the observations are from one
class, then the Gini-index and cross-entropy would take
smaller values than classification error rate.
p̂1 = [0.5, 0.25, 0.25] ⇒ E = 0.5, G = 0.625, D = 1.0397
p̂2 = [0.5, 0.4, 0.1] ⇒ E = 0.5, G = 0.580, D = 0.9433
I Gini-index and cross-entropy are more sensitive to node purity.
I To evaluate the quality of a particular split, the Gini-index and
cross-entropy are more popularly used as error measurement
criteria than classification error rate. (E can’t distinguish p̂1
and p̂2 above, while G /D are more informative for p̂2 )
I The classification error rate is preferable if only prediction
accuracy of the final pruned tree is the goal.
The basics of decision trees. 34
Example: the Heart data.
I The aim is to use p = 13 predictors such as Age, Sex, and
Chol in order to predict whether an individual has heart
disease.
I n = 297 subjects, randomly split into 207 training and 90 test
observations.
I We apply decision trees to the Heart data.
The basics of decision trees. 35
Thal:a
|
Ca < 0.5 Ca < 0.5
Slope < 1.5 Oldpeak < 1.1
MaxHR < 161.5 ChestPain:bc Age < 52 Thal:b RestECG < 1
ChestPain:a Yes
RestBP < 157 No Yes Yes
Yes No
Chol < 244 No Chol < 244 Sex < 0.5
MaxHR < 156 No Yes
MaxHR < 145.5 Yes
No
No No No No Yes
No Yes
Thal:a
|
0.6
Training
Cross−Validation
Test
0.5
0.4
Error
0.3
Ca < 0.5 Ca < 0.5
0.2
Yes Yes
0.1
MaxHR < 161.5 ChestPain:bc
0.0
No No
No Yes
5 10 15
Tree Size
The basics of decision trees. 36
Figure 6. Heart data. Top: The unpruned tree. Bottom Left:
Cross-validation error, training, and test error, for different sizes of
the pruned tree. Bottom Right: The pruned tree corresponding to
the minimal cross-validation error.
The basics of decision trees. 37
Trees vs. Linear models
I For regression model:
Y = f (X ) +
I Linear model assumes
p
X
f (X ) = β0 + Xj βj
j=1
I Regression trees assume
M
X
f (X ) = cm 1(X ∈ Rm )
m=1
where R1 , ..., RM are rectagular partitions of the input space.
I If the underlying realation is close to linear, linear model is
better. Otherwise, regression trees are generally better.
(Useless comments)
The basics of decision trees. 38
2
2
1
1
X2
X2
0
0
−1
−1
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
X1 X1
2
2
1
1
X2
X2
0
0
−1
−1
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
X1 X1
The basics of decision trees. 39
Figure 7. Top Row: A two-dimensional classification example in
which the true decision boundary is linear, and is indicated by the
shaded regions. A classical approach that assumes a linear
boundary (left) will outperform a decision tree that performs splits
parallel to the axes (right). Bottom Row: Here the true decision
boundary is non-linear. Here a linear model is unable to capture
the true decision boundary (left), whereas a decision tree is
successful (right).
The basics of decision trees. 40
Advantages of Trees
I Trees are very easy to explain to people. In fact, they are even
easier to explain than linear regression!
I Some people believe that decision trees more closely mirror
human decision-making than do the regression and
classification approaches seen in previous chapters.
I Trees can be displayed graphically, and are easily interpreted
even by a non-expert (especially if they are small).
I Trees can easily handle qualitative predictors without the need
to create dummy variables.
The basics of decision trees. 41
Disadvantages of Trees
I Trees generally do not have the same level of predictive
accuracy as some of the other regression and classification
approaches seen in this book.
I Trees can be very non-robust. In other words, a small change
in the data can cause a large change in the final estimated
tree.
I However, by aggregating many decision trees, using methods
like bagging, random forests, and boosting, the predictive
performance of trees can be substantially improved. We
introduce these concepts in the next section.
The basics of decision trees. 42
Outline
The basics of decision trees.
Regression Trees
Classification Trees
Bagging, random forests and boosting
Bagging
Random Forest
Boosting
Bagging, random forests and boosting 43
Bagging (Boostrap Aggregating)
I A general purpose procedure to reduce variance of a learning
method.
I A model averaging technique.
I Decision tree is generally a high variance method. (Apply the
method based on different data based on same sampling
scheme would lead to very different result.)
I Average of iid random variables would have a reduced
variance σ 2 /n
Bagging, random forests and boosting 44
Statistical Estimates
I Model
yi = f (xi ) + i , i = 1, ..., n.
I Suppose a statistical learning method gives fˆ(·) based on the
training data (yi , xi ), i =, 1..., n.
I For example,
1. Linear model: fˆ(x) = β̂0 + β̂ T xi
PJ
2. KNN: fˆ(x) = j=1 ȳR̃j with least distance to K -cluster
partition. PJ
3. Decision tree: fˆ(x) = j=1 ȳRj with rectangular partition.
4. ...
Bagging, random forests and boosting 45
The procedure of Bagging
I Data (yi , xi ), i = 1, ..., n; and a learning method fˆ
I Draw a bootstrap sample from the data, and compute a fˆ1∗
based on this set of bootstrap sample.
I Draw another bootstrap sample from the data, and compute a
fˆ2∗ based on this set of bootstrap sample.
I ....
I Repeat B times, obtain fˆ1∗ , ...., fˆB∗ .
I Produce the learning method with bagging as
B
1 X ˆ∗
fb
B
b=1
Bagging, random forests and boosting 46
Summary of the Bagging
I Bagging is general-purpose.
I It works best for high variance low bias learning methods.
I When the trees are grown deep, and are not pruned, each
individual tree has high variance, but low bias.
I Averaging these trees reduces the variance.
I If the response is qualitative, we can take the majority vote
(not averaging) of the predicted class based on all learning
methods based on boostrap samples.
Bagging, random forests and boosting 47
Out-of-Bag (OOB) error estimation
I Estimation of test error for the bagged model.
I For each bootstrap sample, observation i is not bootstrap
sampled with probability (1 − 1/n)n ≈ 1/e.
I For each bootstrap sample, the number of observations not
taken into this bootstrap sample is n(1 − 1/n)n ≈ n/e. These
are referred to as out-of-bag (OOB) observations.
I For totally B bootstrap samples, about B/e times, the
bootstrap sample does not contain observation i.
I For each observation zi = (xi , yi ), construct its bagged
predictor by averaging (for regression) or taking majority vote
(for classification) of only those trees corresponding to
bootstrap samples in which zi did not appear, denoted as
fˆ−i
∗ (x ).
i
Bagging, random forests and boosting 48
Out-of-Bag (OOB) error estimation
I The OOB MSE is
n
X
∗
(yi − fˆ−i (xi ))2
i=1
I The OOB classification error is
Xn
∗
/ fˆ−i
I (yi ∈ (xi ))
i=1
I The resulting OOB error is a valid estimate of the test error
for the bagged model, since the response for each observation
is predicted using only the trees that were not fit using that
observation.
I It can be shown that with B sufficiently large, OOB error is
virtually equivalent to leave-one-out cross-validation error.
Bagging, random forests and boosting 49
Variable importance measures
I Bagging improves prediction accuracy at the expense of
interpretability.
I An overall summary of the importance of each predictor using
the RSS (for bagging regression trees) or the Gini index (for
bagging classification trees).
I Bagging regression trees, we can record the total amount that
the RSS is decreased due to splits over a given predictor,
averaged over all B trees.
I A large value indicates an important predictor.
I Bagging classification trees, we can add up the total amount
that the Gini index is decreased by splits over a given
predictor, averaged over all B trees.
Bagging, random forests and boosting 50
Fbs
RestECG
ExAng
Sex
Slope
Chol
Age
RestBP
MaxHR
Oldpeak
ChestPain
Ca
Thal
0 20 40 60 80 100
Variable Importance
Figure: 9. A variable importance plot for the Heart data. Variable importance is
computed using the mean decrease in Gini index, and expressed relative to the
maximum.
Bagging, random forests and boosting 51
Motivation of Random Forest
I An average of B i.i.d random variables, each with variance σ 2 ,
has variance B1 σ 2 .
I What if not independent but correlated?
I If the variables are simply i.d. (identically distributed but not
necessarily independent) with positive pairwise correlation ρ,
the variance of the average is
1−ρ 2
ρσ 2 + σ .
B
I The idea of random forests is to improve the variance
reduction of bagging by reducing the correlation between
trees, without increasing the variance too much.
Bagging, random forests and boosting 52
Random Forest
I Same as bagging decision trees, except ...
– When building these decision trees, each time a split in a tree
is considered, a random sample of m predictors is chosen as
split candidates from the full set of p predictors
√
– Typically m ≈ p.
Bagging, random forests and boosting 53
Random Forest
I Every step, the split is constrained on a small number m and
randomly selected inputs.
I Avoid all trees are too similar to each other.
I Too similar trees are too highly correlated, average highly
correlated trees cannot achieve large amount of variance
reduction.
I Extreme case: If all trees are the same, average of them is still
the same one.
I Averaging uncorrelated or low-correlated trees can achieve
large amount of variance reduction.
I Random forest produces less correlated trees.
I Random forest reduces to bagging if m = p.
Bagging, random forests and boosting 54
0.30
0.25
Error
0.20
0.15
Test: Bagging
Test: RandomForest
OOB: Bagging
0.10
OOB: RandomForest
0 50 100 150 200 250 300
Number of Trees
Figure: 8. Bagging and random forest results for the Heart data. The test error
(black and orange) is shown as a function of B, the number of bootstrapped training
√
sets used. Random forests were applied with m = p. The dashed line indicates the
test error resulting from a single classification tree. The green and blue traces show
the OOB error, which in this case is considerably lower
Bagging, random forests and boosting 55
m=p
m=p/2
0.5
m= p
Test Classification Error
0.4
0.3
0.2
0 100 200 300 400 500
Number of Trees
Figure: 10. Results from random forests for the 15-class gene expression data set
with p = 500 predictors. The test error is displayed as a function of the number of
trees. Each colored line corresponds to a different value of m, the number of
predictors available for splitting at each interior tree node. Random forests (m < p)
lead to a slight improvement over bagging (m = p). A single classification tree has an
error rate of 45.7%.
Bagging, random forests and boosting 56
Boosting
I General purpose for improving learning methods by combining
many weaker learners in attempt to produce a strong learner.
I Like bagging, boosting involves combining a large number of
weaker learners.
I The weaker learners are created sequentially (no bootstrap
involved).
I Bagging create large variance and possibly over-fit bootstrap
learners and try to reduce their variance by averaging.
I Boosting creates weak learners by sequentially coordinate
descent over hypothesis basis.
Bagging, random forests and boosting 57
Adaboost (Yoav Freund, Robert E. Schapire (1996))
(1)
I 1. Initialize the data weights {wn } by setting wn = 1/N for
n = 1, . . . , N.
I 2. For m = 1, . . . , M:
– (a) Fit a classifier fm (x) to the training data by minimizing the
weighted error function
N
X
Jm = wn(m) I(fm (xn ) 6= yn ), (1)
n=1
where I(fm (xn ) 6= yn ) is the indicator function and equals 1
when fm (xn ) 6= yn and 0 otherwise.
Bagging, random forests and boosting 58
Adaboost
I 2. (Continued)
– (b) Evaluate the quantities
PN (m)
wn I(fm (xn ) 6= yn )
m = n=1 PN (m)
(2)
n=1 wn
and then use these to evaluate
1 − m
αm = log (3)
m
– (c) Update the data weights
wn(m+1) = wn(m) exp{αm I(fm (xn ) 6= yn )} (4)
I 3. Make prediction by
M
!
X
FM (x) = sign αm fm (x) . (5)
m=1
Bagging, random forests and boosting 59
Illustration of Adaboost
Figure: Each figure shows the number m of base learners trained so far, along
with the decision boundary of the most recent base learner (dashed black line)
and the combined decision boundary of the ensemble (solid green line). Each
data point is depicted by a circle whose radius indicates the weight assigned to
that data point when training the most recently added base learner. Thus, for
instance, we see that points that are misclassified by the m = 1 base learner are
given greater weight when training the m = 2 base learner. Bishop (2006).
Bagging, random forests and boosting 60
Statistical view of Adaboost
I Consider the exponential error function
N
X
E = exp{−yn FM (xn )}, (6)
n=1
where FM (x) = 21 M
P
m=1 αm fm (x) and yn ∈ {−1, 1}.
I Our goal is to minimize E w.r.t. both αm and fm (x).
I Instead of doing a global error function miminization, we shall suppose
that the base classifiers f1 (x), . . . , fm−1 (x) are fixed, as are their
coefficients α1 , . . . , αm−1 , and so we are minimizing only w.r.t. αm and
fm (x).
N
X 1
E = exp −yn Fm−1 (xn ) − yn αm fm (xn )
n=1
2
N
X 1
= wn(m) exp{− yn αm fm (xn )} (7)
n=1
2
(m)
where wn = exp{−yn Fm−1 (xn )} can be viewed as constants.
Bagging, random forests and boosting 61
I Denote Tm and Mm as the sets of correctly classified data points by
fm (x) and the misclassified points, respectively.
I We have
X X
E = e −αm /2 wn(m) + e αm /2 wn(m)
n∈Tm n∈Mm
N
X N
X
= (e αm /2 − e −αm /2 ) wn(m) I(fm (xn ) 6= yn ) + e −αm /2 wn(m) .
n−1 n=1
(8)
– When we minimize this w.r.t. fm (x), the second term is
constant, and up to a multiplicative constant minimizing the
first term is equivalent to minimizing (1).
– Similarly, minimizing w.r.t. αm , we obtain (3) in which m is
defined by (2).
Bagging, random forests and boosting 62
I From (7), we see that, having found αm and fm (x), the
weights on the data points are updated using
(m+1) (m) 1
wn = wn exp − yn αm fm (xn ) .
2
– Making use of the fact that
yn fm (xn ) = 1 − 2I(fm (xn ) 6= yn ),
(m+1)
we see that the weight update wn follows
wn(m+1) = wnm exp(−αm /2) exp{αm I(fm (xn ) 6= yn )}.
– Because the term exp(−αm /2) is independent of n, we see
that it weights all data points by the same factor and so can
be discarded. Thus we obtain (4).
I Finally, once all the base classifiers are trained, new data
points are classified by evaluating the sign of the combined
function. Because the factor of 1/2 does not affect the sign,
this gives (5).
Bagging, random forests and boosting 63
Why exponential loss function?
I We have shown that the Adaboost algorithm minimizes the
exponential loss function (6). How to justify the minimization of the
exponential loss function?
I Consider the population version of the exponential loss function,
XZ
Ex,y [exp{−yF (x)}] = exp{−yF (x)}p(y |x)p(x)dx
y
I Taking the functional derivative w.r.t. F (x), we get
∂ X
Ey [exp{−yF (x)}] = − y exp{−yF (x)}p(y |x)p(x)
∂F (x) y
= {exp{F (x)}p(y = −1|x) − exp{−F (x)}p(y = 1|x)}p(x).
I Setting this equal to zero and rearranging, we obtain
1 p(y = 1|x)
F (x) = log ,
2 p(y = −1|x)
whose sign is the Bayes optimal classification.
Bagging, random forests and boosting 64
Convex Losses
Elements of Statistical Learning (2nd Ed.) Hastie,
c Tibshirani & Friedman 2009 Chap 10
3.0
Misclassification
Exponential
Binomial Deviance
2.5
Squared Error
Support Vector
2.0
Loss
1.5
1.0
0.5
0.0 −2 −1 0 1 2
y·f
FIGURE 10.4. Loss functions for two-class classi-
fication. The response is y = ±1; the prediction is
f , with class prediction sign(f ). The losses are mis-
classification: I(sign(f ) = y); exponential: exp(−yf );
binomial deviance: log(1 + exp(−2yf )); squared er-
ror: (y − f )2 ; and support vector: (1 − yf )+ (see Sec-
tion 12.3). Each function has been scaled so that it
passes through the point (0, 1).
Bagging, random forests and boosting 65
Friedman’s Gradient Boost algorithm
I Initialize F̂0 (x) toP
be a constant,
F̂0 (x) = arg minρ N i=1 L(yi , ρ).
I For m in 1, . . . , M do
1. Compute the negative gradient as the working response
ri = −
∂
∂Fm−1 (xi )
L(yi , Fm−1 (xi )) |
Fm−1 (xi )=F̂m−1 (xi )
(9)
2. Fit a regression model, ĝ (x), predicting ri from the covariates
xi .
3. Choose a gradient descent step size as
N
X
λ̂ = arg min L(yi , F̂m−1 (xi ) + λĝ (xi )) (10)
λ∈R
i=1
4. Update the estimate of F (x) as
F̂m (x) = F̂m−1 (x) + λ̂ĝ (x) (11)
Bagging, random forests and boosting 66
Algorithm for tree boosting in regression
I 1. Set f (x) = 0 and ri = yi for all i in the training set.
I 2. For b = 1, 2, ..., B, repeat:
1. Fit a tree with d splits (d + 1 terminal nodes) to the training
data (xi , ri ).
2. Update fˆ by adding in a shrunken version of the new tree:
fˆ(x) ← fˆ(x) + λfˆb (x)
3. Update the residuals,
ri ← ri − λfˆb (xi ) = yi − fˆ(xi ).
I 3. Output the boosted model fˆ. In fact,
B
X
fˆ(x) = λfˆb (x).
i=1
Bagging, random forests and boosting 67
Tuning parameters for boosting trees
I In regression, square loss minimization over fˆt is obtained by
fitting the residue ri
I There are three hyper-parameters:
– The number of trees M. Large M leads to overfit. (Note B is
not a tuning parameter for bagging)
– The learning rate λ.
– The number d in splits in each tree (the size of each tree).
Often d = 1 works well, in which case each tree is a stump,
consisting of a single split
Bagging, random forests and boosting 68
0.25
Boosting: depth=1
Boosting: depth=2
RandomForest: m= p
0.20
Test Classification Error
0.15
0.10
0.05
0 1000 2000 3000 4000 5000
Number of Trees
Figure: 8.11. Results from performing boosting and random forests on the 15-class
gene expression data set in order to predict cancer versus normal. The test error is
displayed as a function of the number of trees. For the two boosted models, λ = 0.01.
Depth-1 trees slightly outperform depth-2 trees, and both outperform the random
forest, although the standard errors are around 0.02, making none of these differences
significant. The test error rate for a single tree is 24%.
Bagging, random forests and boosting 69
Forward Stagewise Regression as Coordinate Descent
Consider a set of fixed basis functions {Tk }k=1,...,K .
I Initialze α̂k = 0, k = 1, . . . , K . Set > 0 to some small
constant, and M large.
I For m = 1 to M:
PN PK
– (β ∗ , k ∗ ) = arg minβ,k i=1 ` yi − l=1 α̂l Tl (xi ) − βTk (xi )
– α̂k ∗ ← α̂k ∗ + · sign(β ∗ ).
PK
Output FM (x) = k=1 α̂k Tk (x).
I
Bagging, random forests and boosting 70
Figure: Profiles of estimated coefficients from linear regression, for the
prostate data. The left panel shows the resultsP from the Lasso, for
different values of the bound parameter t = k |αk |. The right panel
shows the results of the forward stagewise linear regression, using
M = 220 consecutive steps of size = 0.01.
Bagging, random forests and boosting 71
Summary: Statistical view of boosting methods
Boosting methods have three important properties that contribute
to their success:
I they fit by functional gradient descent or coordinate descent
an additive model in a flexible set of basis functions.
I they use a suitable loss function for the fitting process.
I they regularize by forward stagewise fitting with shrinkage
(small step-size in coordinate descent) and early stopping, this
mimics an `1 (Lasso) penalty on the weights.
Bagging, random forests and boosting 72