0% found this document useful (0 votes)

5 views74 pages

Decision Trees and Boosting Techniques

The document discusses decision trees, random forests, and boosting as essential machine learning algorithms, highlighting their structure, functionality, and applications. It explains the concepts of regression and classification trees, the process of splitting data, and the importance of model interpretability versus prediction accuracy. Additionally, it covers techniques like bagging, boosting, and the challenges of overfitting and underfitting in tree-based models.

Uploaded by

yosonjer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views74 pages

Decision Trees and Boosting Techniques

Uploaded by

yosonjer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Decision Trees, Random Forests, Boosting

ISLR Chapter 8

Yuan YAO

Hong Kong University of Science and Technology

Department of Mathematics

Fall 2024
Best Machine Learning Algorithms in history?

I Boosting (ISLR chapter 8): CART, AdaBoost, Random

Forests (Leo Breiman), etc.
I Support Vector Machines (ISLR chapter 9): or kernel methods
(V. Vapnik), etc.
I Neural Networks: perceptrons (1950s), deep learning (2010s),
CNN/RNN/LSTM, ResNet/Transformers (> 2015), etc.

2
The basics of decision trees.
Regression Trees
Classification Trees

Bagging, random forests and boosting

Bagging
Random Forest
Boosting

3
About this chapter

I Decisions trees: splitting each variable sequentially, creating

rectangular regions.
I Making fitting/prediction locally at each hyper-rectangular
region.
I It is intuitive and easy to implement, may have good
interpretation.
I Generally of lower prediction accuracy (weak learners).
I “The Boosting problem” (Kearns & Valiant): Can a set of
weak learners create a single strong learner?
– Bagging, random forests and boosting ... make
fitting/prediction based on a number (ensemble) of trees.
– Bagging and Boosting are general methodologies, not just
limited to trees.

4
About this chapter

I Decisions trees: splitting each variable sequentially, creating

4
About this chapter

I Decisions trees: splitting each variable sequentially, creating

4
Interpretability vs. Prediction

High
Subset Selection
Lasso

Least Squares
Interpretability

Generalized Additive Models

Trees

Bagging, Boosting

Support Vector Machines

Low

Neural Networks

Low High

Flexibility

Figure: 2.7. As models become flexible, interpretability drops. Occam

Razor principle: Everything has to be kept as simple as possible, but not
simpler (Albert Einstein).

5
Outline

The basics of decision trees.

Regression Trees
Classification Trees

Bagging, random forests and boosting

Bagging
Random Forest
Boosting

The basics of decision trees. 6

Figure: CART and C4.5

The basics of decision trees. 7

Figure: Historical story about CART: dates back to Automatic
Interaction Detection ... in 1960s

The basics of decision trees. 8

Regression trees

I Trees can be applied to both regression and classification.

I CART refers to classification and regression trees.
I We first consider regression trees through an example of
predicting Baseball players’ salaries.

The basics of decision trees. 9

The Hitters data

I Response/Outputs: Salary.
I Predictors/Inputs/Covariates:
Years (the number of years that he has played in the major
leagues)
Hits (the number of hits that he made in the previous year).
I preparing data: remove the observations with missing data
and log-transformed the Salary (preventing heavy
right-skewness)

The basics of decision trees. 10

Years < 4.5
|

Hits < 117.5

5.11

6.00 6.74

Figure: 1. Next page

The basics of decision trees. 11

Figure 1. For the Hitters data, a regression tree for predicting the
log salary of a baseball player, based on the number of years that
he has played in the major leagues and the number of hits that he
made in the previous year. At a given internal node, the label (of
the form Xj < tk ) indicates the left-hand branch emanating from
that split, and the right-hand branch corresponds to Xj ≥ tk . For
instance, the split at the top of the tree results in two large
branches. The left-hand branch corresponds to Years< 4.5, and
the right-hand branch corresponds to Years≥ 4.5. The tree has two
internal nodes and three terminal nodes, or leaves. The number in
each leaf is the mean of the response for the observations that fall
there.

The basics of decision trees. 12

238

Hits
R1 117.5

1
1 4.5 24
Years

Figure: 2. The three-region partition for the Hitters data set from the
regression tree illustrated in Figure 1.

The basics of decision trees. 13

Estimation/prediction

I On Regions R1 , R2 , R3 , the mean-log-salary is 5.107, and

6.74.
I Our prediction for any players in R1 , R2 and R3 are,
respectively
1, 000 × e 5.107 = $165, 174, 1, 000 × e 5.999 = $402, 834, and
1, 000 × e 6.740 = $845, 346.

The basics of decision trees. 14

Estimation/prediction

I Trees involve a series of splittings of the data, each time by

one variable.
I The series of actions taken place sequentially creates a
tree-like results.
I As in Figure 1, the terminal nodes are the three indexed by
the numbers, which represent the regions R1 , R2 and R3 .
These regions constitute the final partition of the data.
I Terminal nodes are also called leaves.
I Each internal node represents a splitting,
I In Figure 1, the two internal nodes are indexed by Years < 4.5
and Hits < 117.5.
I The lines connecting nodes are called branches.
I Trees are typically drawn upside down.
The basics of decision trees. 15
Two step towards prediction

I Run the splitting according to input values sequentially, and

obtain final partition of the data in regions R1 , ..., RJ .
I For any new observation with covariates in region Rk , we
predict its response by the average of the reponses of the data
points in region Rk .

The basics of decision trees. 16

How to split

I Suppose we wish to partition a region R. In other words, we

wish to separate the data in region R into two parts, day R1
and R2 , according to one input values.
I What would be the optimal or efficient split in some sense?
I Only two parameters in the split:
1. Choice of the input variable to split,
2. the cutpoint of the split of that chose input.
I Imagine that this is the final split of R: R1 and R2 would be
leaves.
– And we would use the mean response of data in R1 and R2 to
predict the response of any new/old observations.
– We wish our choice of R1 and R2 would be optimal in the
sense of achieving minimum prediction error on the training
data in region R.
The basics of decision trees. 17
Recursive binary splitting

I A greedy algorithm (geedy means it is optimal at the current

step):
1. For j = 1, ..., p and all real value s, let
R1 (j, s) = {i ∈ R : Xj < s} and R2 (j, s) = {i ∈ R : Xj ≥ s}.
And let ŷ1 and ŷ2 be the mean response of all observations in
R1 (j, s) and R2 (j, s), respectively.
2. Consider the following prediction error:
X X
RSSnew = (yi − ŷ1 )2 + (yi − ŷ2 )2
i∈R1 (j,s) i∈R2 (j,s)

Choose the split which has the smallest prediction error. This
split is the optimal one, denoted as R1 and R2 .
I Continue the split till the final partition.

The basics of decision trees. 18

R2 t4

X2
R3

t2 R4

t1 t3

X1 X1

X1 ≤ t1
|

X2 ≤ t2 X1 ≤ t3

X2 ≤ t4

R1 R2 R3
X2
X1

R4 R5

The basics of decision trees. 19

Figure 3. Top Left: A partition of two-dimensional feature space
that could not result from recursive binary splitting. Top Right:
The output of recursive binary splitting on a two-dimensional
example. Bottom Left: A tree corresponding to the partition in the
top right panel. Bottom Right: A perspective plot of the
prediction surface corresponding to that tree.

The basics of decision trees. 20

When to stop split

I The problem of when to stop.

I If too many steps of splitting: many leaves, too complex
model, small bias but large variance, may overfit.
I If too few steps of splitting: few leaves, too simple model,
large bias but small variance, may underfit.

The basics of decision trees. 21

One natural idea

I When splitting R into R1 and R2 , consider the RSS before the

split X
RSSold = (yi − ŷ )2
i∈R

where ŷ is the average of the response of data in R. With the

optimal split, the reduction of RSS is

RSSold − RSSnew
I We can pre-choose a threshold, h, and decide the worthiness
of the split.
– If the reduction is smaller than h, we do not do it, and stop
right there; then R is one terminal node (a leave).
– If the reduction is greater than h, we make the split, and
continue with next step.
The basics of decision trees. 22
One natural idea

I The idea is seemingly reasonable, but is too near-sighted.

I Only look at the effect of the current split.
I It is possible that even if the current split is not effective, the
future splits could be effective and, maybe, very effective.

The basics of decision trees. 23

Tree pruning

I Grow a very large tree.

I Prune the tree back to obtain a subtree.
I Objective: find the subtree that has the best test error.
I Use cross-validation to examine the test errors for a sequence
(parametrized by α below) of subtrees during the
growth/pruning, instead of all possible subtrees which is too
large a model space.

The basics of decision trees. 24

Cost complexity pruning

I Let T0 be the original (large) tree. Let T be any subtree. Use

|T0 | and |T | to denote their numbers of teminal nodes, which
represent complexity.
I Consider “Loss + Penalty”:
T X
X
(yi − ŷm )2 + α|T |
m=1 i∈Rm

where Rm are the terminal nodes of the subtree T , and the

mean response of Rm is ŷm ; α is tuning parameter.
I Denote the minimized subtree as Tα .
– If α = 0, no penalty the optimal tree is the original T0 .
– If α = ∞, the tree has no split at all. The predictor is just ȳ .
– The larger the α, the more penalty for model complexity.
The basics of decision trees. 25
Cost complexity pruning

I Just like Lasso, there exists efficient computation algorithm to

compute the entire sequence of Tα for all α.
I Use cross-validation to find the best α to minimize the test
error.

The basics of decision trees. 26

The algorithm

I 1. Use recursive binary splitting to grow a large tree on the

training data, stopping only when each terminal node has
fewer than some minimum number of observations.
I 2. Apply cost complexity pruning to the large tree in order to
obtain a sequence of best subtrees, as a function of α.

The basics of decision trees. 27

The algorithm (continued)

I 3. Use K -fold cross-validation to determine best α. That is,

divide the training observations into K folds. For each
k = 1, ..., K
(a) Repeat Steps 1 and 2 on all but the kth fold of the training
data.
(b) Evaluate the mean squared prediction error on the data in the
left-out k-th fold, as a function of α.
(c) Average the results for each value of α, and pick the α∗ to
minimize the average error.
I 4. Return the subtree from Step 2 that corresponds to the
chosen α∗ .

The basics of decision trees. 28

Years < 4.5
|

RBI < 60.5 Hits < 117.5

Putouts < 82 Years < 3.5

Years < 3.5
5.487 5.394 6.189
4.622 5.183
Walks < 43.5 Walks < 52.5
Runs < 47.5 RBI < 80.5
6.407 6.549 Years < 6.5
6.015 5.571
7.289
6.459 7.007

Figure: 4. Regression tree analysis for the Hitters data. The unpruned
tree that results from top-down greedy splitting on the training data is
shown.

The basics of decision trees. 29

1.0
Training
Cross−Validation
Test

0.8
Mean Squared Error

0.6
0.4
0.2
0.0

2 4 6 8 10

Tree Size

Figure: 5. Regression tree analysis for the Hitters data. The training,
cross-validation, and test MSE are shown as a function of the number of
terminal nodes in the pruned tree. Standard error bands are displayed.
The minimum cross-validation error occurs at a tree size of three.

The basics of decision trees. 30

Classification trees

I Regression has numerical responses; and classification has

qualitative responses.
I Recall that for regression trees, we chose to obtain the
greatest reduction of RSS.
RSS is using sum of squares to measure the error.
I For classification trees, one can follow the same line of
procedure as that of regression trees, but using error
measurements that are more appropriate for classification.

The basics of decision trees. 31

Classification error rates

I For a region R, let p̂k be the percentage of observations in

this region that belong to class k.
I We assign any new observation in region R as from the class
with largest p̂k , which is the so-called most commonly
occuring class in training data.

The basics of decision trees. 32

The impurity measure

I The classification error rate (for this region R) is

E = 1 − maxk p̂k .

I The Gini index is

K
X
G= p̂k (1 − p̂k )
k=1

I The cross-entropy is
K
X
D=− p̂k log(p̂k )
k=1
.
The basics of decision trees. 33
I Any of these three approaches might be used when pruning
the tree.
I If R is nearly pure, most of the observations are from one
class, then the Gini-index and cross-entropy would take
smaller values than classification error rate.

p̂1 = [0.5, 0.25, 0.25] ⇒ E = 0.5, G = 0.625, D = 1.0397

p̂2 = [0.5, 0.4, 0.1] ⇒ E = 0.5, G = 0.580, D = 0.9433

I Gini-index and cross-entropy are more sensitive to node purity.
I To evaluate the quality of a particular split, the Gini-index and
cross-entropy are more popularly used as error measurement
criteria than classification error rate. (E can’t distinguish p̂1
and p̂2 above, while G /D are more informative for p̂2 )
I The classification error rate is preferable if only prediction
accuracy of the final pruned tree is the goal.
The basics of decision trees. 34
Example: the Heart data.

I The aim is to use p = 13 predictors such as Age, Sex, and

Chol in order to predict whether an individual has heart
disease.
I n = 297 subjects, randomly split into 207 training and 90 test
observations.
I We apply decision trees to the Heart data.

The basics of decision trees. 35

Thal:a
|

Ca < 0.5 Ca < 0.5

Slope < 1.5 Oldpeak < 1.1

MaxHR < 161.5 ChestPain:bc Age < 52 Thal:b RestECG < 1

ChestPain:a Yes
RestBP < 157 No Yes Yes
Yes No
Chol < 244 No Chol < 244 Sex < 0.5
MaxHR < 156 No Yes
MaxHR < 145.5 Yes
No
No No No No Yes
No Yes

Thal:a
|
0.6

Training
Cross−Validation
Test
0.5
0.4
Error

0.3

Ca < 0.5 Ca < 0.5

0.2

Yes Yes
0.1

MaxHR < 161.5 ChestPain:bc

0.0

No No
No Yes
5 10 15

Tree Size

The basics of decision trees. 36

Figure 6. Heart data. Top: The unpruned tree. Bottom Left:
Cross-validation error, training, and test error, for different sizes of
the pruned tree. Bottom Right: The pruned tree corresponding to
the minimal cross-validation error.

The basics of decision trees. 37

Trees vs. Linear models

I For regression model:

Y = f (X ) +
I Linear model assumes
p
X
f (X ) = β0 + Xj βj
j=1
I Regression trees assume
M
X
f (X ) = cm 1(X ∈ Rm )
m=1
where R1 , ..., RM are rectagular partitions of the input space.
I If the underlying realation is close to linear, linear model is
better. Otherwise, regression trees are generally better.
(Useless comments)
The basics of decision trees. 38
2

2
1

1
X2

X2
0

0
−1

−1
−2

−2
−2 −1 0 1 2 −2 −1 0 1 2

X1 X1
2

2
1

1
X2

X2
0

0
−1

−1
−2

−2

−2 −1 0 1 2 −2 −1 0 1 2

X1 X1

The basics of decision trees. 39

Figure 7. Top Row: A two-dimensional classification example in
which the true decision boundary is linear, and is indicated by the
shaded regions. A classical approach that assumes a linear
boundary (left) will outperform a decision tree that performs splits
parallel to the axes (right). Bottom Row: Here the true decision
boundary is non-linear. Here a linear model is unable to capture
the true decision boundary (left), whereas a decision tree is
successful (right).

The basics of decision trees. 40

Advantages of Trees

I Trees are very easy to explain to people. In fact, they are even
easier to explain than linear regression!
I Some people believe that decision trees more closely mirror
human decision-making than do the regression and
classification approaches seen in previous chapters.
I Trees can be displayed graphically, and are easily interpreted
even by a non-expert (especially if they are small).
I Trees can easily handle qualitative predictors without the need
to create dummy variables.

The basics of decision trees. 41

Disadvantages of Trees

I Trees generally do not have the same level of predictive

accuracy as some of the other regression and classification
approaches seen in this book.
I Trees can be very non-robust. In other words, a small change
in the data can cause a large change in the final estimated
tree.
I However, by aggregating many decision trees, using methods
like bagging, random forests, and boosting, the predictive
performance of trees can be substantially improved. We
introduce these concepts in the next section.

The basics of decision trees. 42

Outline

The basics of decision trees.

Regression Trees
Classification Trees

Bagging, random forests and boosting

Bagging
Random Forest
Boosting

Bagging, random forests and boosting 43

Bagging (Boostrap Aggregating)

I A general purpose procedure to reduce variance of a learning

method.
I A model averaging technique.
I Decision tree is generally a high variance method. (Apply the
method based on different data based on same sampling
scheme would lead to very different result.)
I Average of iid random variables would have a reduced
variance σ 2 /n

Bagging, random forests and boosting 44

Statistical Estimates

I Model
yi = f (xi ) + i , i = 1, ..., n.
I Suppose a statistical learning method gives fˆ(·) based on the
training data (yi , xi ), i =, 1..., n.
I For example,
1. Linear model: fˆ(x) = β̂0 + β̂ T xi
PJ
2. KNN: fˆ(x) = j=1 ȳR̃j with least distance to K -cluster
partition. PJ
3. Decision tree: fˆ(x) = j=1 ȳRj with rectangular partition.
4. ...

Bagging, random forests and boosting 45

The procedure of Bagging

I Data (yi , xi ), i = 1, ..., n; and a learning method fˆ

I Draw a bootstrap sample from the data, and compute a fˆ1∗
based on this set of bootstrap sample.
I Draw another bootstrap sample from the data, and compute a
fˆ2∗ based on this set of bootstrap sample.
I ....
I Repeat B times, obtain fˆ1∗ , ...., fˆB∗ .
I Produce the learning method with bagging as
B
1 X ˆ∗
fb
B
b=1

Bagging, random forests and boosting 46

Summary of the Bagging

I Bagging is general-purpose.
I It works best for high variance low bias learning methods.
I When the trees are grown deep, and are not pruned, each
individual tree has high variance, but low bias.
I Averaging these trees reduces the variance.
I If the response is qualitative, we can take the majority vote
(not averaging) of the predicted class based on all learning
methods based on boostrap samples.

Bagging, random forests and boosting 47

Out-of-Bag (OOB) error estimation

I Estimation of test error for the bagged model.

I For each bootstrap sample, observation i is not bootstrap
sampled with probability (1 − 1/n)n ≈ 1/e.
I For each bootstrap sample, the number of observations not
taken into this bootstrap sample is n(1 − 1/n)n ≈ n/e. These
are referred to as out-of-bag (OOB) observations.
I For totally B bootstrap samples, about B/e times, the
bootstrap sample does not contain observation i.
I For each observation zi = (xi , yi ), construct its bagged
predictor by averaging (for regression) or taking majority vote
(for classification) of only those trees corresponding to
bootstrap samples in which zi did not appear, denoted as
fˆ−i
∗ (x ).
i

Bagging, random forests and boosting 48

Out-of-Bag (OOB) error estimation

I The OOB MSE is

n
X
∗
(yi − fˆ−i (xi ))2
i=1
I The OOB classification error is
Xn
∗
/ fˆ−i
I (yi ∈ (xi ))
i=1
I The resulting OOB error is a valid estimate of the test error
for the bagged model, since the response for each observation
is predicted using only the trees that were not fit using that
observation.
I It can be shown that with B sufficiently large, OOB error is
virtually equivalent to leave-one-out cross-validation error.
Bagging, random forests and boosting 49
Variable importance measures

I Bagging improves prediction accuracy at the expense of

interpretability.
I An overall summary of the importance of each predictor using
the RSS (for bagging regression trees) or the Gini index (for
bagging classification trees).
I Bagging regression trees, we can record the total amount that
the RSS is decreased due to splits over a given predictor,
averaged over all B trees.
I A large value indicates an important predictor.
I Bagging classification trees, we can add up the total amount
that the Gini index is decreased by splits over a given
predictor, averaged over all B trees.

Bagging, random forests and boosting 50

Fbs

RestECG

ExAng

Sex

Slope

Chol

Age

RestBP

MaxHR

Oldpeak

ChestPain

Thal

0 20 40 60 80 100
Variable Importance

Figure: 9. A variable importance plot for the Heart data. Variable importance is
computed using the mean decrease in Gini index, and expressed relative to the
maximum.

Bagging, random forests and boosting 51

Motivation of Random Forest

I An average of B i.i.d random variables, each with variance σ 2 ,

has variance B1 σ 2 .
I What if not independent but correlated?
I If the variables are simply i.d. (identically distributed but not
necessarily independent) with positive pairwise correlation ρ,
the variance of the average is
1−ρ 2
ρσ 2 + σ .
B
I The idea of random forests is to improve the variance
reduction of bagging by reducing the correlation between
trees, without increasing the variance too much.

Bagging, random forests and boosting 52

Random Forest

I Same as bagging decision trees, except ...

– When building these decision trees, each time a split in a tree
is considered, a random sample of m predictors is chosen as
split candidates from the full set of p predictors
√
– Typically m ≈ p.

Bagging, random forests and boosting 53

Random Forest

I Every step, the split is constrained on a small number m and

randomly selected inputs.
I Avoid all trees are too similar to each other.
I Too similar trees are too highly correlated, average highly
correlated trees cannot achieve large amount of variance
reduction.
I Extreme case: If all trees are the same, average of them is still
the same one.
I Averaging uncorrelated or low-correlated trees can achieve
large amount of variance reduction.
I Random forest produces less correlated trees.
I Random forest reduces to bagging if m = p.

Bagging, random forests and boosting 54

0.30
0.25
Error

0.20
0.15

Test: Bagging
Test: RandomForest
OOB: Bagging
0.10

OOB: RandomForest

0 50 100 150 200 250 300

Number of Trees

Figure: 8. Bagging and random forest results for the Heart data. The test error
(black and orange) is shown as a function of B, the number of bootstrapped training
√
sets used. Random forests were applied with m = p. The dashed line indicates the
test error resulting from a single classification tree. The green and blue traces show
the OOB error, which in this case is considerably lower

Bagging, random forests and boosting 55

m=p
m=p/2

0.5
m= p

Test Classification Error

0.4
0.3
0.2

0 100 200 300 400 500

Number of Trees

Figure: 10. Results from random forests for the 15-class gene expression data set
with p = 500 predictors. The test error is displayed as a function of the number of
trees. Each colored line corresponds to a different value of m, the number of
predictors available for splitting at each interior tree node. Random forests (m < p)
lead to a slight improvement over bagging (m = p). A single classification tree has an
error rate of 45.7%.

Bagging, random forests and boosting 56

Boosting

I General purpose for improving learning methods by combining

many weaker learners in attempt to produce a strong learner.
I Like bagging, boosting involves combining a large number of
weaker learners.
I The weaker learners are created sequentially (no bootstrap
involved).
I Bagging create large variance and possibly over-fit bootstrap
learners and try to reduce their variance by averaging.
I Boosting creates weak learners by sequentially coordinate
descent over hypothesis basis.

Bagging, random forests and boosting 57

Adaboost (Yoav Freund, Robert E. Schapire (1996))

(1)
I 1. Initialize the data weights {wn } by setting wn = 1/N for
n = 1, . . . , N.
I 2. For m = 1, . . . , M:
– (a) Fit a classifier fm (x) to the training data by minimizing the
weighted error function
N
X
Jm = wn(m) I(fm (xn ) 6= yn ), (1)
n=1

where I(fm (xn ) 6= yn ) is the indicator function and equals 1

when fm (xn ) 6= yn and 0 otherwise.

Bagging, random forests and boosting 58

Adaboost
I 2. (Continued)
– (b) Evaluate the quantities
PN (m)
wn I(fm (xn ) 6= yn )
m = n=1 PN (m)
(2)
n=1 wn
and then use these to evaluate
1 − m

αm = log (3)
m
– (c) Update the data weights
wn(m+1) = wn(m) exp{αm I(fm (xn ) 6= yn )} (4)
I 3. Make prediction by
M
!
X
FM (x) = sign αm fm (x) . (5)
m=1

Bagging, random forests and boosting 59

Illustration of Adaboost

Figure: Each figure shows the number m of base learners trained so far, along
with the decision boundary of the most recent base learner (dashed black line)
and the combined decision boundary of the ensemble (solid green line). Each
data point is depicted by a circle whose radius indicates the weight assigned to
that data point when training the most recently added base learner. Thus, for
instance, we see that points that are misclassified by the m = 1 base learner are
given greater weight when training the m = 2 base learner. Bishop (2006).

Bagging, random forests and boosting 60

Statistical view of Adaboost
I Consider the exponential error function
N
X
E = exp{−yn FM (xn )}, (6)
n=1

where FM (x) = 21 M
P
m=1 αm fm (x) and yn ∈ {−1, 1}.
I Our goal is to minimize E w.r.t. both αm and fm (x).
I Instead of doing a global error function miminization, we shall suppose
that the base classifiers f1 (x), . . . , fm−1 (x) are fixed, as are their
coefficients α1 , . . . , αm−1 , and so we are minimizing only w.r.t. αm and
fm (x).
N
X 1
E = exp −yn Fm−1 (xn ) − yn αm fm (xn )
n=1
2
N
X 1
= wn(m) exp{− yn αm fm (xn )} (7)
n=1
2
(m)
where wn = exp{−yn Fm−1 (xn )} can be viewed as constants.
Bagging, random forests and boosting 61
I Denote Tm and Mm as the sets of correctly classified data points by
fm (x) and the misclassified points, respectively.
I We have
X X
E = e −αm /2 wn(m) + e αm /2 wn(m)
n∈Tm n∈Mm
N
X N
X
= (e αm /2 − e −αm /2 ) wn(m) I(fm (xn ) 6= yn ) + e −αm /2 wn(m) .
n−1 n=1
(8)

– When we minimize this w.r.t. fm (x), the second term is

constant, and up to a multiplicative constant minimizing the
first term is equivalent to minimizing (1).
– Similarly, minimizing w.r.t. αm , we obtain (3) in which m is
defined by (2).

Bagging, random forests and boosting 62

I From (7), we see that, having found αm and fm (x), the
weights on the data points are updated using

(m+1) (m) 1
wn = wn exp − yn αm fm (xn ) .
2

– Making use of the fact that

yn fm (xn ) = 1 − 2I(fm (xn ) 6= yn ),
(m+1)
we see that the weight update wn follows
wn(m+1) = wnm exp(−αm /2) exp{αm I(fm (xn ) 6= yn )}.
– Because the term exp(−αm /2) is independent of n, we see
that it weights all data points by the same factor and so can
be discarded. Thus we obtain (4).
I Finally, once all the base classifiers are trained, new data
points are classified by evaluating the sign of the combined
function. Because the factor of 1/2 does not affect the sign,
this gives (5).
Bagging, random forests and boosting 63
Why exponential loss function?

I We have shown that the Adaboost algorithm minimizes the

exponential loss function (6). How to justify the minimization of the
exponential loss function?
I Consider the population version of the exponential loss function,
XZ
Ex,y [exp{−yF (x)}] = exp{−yF (x)}p(y |x)p(x)dx
y

I Taking the functional derivative w.r.t. F (x), we get

∂ X
Ey [exp{−yF (x)}] = − y exp{−yF (x)}p(y |x)p(x)
∂F (x) y

= {exp{F (x)}p(y = −1|x) − exp{−F (x)}p(y = 1|x)}p(x).

I Setting this equal to zero and rearranging, we obtain

1 p(y = 1|x)
F (x) = log ,
2 p(y = −1|x)
whose sign is the Bayes optimal classification.
Bagging, random forests and boosting 64
Convex Losses

Elements of Statistical Learning (2nd Ed.) Hastie,

c Tibshirani & Friedman 2009 Chap 10

3.0
Misclassification
Exponential
Binomial Deviance

2.5
Squared Error
Support Vector

2.0
Loss
1.5
1.0
0.5
0.0 −2 −1 0 1 2

y·f

FIGURE 10.4. Loss functions for two-class classi-

ﬁcation. The response is y = ±1; the prediction is
f , with class prediction sign(f ). The losses are mis-
classiﬁcation: I(sign(f ) = y); exponential: exp(−yf );
binomial deviance: log(1 + exp(−2yf )); squared er-
ror: (y − f )2 ; and support vector: (1 − yf )+ (see Sec-
tion 12.3). Each function has been scaled so that it
passes through the point (0, 1).

Bagging, random forests and boosting 65

Friedman’s Gradient Boost algorithm

I Initialize F̂0 (x) toP

be a constant,
F̂0 (x) = arg minρ N i=1 L(yi , ρ).
I For m in 1, . . . , M do
1. Compute the negative gradient as the working response

ri = −
∂
∂Fm−1 (xi )
L(yi , Fm−1 (xi )) |
Fm−1 (xi )=F̂m−1 (xi )
(9)

2. Fit a regression model, ĝ (x), predicting ri from the covariates

xi .
3. Choose a gradient descent step size as
N
X
λ̂ = arg min L(yi , F̂m−1 (xi ) + λĝ (xi )) (10)
λ∈R
i=1

4. Update the estimate of F (x) as

F̂m (x) = F̂m−1 (x) + λ̂ĝ (x) (11)
Bagging, random forests and boosting 66
Algorithm for tree boosting in regression

I 1. Set f (x) = 0 and ri = yi for all i in the training set.

I 2. For b = 1, 2, ..., B, repeat:
1. Fit a tree with d splits (d + 1 terminal nodes) to the training
data (xi , ri ).
2. Update fˆ by adding in a shrunken version of the new tree:
fˆ(x) ← fˆ(x) + λfˆb (x)
3. Update the residuals,
ri ← ri − λfˆb (xi ) = yi − fˆ(xi ).

I 3. Output the boosted model fˆ. In fact,

B
X
fˆ(x) = λfˆb (x).
i=1

Bagging, random forests and boosting 67

Tuning parameters for boosting trees

I In regression, square loss minimization over fˆt is obtained by

fitting the residue ri
I There are three hyper-parameters:
– The number of trees M. Large M leads to overfit. (Note B is
not a tuning parameter for bagging)
– The learning rate λ.
– The number d in splits in each tree (the size of each tree).
Often d = 1 works well, in which case each tree is a stump,
consisting of a single split

Bagging, random forests and boosting 68

0.25
Boosting: depth=1
Boosting: depth=2
RandomForest: m= p

0.20
Test Classification Error

0.15
0.10
0.05

0 1000 2000 3000 4000 5000

Number of Trees

Figure: 8.11. Results from performing boosting and random forests on the 15-class
gene expression data set in order to predict cancer versus normal. The test error is
displayed as a function of the number of trees. For the two boosted models, λ = 0.01.
Depth-1 trees slightly outperform depth-2 trees, and both outperform the random
forest, although the standard errors are around 0.02, making none of these differences
significant. The test error rate for a single tree is 24%.

Bagging, random forests and boosting 69

Forward Stagewise Regression as Coordinate Descent

Consider a set of fixed basis functions {Tk }k=1,...,K .

I Initialze α̂k = 0, k = 1, . . . , K . Set > 0 to some small
constant, and M large.
I For m = 1 to M:
PN PK
– (β ∗ , k ∗ ) = arg minβ,k i=1 ` yi − l=1 α̂l Tl (xi ) − βTk (xi )
– α̂k ∗ ← α̂k ∗ + · sign(β ∗ ).
PK
Output FM (x) = k=1 α̂k Tk (x).
I

Bagging, random forests and boosting 70

Figure: Profiles of estimated coefficients from linear regression, for the
prostate data. The left panel shows the resultsP from the Lasso, for
different values of the bound parameter t = k |αk |. The right panel
shows the results of the forward stagewise linear regression, using
M = 220 consecutive steps of size = 0.01.

Bagging, random forests and boosting 71

Summary: Statistical view of boosting methods

Boosting methods have three important properties that contribute

to their success:
I they fit by functional gradient descent or coordinate descent
an additive model in a flexible set of basis functions.
I they use a suitable loss function for the fitting process.
I they regularize by forward stagewise fitting with shrinkage
(small step-size in coordinate descent) and early stopping, this
mimics an `1 (Lasso) penalty on the weights.

Bagging, random forests and boosting 72

Understanding Decision Tree Learning
No ratings yet
Understanding Decision Tree Learning
15 pages
Understanding Decision Trees in ML
No ratings yet
Understanding Decision Trees in ML
82 pages
Decision Trees and Nearest Neighbors
No ratings yet
Decision Trees and Nearest Neighbors
8 pages
5 Review Paper
No ratings yet
5 Review Paper
7 pages
Decision Trees in Machine Learning
No ratings yet
Decision Trees in Machine Learning
36 pages
Understanding CART: Decision Trees Explained
No ratings yet
Understanding CART: Decision Trees Explained
25 pages
Tree-Based Methods & SVM Overview
No ratings yet
Tree-Based Methods & SVM Overview
79 pages
ML 3
No ratings yet
ML 3
21 pages
Classifying Employee Retention with Trees
No ratings yet
Classifying Employee Retention with Trees
36 pages
Understanding Decision Trees in ML
No ratings yet
Understanding Decision Trees in ML
11 pages
ML Unit-5
No ratings yet
ML Unit-5
120 pages
Decision Tree Learning Overview
No ratings yet
Decision Tree Learning Overview
11 pages
Overview of Common Classification Algorithms
No ratings yet
Overview of Common Classification Algorithms
12 pages
Decision Tree Learning Overview
No ratings yet
Decision Tree Learning Overview
33 pages
Understanding Decision Trees in ML
No ratings yet
Understanding Decision Trees in ML
67 pages
Decision Tree and Random Forest Documentation
No ratings yet
Decision Tree and Random Forest Documentation
15 pages
Understanding Decision Tree Algorithms
100% (1)
Understanding Decision Tree Algorithms
57 pages
Tree-Based Methods in Machine Learning
No ratings yet
Tree-Based Methods in Machine Learning
21 pages
Big Mart Sales Prediction with ML
No ratings yet
Big Mart Sales Prediction with ML
66 pages
Understanding Decision Tree Structure
No ratings yet
Understanding Decision Tree Structure
57 pages
Introduction to Decision Tree Learning
No ratings yet
Introduction to Decision Tree Learning
96 pages
Understanding Decision Tree Learning
No ratings yet
Understanding Decision Tree Learning
10 pages
Tree-Based Methods for Regression & Classification
No ratings yet
Tree-Based Methods for Regression & Classification
33 pages
Tree-Based Methods in Statistics
No ratings yet
Tree-Based Methods in Statistics
42 pages
Decision Tree Algorithms in Machine Learning
No ratings yet
Decision Tree Algorithms in Machine Learning
54 pages
Understanding Decision Tree Algorithm
No ratings yet
Understanding Decision Tree Algorithm
9 pages
Introduction to Decision Trees in Data Mining
No ratings yet
Introduction to Decision Trees in Data Mining
30 pages
Machine Learning: Decision Trees Overview
No ratings yet
Machine Learning: Decision Trees Overview
82 pages
Understanding Decision Trees in Data Mining
No ratings yet
Understanding Decision Trees in Data Mining
13 pages
Understanding Decision Trees in Classification
100% (2)
Understanding Decision Trees in Classification
16 pages
Decision Trees and Random Forests Explained
No ratings yet
Decision Trees and Random Forests Explained
16 pages
Decision Trees and Ensemble Methods
No ratings yet
Decision Trees and Ensemble Methods
69 pages
ML Unit 3
No ratings yet
ML Unit 3
35 pages
Understanding Decision Trees and Overfitting
No ratings yet
Understanding Decision Trees and Overfitting
24 pages
C2.1 DecisionTrees Slides 4beamer
No ratings yet
C2.1 DecisionTrees Slides 4beamer
76 pages
Regression Trees and Random Forests Explained
No ratings yet
Regression Trees and Random Forests Explained
34 pages
Tree-Based Methods for Prediction Analysis
No ratings yet
Tree-Based Methods for Prediction Analysis
51 pages
Understanding Decision Trees in ML
No ratings yet
Understanding Decision Trees in ML
52 pages
Class 3
No ratings yet
Class 3
6 pages
Decision Tree Learning Overview
No ratings yet
Decision Tree Learning Overview
22 pages
Decision Trees for Regression Explained
No ratings yet
Decision Trees for Regression Explained
27 pages
Understanding Decision Trees in ML
No ratings yet
Understanding Decision Trees in ML
3 pages
Tree-Based Methods for Regression & Classification
No ratings yet
Tree-Based Methods for Regression & Classification
81 pages
Understanding Bayesian Classification
No ratings yet
Understanding Bayesian Classification
52 pages
Lecture 5-Decision Tree Based Learning
No ratings yet
Lecture 5-Decision Tree Based Learning
14 pages
Understanding Decision Trees in ML
No ratings yet
Understanding Decision Trees in ML
14 pages
Understanding Decision Tree Classification
No ratings yet
Understanding Decision Tree Classification
6 pages
Decision Trees in Business Analytics
No ratings yet
Decision Trees in Business Analytics
24 pages
Tree-Based Methods for Regression & Classification
No ratings yet
Tree-Based Methods for Regression & Classification
36 pages
DSA 5102: Decision Trees & Ensembling
No ratings yet
DSA 5102: Decision Trees & Ensembling
34 pages
Understanding Decision Trees in AI
No ratings yet
Understanding Decision Trees in AI
3 pages
Decision Tree Classifier Overview
No ratings yet
Decision Tree Classifier Overview
50 pages
ISB - Week 5 - Descriptive and Predictive Analytics - Decision Trees - Summary Deck
No ratings yet
ISB - Week 5 - Descriptive and Predictive Analytics - Decision Trees - Summary Deck
43 pages
Decision Tree Analysis in AI & ML
No ratings yet
Decision Tree Analysis in AI & ML
29 pages
Decision Tree Uodated Final
No ratings yet
Decision Tree Uodated Final
28 pages
Tree-Based Machine Learning Methods
No ratings yet
Tree-Based Machine Learning Methods
64 pages
Decision Tree Algorithm Explained
No ratings yet
Decision Tree Algorithm Explained
14 pages
Understanding Decision Trees in ML
No ratings yet
Understanding Decision Trees in ML
43 pages
Understanding Decision Tree Algorithms
No ratings yet
Understanding Decision Tree Algorithms
29 pages
Supervised Learning: Regression & Classification
No ratings yet
Supervised Learning: Regression & Classification
74 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
78 pages
Nonlinear Regression Techniques Explained
No ratings yet
Nonlinear Regression Techniques Explained
68 pages
Model Assessment and Selection Techniques
No ratings yet
Model Assessment and Selection Techniques
83 pages
PMJAY 2.0 BIS User Manual Overview
No ratings yet
PMJAY 2.0 BIS User Manual Overview
46 pages
Internet Risks and Safety Tips for Youth
No ratings yet
Internet Risks and Safety Tips for Youth
3 pages
AJP CT-I Question Bank
No ratings yet
AJP CT-I Question Bank
5 pages
Smartphone Networking Simulation Lab 1.1
No ratings yet
Smartphone Networking Simulation Lab 1.1
6 pages
Insights from an Industrial Visit
No ratings yet
Insights from an Industrial Visit
5 pages
Chapter 9: LibreOffice Base Overview
No ratings yet
Chapter 9: LibreOffice Base Overview
2 pages
Optimal Binary Search Trees Analysis
No ratings yet
Optimal Binary Search Trees Analysis
24 pages
Power BI Customization & Interactivity Guide
No ratings yet
Power BI Customization & Interactivity Guide
69 pages
Git Version Control Overview
No ratings yet
Git Version Control Overview
67 pages
IBM Sterling Agent Architecture Overview
No ratings yet
IBM Sterling Agent Architecture Overview
10 pages
EASA Part-66 Exam Questions of Module 05 Avionics - Part VI
83% (6)
EASA Part-66 Exam Questions of Module 05 Avionics - Part VI
22 pages
XONE:22 Parts Identification Guide
No ratings yet
XONE:22 Parts Identification Guide
4 pages
Nonlinear Finite Element Analysis Guide
100% (1)
Nonlinear Finite Element Analysis Guide
29 pages
Sign Language Recognition System Report
No ratings yet
Sign Language Recognition System Report
29 pages
Cybersecurity Strategies Post-Ransomware
No ratings yet
Cybersecurity Strategies Post-Ransomware
17 pages
Capgemini PC Autopilot Setup Guide
No ratings yet
Capgemini PC Autopilot Setup Guide
20 pages
ManoVet X-ray Solutions for Vets
No ratings yet
ManoVet X-ray Solutions for Vets
4 pages
Theory of Computation Exam Questions
No ratings yet
Theory of Computation Exam Questions
28 pages
Understanding Data Cubes in OLAP
No ratings yet
Understanding Data Cubes in OLAP
27 pages
Boston Housing Price Prediction Model
No ratings yet
Boston Housing Price Prediction Model
3 pages
ISO8583 Message Parser and Converter
No ratings yet
ISO8583 Message Parser and Converter
10 pages
CMOS NOT Gate Design and Analysis
No ratings yet
CMOS NOT Gate Design and Analysis
4 pages
Angular 6 Update and Features Guide
No ratings yet
Angular 6 Update and Features Guide
6 pages
ST950 Stratos OS Upgrade Quick Guide
No ratings yet
ST950 Stratos OS Upgrade Quick Guide
3 pages
DBMS Exam Questions for 2025
No ratings yet
DBMS Exam Questions for 2025
2 pages
SQL and Python Data Analysis Guide
No ratings yet
SQL and Python Data Analysis Guide
6 pages
OSS Components for Customer Solutions
No ratings yet
OSS Components for Customer Solutions
17 pages
Dutch Flag Algorithm Explained
No ratings yet
Dutch Flag Algorithm Explained
5 pages
Class XII Computer Science Syllabus 2025
No ratings yet
Class XII Computer Science Syllabus 2025
72 pages
Shopify Resume of React Developer
No ratings yet
Shopify Resume of React Developer
2 pages