Variable Importance in Binary Trees
Variable Importance in Binary Trees
Hemant Ishwaran∗
Department of Quantitative Health Sciences
Cleveland Clinic, 9500 Euclid Avenue, Cleveland, OH 44195
e-mail: [Link]@[Link]
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
2 Binary regression trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
3 A surrogate VIMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
3.1 Comparison to random forests . . . . . . . . . . . . . . . . . . . . 522
3.2 Subtrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
3.3 Prediction error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
4 A theoretical framework for VIMP . . . . . . . . . . . . . . . . . . . . 526
5 Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
5.1 VIMP for forests . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
5.2 Implications for random forests . . . . . . . . . . . . . . . . . . . 530
6 Paired importance values and associations . . . . . . . . . . . . . . . . 530
6.1 Paired association values for forests . . . . . . . . . . . . . . . . . 532
7 Air pollution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
8 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
1. Introduction
Here Bm (x) are 0-1 basis functions that, because of the recursive nature of T ,
partition X .
An important class of binary trees are those associated with Classification
and Regression Trees (CART), a widely used method in regression and multi-
class data settings [1]. In CART, a binary tree is grown using recursive splitting
based on node impurity. The best split for a node is found by searching over
all possible binary splits for each variable, and identifying the split for a vari-
able leading to the greatest reduction in tree node impurity. Binary trees are
also implicitly used in Random Forests [2], a popular ensemble technique used
for prediction. Rather than using a single tree, random forests constructs an
ensemble predictor by averaging over a collection of binary trees. Each tree is
grown from an independent bootstrap sample of the data, where at each node
of the tree a randomly selected subset of variables of size mtry are chosen as
candidate variables to split on. Averaging over trees, in combination with the
randomization used in growing a tree, enables random forests to approximate a
rich class of functions while maintaining low generalization error. This enables
random forests to adapt to the data, automatically fitting higher order interac-
tions and non-linear effects, while at the same time keeping overfitting in check.
This has led to a great interest in the method and applications in many fields.
While CART and random forests are often used for exploratory data analy-
sis, they can also be used to select variables and reduce dimensionality. This is
done by ranking variables by some measure of importance and removing those
variables with low rank. Variable importance (VIMP) was originally defined in
CART using a measure involving surrogate variables (see Chapter 5 of [1]). Def-
initions in terms of mean overall improvement in node impurity for a tree have
also been proposed. In regression trees, node impurity is measured by mean
squared error, whereas in classification problems, the Gini index is used [3]. The
most popular VIMP method to date, however, adopts a prediction error ap-
proach involving “noising-up” a variable. In random forests, for example, VIMP
for a variable xv is the difference between prediction error when xv is noised
up by permuting its value randomly, compared to prediction error under the
original predictor [2, 4, 5, 6]. This type of definition has become quite popu-
lar. For example, there is a growing trend in Bioinformatics where importance
values from forests are used as a filtering method to reduce dimensionality in
high-throughput genomic data [7, 8, 9, 10].
Although VIMP has been studied in general settings [11], there is still a
great need for a systematic theoretical study focusing on trees. In this paper we
provide such a treatment for the class of binary regression trees and their forests.
This applies to data settings where the outcome Y is continuous. We define
rigorously VIMP as well as a measure of association between pairs of variables,
H. Ishwaran/Variable importance in trees and forests 521
and derive theoretical properties for such values. We find that something we call
a maximal subtree and its node mean squared error play a fundamental role in
our results.
It should be noted carefully that the definition for VIMP adopted here differs
from that used in random forests. The permutation method used by random
forests is complex and proved too difficult to analyze theoretically in any detail.
This motivated us to look for an approach simpler to study, but with similar key
features. Hence, we have adopted a simpler definition whereby VIMP is defined
as the difference between prediction error under noising and without noising,
where noising up a variable corresponds to what can be thought of as a random
left-right daughter node assignment. Section 3 provides a rigorous definition of
this process and discusses in detail how this method compares with random
forests.
Here Lm are the number of splits used to define Bm (x). Each split l involves a
variable, denoted by xl(m) for the l(m)-th coordinate of x, and a splitting value
for this variable denoted by cl,m . The sl,m are ±1 binary values, and for any
given x, [x]+1 = 1{x > 0} and [x]−1 = 1{x ≤ 0}. One can think of Bm (x) as a
0-1 function that is 1 only if x sits inside the hyper-rectangular region defined
by xl(m) and the splits cl,m .
As mentioned in the Introduction, recursive binary partitioning ensures that
the resulting decomposition of X induced by the basis functions is a non-
overlapping partition. Hence, an important property possessed by Bm (x) is
orthogonality. Thus,
3. A surrogate VIMP
We define the VIMP for a variable xv as the difference between prediction error
when xv is “noised up” versus the prediction error otherwise. To noise up xv
we adopt the following convention. To assign a terminal value to a case x, drop
x down T and follow its path until either a terminal node is reached or a node
with a split depending upon xv is reached. In the latter case choose the right or
left daughter of the node with equal probability. Now continue down the tree,
randomly choosing right and left daughter nodes whenever a split is encountered
(whether the split depends upon xv or not) until reaching a terminal node.
Assign x the node membership of this terminal node.
The left-right random daughter assignment is designed to induce a random
tree that promotes poor terminal values for cases that pass through nodes that
split on xv . In fact, one can simply think of the noising up process equivalently
in terms of this random tree, which we shall denote by Tv . The prediction
performance of Tv is intimately tied to the VIMP of xv , which is directly related
to the location of xv splits in T . The more informative xv is, the higher up xv
splits will appear in T , and consequently the more perturbed Tv becomes relative
to T . This results in a loss of prediction accuracy for Tv and a high VIMP for
xv .
While our noising up procedure is different than the current method employed
in random forests, our procedure is designed to approximately mimic what is
seen in practice. In random forests, VIMP is calculated by noising up a variable
by randomly permuting it. A given xv is randomly permuted in the out-of-bag
(OOB) data (the data not selected by bootstrapping) and the noised up OOB
data is dropped down the tree grown from the in-bag data (bootstrap data).
This is done for each tree in the forest and an out-of-bag estimate of prediction
error is computed from the resulting predictor. The difference between this
value and the out-of-bag error without random permutation is the VIMP of
xv . Large positive values indicate xv is predictive (since noising up xv increases
prediction error), whereas zero or negative importance values identify variables
not predictive [2].
Often one finds in random forests that variables that tend to split close to
the root have a strong effect on prediction accuracy [12]. When such variables
are randomly permuted, each tree is perturbed substantially, so when out-of-
bag data is dropped down the tree, the resulting predictor has poor prediction.
Averaging over such predictors results in an ensemble with high error rate and a
large VIMP for the variable. Variables tending to split lower down in a tree, on
the other hand, have much less impact, because as one travels through a binary
tree the amount of data drops off exponentially fast. Consequently there are
much fewer observations affected and the noised up predictor does not suffer so
much.
H. Ishwaran/Variable importance in trees and forests 523
Our left-right noising up process shares with random forests the key feature
that VIMP is directly impacted by location of the primary split relative to the
root node. For this reason we believe our approach can provide insight into
VIMP for random forests.
However, the analogy is by no means perfect. A nice example of this, sug-
gested by one of the reviewers of the paper, is as follows. Recall that at each
node, while growing a tree, random forests randomly selects a subset of mtry
variables to split on. In a problem with a large number of non-informative vari-
ables, it is possible that early on in the tree growing process the mtry variables
for a node are all non-informative. Hence, the eventual split for the node is on
a non-informative variable, high up in the tree. Trees are, however, highly data
adaptive and can recover from scenarios like this. Splits further down the tree
can be on informative variables and the tree can still be predictive. Now consider
what happens when we drop a case down the tree using the left-right noising up
process. If the case passes through the node split on the non-informative vari-
able xv , then it follows a random left-right path through potentially predictive
splits. Since the node is high in the tree, the resulting random tree Tv suffers in
terms of prediction and xv can appear informative.
This type of scenario shows that a non-informative variable can appear in-
formative over a single tree under our noising up process. However, over a forest
such an effect will most likely be washed out. With many non-informative vari-
ables, the probability that xv splits high in a tree is small. Since the event has
low probability, the forest will contain few trees with high xv splits. Averaging
over trees pushes the VIMP of xv to zero. Moreover, for a single tree, this kind
of problem can be resolved by slightly modifying the noising up process. Rather
than using random left-right assignments on all nodes beneath xv , use random
assignments for only those nodes that split on xv . This will impact prediction
only when xv is informative and not affect prediction for non-informative vari-
ables. This modified process, in fact, was used in early attempts to study this
problem. Unfortunately, though, while it was our feeling this type of VIMP
could compete against the one used by random forests, it was still too complex
for the kind of detailed theoretical development we were after.
It is important to emphasize that while deficiencies like those mentioned
suggest our noising up process may not be practicable in all data examples, this
does not detract from theory developed under this mechanism. As we define
shortly, a key theoretical concept in our framework is the notion of a maximal
subtree. We believe this is key to understanding VIMP, even complicated values
like those used by random forests.
3.2. Subtrees
To discuss maximal subtrees, first note the predictor associated with the original
tree T can easily be written in closed form. Denote the predictor by µ̂(x). It
H. Ishwaran/Variable importance in trees and forests 524
follows that
M
X
µ̂(x) = am Bm (x).
m=1
Definition 2. Call T̃v a v-subtree of T if the root node of T̃v has daughters
that depend upon an xv split. Call T̃v a maximal v-subtree if T̃v is not a subtree
of a larger v-subtree (see Figure 1). For example, the root node tree T is always
a maximal v-subtree for some v.
root
x1 ≤ c 1 x1 > c 1
x9 ≤ c 2 x9 > c 2
1 2
a1 a2 x5 > c 3
x5 ≤ c 3
8
a8
x3 ≤ c 4
x3 > c 4
x5 ≤ c 5 x5 > c 5
x3 ≤ c 6 x3 > c 6
3 4
a3 a4
x8 ≤ c 7 x8 > c 7 7
a7
5 6
a5 a6
Fig 1. Large box contains a maximal v-subtree, for v = 3. Smaller box containing tree with
nodes 5, 6 and 7 is also a v-subtree, but it is not maximal.
H. Ishwaran/Variable importance in trees and forests 525
X Kv
X
µ̂v (x) = am Bm (x) + ãk,v I{T (x) ∈ Mk,v }, (3.1)
m∈M
/ v k=1
where ãk,v is the random terminal value assigned by T̃k,v under a random left-
right path through T̃k,v . Write P̃k,v for the distribution of ãk,v . Note that P̃k,v
is conditional on the original tree T .
The expectation on the right-hand side is joint over the original learning data
L as well as the independent test data (Y, x). It is assumed that Y can be
written as a regression model
Y = µ(x) + ε, (3.2)
where ε has zero mean and variance σ 2 > 0 and ε is assumed independent of
µ(x). In a similar way one can define PE for µ̂v :
P(µ̂v ) = E g(Y, µ̂v (x)) .
Note the expection on the right-hand side involves an extra level of integration
over ãk,v via its distribution P̃k,v .
We now define the VIMP of xv . In this paper we confine attention to the
case when g(Y, µ̂) = (Y − µ̂)2 , corresponding to L2 -loss. The VIMP value for xv
under L2 -loss is defined as
∆v = P(µ̂v ) − P(µ̂).
Using a similar expansion for P(µ̂v ) as (3.3), deduce that the VIMP for xv is
n o
∆v = E Rv (x)2 − 2E Rv (x) µ(x) − µ̂(x)
where
Kv
X X
Rv (x) = (ãk,v − am )Bm (x).
k=1 m∈Mk,v
Clearly ∆v depends heavily upon the relationship between am and ãk,v . For
example, ∆v = 0 if ãk,v = am for each m ∈ Mk,v and each k. In this case, all
terminal values within a maximal subtree are equal and randomly assigning a
terminal value cannot perturb matters. Hence, xv is non-informative as further
clustering of the data is ineffective. In general, however, noising up xv will
perturb ãm,v relative to am and result in a positive or negative value for ∆v .
Trying to detangle the resulting properties of ∆v in such settings requires a
theoretical framework which we discuss next.
where am,0 are the true, but unknown, terminal values. Observe that µ(x) is
random because Bm (x) are basis functions determined from the learning data.
Nevertheless, we shall still think of µ(x) as our “population parameter” because
the terminal values for µ(x) are fixed and do not depend upon the data. Note
importantly that (3.3) still holds even though µ(x) depends upon the data.
Remark 1. Assumption (4.1) might seem simplistic at first glance, but we’ll
show later in Section 5 that it is quite realistic when put in the context of forests.
By exploiting the orthogonality property (2.1) of the tree basis functions, we
can greatly simplify the expression for ∆v under assumption (4.1). Consider the
following theorem.
H. Ishwaran/Variable importance in trees and forests 527
Theorem 1. Assume the true model is (3.2) and that (4.1) holds (for any given
L ). Then,
(K )
v h
X X
2
i
∆v = E πm sk,v + ak,v − am ak,v + am − 2am,0 , (4.2)
k=1 m∈Mk,v
where πm = P{Bm (x) = 1|L } and s2k,v and ak,v denote the variance and mean
for ãk,v under P̃k,v .
Proof. Using orthogonality (2.1) and assumption (4.1), collect together com-
mon expressions to deduce that
(K )
X v X
2
∆v = E rk,m − 2rk,m rm Bm (x) ,
k=1 m∈Mk,v
Continuing to keep L fixed, bring the expectation over x inside the sum. Only
the term Bm (x) depends upon x, from which (4.2) follows.
the node mean squared error for the subtree T̃k,v . An immediate corollary to
Theorem 1 is that the closer µ̂(x) is to the true signal, the greater the effect
node mean squared error has on VIMP.
Corollary 1. If am → am,0 , as one might hope for if sample sizes converge to
∞, (K )
X v
∆v → E θ0 (k, v) .
k=1
Note because θ0 (k, v) ≥ 0, all VIMP values are non-negative in the limiting case.
H. Ishwaran/Variable importance in trees and forests 528
Corollary 1 shows in the limit that each maximal v-subtree contributes equally
to ∆v through its node mean squared error. Furthermore, the node mean squared
error for a subtree T̃k,v is a weighted average with some nodes contributing sig-
nificantly more to overall mean squared error than others. From (4.3), we know
that θ0 (k, v) is a weighted average involving weights πm . The value πm is the
probability that a fresh test x has node membership m under T . If many splits
are needed to reach m, or equivalently if Bm (x) involves a large number of basis
functions, then πm is likely to be small. Hence, nodes far down T̃k,v contribute
less to θ0 (k, v) then nodes closer to its root node. Moreover, this effect is further
amplified due to P̃k,v . Under random splitting, if Lm splits are needed to reach
a node m from the root of T̃k,v , then P̃k,v assigns mass 2−Lm to am . So again,
the closer a node is to the root of the subtree, the bigger its impact on node
mean squared error and ∆v . This is consistent with what was said earlier in
Section 3 regarding impact node position has on VIMP.
5. Forests
In this section we extend our theory to forests. We begin with a formal definition.
Note unlike trees, forests do not assign node membership values. Forests
are simply predictors, and they are much better at predicting than individual
trees. In fact, under mild conditions, one can show that the closure of linear
combination of binary tree predictors form a complete space. This leads to the
following theorem (c.f. Proposition 2 of [13]).
Theorem 2. Assume x has distribution P0 that is absolutely continuous with
respect to Lebesgue measure with support Sd on a finite closed rectangle in Rd .
Then for each real valued function µ ∈ L2 (P0 ), and each δ > 0, there exists a
forest µ̂F constructed from binary trees with M > d terminal nodes such that
Z
(µ̂F (x) − µ(x))2 P0 (dx) ≤ δ.
Sd
binary tree T with M > d nodes, the result follows because the closure of finite
linear combinations of such predictors is L2 (P0 ). It suffices to consider indicator
functions of the form
I(x) = I{x1 ≤ c1 , x2 ≤ c2 , . . . , xd ≤ cd }.
To construct T , first split x1 at c1 . This yields a left daughter node x1 ≤ c1 .
Now split the left daughter at x2 ≤ c2 and it’s left daughter at x3 ≤ c3 and
so forth up to xd ≤ cd . Label the resulting terminal node m = 1 and denote
its basis function by B1 (x). Observe that d splits are needed to create B1 (x)
and therefore T has M = d + 1 terminal nodes. Let a1 = 1 and set am = 0 for
m > 1. Then µ̂(x) = I(x).
Suppose that T (x; b), b = 1, . . . , B, are distinct binary trees grown from the
learning data L . For example, each tree could be grown as in random forests
but without bootstrapping the data. Extending our previous notation in an
obvious way, the forest predictor for the collection of trees is defined to be
(b)
B M
1 X X (b)
µ̂F (x) = am Bm (x; b).
B m=1b=1
Now using a similar argument as in Section 3.3, the VIMP for xv is:
n o
∆Fv = E RFv (x)2 − 2E RFv (x) µ(x) − µ̂F (x) ,
where
(b)
B Kv
1 XX X (b)
RFv (x) = ãk,v − a(b)
m Bm (x; b).
B
b=1 k=1 m∈M (b)
k,v
(b)
Theorem 3. Let RFv ,0 (x) be the function RFv (x) in which am is replaced with
(b) (b) (b)
the value am,0 . Assume (3.2) and (5.1) holds. If am → am,0 for each m and b,
then
(b)
B K
( )
v
1 X X
∆Fv → E RFv ,0 (x)2 ≤ E
θ0 (k, v, b) , (5.2)
B
b=1 k=1
where 2
(b) (b) (b)
X
θ0 (k, v, b) = πm,b P̃k,v,0 ãk,v,0 − am,0
(b)
m∈Mk,v
is the node mean squared error for the kth maximal v-subtree of T (x; b) and
πm,b = P{Bm (x; b) = 1|L }.
Proof. The limit follows automatically. The bound on the right-hand side
of (5.2) is due to Jensen’s inequality and orthogonality (2.1).
The bound in (5.2) becomes tighter as the trees in the forest become orthogonal
to one another. Moreover, a key assumption in Theorem (3) is that (5.1) holds,
which by Theorem 2 is reasonable so as long as the individual trees in the forest
are rich enough. Thus, a prescription for VIMP to be positive and fully charac-
terized by subtree node mean squared error is that the forest should be derived
from trees rich enough to approximate µ(x) while simultaneously being nearly
orthogonal to one another. This is quite interesting, because this is the same
prescription needed to achieve low generalization errors in random forests (see
Theorem 11.2 of [2]). So the right-hand side of (5.2) is a very reasonable starting
point for understanding VIMP in random forests. In fact, in Section 6.1, we will
show that node mean squared error terms like those appearing in (5.2) play a
crucial role in understanding paired associations between variables in forests.
For the moment let us return to a study of single trees. We consider the joint
VIMP of a pair of variables (xv , xw ) in a binary tree T . Let t = (v, w) and
write xt to indicate (xv , xw ). The paired VIMP for xt is the difference between
prediction error when xt is noised up versus the prediction error otherwise.
Noising up a pair of variables is defined as follows. For a case x, drop x down
T , and if x enters the root node of a maximal v-subtree T̃k,v , or a maximal
w-subtree T̃k,w , then initiate a random left-right path until a terminal node is
reached. This will assign x a random terminal value ãk,t with distribution P̃k,t .
Similar to our perturbation for a single variable, one can think of the above
randomization process as defining a random tree. We refer to this tree as Tt .
Let
T̃1,t , . . . , T̃Kt ,t
H. Ishwaran/Variable importance in trees and forests 531
Here K (v) and K (w) are the set of indices {ik : k = 1, . . . , Kv∗ } and {il : l =
∗
1, . . . , Kw }, and Mt is the set of all terminal nodes for (6.1). Note importantly
that ãk,v and ãl,v have distribution P̃k,v and P̃l,w .
Define the paired VIMP for xt as
∆t = P(µ̂t ) − P(µ̂).
We compare ∆t to a composite additive effect of the individual variables v and
w. This will allow us to identify pairs of variables whose “joint effect” differs
from the sum of their individual components. Thus allowing identification of in-
teresting paired assocations (for another interesting way to identify interactions
see [14]).
The marginal importance effect for v and w under an additive effect paradigm
is defined to be
∆v+w = ∆v + ∆w .
The difference between the two sets of importance values is what we call At ,
the measure of association between xv and xw :
At = ∆t − ∆v+w = ∆t − (∆v + ∆w ). (6.2)
Using similar arguments as in Theorem 1, we obtain the following characteriza-
tion of At .
Theorem 4. If am → am,0 , then under the conditions of Theorem 1:
( )
X X
At → − E θ0 (k, v) + θ0 (l, w) .
k∈K
/ (v) l∈K
/ (w)
H. Ishwaran/Variable importance in trees and forests 532
Theorem 4 shows in the limit that At can never be positive. The reason for
this is that ∆v and ∆w overcount maximal subtrees since a maximal v-subtree
can be a subset of a maximal w-subtree and vice-versa. Consequently, ∆v + ∆w
can never be smaller than ∆t . If v and w are orthogonal, then there is no
overlap between maximal v and w-subtrees and no overcounting occurs. Hence,
the above sums are null and At = 0 in the limit. On the other hand, if v and w
are not orthogonal, then At has a negative limit. The more negative, the more
overlap there is between v and w-subtrees, and the stronger the indication of
an association between the two variables.
where
B
(
1 X X X (b)
RFt (x) = ãk,v − a(b)
m Bm (x; b)
B
b=1 k∈K (v,b) m∈M (b)
k,v
)
X X (b)
+ ãl,w − a(b)
m Bm (x; b) .
l∈K (w,b) (b)
m∈Ml,w
Theorem 5 implies a delicate relationship for At . This involves the way v and
w interact within a given tree as well as how the variables interact across trees
in the forest. Consider, first, the extreme case when all basis functions in RFt
are mutually orthogonal. This implies a forest where v and w-maximal subtrees
are disjoint across trees. Then,
1 X B X X
At → − E θ0 (k, v, b) + θ0 (l, w, b) ,
B
b=1 k∈K
/ (v,b) l∈K
/ (w,b)
where θ0 (k, v, b) and θ0 (l, w, b) are node mean squared errors as defined in Theo-
rem 3. Note that the limit on the right-hand side is similar to that in Theorem 4.
The only difference here is that we are averaging node mean squared error terms
over trees. Just as in Theorem 4 the limit of At cannot be positive. Also, the size
of At depends upon the behavior of v and w within a tree. For example, if v and
w are orthogonal, such that for any given tree all maximal v and w-subtrees are
disjoint, then At = 0 in the limit. In contrast, the more overlap between v and
w-subtrees within a given tree, the more maximal subtrees are overcounted, and
the more negative At becomes. So just like single trees, large negative values
indicate a strong association.
On the other hand, if trees are correlated, then (6.3) can be positive. One
example is as follows. Suppose that v and w are highly correlated and both
are influential. Then for any given tree both variables are strong competitors to
grow the tree. If one variable is selected early on in the tree growing process,
this might preclude the other variable from being selected. Things might be so
bad that we could have 50% of the trees grown using v and 50% grown using w.
So noising up v or w individually will only affect 50% of the trees. However, a
joint noising up of variables perturbs all trees. Consequently, prediction will be
substantially worse and we would expect At to be positive. The next example
is a good illustration of this.
7. Air pollution
For our first illustration we use the well known “air pollution” data set [15]. The
variables are daily readings of various air quality values measured from May 1,
1973 to September 30, 1973 in the New York metropolitan area. Solar radiation
in Langleys (Solar), average wind speed in miles per hour (Wind), maximum
daily temperature in Fahrenheit (Temp), month of the year (Month) and day of
month (Day) were all recorded. The outcome is the mean ozone value (Ozone),
which was transformed by taking its cube-root.
We analyzed the data using random forests. Computations were carried out
using the “randomForest” R-package [16]. In regression settings, randomForest
computes VIMP by subtracting an out-of-bag mean squared error rate under
a random permutation of a variable xv from an out-of-bag mean squared error
rate under the original xv . As we have discussed already, these values resemble
∆v , but are not exactly the same.
H. Ishwaran/Variable importance in trees and forests 534
Table 1
Paired association values using random forests for air pollution data. Values reported are
averaged over 1000 independent replicates with each forest based on 1000 trees.
Paired Additive Association Assoc/MSE
Temp:Wind 0.706 0.600 0.106 11.351
Temp:Solar 0.590 0.529 0.061 6.532
Temp:Month 0.464 0.455 0.008 0.904
Temp:Day 0.465 0.462 0.003 0.298
Wind:Solar 0.232 0.214 0.017 2.465
Wind:Month 0.142 0.141 0.001 0.137
Wind:Day 0.150 0.148 0.002 0.328
Solar:Month 0.074 0.073 0.000 0.041
Solar:Day 0.084 0.082 0.002 0.486
Month:Day 0.006 0.006 0.000 0.046
Table 1 shows that the largest association value is between the variables Temp
and Wind. Also interesting, is the second largest association value occurring for
Solar and Temp. Solar radiation and temperature are known to be related,
and the fairly large association value in Table 1 confirms this relationship. The
next largest association value is between Wind and Solar, but this value is
substantially smaller (as can be seen by the standardized value in the 4th column
of the table). After this, association values drop off dramatically.
These results are based on random forests, whereas Theorem 5 applies to the
H. Ishwaran/Variable importance in trees and forests 535
Given : Temp
60 70 80 90
20
5
●
● ● ●
● ●
● ●● ●
● ●● ● ● ●● ●●
● ● ● ● ● ● ● ● ● ● ●
● ● ●
3
● ● ● ● ● ●
● ● ● ●●●●●
● ● ● ● ● ●●
● ● ●● ●
● ●
●● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ●
●●● ● ● ●●● ● ● ● ● ●
●
●
1
5
15
●
●
● ● ● ● ● ● ●
Given : Wind
● ●
● ● ●● ●
● ● ● ●
●●
● ●● ● ●● ● ●
●● ● ● ● ● ●
3
● ● ●
● ● ●●● ● ●
● ● ●●●●●
● ● ● ●
● ●● ● ●● ● ● ●● ● ●● ● ●
Ozone
●●●
● ● ●
●
1
5
● ●●
●
10
● ● ●● ● ●
● ● ● ●
● ●
● ●●
● ● ● ● ● ● ● ● ● ●●● ● ●
● ●● ●
● ● ●● ● ● ● ● ●● ● ● ●
3
● ● ● ●
● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ●
●● ● ● ●● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ●
●
1
●
● ● ●
● ●
5
● ●
● ●●● ●
● ● ●●
●●
● ● ●●
● ● ●● ●
●
● ●
● ● ● ●
● ●
5
● ● ●
● ● ● ●●● ● ● ●● ●
● ● ● ● ● ● ● ●
● ● ●
3
● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
●
1
Fig 2. Coplot of ozone values against solar radiation given wind and temperature.
VIMP under a special type of noising up process for forests. Thus, it is inter-
esting to see how well the theory extrapolates. To consider this, we generated a
conditional plot (coplot) of ozone values against solar radiation, conditioning on
wind and temperature. See Figure 2. As seen, ozone increase with solar radia-
tion, and the slope of the curves increase as temperature increases for any fixed
range of wind speeds. This clearly demonstrates an association between solar
radiation and temperature. Also, looking at ozone and radiation with respect
to wind, the slope increases as wind increases at high temperatures, but this
pattern does not apply at lower temperature values. This is clear evidence of a
Temp:Wind association. These results are consistent with Table 1.
8. Simulation
A sample size of n = 100 was drawn. The data was then analyzed using the
randomForest package exactly as outlined in the previous section. All settings
were kept the same. The simulation was repeated 100 times independently. The
averaged values over the 100 simulations are reported in Table 2. Variables 1-6
are coded as a-f respectively in Table 2.
Table 2
Paired association values using random forests for simulated data. Analysis carried out in
the same manner as Table 1.
Paired Additive Association Assoc/MSE
a:b 185.216 192.870 −7.654 −4.231
a:c 132.347 132.426 −0.079 −0.044
a:d 140.472 141.906 −1.434 −0.849
a:e 133.518 133.530 −0.011 −0.003
a:f 131.552 131.639 −0.088 −0.055
b:c 61.692 61.736 −0.044 −0.033
b:d 71.350 71.200 0.150 0.110
b:e 62.822 62.857 −0.035 −0.023
b:f 60.990 60.965 0.025 0.021
c:d 10.888 10.870 0.018 0.014
c:e 2.500 2.474 0.026 0.026
c:f 0.599 0.589 0.009 0.007
d:e 11.948 11.934 0.015 0.019
d:f 10.022 10.040 −0.019 −0.022
e:f 1.713 1.715 −0.002 −0.011
By far, the two largest standarized association values are a:b and a:d, rep-
resening the interactions between variable x1 and x2 and x1 and x4 (see the
4th column in Table 2). Note how these values are negative, unlike in our pre-
vious analysis where all significant associations were positive. Large negative
values like these, as our theory suggests, is due to high overlap between v and
w-subtrees. For example, for x1 and x2 , their interaction must be such that
whenever x1 splits a node in the tree, there is a high likelihood that x2 is split
underneath this node, and vice-versa.
References
[1] Breiman L., Friedman J.H., Olshen R.A., and Stone C.J. Classification and
Regression Trees. Wadsworth, Belmont, California, 1984. MR0726392
[2] Breiman L. Random forests. Machine Learning, 45:5–32, 2001.
[3] Friedman J. Greedy function approximation: a gradient boosting machine.
Ann. Statist., 29, 1189–1232, 2001. MR1873328
[4] Liaw A. and Wiener M. Classification and regression by randomForest. R
News, 2:18–22, 2002.
[5] Ishwaran H. and Kogalur U.B. Random survival forests for R. To appear
in R News, 2007.
[6] Ishwaran H., Kogalur U.B, Blackstone E.H. and Lauer, M.S. Random
survival forests. Cleveland Clinic Technical Report, 2007.
H. Ishwaran/Variable importance in trees and forests 537
[7] Breiman L. Statistical modeling: the two cultures. Stat. Science, 16:199–
231, 2001. MR1874152
[8] Lunetta K.L., Hayward L.B., Segal J. and Eerdewegh P.V. Screening large-
scale association study data: exploiting interactions using random forests.
BMC Genetics, 5:32, 2004.
[9] Bureau A., Dupuis J., Falls K., Lunetta K.L., Hayward B., Keith T.P.,
Eerdewegh P.V. Identifying SNPs predictive of phenotype using random
forests. Genetic Epidemiology, 28:171-182, 2005.
[10] Diaz-Uriarte R. and Alvarez de Andres S. Gene selection and classification
of microarray data using random forest. BMC Bioinformatics, 7:3, 2006.
[11] van der Laan M. Statistical inference for variable importance. International
J. Biosatist., 2:1008, 2006. MR2275897
[12] Strobl C., Boulesteix A.L., Zeileis A., and Hothorn T. Bias in random
forest variable importance measures: Illustrations, sources and a solution.
BMC Bioinformatics, 8:25, 2007.
[13] Breiman L. Population theory for boosting ensembles. Ann. Stat., 32:1–11,
2004. MR2050998
[14] Ng V.W. and Breiman L. Bivariate variable selection for classification
problem. Technical report 692: Berkeley Statistics Department, 2005.
[15] Chambers J. M., Cleveland W. S., Kleiner B., and Tukey P. A. Graphical
Methods for Data Analysis. Wadsworth, Belmont, California, 1983.
[16] Liaw A. and Wiener M. randomForest 4.5-18.
[Link] 2007.