0% found this document useful (0 votes)

2 views19 pages

Variable Importance in Binary Trees

The document discusses variable importance (VIMP) in binary regression trees and forests, highlighting its theoretical foundations and practical applications, particularly in bioinformatics for genomic data analysis. It introduces a novel method for defining VIMP based on prediction error differences when variables are 'noised up,' comparing this approach to existing methods used in random forests. The paper aims to provide a systematic theoretical framework for understanding VIMP and its implications for variable selection and dimensionality reduction in statistical modeling.

Uploaded by

Joaquim Ferreira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views19 pages

Variable Importance in Binary Trees

Uploaded by

Joaquim Ferreira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Electronic Journal of Statistics

Vol. 1 (2007) 519–537

ISSN: 1935-7524
DOI: 10.1214/07-EJS039

Variable importance in binary

regression trees and forests
arXiv:0711.2434v1 [[Link]] 15 Nov 2007

Hemant Ishwaran∗
Department of Quantitative Health Sciences
Cleveland Clinic, 9500 Euclid Avenue, Cleveland, OH 44195
e-mail: [Link]@[Link]

Abstract: We characterize and study variable importance (VIMP) and

pairwise variable associations in binary regression trees. A key component
involves the node mean squared error for a quantity we refer to as a max-
imal subtree. The theory naturally extends from single trees to ensembles
of trees and applies to methods like random forests. This is useful because
while importance values from random forests are used to screen variables,
for example they are used to filter high throughput genomic data in Bioin-
formatics, very little theory exists about their properties.

Keywords and phrases: CART, random forests, maximal subtree.

Received April 2007.

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
2 Binary regression trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
3 A surrogate VIMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
3.1 Comparison to random forests . . . . . . . . . . . . . . . . . . . . 522
3.2 Subtrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
3.3 Prediction error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
4 A theoretical framework for VIMP . . . . . . . . . . . . . . . . . . . . 526
5 Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
5.1 VIMP for forests . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
5.2 Implications for random forests . . . . . . . . . . . . . . . . . . . 530
6 Paired importance values and associations . . . . . . . . . . . . . . . . 530
6.1 Paired association values for forests . . . . . . . . . . . . . . . . . 532
7 Air pollution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
8 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536

1. Introduction

A binary recursively grown tree, T , can be thought of as a map that uniquely

identifies a multivariable covariate x ∈ X with tree node membership. Here x is
∗ Research partially supported by National Institutes of Health RO1 grant HL-072771.
519
H. Ishwaran/Variable importance in trees and forests 520

a d-dimensional vector x = (x1 , . . . , xv ) ∈ Rd , where xv are real valued variables.

If T has M terminal nodes, then T is the function T : X → {1, . . . , M } defined
via
XM
T (x) = mBm (x). (1.1)
m=1

Here Bm (x) are 0-1 basis functions that, because of the recursive nature of T ,
partition X .
An important class of binary trees are those associated with Classification
and Regression Trees (CART), a widely used method in regression and multi-
class data settings [1]. In CART, a binary tree is grown using recursive splitting
based on node impurity. The best split for a node is found by searching over
all possible binary splits for each variable, and identifying the split for a vari-
able leading to the greatest reduction in tree node impurity. Binary trees are
also implicitly used in Random Forests [2], a popular ensemble technique used
for prediction. Rather than using a single tree, random forests constructs an
ensemble predictor by averaging over a collection of binary trees. Each tree is
grown from an independent bootstrap sample of the data, where at each node
of the tree a randomly selected subset of variables of size mtry are chosen as
candidate variables to split on. Averaging over trees, in combination with the
randomization used in growing a tree, enables random forests to approximate a
rich class of functions while maintaining low generalization error. This enables
random forests to adapt to the data, automatically fitting higher order interac-
tions and non-linear effects, while at the same time keeping overfitting in check.
This has led to a great interest in the method and applications in many fields.
While CART and random forests are often used for exploratory data analy-
sis, they can also be used to select variables and reduce dimensionality. This is
done by ranking variables by some measure of importance and removing those
variables with low rank. Variable importance (VIMP) was originally defined in
CART using a measure involving surrogate variables (see Chapter 5 of [1]). Def-
initions in terms of mean overall improvement in node impurity for a tree have
also been proposed. In regression trees, node impurity is measured by mean
squared error, whereas in classification problems, the Gini index is used [3]. The
most popular VIMP method to date, however, adopts a prediction error ap-
proach involving “noising-up” a variable. In random forests, for example, VIMP
for a variable xv is the difference between prediction error when xv is noised
up by permuting its value randomly, compared to prediction error under the
original predictor [2, 4, 5, 6]. This type of definition has become quite popu-
lar. For example, there is a growing trend in Bioinformatics where importance
values from forests are used as a filtering method to reduce dimensionality in
high-throughput genomic data [7, 8, 9, 10].
Although VIMP has been studied in general settings [11], there is still a
great need for a systematic theoretical study focusing on trees. In this paper we
provide such a treatment for the class of binary regression trees and their forests.
This applies to data settings where the outcome Y is continuous. We define
rigorously VIMP as well as a measure of association between pairs of variables,
H. Ishwaran/Variable importance in trees and forests 521

and derive theoretical properties for such values. We find that something we call
a maximal subtree and its node mean squared error play a fundamental role in
our results.
It should be noted carefully that the definition for VIMP adopted here differs
from that used in random forests. The permutation method used by random
forests is complex and proved too difficult to analyze theoretically in any detail.
This motivated us to look for an approach simpler to study, but with similar key
features. Hence, we have adopted a simpler definition whereby VIMP is defined
as the difference between prediction error under noising and without noising,
where noising up a variable corresponds to what can be thought of as a random
left-right daughter node assignment. Section 3 provides a rigorous definition of
this process and discusses in detail how this method compares with random
forests.

2. Binary regression trees

We begin by first defining a binary regression tree.

Definition 1. Let L = {(Yi , xi ) : i = 1, . . . , n} denote the learning data. A

binary tree T grown from L is a tree grown using successive recursive binary
splits on variables of the form xv ≤ c and xv > c where split values c are chosen
from the observed values of xi . The terminal values am for a terminal node of T
is the P
average of those Yi withPxni having node membership m. In other words,
n
am = i=1 I{T (xi ) ∈ m}Yi / i=1 I{T (xi ) ∈ m}.

A binary regression tree (hereafter simply refered to as a binary tree) must

be of the form (1.1). Moreover, because of the nature of recursive partitioning,
the basis functions Bm (x) in T are product splines of the form:
Lm
Y
Bm (x) = xl(m) − cl,m sl,m
.
l=1

Here Lm are the number of splits used to define Bm (x). Each split l involves a
variable, denoted by xl(m) for the l(m)-th coordinate of x, and a splitting value
for this variable denoted by cl,m . The sl,m are ±1 binary values, and for any
given x, [x]+1 = 1{x > 0} and [x]−1 = 1{x ≤ 0}. One can think of Bm (x) as a
0-1 function that is 1 only if x sits inside the hyper-rectangular region defined
by xl(m) and the splits cl,m .
As mentioned in the Introduction, recursive binary partitioning ensures that
the resulting decomposition of X induced by the basis functions is a non-
overlapping partition. Hence, an important property possessed by Bm (x) is
orthogonality. Thus,

Bm (x)Bm′ (x) = 0 if m 6= m′ . (2.1)

This property will play a key role in our theoretical development.

H. Ishwaran/Variable importance in trees and forests 522

3. A surrogate VIMP

We define the VIMP for a variable xv as the difference between prediction error
when xv is “noised up” versus the prediction error otherwise. To noise up xv
we adopt the following convention. To assign a terminal value to a case x, drop
x down T and follow its path until either a terminal node is reached or a node
with a split depending upon xv is reached. In the latter case choose the right or
left daughter of the node with equal probability. Now continue down the tree,
randomly choosing right and left daughter nodes whenever a split is encountered
(whether the split depends upon xv or not) until reaching a terminal node.
Assign x the node membership of this terminal node.
The left-right random daughter assignment is designed to induce a random
tree that promotes poor terminal values for cases that pass through nodes that
split on xv . In fact, one can simply think of the noising up process equivalently
in terms of this random tree, which we shall denote by Tv . The prediction
performance of Tv is intimately tied to the VIMP of xv , which is directly related
to the location of xv splits in T . The more informative xv is, the higher up xv
splits will appear in T , and consequently the more perturbed Tv becomes relative
to T . This results in a loss of prediction accuracy for Tv and a high VIMP for
xv .

3.1. Comparison to random forests

While our noising up procedure is different than the current method employed
in random forests, our procedure is designed to approximately mimic what is
seen in practice. In random forests, VIMP is calculated by noising up a variable
by randomly permuting it. A given xv is randomly permuted in the out-of-bag
(OOB) data (the data not selected by bootstrapping) and the noised up OOB
data is dropped down the tree grown from the in-bag data (bootstrap data).
This is done for each tree in the forest and an out-of-bag estimate of prediction
error is computed from the resulting predictor. The difference between this
value and the out-of-bag error without random permutation is the VIMP of
xv . Large positive values indicate xv is predictive (since noising up xv increases
prediction error), whereas zero or negative importance values identify variables
not predictive [2].
Often one finds in random forests that variables that tend to split close to
the root have a strong effect on prediction accuracy [12]. When such variables
are randomly permuted, each tree is perturbed substantially, so when out-of-
bag data is dropped down the tree, the resulting predictor has poor prediction.
Averaging over such predictors results in an ensemble with high error rate and a
large VIMP for the variable. Variables tending to split lower down in a tree, on
the other hand, have much less impact, because as one travels through a binary
tree the amount of data drops off exponentially fast. Consequently there are
much fewer observations affected and the noised up predictor does not suffer so
much.
H. Ishwaran/Variable importance in trees and forests 523

Our left-right noising up process shares with random forests the key feature
that VIMP is directly impacted by location of the primary split relative to the
root node. For this reason we believe our approach can provide insight into
VIMP for random forests.
However, the analogy is by no means perfect. A nice example of this, sug-
gested by one of the reviewers of the paper, is as follows. Recall that at each
node, while growing a tree, random forests randomly selects a subset of mtry
variables to split on. In a problem with a large number of non-informative vari-
ables, it is possible that early on in the tree growing process the mtry variables
for a node are all non-informative. Hence, the eventual split for the node is on
a non-informative variable, high up in the tree. Trees are, however, highly data
adaptive and can recover from scenarios like this. Splits further down the tree
can be on informative variables and the tree can still be predictive. Now consider
what happens when we drop a case down the tree using the left-right noising up
process. If the case passes through the node split on the non-informative vari-
able xv , then it follows a random left-right path through potentially predictive
splits. Since the node is high in the tree, the resulting random tree Tv suffers in
terms of prediction and xv can appear informative.
This type of scenario shows that a non-informative variable can appear in-
formative over a single tree under our noising up process. However, over a forest
such an effect will most likely be washed out. With many non-informative vari-
ables, the probability that xv splits high in a tree is small. Since the event has
low probability, the forest will contain few trees with high xv splits. Averaging
over trees pushes the VIMP of xv to zero. Moreover, for a single tree, this kind
of problem can be resolved by slightly modifying the noising up process. Rather
than using random left-right assignments on all nodes beneath xv , use random
assignments for only those nodes that split on xv . This will impact prediction
only when xv is informative and not affect prediction for non-informative vari-
ables. This modified process, in fact, was used in early attempts to study this
problem. Unfortunately, though, while it was our feeling this type of VIMP
could compete against the one used by random forests, it was still too complex
for the kind of detailed theoretical development we were after.
It is important to emphasize that while deficiencies like those mentioned
suggest our noising up process may not be practicable in all data examples, this
does not detract from theory developed under this mechanism. As we define
shortly, a key theoretical concept in our framework is the notion of a maximal
subtree. We believe this is key to understanding VIMP, even complicated values
like those used by random forests.

3.2. Subtrees

To discuss maximal subtrees, first note the predictor associated with the original
tree T can easily be written in closed form. Denote the predictor by µ̂(x). It
H. Ishwaran/Variable importance in trees and forests 524

follows that
M
X
µ̂(x) = am Bm (x).
m=1

Surprisingly, it is also possible to give an explicit representation for the predictor

of Tv . To do so, we introduce the notion of a subtree and a maximal subtree.

Definition 2. Call T̃v a v-subtree of T if the root node of T̃v has daughters
that depend upon an xv split. Call T̃v a maximal v-subtree if T̃v is not a subtree
of a larger v-subtree (see Figure 1). For example, the root node tree T is always
a maximal v-subtree for some v.

Suppose that T contains Kv distinct maximal v-subtrees, T̃1,v , . . . , T̃Kv ,v .

Let Mk,v be the set of terminal nodes for the subtree T̃k,v and define Mv =
SKv
k=1 Mk,v . Observe that Mk,v are disjoint sets and that Mv is the set of terminal
nodes of T that can only be reached by going through at least one node involving
a split on xv . The following key Lemma shows that the predictor for Tv can be
written as a deterministic component (conditioned on T ) depending on terminal
nodes not in Mv and a random component involving terminal nodes in Mv .

root

x1 ≤ c 1 x1 > c 1

x9 ≤ c 2 x9 > c 2

1 2
a1 a2 x5 > c 3
x5 ≤ c 3
8
a8

x3 ≤ c 4
x3 > c 4
x5 ≤ c 5 x5 > c 5
x3 ≤ c 6 x3 > c 6
3 4
a3 a4
x8 ≤ c 7 x8 > c 7 7
a7
5 6
a5 a6

Fig 1. Large box contains a maximal v-subtree, for v = 3. Smaller box containing tree with
nodes 5, 6 and 7 is also a v-subtree, but it is not maximal.
H. Ishwaran/Variable importance in trees and forests 525

Lemma 1. Let µ̂v (x) denote the predictor for Tv . Then,

X Kv
X
µ̂v (x) = am Bm (x) + ãk,v I{T (x) ∈ Mk,v }, (3.1)
m∈M
/ v k=1

where ãk,v is the random terminal value assigned by T̃k,v under a random left-
right path through T̃k,v . Write P̃k,v for the distribution of ãk,v . Note that P̃k,v
is conditional on the original tree T .

Proof. If T (x) ∈ / Mv , then x is assigned the same terminal node as in

T (x). This explains the first sum in the expression (3.1). On the other hand
if T (x) ∈ Mv , then x passes through a root node of a maximal subtree, say
T̃k,v . Upon entering this node, x follows a random path through T̃k,v until even-
tually terminating at a node m ∈ Mk,v . Thus, the terminal value for x is the
random value ãk,v assigned by a random left-right path through T̃k,v .

3.3. Prediction error

To formally discuss VIMP it is necessary to first define what we mean by pre-

diction error (PE). Let g be some loss function. The PE for µ̂ is defined as

P(µ̂) = E g(Y, µ̂(x)) .

The expectation on the right-hand side is joint over the original learning data
L as well as the independent test data (Y, x). It is assumed that Y can be
written as a regression model

Y = µ(x) + ε, (3.2)

where ε has zero mean and variance σ 2 > 0 and ε is assumed independent of
µ(x). In a similar way one can define PE for µ̂v :

P(µ̂v ) = E g(Y, µ̂v (x)) .

Note the expection on the right-hand side involves an extra level of integration
over ãk,v via its distribution P̃k,v .
We now define the VIMP of xv . In this paper we confine attention to the
case when g(Y, µ̂) = (Y − µ̂)2 , corresponding to L2 -loss. The VIMP value for xv
under L2 -loss is defined as

∆v = P(µ̂v ) − P(µ̂).

Using standard arguments it is easy to show that under (3.2),

P(µ̂) = σ 2 + E(µ(x) − µ̂(x))2 . (3.3)

H. Ishwaran/Variable importance in trees and forests 526

Now observe by Lemma 1 that µ̂v (x) can be rewritten as

Kv
X X
µ̂v (x) = µ̂(x) + (ãk,v − am )Bm (x).
k=1 m∈Mk,v

Using a similar expansion for P(µ̂v ) as (3.3), deduce that the VIMP for xv is
n o
∆v = E Rv (x)2 − 2E Rv (x) µ(x) − µ̂(x)

where
Kv
X X
Rv (x) = (ãk,v − am )Bm (x).
k=1 m∈Mk,v

Clearly ∆v depends heavily upon the relationship between am and ãk,v . For
example, ∆v = 0 if ãk,v = am for each m ∈ Mk,v and each k. In this case, all
terminal values within a maximal subtree are equal and randomly assigning a
terminal value cannot perturb matters. Hence, xv is non-informative as further
clustering of the data is ineffective. In general, however, noising up xv will
perturb ãm,v relative to am and result in a positive or negative value for ∆v .
Trying to detangle the resulting properties of ∆v in such settings requires a
theoretical framework which we discuss next.

4. A theoretical framework for VIMP

In order to study ∆v systematically, it will be necessary to make a simplifying

assumption about the true signal µ(x). In particular, it will be helpful if we
assume that µ̂(x) is a good approximation to µ(x). We shall assume the true
signal has a similar topology to T :
M
X
µ(x) = am,0 Bm (x), (4.1)
m=1

where am,0 are the true, but unknown, terminal values. Observe that µ(x) is
random because Bm (x) are basis functions determined from the learning data.
Nevertheless, we shall still think of µ(x) as our “population parameter” because
the terminal values for µ(x) are fixed and do not depend upon the data. Note
importantly that (3.3) still holds even though µ(x) depends upon the data.
Remark 1. Assumption (4.1) might seem simplistic at first glance, but we’ll
show later in Section 5 that it is quite realistic when put in the context of forests.
By exploiting the orthogonality property (2.1) of the tree basis functions, we
can greatly simplify the expression for ∆v under assumption (4.1). Consider the
following theorem.
H. Ishwaran/Variable importance in trees and forests 527

Theorem 1. Assume the true model is (3.2) and that (4.1) holds (for any given
L ). Then,
(K )
v h
X X
2
i
∆v = E πm sk,v + ak,v − am ak,v + am − 2am,0 , (4.2)
k=1 m∈Mk,v

where πm = P{Bm (x) = 1|L } and s2k,v and ak,v denote the variance and mean
for ãk,v under P̃k,v .

Proof. Using orthogonality (2.1) and assumption (4.1), collect together com-
mon expressions to deduce that
(K )
X v X
2

∆v = E rk,m − 2rk,m rm Bm (x) ,
k=1 m∈Mk,v

where rk,m = ãk,v − am and rm = am,0 − am . The expectation sitting outside

the sum can be partially simplified by integrating over P̃k,v while conditioning
on L and the test data x. A little bit of re-arrangement yields
(K )
v X h
X
2
i
∆v = E sk,v + ak,v − am ak,v + am − 2am,0 Bm (x) .
k=1 m∈Mk,v

Continuing to keep L fixed, bring the expectation over x inside the sum. Only
the term Bm (x) depends upon x, from which (4.2) follows.

Theorem 1 has an interesting implication. Let P̃k,v,0 be the distribution for

ãk,v,0 : the random terminal node assigned from a random path through the
maximal subtree T̃k,v with terminal nodes replaced by {am,0 : m ∈ Mk,v }. Call
the quantity
X
θ0 (k, v) = πm P̃k,v,0 (ãk,v,0 − am,0 )2 , for k = 1, . . . , Kv , (4.3)
m∈Mk,v

the node mean squared error for the subtree T̃k,v . An immediate corollary to
Theorem 1 is that the closer µ̂(x) is to the true signal, the greater the effect
node mean squared error has on VIMP.
Corollary 1. If am → am,0 , as one might hope for if sample sizes converge to
∞, (K )
X v

∆v → E θ0 (k, v) .
k=1

Note because θ0 (k, v) ≥ 0, all VIMP values are non-negative in the limiting case.
H. Ishwaran/Variable importance in trees and forests 528

Corollary 1 shows in the limit that each maximal v-subtree contributes equally
to ∆v through its node mean squared error. Furthermore, the node mean squared
error for a subtree T̃k,v is a weighted average with some nodes contributing sig-
nificantly more to overall mean squared error than others. From (4.3), we know
that θ0 (k, v) is a weighted average involving weights πm . The value πm is the
probability that a fresh test x has node membership m under T . If many splits
are needed to reach m, or equivalently if Bm (x) involves a large number of basis
functions, then πm is likely to be small. Hence, nodes far down T̃k,v contribute
less to θ0 (k, v) then nodes closer to its root node. Moreover, this effect is further
amplified due to P̃k,v . Under random splitting, if Lm splits are needed to reach
a node m from the root of T̃k,v , then P̃k,v assigns mass 2−Lm to am . So again,
the closer a node is to the root of the subtree, the bigger its impact on node
mean squared error and ∆v . This is consistent with what was said earlier in
Section 3 regarding impact node position has on VIMP.

5. Forests

In this section we extend our theory to forests. We begin with a formal definition.

Definition 3. Let C = {T (x; b) : b = 1, . . . , B} be a collection of 1 ≤ B < ∞

binary trees grown from L . The forest µ̂F , for the collection of trees C , is the
ensemble predictor defined by
B
1 X
µ̂F (x) = µ̂(x; b),
B
b=1

where µ̂(x; b) denotes the predictor for T (x; b).

Note unlike trees, forests do not assign node membership values. Forests
are simply predictors, and they are much better at predicting than individual
trees. In fact, under mild conditions, one can show that the closure of linear
combination of binary tree predictors form a complete space. This leads to the
following theorem (c.f. Proposition 2 of [13]).
Theorem 2. Assume x has distribution P0 that is absolutely continuous with
respect to Lebesgue measure with support Sd on a finite closed rectangle in Rd .
Then for each real valued function µ ∈ L2 (P0 ), and each δ > 0, there exists a
forest µ̂F constructed from binary trees with M > d terminal nodes such that
Z
(µ̂F (x) − µ(x))2 P0 (dx) ≤ δ.
Sd

Proof. The L2 (P0 )-closure of finite linear combinations of indicator functions

of rectangles over Sd equals L2 (P0 ). Therefore, if we can show that any indica-
tor function for a rectangle on Sd can be described by a predictor µ̂(x) for a
H. Ishwaran/Variable importance in trees and forests 529

binary tree T with M > d nodes, the result follows because the closure of finite
linear combinations of such predictors is L2 (P0 ). It suffices to consider indicator
functions of the form
I(x) = I{x1 ≤ c1 , x2 ≤ c2 , . . . , xd ≤ cd }.
To construct T , first split x1 at c1 . This yields a left daughter node x1 ≤ c1 .
Now split the left daughter at x2 ≤ c2 and it’s left daughter at x3 ≤ c3 and
so forth up to xd ≤ cd . Label the resulting terminal node m = 1 and denote
its basis function by B1 (x). Observe that d splits are needed to create B1 (x)
and therefore T has M = d + 1 terminal nodes. Let a1 = 1 and set am = 0 for
m > 1. Then µ̂(x) = I(x).

5.1. VIMP for forests

Suppose that T (x; b), b = 1, . . . , B, are distinct binary trees grown from the
learning data L . For example, each tree could be grown as in random forests
but without bootstrapping the data. Extending our previous notation in an
obvious way, the forest predictor for the collection of trees is defined to be
(b)
B M
1 X X (b)
µ̂F (x) = am Bm (x; b).
B m=1b=1

In light of Theorem 2, it seems quite reasonable to assume that µ̂F (x) is a

good approximate to the true signal so long as M (b) > d. Thus if each tree in
the forest is deep enough that they have at least d + 1 terminal nodes, we can
assume, analogous to (4.1), that
(b)
B M
1 X X (b)
µ(x) = am,0 Bm (x; b). (5.1)
B m=1
b=1

Noising up a variable xv in a forest is straightforward. One simply noises

up xv for each tree as before. Then one computes the forest predictor, µ̂Fv (x),
by averaging over the corresponding perturbed predictors µ̂v (x; b). Extending
notation in an obvious manner, by Lemma 1 we have
B
( Kv(b) )
1 X X
(b)
X (b) n (b)
o
µ̂Fv (x) = am Bm (x; b) + ãk,v I T (x; b) ∈ Mk,v .
B
b=1 m∈M
/ v
(b) k=1

Now using a similar argument as in Section 3.3, the VIMP for xv is:
n o
∆Fv = E RFv (x)2 − 2E RFv (x) µ(x) − µ̂F (x) ,

where
(b)
B Kv
1 XX X (b)
RFv (x) = ãk,v − a(b)
m Bm (x; b).
B
b=1 k=1 m∈M (b)
k,v

From this we arrive at the following result.

H. Ishwaran/Variable importance in trees and forests 530

(b)
Theorem 3. Let RFv ,0 (x) be the function RFv (x) in which am is replaced with
(b) (b) (b)
the value am,0 . Assume (3.2) and (5.1) holds. If am → am,0 for each m and b,
then
(b)
B K
( )
v
1 X X
∆Fv → E RFv ,0 (x)2 ≤ E

θ0 (k, v, b) , (5.2)
B
b=1 k=1
where 2
(b) (b) (b)
X
θ0 (k, v, b) = πm,b P̃k,v,0 ãk,v,0 − am,0
(b)
m∈Mk,v

is the node mean squared error for the kth maximal v-subtree of T (x; b) and
πm,b = P{Bm (x; b) = 1|L }.

Proof. The limit follows automatically. The bound on the right-hand side
of (5.2) is due to Jensen’s inequality and orthogonality (2.1).

5.2. Implications for random forests

The bound in (5.2) becomes tighter as the trees in the forest become orthogonal
to one another. Moreover, a key assumption in Theorem (3) is that (5.1) holds,
which by Theorem 2 is reasonable so as long as the individual trees in the forest
are rich enough. Thus, a prescription for VIMP to be positive and fully charac-
terized by subtree node mean squared error is that the forest should be derived
from trees rich enough to approximate µ(x) while simultaneously being nearly
orthogonal to one another. This is quite interesting, because this is the same
prescription needed to achieve low generalization errors in random forests (see
Theorem 11.2 of [2]). So the right-hand side of (5.2) is a very reasonable starting
point for understanding VIMP in random forests. In fact, in Section 6.1, we will
show that node mean squared error terms like those appearing in (5.2) play a
crucial role in understanding paired associations between variables in forests.

6. Paired importance values and associations

For the moment let us return to a study of single trees. We consider the joint
VIMP of a pair of variables (xv , xw ) in a binary tree T . Let t = (v, w) and
write xt to indicate (xv , xw ). The paired VIMP for xt is the difference between
prediction error when xt is noised up versus the prediction error otherwise.
Noising up a pair of variables is defined as follows. For a case x, drop x down
T , and if x enters the root node of a maximal v-subtree T̃k,v , or a maximal
w-subtree T̃k,w , then initiate a random left-right path until a terminal node is
reached. This will assign x a random terminal value ãk,t with distribution P̃k,t .
Similar to our perturbation for a single variable, one can think of the above
randomization process as defining a random tree. We refer to this tree as Tt .
Let
T̃1,t , . . . , T̃Kt ,t
H. Ishwaran/Variable importance in trees and forests 531

be the set of maximal v and w-subtrees of T . Observe that in some cases a

maximal v-subtree can be a w-subtree and a maximal w-subtree can be a v-
subtree. Thus, {T̃k,t } are a collection of not necessarily disjoint trees. To avoid
overlap, we let
{T̃ik ,v , k = 1, . . . , Kv∗ } and {T̃il ,w , l = 1, . . . , Kw
∗
} (6.1)
denote the set of distinct maximal v and w-subtrees of T . The list (6.1) is
constructed such that in the case where a maximal v-subtree and w-subtree
intersect, the larger of the two subtrees is used. In particular, note that Kv∗ ≤ Kv
∗
and Kw ≤ Kw where Kv and Kw are the number of distinct maximal v-subtrees
and w-subtrees, respectively. When Kv∗ = Kv and Kw ∗
= Kw , then the set of
v and w-maximal subtrees are disjoint and v and w are orthogonal variables.
Otherwise, v and w are associated. We will say more about this shortly.
Similar to Lemma 1, one can show that µ̂t , the predictor for Tt , must be of
the form:
X X
µ̂t (x) = am Bm (x) + ãk,v I{T (x) ∈ Mk,v }
m∈M
/ t k∈K (v)
X
+ ãl,w I{T (x) ∈ Ml,w }.
l∈K (w)

Here K (v) and K (w) are the set of indices {ik : k = 1, . . . , Kv∗ } and {il : l =
∗
1, . . . , Kw }, and Mt is the set of all terminal nodes for (6.1). Note importantly
that ãk,v and ãl,v have distribution P̃k,v and P̃l,w .
Define the paired VIMP for xt as
∆t = P(µ̂t ) − P(µ̂).
We compare ∆t to a composite additive effect of the individual variables v and
w. This will allow us to identify pairs of variables whose “joint effect” differs
from the sum of their individual components. Thus allowing identification of in-
teresting paired assocations (for another interesting way to identify interactions
see [14]).
The marginal importance effect for v and w under an additive effect paradigm
is defined to be
∆v+w = ∆v + ∆w .
The difference between the two sets of importance values is what we call At ,
the measure of association between xv and xw :
At = ∆t − ∆v+w = ∆t − (∆v + ∆w ). (6.2)
Using similar arguments as in Theorem 1, we obtain the following characteriza-
tion of At .
Theorem 4. If am → am,0 , then under the conditions of Theorem 1:
( )
X X
At → − E θ0 (k, v) + θ0 (l, w) .
k∈K
/ (v) l∈K
/ (w)
H. Ishwaran/Variable importance in trees and forests 532

Theorem 4 shows in the limit that At can never be positive. The reason for
this is that ∆v and ∆w overcount maximal subtrees since a maximal v-subtree
can be a subset of a maximal w-subtree and vice-versa. Consequently, ∆v + ∆w
can never be smaller than ∆t . If v and w are orthogonal, then there is no
overlap between maximal v and w-subtrees and no overcounting occurs. Hence,
the above sums are null and At = 0 in the limit. On the other hand, if v and w
are not orthogonal, then At has a negative limit. The more negative, the more
overlap there is between v and w-subtrees, and the stronger the indication of
an association between the two variables.

6.1. Paired association values for forests

Association between variables is a little more delicate when it comes to forests.

We show that association values can not only be negative, but they can also be
positive in forests.
Extending notation in an obvious manner, the forest predictor under a noised
up xt is defined as
B
(
1 X X X (b)
n
(b)
o
µ̂Ft (x) = a(b)
m Bm (x; b) + ã k,v I T (x; b) ∈ M k,v
B
b=1 (b)
m∈M
/ t k∈K (v,b)
)
n o
(b) (b)
X
+ ãl,w I T (x; b) ∈ Ml,v .
l∈K (w,b)

Moreover, it follows straightforwardly from our previous arguments that the

VIMP for xt is
n o
∆Ft = E RFt (x)2 − 2E RFt (x) µ(x) − µ̂F (x) ,

where
B
(
1 X X X (b)

RFt (x) = ãk,v − a(b)
m Bm (x; b)
B
b=1 k∈K (v,b) m∈M (b)
k,v
)
X X (b)

+ ãl,w − a(b)
m Bm (x; b) .
l∈K (w,b) (b)
m∈Ml,w

From this we obtain the following characterization.

(b)
Theorem 5. Let RFt ,0 (x) be the function RFt (x) in which am is replaced with
(b)
the value am,0 . Likewise, let RFv ,0 (x) and RFw ,0 (x) be defined as in Theorem 3.
(b) (b)
Assume (3.2) and (5.1) holds. If am → am,0 for each m and b, then
At = ∆Ft − (∆Fv + ∆Fw )

→ E RFt ,0 (x)2 − E RFv ,0 (x)2 + E RFw ,0 (x)2 .

(6.3)
H. Ishwaran/Variable importance in trees and forests 533

Theorem 5 implies a delicate relationship for At . This involves the way v and
w interact within a given tree as well as how the variables interact across trees
in the forest. Consider, first, the extreme case when all basis functions in RFt
are mutually orthogonal. This implies a forest where v and w-maximal subtrees
are disjoint across trees. Then,
  
1 X B X X 
At → − E  θ0 (k, v, b) + θ0 (l, w, b) ,
B 
b=1 k∈K
/ (v,b) l∈K
/ (w,b)

where θ0 (k, v, b) and θ0 (l, w, b) are node mean squared errors as defined in Theo-
rem 3. Note that the limit on the right-hand side is similar to that in Theorem 4.
The only difference here is that we are averaging node mean squared error terms
over trees. Just as in Theorem 4 the limit of At cannot be positive. Also, the size
of At depends upon the behavior of v and w within a tree. For example, if v and
w are orthogonal, such that for any given tree all maximal v and w-subtrees are
disjoint, then At = 0 in the limit. In contrast, the more overlap between v and
w-subtrees within a given tree, the more maximal subtrees are overcounted, and
the more negative At becomes. So just like single trees, large negative values
indicate a strong association.
On the other hand, if trees are correlated, then (6.3) can be positive. One
example is as follows. Suppose that v and w are highly correlated and both
are influential. Then for any given tree both variables are strong competitors to
grow the tree. If one variable is selected early on in the tree growing process,
this might preclude the other variable from being selected. Things might be so
bad that we could have 50% of the trees grown using v and 50% grown using w.
So noising up v or w individually will only affect 50% of the trees. However, a
joint noising up of variables perturbs all trees. Consequently, prediction will be
substantially worse and we would expect At to be positive. The next example
is a good illustration of this.

7. Air pollution

For our first illustration we use the well known “air pollution” data set [15]. The
variables are daily readings of various air quality values measured from May 1,
1973 to September 30, 1973 in the New York metropolitan area. Solar radiation
in Langleys (Solar), average wind speed in miles per hour (Wind), maximum
daily temperature in Fahrenheit (Temp), month of the year (Month) and day of
month (Day) were all recorded. The outcome is the mean ozone value (Ozone),
which was transformed by taking its cube-root.
We analyzed the data using random forests. Computations were carried out
using the “randomForest” R-package [16]. In regression settings, randomForest
computes VIMP by subtracting an out-of-bag mean squared error rate under
a random permutation of a variable xv from an out-of-bag mean squared error
rate under the original xv . As we have discussed already, these values resemble
∆v , but are not exactly the same.
H. Ishwaran/Variable importance in trees and forests 534

Since randomForest does not compute paired importance values we adopted

the following strategy for (6.2). To compute an analog of ∆t , we randomly
selected 63% of the data to act as training data and set the remaining 37%
aside as test data (these values represent the average in-bag and out-of-bag
sample size). A forest with 1000 trees was grown over the training data using an
mtry value of 3 (see [16] for details about options for randomForest). Then over
the test data we randomly permuted xv and xw and created a noised up test
data containing the permuted xv and xw variables. Using the forest predictor
from the training data we computed the mean squared error over the modified
test data and then we computed the mean squared error over the unmodified
(original) test data. We then took the difference between these two values. This
was repeated 1000 times independently. The average over the 1000 replicates is
our proxy for ∆t . Finally, in order to be consistent, we computed proxies for
∆v and ∆w in exactly the same fashion. That is, we created a test data with a
permuted xv and compared its test mean squared error to the test mean squared
error for the original test data. The same was done for xw .
Table 1 presents our results. All results are averaged over replicates. The first
column, “Paired”, records the paired VIMP for each possible pairwise compari-
son; the second column lists the additive effects for a given pair of variables (the
sum of their individual importance values); the third column lists the associa-
tion value (the difference between the paired VIMP and the additive effects);
and the last column is the association value divided by the out-of-bag mean
squared error for the training data multiplied by 100. This is a standardized
value for association expressed in percentages.

Table 1
Paired association values using random forests for air pollution data. Values reported are
averaged over 1000 independent replicates with each forest based on 1000 trees.
Paired Additive Association Assoc/MSE
Temp:Wind 0.706 0.600 0.106 11.351
Temp:Solar 0.590 0.529 0.061 6.532
Temp:Month 0.464 0.455 0.008 0.904
Temp:Day 0.465 0.462 0.003 0.298
Wind:Solar 0.232 0.214 0.017 2.465
Wind:Month 0.142 0.141 0.001 0.137
Wind:Day 0.150 0.148 0.002 0.328
Solar:Month 0.074 0.073 0.000 0.041
Solar:Day 0.084 0.082 0.002 0.486
Month:Day 0.006 0.006 0.000 0.046

Table 1 shows that the largest association value is between the variables Temp
and Wind. Also interesting, is the second largest association value occurring for
Solar and Temp. Solar radiation and temperature are known to be related,
and the fairly large association value in Table 1 confirms this relationship. The
next largest association value is between Wind and Solar, but this value is
substantially smaller (as can be seen by the standardized value in the 4th column
of the table). After this, association values drop off dramatically.
These results are based on random forests, whereas Theorem 5 applies to the
H. Ishwaran/Variable importance in trees and forests 535

Given : Temp
60 70 80 90

0 100 250 0 100 250

20
5
●
● ● ●
● ●
● ●● ●
● ●● ● ● ●● ●●
● ● ● ● ● ● ● ● ● ● ●
● ● ●

3
● ● ● ● ● ●
● ● ● ●●●●●
● ● ● ● ● ●●
● ● ●● ●
● ●
●● ●● ●● ● ● ● ● ● ●● ●● ● ● ● ●
●●● ● ● ●●● ● ● ● ● ●
●
●

1
5

15
●
●
● ● ● ● ● ● ●

Given : Wind
● ●
● ● ●● ●
● ● ● ●
●●
● ●● ● ●● ● ●
●● ● ● ● ● ●
3

● ● ●
● ● ●●● ● ●
● ● ●●●●●
● ● ● ●
● ●● ● ●● ● ● ●● ● ●● ● ●
Ozone

●●●
● ● ●
●
1

5
● ●●
●

10
● ● ●● ● ●
● ● ● ●
● ●
● ●●
● ● ● ● ● ● ● ● ● ●●● ● ●
● ●● ●
● ● ●● ● ● ● ● ●● ● ● ●

3
● ● ● ●
● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ●
●● ● ● ●● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ●
●

1
●
● ● ●
● ●
5

● ●
● ●●● ●
● ● ●●
●●
● ● ●●
● ● ●● ●
●
● ●
● ● ● ●
● ●

5
● ● ●
● ● ● ●●● ● ● ●● ●
● ● ● ● ● ● ● ●
● ● ●
3

● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
●
1

0 100 250 0 100 250

Solar

Fig 2. Coplot of ozone values against solar radiation given wind and temperature.

VIMP under a special type of noising up process for forests. Thus, it is inter-
esting to see how well the theory extrapolates. To consider this, we generated a
conditional plot (coplot) of ozone values against solar radiation, conditioning on
wind and temperature. See Figure 2. As seen, ozone increase with solar radia-
tion, and the slope of the curves increase as temperature increases for any fixed
range of wind speeds. This clearly demonstrates an association between solar
radiation and temperature. Also, looking at ozone and radiation with respect
to wind, the slope increases as wind increases at high temperatures, but this
pattern does not apply at lower temperature values. This is clear evidence of a
Temp:Wind association. These results are consistent with Table 1.

8. Simulation

A more direct confirmation of the theory can be illustrated using simulated

data. For this example, data was simulated from the model

Y = 30 sin(πx1 x2 ) + 20(x3 − 0.5)2 + 20x1 x4 + 5x5 + ε,

where ε was distributed independently from a standard normal distribution. All

variables xv were drawn from a uniform distribution on [0, 1]. This included a
6th variable, x6 , representing noise, which was also part of the x-design matrix.
H. Ishwaran/Variable importance in trees and forests 536

A sample size of n = 100 was drawn. The data was then analyzed using the
randomForest package exactly as outlined in the previous section. All settings
were kept the same. The simulation was repeated 100 times independently. The
averaged values over the 100 simulations are reported in Table 2. Variables 1-6
are coded as a-f respectively in Table 2.
Table 2
Paired association values using random forests for simulated data. Analysis carried out in
the same manner as Table 1.
Paired Additive Association Assoc/MSE
a:b 185.216 192.870 −7.654 −4.231
a:c 132.347 132.426 −0.079 −0.044
a:d 140.472 141.906 −1.434 −0.849
a:e 133.518 133.530 −0.011 −0.003
a:f 131.552 131.639 −0.088 −0.055
b:c 61.692 61.736 −0.044 −0.033
b:d 71.350 71.200 0.150 0.110
b:e 62.822 62.857 −0.035 −0.023
b:f 60.990 60.965 0.025 0.021
c:d 10.888 10.870 0.018 0.014
c:e 2.500 2.474 0.026 0.026
c:f 0.599 0.589 0.009 0.007
d:e 11.948 11.934 0.015 0.019
d:f 10.022 10.040 −0.019 −0.022
e:f 1.713 1.715 −0.002 −0.011

By far, the two largest standarized association values are a:b and a:d, rep-
resening the interactions between variable x1 and x2 and x1 and x4 (see the
4th column in Table 2). Note how these values are negative, unlike in our pre-
vious analysis where all significant associations were positive. Large negative
values like these, as our theory suggests, is due to high overlap between v and
w-subtrees. For example, for x1 and x2 , their interaction must be such that
whenever x1 splits a node in the tree, there is a high likelihood that x2 is split
underneath this node, and vice-versa.

References

[1] Breiman L., Friedman J.H., Olshen R.A., and Stone C.J. Classification and
Regression Trees. Wadsworth, Belmont, California, 1984. MR0726392
[2] Breiman L. Random forests. Machine Learning, 45:5–32, 2001.
[3] Friedman J. Greedy function approximation: a gradient boosting machine.
Ann. Statist., 29, 1189–1232, 2001. MR1873328
[4] Liaw A. and Wiener M. Classification and regression by randomForest. R
News, 2:18–22, 2002.
[5] Ishwaran H. and Kogalur U.B. Random survival forests for R. To appear
in R News, 2007.
[6] Ishwaran H., Kogalur U.B, Blackstone E.H. and Lauer, M.S. Random
survival forests. Cleveland Clinic Technical Report, 2007.
H. Ishwaran/Variable importance in trees and forests 537

[7] Breiman L. Statistical modeling: the two cultures. Stat. Science, 16:199–
231, 2001. MR1874152
[8] Lunetta K.L., Hayward L.B., Segal J. and Eerdewegh P.V. Screening large-
scale association study data: exploiting interactions using random forests.
BMC Genetics, 5:32, 2004.
[9] Bureau A., Dupuis J., Falls K., Lunetta K.L., Hayward B., Keith T.P.,
Eerdewegh P.V. Identifying SNPs predictive of phenotype using random
forests. Genetic Epidemiology, 28:171-182, 2005.
[10] Diaz-Uriarte R. and Alvarez de Andres S. Gene selection and classification
of microarray data using random forest. BMC Bioinformatics, 7:3, 2006.
[11] van der Laan M. Statistical inference for variable importance. International
J. Biosatist., 2:1008, 2006. MR2275897
[12] Strobl C., Boulesteix A.L., Zeileis A., and Hothorn T. Bias in random
forest variable importance measures: Illustrations, sources and a solution.
BMC Bioinformatics, 8:25, 2007.
[13] Breiman L. Population theory for boosting ensembles. Ann. Stat., 32:1–11,
2004. MR2050998
[14] Ng V.W. and Breiman L. Bivariate variable selection for classification
problem. Technical report 692: Berkeley Statistics Department, 2005.
[15] Chambers J. M., Cleveland W. S., Kleiner B., and Tukey P. A. Graphical
Methods for Data Analysis. Wadsworth, Belmont, California, 1983.
[16] Liaw A. and Wiener M. randomForest 4.5-18.
[Link] 2007.

Understanding Random Forest Techniques
No ratings yet
Understanding Random Forest Techniques
39 pages
Understanding Random Forests in Machine Learning
No ratings yet
Understanding Random Forests in Machine Learning
12 pages
Understanding Random Forests in Machine Learning
No ratings yet
Understanding Random Forests in Machine Learning
12 pages
A Complete View of Decision Trees and SVM in Machine Learning - by Hailey Huong Nguyen - TDS Archive - Medium
No ratings yet
A Complete View of Decision Trees and SVM in Machine Learning - by Hailey Huong Nguyen - TDS Archive - Medium
10 pages
WIREs Data Min Knowl - 2011 - Loh - Classification and Regression Trees
No ratings yet
WIREs Data Min Knowl - 2011 - Loh - Classification and Regression Trees
10 pages
Overview of Classification and Regression Trees
No ratings yet
Overview of Classification and Regression Trees
17 pages
CART and Ensemble Methods in Statistics
No ratings yet
CART and Ensemble Methods in Statistics
50 pages
CART Methodology: Bias and Adaptivity Analysis
No ratings yet
CART Methodology: Bias and Adaptivity Analysis
58 pages
Understanding Random Forest Techniques
100% (2)
Understanding Random Forest Techniques
83 pages
Ensemble Trees: Boosting & Random Forests
No ratings yet
Ensemble Trees: Boosting & Random Forests
1 page
Conditional Inference Trees in R
No ratings yet
Conditional Inference Trees in R
34 pages
Data Science - Decision Tree - Random Forest
No ratings yet
Data Science - Decision Tree - Random Forest
15 pages
Classification and Regression Tree Methods
No ratings yet
Classification and Regression Tree Methods
13 pages
Understanding Decision Trees and CHAID
100% (1)
Understanding Decision Trees and CHAID
19 pages
Tree-Based Methods for Regression & Classification
No ratings yet
Tree-Based Methods for Regression & Classification
36 pages
Tree-Based Machine Learning Overview
No ratings yet
Tree-Based Machine Learning Overview
28 pages
Yield Modeling Rule Ensembles PDF ASMC2007
No ratings yet
Yield Modeling Rule Ensembles PDF ASMC2007
6 pages
Decision Trees for Predictive Modeling
100% (1)
Decision Trees for Predictive Modeling
24 pages
Decision Trees and Ensemble Methods
No ratings yet
Decision Trees and Ensemble Methods
69 pages
Decision Trees and Random Forests Explained
No ratings yet
Decision Trees and Random Forests Explained
60 pages
Classification and Regression Trees Explained
No ratings yet
Classification and Regression Trees Explained
42 pages
Tree-Based Methods for Regression & Classification
No ratings yet
Tree-Based Methods for Regression & Classification
33 pages
Tree-Based Machine Learning Methods
No ratings yet
Tree-Based Machine Learning Methods
64 pages
Practical Guide to Conformal Prediction
No ratings yet
Practical Guide to Conformal Prediction
27 pages
Decision Trees and Random Forests Explained
No ratings yet
Decision Trees and Random Forests Explained
34 pages
Introduction to CART Algorithm
No ratings yet
Introduction to CART Algorithm
65 pages
Tree-Based Methods in Machine Learning
No ratings yet
Tree-Based Methods in Machine Learning
21 pages
Importance Metrics in Machine Learning
No ratings yet
Importance Metrics in Machine Learning
35 pages
Understanding Decision Trees in Classification
100% (2)
Understanding Decision Trees in Classification
16 pages
Machine Learning: Practical Tutorial On Random Forest and Parameter Tuning in R
No ratings yet
Machine Learning: Practical Tutorial On Random Forest and Parameter Tuning in R
11 pages
Decision Tree Analysis and Algorithms
No ratings yet
Decision Tree Analysis and Algorithms
21 pages
Born Again Trees: Enhanced Predictors
No ratings yet
Born Again Trees: Enhanced Predictors
13 pages
Random Forest for Imbalanced Datasets
No ratings yet
Random Forest for Imbalanced Datasets
5 pages
Understanding Decision Tree Learning
No ratings yet
Understanding Decision Tree Learning
15 pages
Decision Tree Algorithms Overview
No ratings yet
Decision Tree Algorithms Overview
106 pages
Decision Tree
No ratings yet
Decision Tree
10 pages
Support Vector Regression Explained
No ratings yet
Support Vector Regression Explained
8 pages
Decision Trees in Machine Learning Guide
No ratings yet
Decision Trees in Machine Learning Guide
8 pages
Class 6 Decision Trees Slides
No ratings yet
Class 6 Decision Trees Slides
41 pages
Understanding Decision Tree Algorithm
No ratings yet
Understanding Decision Tree Algorithm
9 pages
Understanding Random Forests in ML
No ratings yet
Understanding Random Forests in ML
10 pages
Understanding Decision Trees in ML
No ratings yet
Understanding Decision Trees in ML
82 pages
Introduction to Random Forests in R
No ratings yet
Introduction to Random Forests in R
5 pages
Please Stop Permuting Features An Explanation and Alternatives
No ratings yet
Please Stop Permuting Features An Explanation and Alternatives
15 pages
Understanding Decision Trees in Analytics
No ratings yet
Understanding Decision Trees in Analytics
9 pages
Explainable Bayesian Decision Trees
No ratings yet
Explainable Bayesian Decision Trees
15 pages
Logworth in Decision Tree Splits
100% (1)
Logworth in Decision Tree Splits
26 pages
Lecture 7
No ratings yet
Lecture 7
44 pages
Algoritmos de Árboles de Regresión
No ratings yet
Algoritmos de Árboles de Regresión
34 pages
Unit 4
No ratings yet
Unit 4
8 pages
Understanding Decision Trees in AI
No ratings yet
Understanding Decision Trees in AI
37 pages
Overview of Tree-Based Models
No ratings yet
Overview of Tree-Based Models
21 pages
Random Survival Forests Overview
No ratings yet
Random Survival Forests Overview
18 pages
Mod 4 Notes
No ratings yet
Mod 4 Notes
17 pages
New Variable Importance for Random Forests
No ratings yet
New Variable Importance for Random Forests
14 pages
Random Forest Overview and Insights
No ratings yet
Random Forest Overview and Insights
4 pages
Decision Trees and Ensemble Methods
No ratings yet
Decision Trees and Ensemble Methods
2 pages
Pricing Strategies Impact on MSMEs Performance
No ratings yet
Pricing Strategies Impact on MSMEs Performance
17 pages
Jurnal - Estimation of C Faktor For Soil Erotion Modelling Using NDVI in Buyukcekmecee Watershed
No ratings yet
Jurnal - Estimation of C Faktor For Soil Erotion Modelling Using NDVI in Buyukcekmecee Watershed
9 pages
Simple Linear Regression Analysis
100% (5)
Simple Linear Regression Analysis
62 pages
Advances in Regression Discontinuity Design
No ratings yet
Advances in Regression Discontinuity Design
56 pages
Proxy Means Test Methodology Exposition
No ratings yet
Proxy Means Test Methodology Exposition
14 pages
EC507 Statistics for Economics Overview
No ratings yet
EC507 Statistics for Economics Overview
3 pages
Business Statistics Exam Questions
No ratings yet
Business Statistics Exam Questions
4 pages
Stock Watson 2010 Vector Autoregressions
No ratings yet
Stock Watson 2010 Vector Autoregressions
49 pages
AI, ML, and Data Science Explained
No ratings yet
AI, ML, and Data Science Explained
17 pages
Data Science Regression Techniques Guide
No ratings yet
Data Science Regression Techniques Guide
17 pages
Permeability Determination From Well Log Data
No ratings yet
Permeability Determination From Well Log Data
10 pages
U.S. Business Sales and Inventories Analysis
No ratings yet
U.S. Business Sales and Inventories Analysis
23 pages
Class 12 Report
No ratings yet
Class 12 Report
10 pages
Analyzing Residuals in Regression
No ratings yet
Analyzing Residuals in Regression
59 pages
Regulating Cable Monopoly Pricing
100% (1)
Regulating Cable Monopoly Pricing
178 pages
A Comparison of Forecasting Methods For Hotel Revenue Management
No ratings yet
A Comparison of Forecasting Methods For Hotel Revenue Management
20 pages
ECO 422: Applied Econometrics Outline
No ratings yet
ECO 422: Applied Econometrics Outline
9 pages
Understanding Inferential Statistics
No ratings yet
Understanding Inferential Statistics
22 pages
P/B-ROE Valuation Model Analysis
No ratings yet
P/B-ROE Valuation Model Analysis
30 pages
Consumer Promotions in Bangladesh's Detergent Market
83% (6)
Consumer Promotions in Bangladesh's Detergent Market
60 pages
Exchange Rate Forecasting Techniques
No ratings yet
Exchange Rate Forecasting Techniques
20 pages
Reporting Quality of MPH Dissertations in China
No ratings yet
Reporting Quality of MPH Dissertations in China
12 pages
Understanding Logistic Regression
No ratings yet
Understanding Logistic Regression
32 pages
INN-Hotel Business Report
No ratings yet
INN-Hotel Business Report
63 pages
Efektivitas Sistem Informasi Akuntansi
No ratings yet
Efektivitas Sistem Informasi Akuntansi
10 pages
Financial Literacy and Savings in Malaysia
No ratings yet
Financial Literacy and Savings in Malaysia
9 pages
ZIMSEC A Level Statistics Syllabus 2024-2030
No ratings yet
ZIMSEC A Level Statistics Syllabus 2024-2030
32 pages
Understanding Measures of Correlation
No ratings yet
Understanding Measures of Correlation
24 pages
Logistic Regression Explained: Key Concepts
No ratings yet
Logistic Regression Explained: Key Concepts
33 pages
Deep Learning for Pandemic Detection
No ratings yet
Deep Learning for Pandemic Detection
35 pages

Variable Importance in Binary Trees

Uploaded by

Variable Importance in Binary Trees

Uploaded by

Electronic Journal of Statistics

Vol. 1 (2007) 519–537

Variable importance in binary

Abstract: We characterize and study variable importance (VIMP) and

Keywords and phrases: CART, random forests, maximal subtree.

Received April 2007.

A binary recursively grown tree, T , can be thought of as a map that uniquely

a d-dimensional vector x = (x1 , . . . , xv ) ∈ Rd , where xv are real valued variables.

2. Binary regression trees

We begin by first defining a binary regression tree.

Definition 1. Let L = {(Yi , xi ) : i = 1, . . . , n} denote the learning data. A

A binary regression tree (hereafter simply refered to as a binary tree) must

Bm (x)Bm′ (x) = 0 if m 6= m′ . (2.1)

This property will play a key role in our theoretical development.

3.1. Comparison to random forests

Surprisingly, it is also possible to give an explicit representation for the predictor

Suppose that T contains Kv distinct maximal v-subtrees, T̃1,v , . . . , T̃Kv ,v .

Lemma 1. Let µ̂v (x) denote the predictor for Tv . Then,

Proof. If T (x) ∈ / Mv , then x is assigned the same terminal node as in

3.3. Prediction error

To formally discuss VIMP it is necessary to first define what we mean by pre-

Using standard arguments it is easy to show that under (3.2),

P(µ̂) = σ 2 + E(µ(x) − µ̂(x))2 . (3.3)

Now observe by Lemma 1 that µ̂v (x) can be rewritten as

4. A theoretical framework for VIMP

In order to study ∆v systematically, it will be necessary to make a simplifying

where rk,m = ãk,v − am and rm = am,0 − am . The expectation sitting outside

Theorem 1 has an interesting implication. Let P̃k,v,0 be the distribution for

Definition 3. Let C = {T (x; b) : b = 1, . . . , B} be a collection of 1 ≤ B < ∞

where µ̂(x; b) denotes the predictor for T (x; b).

Proof. The L2 (P0 )-closure of finite linear combinations of indicator functions

5.1. VIMP for forests

In light of Theorem 2, it seems quite reasonable to assume that µ̂F (x) is a

Noising up a variable xv in a forest is straightforward. One simply noises

From this we arrive at the following result.

5.2. Implications for random forests

6. Paired importance values and associations

be the set of maximal v and w-subtrees of T . Observe that in some cases a

6.1. Paired association values for forests

Association between variables is a little more delicate when it comes to forests.

Moreover, it follows straightforwardly from our previous arguments that the

From this we obtain the following characterization.

Since randomForest does not compute paired importance values we adopted

0 100 250 0 100 250

0 100 250 0 100 250

A more direct confirmation of the theory can be illustrated using simulated

Y = 30 sin(πx1 x2 ) + 20(x3 − 0.5)2 + 20x1 x4 + 5x5 + ε,

where ε was distributed independently from a standard normal distribution. All

You might also like