Tree-Based Methods & SVM Overview
Tree-Based Methods & SVM Overview
Unit –4
Prabhu K, AP(SS)/CSE
Unit IV
R1 ={X | Years<4.5}
R2 ={X | Years>=4.5Hits<117.5}
R3 ={X | Years>=4.5, Hits>=117.5}.
Regression Tree
R2,...,RJ .
where ˆyR1 is the mean response for the training observations in R1(j, s),
and ˆyR2 is the mean response for the training observations in R2(j, s)
Regression Tree
• Repeat the process, looking for the best predictor and best
cutpoint in order to split the data further so as to minimize
the RSS within each of the resulting regions.
• The process continues until a stopping criterion is reached i.e
continue until no region contains more than five
observations.
• Once the regions R1, . . . , RJ have been created, predict the
response for a given test observation using the mean of the
training observations in the region to which that test
observation belongs
Tree Pruning
• grow a very large tree T0, and then prune it back in order to
obtain a subtree.
• goal is to select a subtree that subtree leads to the lowest test
error rate.
• Cost complexity pruning—also known as weakest link pruning
• consider a sequence of trees indexed by a nonnegative tuning
parameter α.
• For each value of α there corresponds a subtree
such that
is as small as possible
Tree Pruning
• yRm is the predicted response associated with Rm-that is, the mean of the
training observations in Rm.
• The tuning parameter α controls a trade-off between the subtree’s
complexity and its fit to the training data.
• When α = 0, then the subtree T simply equal T0
• as α increases, there is a price to pay for having a tree with many terminal
nodes.
• select a value of α using a validation set or using cross-validation.
• then return to the full data set and obtain the subtree corresponding to α
Regression tree analysis for the Hitters data
Algorithm: Building a Regression Tree
1. Use recursive binary splitting to grow a large tree on the training data,
stopping only when each terminal node has fewer than some minimum number
of observations.
2. Apply cost complexity pruning to the large tree in order to obtain a sequence
of best subtrees, as a function of α.
3. Use K-fold cross-validation to choose α. That is, divide the training
observations into K folds. For each k = 1,...,K:
(a) Repeat Steps 1 and 2 on all but the kth fold of the training data.
(b) Evaluate the mean squared prediction error on the data in the left-out kth
fold, as a function of α.
Average the results for each value of α, and pick α to minimize the average error.
4. Return the subtree from Step 2 that corresponds to the chosen value of α.
Classification Tree
• If all the samples belong to the same class the node becomes a leaf
labeled with the class else choose attribute for partitioning
• Attributes are categorical (if continuous-valued, they are discretized
in advance)
• Examples are partitioned recursively based on selected attributes
• Splitting Criteria
• determines the best way to split
• Measures
• Information Gain
• Gain Ratio
• Gini Index
Partitioning Scenarios
• Attribute:
a. Discrete Valued
• A1, A2, A3.?
b. Continuous Valued
• A <= Split point; A > Split
point
c. Discrete Valued and Binary
tree must be produced
• A SA
Partitioning scenarios Examples
a)
A? color?
green
a1 a2 …. ap
orange
purpl
red
blue
e
b) A? income?
c)
AϵSA? colorϵ{red,green}
yes no yes no
Attribute Selection Measures
• Heuristic for selecting splitting criterion
• Notations
• D – Data Partition
youth 2 3 Info(D) = -
middle-aged 4 0
senior 3 2
=0.694bits
Excellent 3 3 2 Excellent No
3 Fair Yes
4 Fair Yes
Info creditrating(D)= (- log2 - log2 ) 5 Fair Yes
6 Excellent No
+ (- log2 - log2 )
7 Excellent Yes
=0.892 8 Fair No
9 Fair Yes
10 Fair Yes
Gain(credit rating) = Info(D) – Infocreditraing(D) 11 Excellent Yes
12 Excellent Yes
=0.940-0.892
13 Fair Yes
=0.048 bits
14 Excellent No
RID Age Income Student Credit Buys_compu
rating ter
1 Youth High No Fair No
youth senior
Middle_aged
Gain(age) = 0.246
Income Student Credit_ratin Class
Gain(income) = 0.029 g
• The point with the minimum expected information requirement for A is selected
as the split-point for A
• Calculate InfoA(D) for each possible split point and choose minimum one
• Split:
- -
= 0.926
Income D1i D2i
High 2 2
Medium 4 2
low 3 1
Gain(Income) = 0.029
GainRatio(Income) = 0.029/0.926 = 0.031
Gini Index
• Measures the impurity of a data partition
n 2
gini( D) 1 p j
j 1
• pj is probability of a tuple belonging to class Cj
• Considers a binary split for each value
• To determine best binary split on A, all possible subsets that can be formed
are considered
• 2v – 2 possible ways to form two partitions
• For binary splits |D | |D |
gini A ( D) 1 gini( D1) 2 gini( D 2)
|D| |D|
10 4
giniincome{low,medium} ( D) Gini( D1 ) Gini( D1 )
14 14
+
=0.450
Income D1i D2i
=Gini income𝞊 {high} (D)
High 2 2
Medium 4 2
low 3 1
but gini{medium,high} is 0.30 and thus the best since it is the lowest
Attribute Selection Measures
• Advantages
• Disadvantage
• Trees generally do not have the same level of predictive accuracy as some of
the other regression and classification approaches
• Trees can be very non-robust ie.a small change in the data can cause a large
change in the final estimated tree.
Random Forest
Random Forest
• It is a supervised machine learning technique
• Definition
• Instead of relying on one decision tree, the random forest takes the prediction from each
tree and based on the majority votes of predictions, and it predicts the final output.
• The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
• Each time a split is to be performed, the search for the split variable is limited to a random
subset of m of the p attributes.
• m = √p
• on average (p − m)/p of the splits will not even consider the strong predictor, and so
other predictors will have more of a chance. This process is called as decorrelating
the trees, that makes the average of the resulting trees less variable and hence more
reliable.
• Random Forests produce many unique trees.
Random forest classifier
• Random forest introduces randomness into the rows and columns of the data
• provides a more diverse set of trees that almost always lowers our prediction
error.
Random Forest Classifier
Training Data
M features
N examples
Random Forest Classifier
Create bootstrap samples
from the training data
M features
N examples
....…
Random Forest Classifier
Construct a decision tree
M features
N examples
....…
Random Forest Classifier
At each node in choosing the split feature
choose only among m<M features
M features
N examples
....…
Random Forest Classifier
Create decision tree
from each bootstrap sample
M features
N examples
....…
....…
Random Forest Classifier
M features
N examples
Take the
majority
vote
....…
....…
Random Forest algorithm
• Step-2: Build the decision trees associated with the selected data points
(Subsets).
• Step-3: Choose the number N for decision trees that you want to build.
• Step-5: For new data points, find the predictions of each decision tree,
and assign the new data points to the category that wins the majority
votes.
Advantages
Disadvantages
• β0 + β1X1 + β2X2 = 0
• The blue region is the set of points for which 1+2X1 + 3X2 > 0
• The purple region is the set of points for which 1+2X1 + 3X2 < 0.
Classification Using a Separating Hyperplane
• f(x∗) = β0+β1x∗1+β2x∗2+...+βpx∗
• If f(x∗) is
• positive, assign the test observation to class 1
• negative assign it to class −1
The Maximal Margin Classifier
• Maximal margin hyperplane also known as the optimal separating hyperplane is
the separating hyperplane that is farthest from the training observations.
• Compute the distance from each training observation to a given separating
hyperplane
• The smallest distance is the minimal distance from the observations to the
hyperplane, and is known as the margin
• The maximal margin hyperplane is the separating hyperplane for which the
margin is largest
• It is the hyperplane that has the farthest minimum distance to the training
observations
• Then classify a test observation based on which side of the maximal margin
hyperplane it lies
• This is known as the maximal margin classifier
• three training observations are equidistant from the maximal margin hyperplane and
lie along the dashed lines indicating the width of the margin.
• These three observations are known as support vectors, since they are vectors in p-
dimensional space
• they “support” the maximal margin hyperplane in the sense that if these points were
moved slightly then the maximal margin hyperplane would move as well
• the maximal margin hyperplane depends directly on the support vectors
• the maximal margin hyperplane depends directly on only a small subset of the
observations is an important property
Construction of the Maximal Margin Classifier
• Given
• subject to =1
• When there is no separating hyperplane an observation can be not only on the wrong
side of the margin, but also on the wrong side of the hyperplane.
• Observations on the wrong side of the hyperplane correspond to training observations
that are misclassified by the support vector classifier
Maximize M
β0,β1,...,βp,𝞊1,…….. 𝞊n ---------------------------------------(1)
subject to =1--------------------------------------(2)
yi(β0 + β1xi1 + β2xi2 + ...+ βpxip) ≥ M (1- 𝞊i),-------------(3)
𝞊i>=0, ------------------------------------------(4)
where C is a nonnegative tuning parameter
𝞊1,..., 𝞊n are slack variables that allow individual observations to be on
the wrong side of the margin or the hyperplane.
𝞊i tells us where the ith observation is located, relative to the
hyperplane and relative to the margin
• classify the test observation based on the sign of
• f(x∗) = β0 + β1x∗ 1 + ... + βpx∗p.
• If 𝞊i > 0 then the ith observation is on the wrong side of the margin
The observations fall into two classes, with a The support vector classifier seeks a
non-linear boundary between them. linear boundary, and consequently
performs very poorly.
• The support vector classifier, address the problem of non-linear boundaries
between classes , by enlarging the feature space using quadratic, cubic, and
even higher-order polynomial functions of the predictors.
• For instance,rather than fitting a support vector classifier using p features
• X1, X2...,Xp,
•
The Support Vector Machine
• The support vector machine (SVM) is an extension of the support vector
classifier that results from enlarging the feature space in a specific way, using
kernels.
⟨ ⟩
𝑝
𝑥𝑖,𝑥 𝑖 =∑ 𝑥𝑖𝑗 𝑥𝑖 𝑗
′ ′
𝑗=1
• where there are n parameters αi, i = 1,...,n, one per training observation
• To estimate the parameters α1,...,αn and β0, need n(n-1)/ 2 inner products
between all pairs of training observations.
• 𝞪i=nonzero only for the support vectors in the solution
• f(x)=𝞫0+
• K(xi,xi’)=.
• This is known as a linear kernel because the support vector classifier is
linear in the features.
• K(xi,xi’)=(1+)d
• This is known as a polynomial kernel of degree d, where d is a positive
polynomial integer . This is an example non-linear kernel
• Radial kernel, takes the form
• If K > 2 classes
• one-versus-one or all-pairs approach constructs
• SVMs, each of which compares a pair of classes.
• SVM might compare the k th class, coded as +1, to the k’th class, coded as −1
• classify a test observation using each of the classifiers,
• tally the number of times that the test observation is assigned to each of the
K classes
• The classification is performed by assigning the test observation to the
class to which it was most frequently assigned in these pairwise
classifications.
One-Versus-All Classification
• fit K SVMs, each time comparing one of the K classes to the remaining
K − 1 classes
• Let β0k, β1k,...,βpk denote the parameters that result from fitting an SVM
comparing the kth class (coded as +1) to the others (coded as −1)
• assign the observation to the class for which β0k +β1kx∗ 1 +β2kx∗ 2 +...+ βpkx∗
p is largest
Reference(s):