0% found this document useful (0 votes)
6 views79 pages

Tree-Based Methods & SVM Overview

Unit IV of the document covers Tree Based Methods and Support Vector Machines, focusing on decision trees and regression trees. It details the process of decision tree induction, including algorithms for building and pruning trees, as well as attribute selection measures like Information Gain. The unit also discusses the application of these methods in both classification and regression problems.

Uploaded by

ersenthilprabhu
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views79 pages

Tree-Based Methods & SVM Overview

Unit IV of the document covers Tree Based Methods and Support Vector Machines, focusing on decision trees and regression trees. It details the process of decision tree induction, including algorithms for building and pruning trees, as well as attribute selection measures like Information Gain. The unit also discusses the application of these methods in both classification and regression problems.

Uploaded by

ersenthilprabhu
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

19CSCN1602 – Machine Intelligence

Unit –4
Prabhu K, AP(SS)/CSE
Unit IV

Tree Based Methods and Support Vector


Machines

Decision Tree – Random Forest – Maximal margin classifier – Support vector


classifiers –Support Vector Machines (SVM) – SVMs with more than two
classes – Relationship to logistic regression.
Decision Tree
Decision Tree Induction
• Decision tree
• Flow chart like tree
structure age?
• Internal nodes - test <=30 >40
on an attribute 31..40
• Branch - outcome of
the test
student? yes credit rating?
• leaf node (or terminal
no
node) holds a class excellent
fair
label
no yes no yes
• To classify sample
trace path from root
Decision trees can be applied
to both regression and
A decision tree for the concept buys computer
classification problems
Regression Tree
Example: Predicting Baseball Players’ Salaries Using Regression
Trees
Hitters data set:to predict a baseball player’s Salary based on Years
and Hits

R1 ={X | Years<4.5}
R2 ={X | Years>=4.5Hits<117.5}
R3 ={X | Years>=4.5, Hits>=117.5}.
Regression Tree

Prediction via Stratification of the Feature Space

1. Divide the predictor space—that is, the set of possible values


for X1, X2,...,Xp—into J distinct and non-overlapping regions, R1,

R2,...,RJ .

2. For every observation that falls into the region Rj , we make


the same prediction, which is simply the mean of the response
values for the training observations in Rj
Regression Tree

• Find boxes R1,...,RJ that minimize the RSS, given by

Where is the mean response for the training observations


within the jth box
Regression Tree
• Top-down, greedy approach is known as recursive binary
splitting
• first select the predictor Xj and the cutpoint s such that
splitting the predictor space into the regions {X|Xj < s} and {X|
Xj ≥ s} leads to the greatest possible reduction in RSS.
• R1(j, s) = {X|Xj < s} and R2(j, s) = {X|Xj ≥ s}
• seek the value of j and s that minimize the equation

where ˆyR1 is the mean response for the training observations in R1(j, s),

and ˆyR2 is the mean response for the training observations in R2(j, s)
Regression Tree

• Repeat the process, looking for the best predictor and best
cutpoint in order to split the data further so as to minimize
the RSS within each of the resulting regions.
• The process continues until a stopping criterion is reached i.e
continue until no region contains more than five
observations.
• Once the regions R1, . . . , RJ have been created, predict the
response for a given test observation using the mean of the
training observations in the region to which that test
observation belongs
Tree Pruning
• grow a very large tree T0, and then prune it back in order to
obtain a subtree.
• goal is to select a subtree that subtree leads to the lowest test
error rate.
• Cost complexity pruning—also known as weakest link pruning
• consider a sequence of trees indexed by a nonnegative tuning
parameter α.
• For each value of α there corresponds a subtree
such that

is as small as possible
Tree Pruning

• |T |-number of terminal nodes of the tree T


• Rm is the rectangle corresponding to the mth terminal node

• yRm is the predicted response associated with Rm-that is, the mean of the
training observations in Rm.
• The tuning parameter α controls a trade-off between the subtree’s
complexity and its fit to the training data.
• When α = 0, then the subtree T simply equal T0
• as α increases, there is a price to pay for having a tree with many terminal
nodes.
• select a value of α using a validation set or using cross-validation.
• then return to the full data set and obtain the subtree corresponding to α
Regression tree analysis for the Hitters data
Algorithm: Building a Regression Tree

1. Use recursive binary splitting to grow a large tree on the training data,
stopping only when each terminal node has fewer than some minimum number
of observations.
2. Apply cost complexity pruning to the large tree in order to obtain a sequence
of best subtrees, as a function of α.
3. Use K-fold cross-validation to choose α. That is, divide the training
observations into K folds. For each k = 1,...,K:
(a) Repeat Steps 1 and 2 on all but the kth fold of the training data.
(b) Evaluate the mean squared prediction error on the data in the left-out kth
fold, as a function of α.
Average the results for each value of α, and pick α to minimize the average error.
4. Return the subtree from Step 2 that corresponds to the chosen value of α.
Classification Tree

• A classification tree is very similar to a regression


tree, except that it is used to predict a qualitative
response rather than a quantitative one
Decision Tree Induction
• Basic algorithm (a greedy algorithm)

• Tree is constructed in a top-down recursive divide-and-conquer


manner
• At start, all the training examples are at the root

• If all the samples belong to the same class the node becomes a leaf
labeled with the class else choose attribute for partitioning
• Attributes are categorical (if continuous-valued, they are discretized
in advance)
• Examples are partitioned recursively based on selected attributes

• Test attributes are selected on the basis of a heuristic or statistical


measure (e.g., information gain)
Algorithm: Generate decision tree. Generate a decision tree from the training tuples of data partition, D.
Input:
Data partition, D, which is a set of training tuples and their associated class labels;
attribute list, the set of candidate attributes;
Attribute selection method, a procedure to determine the splitting criterion that “best” partitions the data
tuples into individual classes. This criterion consists of a splitting attribute and, possibly, either a split-point or splitting subset.
Output: A decision tree. Method:
(1) create a node N;
(2) if tuples in D are all of the same class, C, then
(3) return N as a leaf node labeled with the class C;
(4) if attribute list is empty then
(5) return N as a leaf node labeled with the majority class in D; // majority voting
(6) apply Attribute selection method(D, attribute list) to find the “best” splitting criterion;
(7) label node N with splitting criterion;
(8) if splitting attribute is discrete-valued and multiway splits allowed then // not restricted to binary trees
(9) attribute list ←attribute list −splitting attribute; // remove splitting attribute
(10) for each outcome j of splitting criterion // partition the tuples and grow subtrees for each partition
(11) let Dj be the set of data tuples in D satisfying outcome j; // a partition
(12) if Dj is empty then
(13) attach a leaf labeled with the majority class in D to node N;
(14) else attach the node returned by Generate decision tree(Dj, attribute list) to node N; end for
(15) return N;
Attribute Selection

• Splitting Criteria
• determines the best way to split

• Indicates the splitting attribute and split point

• Measures
• Information Gain
• Gain Ratio
• Gini Index
Partitioning Scenarios
• Attribute:
a. Discrete Valued
• A1, A2, A3.?
b. Continuous Valued
• A <= Split point; A > Split
point
c. Discrete Valued and Binary
tree must be produced
• A  SA
Partitioning scenarios Examples
a)

A? color?

green
a1 a2 …. ap

orange
purpl
red

blue

e
b) A? income?

A<=split_point A>split_point <=42000 >42000

c)
AϵSA? colorϵ{red,green}

yes no yes no
Attribute Selection Measures
• Heuristic for selecting splitting criterion

• Also termed as Splitting rules

• Ranks the attributes

• If selected attribute is continuous valued or discrete with binary split, then


split point or split subset must also be chosen

• Common measures – Information Gain, Gain ratio, Gini Index

• Notations

• D – Data Partition

• Class label attribute has m distinct values – Ci (for i=1….m)

• Ci,D – Set of tuples of class i in D


Attribute Selection Measure: Information
Gain
 Proposed by Claude Shannon
 Select the attribute with the highest information gain
 Minimizes the information needed to classify tuples in the resulting partition
 Minimizes the expected number of tests needed to classify a tuple
 Expected information needed to classify a tuple in D
Info(D) = -
where pi = which is the probability that an arbitrary sample belongs to
class Ci
 Info(D) - Average information required to classify a tuple in D
 Info(D) is also known as entropy of D
Information Gain
• Attribute A with v distinct values {a1, a2, …av}

• If A is discrete valued partition D into v subsets {D1,D2,…DV}


• Each partition is expected to be pure
• Additional information required to classify the samples is:
InfoA(D) =
• - Weight of partition
• InfoA(D) – Expected information required to classify a tuple from D
based on partitioning by A
• Smaller the expected information greater the purity of partitions
• The Information Gain is given by:
Gain(A) = Info(D) – InfoA(D)
• Expected reduction in Information requirement by choosing A
Example
RID Age Income Stude Credit Buys_com
nt rating puter
1 Youth High No Fair No

2 Youth High No Excellent No

3 Middle_aged High No Fair Yes


Based on the
attribute
4 senior Medium No Fair Yes ‘buys_computer’
5 senior Low Yes Fair Yes
the number of
6 senior Low Yes Excellent No classes m = 2
7 Middle_aged Low Yes Excellent Yes C1 – ‘Yes’
8 Youth Medium No Fair No C2 – ‘No’
9 Youth Low Yes Fair Yes

10 senior Medium Yes Fair Yes

11 youth Medium Yes Excellent Yes

12 Middle_aged Medium No Excellent Yes

13 Middle_aged High Yes Fair Yes

14 senior Medium no Excellent No


Example
• Expected information needed to classify the sample
- =0.940 bits

• Expected Information requirement for age Formula:


Age D1i D2i InfoA(D) =

youth 2 3 Info(D) = -

middle-aged 4 0
senior 3 2

=0.694bits

Gain(age) = Info(D) – Infoage(D)


=0.940-0.694
= 0.246 bits
Formula:
InfoA(D) = RID Income Class
Info(D) = - 1 High No
2 High No
Income D1i D2i
3 High Yes
High 2 2
4 Medium Yes
Medium 4 2
5 Low Yes
low 3 1
6 Low No
Info income(D)= ( - log2 - log2 )
7 Low Yes
+ (- log2 - log2 ) 8 Medium No
9 Low Yes
+ (- log2 - log2 )
10 Medium Yes
=0.911 11 Medium Yes
12 Medium Yes
Gain(income) = Info(D) – Infoincome(D) 13 High Yes
=0.940-0.911 14 Medium No
=0.029 bits
Formula:
InfoA(D) =
Info(D) = -
RID student Class
1 No No
Student D1i D2i 2 No No
No 4 3 3 No Yes
yes 6 1 4 No Yes
5 Yes Yes
Info student(D)= (- log2 - log2 )
6 Yes No
+ (- log2 - log2 ) 7 Yes Yes
=0.789 8 No No
9 Yes Yes
10 Yes Yes
Gain(student) = Info(D) – Infostudent(D) 11 Yes Yes
=0.940-0.789 12 No Yes
=0.151 bits 13 Yes Yes
14 no No
Credit D1i D2i RID Credit Class
rating rating
Fair 6 2 1 Fair No

Excellent 3 3 2 Excellent No
3 Fair Yes
4 Fair Yes
Info creditrating(D)= (- log2 - log2 ) 5 Fair Yes
6 Excellent No
+ (- log2 - log2 )
7 Excellent Yes
=0.892 8 Fair No
9 Fair Yes
10 Fair Yes
Gain(credit rating) = Info(D) – Infocreditraing(D) 11 Excellent Yes
12 Excellent Yes
=0.940-0.892
13 Fair Yes
=0.048 bits
14 Excellent No
RID Age Income Student Credit Buys_compu
rating ter
1 Youth High No Fair No

2 Youth High No Excellent No

3 Middle_aged High No Fair Yes

4 senior Medium No Fair Yes

5 senior Low Yes Fair Yes

6 senior Low Yes Excellent No

7 Middle_aged Low Yes Excellent Yes


8 Youth Medium No Fair No

9 Youth Low Yes Fair Yes

10 senior Medium Yes Fair Yes

11 youth Medium Yes Excellent Yes

12 Middle_aged Medium No Excellent Yes

13 Middle_aged High Yes Fair Yes

14 senior Medium no Excellent No


Example
Age?

youth senior
Middle_aged

Income Student Credit_rating Class Income Student Credit_ratin Class


g
High No Fair No
Medium No Fair yes
High No Excellent no
Low Yes Fair yes
Medium No Fair no
Low Yes Excellent no
Low Yes Fair yes
Medium Yes Fair yes
Medium Yes Excellent yes
Medium No Excellent no

Gain(age) = 0.246
Income Student Credit_ratin Class
Gain(income) = 0.029 g

Gain(student) = 0.151 High No Fair yes

Gain(credit_rating) = 0.048 Low Yes Excellent yes


Medium no Excellent yes
Age has highest gain
High Yes Fair yes
Age?
youth senior
Middle_aged
Information-Gain for Continuous-Value
Attributes
• Must determine the best split point for A

• Sort the value A in increasing order

• Typically, the midpoint between each pair of adjacent values is considered as a


possible split point

• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1

• The point with the minimum expected information requirement for A is selected
as the split-point for A

• Calculate InfoA(D) for each possible split point and choose minimum one

• Split:

• D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples


in D satisfying A > split-point
Gain Ratio for Attribute Selection
• Information gain measure is biased towards attributes with a large number
of values

• Results in more number of partitions - pure

• C4.5 (a successor of ID3) uses gain ratio to overcome the problem


(normalization to information gain)
• Split information value is used:

• Potential information generated by splitting the training data set D into


v partitions – Considers the number of tuples wrt total tuples
• The attribute with the maximum gain ratio is selected as the splitting
attribute GainRatio(A)=
Gain Ratio - Example Formula:
GainRatio(A)=

• Gain ratio for Income attribute

- -

= 0.926
Income D1i D2i
High 2 2
Medium 4 2
low 3 1

Gain(Income) = 0.029
GainRatio(Income) = 0.029/0.926 = 0.031
Gini Index
• Measures the impurity of a data partition
n 2
gini( D) 1  p j
j 1
• pj is probability of a tuple belonging to class Cj
• Considers a binary split for each value
• To determine best binary split on A, all possible subsets that can be formed
are considered
• 2v – 2 possible ways to form two partitions
• For binary splits |D | |D |
gini A ( D)  1 gini( D1)  2 gini( D 2)
|D| |D|

• Reduction in impurity gini( A) gini( D)  giniA (D)


• Attribute that maximizes reduction in impurity or one with minimum
Gini index is chosen
n 2
Gini index
gini( D) 1  p j
j 1

• Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”


2 2
 9  5
gini ( D ) 1       0.459
 14   14 
• attribute income partitions D into 10 in D1: {low, medium} and 4 in D2

 10   4
giniincome{low,medium} ( D)   Gini( D1 )    Gini( D1 )
 14   14 

+
=0.450
Income D1i D2i
=Gini income𝞊 {high} (D)
High 2 2
Medium 4 2
low 3 1

but gini{medium,high} is 0.30 and thus the best since it is the lowest
Attribute Selection Measures

• The three measures, in general, return good results but


• Information gain:
• biased towards multivalued attributes
• Gain ratio:
• tends to prefer unbalanced splits in which one partition is much
smaller than the others
• Gini index:
• biased to multivalued attributes
• has difficulty when # of classes is large
• tends to favor tests that result in equal-sized partitions and purity
in both partitions
Advantages and Disadvantages of Trees

• Advantages

• Trees are very easy to explain to people

• Trees can be displayed graphically, and are easily interpreted even by a


non-expert
• Trees can easily handle qualitative predictors without the need to create
dummy variables.

• Disadvantage

• Trees generally do not have the same level of predictive accuracy as some of
the other regression and classification approaches
• Trees can be very non-robust ie.a small change in the data can cause a large
change in the final estimated tree.
Random Forest
Random Forest
• It is a supervised machine learning technique

• Can be used for both Classification and Regression problems

• It is based on the concept of ensemble learning, which is a process


of combining multiple classifiers to solve a complex problem and to
improve the performance of the model.

• Definition

• Random Forest is a classifier that contains a number of decision trees


on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset.
Random forest classifier

• Instead of relying on one decision tree, the random forest takes the prediction from each
tree and based on the majority votes of predictions, and it predicts the final output.
• The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
• Each time a split is to be performed, the search for the split variable is limited to a random
subset of m of the p attributes.
• m = √p
• on average (p − m)/p of the splits will not even consider the strong predictor, and so
other predictors will have more of a chance. This process is called as decorrelating
the trees, that makes the average of the resulting trees less variable and hence more
reliable.
• Random Forests produce many unique trees.
Random forest classifier
• Random forest introduces randomness into the rows and columns of the data
• provides a more diverse set of trees that almost always lowers our prediction
error.
Random Forest Classifier

Training Data

M features
N examples
Random Forest Classifier
Create bootstrap samples
from the training data

M features
N examples

....…
Random Forest Classifier
Construct a decision tree

M features
N examples

....…
Random Forest Classifier
At each node in choosing the split feature
choose only among m<M features

M features
N examples

....…
Random Forest Classifier
Create decision tree
from each bootstrap sample

M features
N examples

....…
....…
Random Forest Classifier

M features
N examples

Take the
majority
vote

....…
....…
Random Forest algorithm

• Step-1: Select random K data points from the training set.

• Step-2: Build the decision trees associated with the selected data points
(Subsets).

• Step-3: Choose the number N for decision trees that you want to build.

• Step-4: Repeat Step 1 & 2

• Step-5: For new data points, find the predictions of each decision tree,
and assign the new data points to the category that wins the majority
votes.
Advantages

• It takes less training time as compared to other algorithms.


• It is capable of performing both Classification and Regression tasks.
• It is capable of handling large datasets with high dimensionality.
• It enhances the accuracy of the model and prevents the overfitting issue.

Disadvantages

• Although random forest can be used for both classification and


regression tasks, it is not more suitable for Regression tasks.
Maximal margin classifier
What Is a Hyperplane?
• In a p-dimensional space, a hyperplane is a flat affine subspace of dimension
p−1
• For instance ,
• In two dimensions, a hyperplane is a flat one-dimensional subspace—
that is a line
• In three dimensions, a hyperplane is a flat two-dimensional subspace—
that is, a plane
• In two dimensions, a hyperplane is defined by the equation

• β0 + β1X1 + β2X2 = 0

• for parameters β0, β1, and β2.


• In p-dimensional setting
• β0 + β1X1 + β2X2 + ... + βpXp = 0
What Is a Hyperplane?

• if a point X = (X1, X2,...,Xp)T in p-dimensional space satisfies


• β0 + β1X1 + β2X2 + ... + βpXp = 0
• then X lies on the hyperplane.
• If β0 + β1X1 + β2X2 + ... + βpXp > 0.
• X lies to one side of the hyperplane
• If β0 + β1X1 + β2X2 + ... + βpXp < 0
• X lies on the other side of the hyperplane
• Hyperplane divides p-dimensional space into two halves
• The hyperplane 1+2X1 + 3X2 = 0

• The blue region is the set of points for which 1+2X1 + 3X2 > 0

• The purple region is the set of points for which 1+2X1 + 3X2 < 0.
Classification Using a Separating Hyperplane

• n×p data matrix X consists of n training observations in p-dimensional


space

• these observations fall into two classes—that is, y1,...,yn ∈{−1, 1}


• construct a hyperplane that separates the training observations perfectly
according to their class labels

Property of separating hyperplane


β0 + β1xi1 + β2xi2 + ... + βpxip > 0 if yi = 1,

β0 + β1xi1 + β2xi2 + ... + βpxip < 0 if yi = −1.


• classify the test observation x∗ based on the sign of

• f(x∗) = β0+β1x∗1+β2x∗2+...+βpx∗

• If f(x∗) is
• positive, assign the test observation to class 1
• negative assign it to class −1
The Maximal Margin Classifier
• Maximal margin hyperplane also known as the optimal separating hyperplane is
the separating hyperplane that is farthest from the training observations.
• Compute the distance from each training observation to a given separating
hyperplane
• The smallest distance is the minimal distance from the observations to the
hyperplane, and is known as the margin
• The maximal margin hyperplane is the separating hyperplane for which the
margin is largest
• It is the hyperplane that has the farthest minimum distance to the training
observations
• Then classify a test observation based on which side of the maximal margin
hyperplane it lies
• This is known as the maximal margin classifier
• three training observations are equidistant from the maximal margin hyperplane and
lie along the dashed lines indicating the width of the margin.
• These three observations are known as support vectors, since they are vectors in p-
dimensional space
• they “support” the maximal margin hyperplane in the sense that if these points were
moved slightly then the maximal margin hyperplane would move as well
• the maximal margin hyperplane depends directly on the support vectors
• the maximal margin hyperplane depends directly on only a small subset of the
observations is an important property
Construction of the Maximal Margin Classifier

• Given

• n training observations x1,...,xn ∈ Rp

• class labels y1,...,yn ∈ {− 1,1}

• maximal margin hyperplane is the solution to the optimization problem


• maximize M
β0,β1,...,βp

• subject to =1

• yi(β0 + β1xi1 + β2xi2 + ...+ βpxip) ≥ M ∀ i =1,...,n.

• M represents the margin of hyperplane, and the optimization problem


chooses β0,β1,...,βp to maximize M
The Non-separable Case

• no separating hyperplane exists


• the optimization problem has no solution with M>0.
• cannot exactly separate the two classes
Support Vector Classifiers
Support Vector Classifiers
• The generalization of the maximal margin classifier to the non-separable
case is known as the support vector classifier.
• consider a classifier based on a hyperplane that does not perfectly
separate the two classes , in the interest of
• Greater robustness to individual observations, and
• Better classification of most of the training observations
• it could be worthwhile to misclassify a few training observations in order
to do a better job in classifying the remaining observations.
• The support vector classifier is also called as soft margin classifier
• Rather than seeking the largest possible margin so that every observation is not only
on the correct side of the hyperplane but also on the correct side of the margin,
instead allow some observations to be on the incorrect side of the margin or even the
incorrect side of the hyperplane.
• The margin is soft because it can be violated by some of the training observations

• When there is no separating hyperplane an observation can be not only on the wrong
side of the margin, but also on the wrong side of the hyperplane.
• Observations on the wrong side of the hyperplane correspond to training observations
that are misclassified by the support vector classifier
Maximize M
β0,β1,...,βp,𝞊1,…….. 𝞊n ---------------------------------------(1)
subject to =1--------------------------------------(2)
yi(β0 + β1xi1 + β2xi2 + ...+ βpxip) ≥ M (1- 𝞊i),-------------(3)

𝞊i>=0, ------------------------------------------(4)
where C is a nonnegative tuning parameter
𝞊1,..., 𝞊n are slack variables that allow individual observations to be on
the wrong side of the margin or the hyperplane.
𝞊i tells us where the ith observation is located, relative to the
hyperplane and relative to the margin
• classify the test observation based on the sign of
• f(x∗) = β0 + β1x∗ 1 + ... + βpx∗p.

• If 𝞊i = 0 then the ith observation is on the correct side of the margin

• If 𝞊i > 0 then the ith observation is on the wrong side of the margin

• If 𝞊i > 1 then it is on the wrong side of the hyperplane

• C is treated as a tuning parameter. It controls the bias-variance trade-off


• C is small
• seek narrow margins that are rarely violated
• this amounts to a classifier that is highly fit to the data
• which may have low bias but high variance
• C is larger
• the margin is wider and allow more violations to it
• this amounts to fitting the data less hard
• obtaining a classifier with high biased and low variance.
• an observation that lies strictly on the correct side of the margin
does not affect the support vector classifier.
• Changing the position of that observation would not change the
classifier at all, provided that its position remains on the correct side
of the margin.
• Observations that lie directly on the margin, or on the wrong side of
the margin for their class, are known as support vectors.
• These observations do affect the support vector classifier
Support Vector Machines (SVM)
Classification with Non-linear Decision
Boundaries

The observations fall into two classes, with a The support vector classifier seeks a
non-linear boundary between them. linear boundary, and consequently
performs very poorly.
• The support vector classifier, address the problem of non-linear boundaries
between classes , by enlarging the feature space using quadratic, cubic, and
even higher-order polynomial functions of the predictors.
• For instance,rather than fitting a support vector classifier using p features

• X1, X2...,Xp,

• instead fit a support vector classifier using 2p features

• X1, X2 1 , X2, X2 2 ,...,Xp, X2 p .


• Maximize M

• β0,β11, β12...,βp1, βp2, 𝞊1,…….. 𝞊n

• subject to yi(β0 + ) ≥ M (1- 𝞊i)


The Support Vector Machine
• The support vector machine (SVM) is an extension of the support vector
classifier that results from enlarging the feature space in a specific way, using
kernels.

• the inner product of two observations xi, xi’ is given by

⟨ ⟩
𝑝
𝑥𝑖,𝑥 𝑖 =∑ 𝑥𝑖𝑗 𝑥𝑖 𝑗
′ ′

𝑗=1

• The linear support vector classifier can be represented as


• f(x)=𝞫0+

• where there are n parameters αi, i = 1,...,n, one per training observation

• To estimate the parameters α1,...,αn and β0, need n(n-1)/ 2 inner products
between all pairs of training observations.
• 𝞪i=nonzero only for the support vectors in the solution

• 𝞪i=zero if a training observation is not a support vector

• If S is the collection of indices of these support points

• f(x)=𝞫0+

• generalization of the inner product of the form


• K(xi, xi’ ), where K is some function referred as a kernel.

• A kernel is a function that quantifies the similarity of two observations

• K(xi,xi’)=.
• This is known as a linear kernel because the support vector classifier is
linear in the features.

• K(xi,xi’)=(1+)d
• This is known as a polynomial kernel of degree d, where d is a positive
polynomial integer . This is an example non-linear kernel
• Radial kernel, takes the form

Where γ is a positive constant

• The advantage of using a kernel is


• need to compute only K(xi, x’i) for all distinct pairs i, i’ .
An SVM with a polynomial kernel of degree 3 An SVM with a radial kernel is applied
is applied to the non-linear data resulting in
a far more appropriate decision rule.
SVMs with more than two classes
One-Versus-One Classification

• If K > 2 classes
• one-versus-one or all-pairs approach constructs
• SVMs, each of which compares a pair of classes.
• SVM might compare the k th class, coded as +1, to the k’th class, coded as −1
• classify a test observation using each of the classifiers,
• tally the number of times that the test observation is assigned to each of the
K classes
• The classification is performed by assigning the test observation to the
class to which it was most frequently assigned in these pairwise
classifications.
One-Versus-All Classification

• The one-versus-all approach is an alternative procedure for applying SVM


in the case of K > 2 classes.

• fit K SVMs, each time comparing one of the K classes to the remaining

K − 1 classes

• Let β0k, β1k,...,βpk denote the parameters that result from fitting an SVM
comparing the kth class (coded as +1) to the others (coded as −1)

• Let x∗ denote a test observation.

• assign the observation to the class for which β0k +β1kx∗ 1 +β2kx∗ 2 +...+ βpkx∗

p is largest
Reference(s):

• James G, Witten D, Hastie T and Tibshirani R, “An


Introduction to Statistical Learning with
Applications in R”, Springer,2013

You might also like