Kernel Methods
Prof. Navneet Goyal,
CS Dept.
BITS-Pilani, Pilani Campus
INDIA
Figure source:
[Link]
[Link]/thbio/group/neuralnet
/index_p.html
Kernel Methods
• In computer science, kernel methods (KMs) are a
class of algorithms for pattern analysis, whose best
known element is the Support Vector Machine
(SVM) (Wikipedia)
• Transformations
• Feature Spaces
• Kernel Functions
• Kernel Tricks
• Inner Products
Kernel Methods
Algorithms capable of operating with kernels include:
• Support vector machine (SVM)
• Gaussian processes
• Fisher's linear discriminant analysis (LDA)
• Principal components analysis (PCA) (Kernel PCA)
• Canonical correlation analysis
• Ridge regression
• Spectral clustering
• Linear adaptive filters
•…
Kernel Methods
• Kernels are non-linear generalizations of
inner products
Kernel Methods
• Any kernel-based method comprises of
two modules:
• Mapping into embedding or feature space
• Learning algorithm designed to discover
linear patterns in that space
Kernel Methods
• Why this approach works?
• Detecting linear patterns has been the focus of
much research in statistics and machine learning
• Resulting algorithms are well understood and
efficient
• Computational shortcut: makes it possible to
represent linear patterns efficiently in high
dimensional space to ensure adequate
representational power
• The shortcut is nothing but the KERNEL FUNCTION
Kernel Methods
• Kernel methods allow us to extend algorithms such as
SVMs to define non-linear decision boundaries
• Other algorithms that only depend on inner products
between data points can be extended similarly
• Kernel functions which are symmetric and positive
definite allows us to implicitly define inner products in
high dimensional
• Replacing inner products in input space with positive
definite kernels immediately extends algorithms like SVM
to
• Linear separation in high dimensional space
• Or equivalently to a non-linear separation in input space
Kernel Methods
• Input space, χ
• High dimensional space, ℍ
• ℍ can be really large!!
• Document classification
• Trigrams
• Vocabulary of 100000 words
• Dimension of feature space reaches 1015
Kernel Functions
• A function K: 𝟀 x 𝟀 → ℝ is called a kernel over 𝟀
if for any two points x, x’ ∈ 𝟀,
K(x,x’) = 〈 ϕ (x), ϕ (x’)〉
For some mapping ϕ : 𝟀 → ℍ to a Hilbert space ℍ,
called a feature space
• K is efficient!
• K(x,x’) is O(N)
• 〈 ϕ (x), ϕ (x’)〉 is O(dim ℍ) with dim ℍ ≫ N
• K is flexible!
• No need to explicitly define or compute ϕ
• Kernel K can be arbitrarily chosen so long as the existense of
ϕ is guaranteed, i.e. K satisfies Mercer’s condition
Kernel Trick: Computational saving of the kernel trick
Example Quadratic Basis function (Andrew Moore Tut.)
The cost of
computation is:
O(m 2 )
(m is the dimension of input)
Whereas the corresponding Kernel is :
K (a, b) = (a × b + 1) 2
The cost of computation is: O (m)
To believe that it is
really the real Kernel :
Mercer’s Theorem
A function K ( xi , x j ) is a kernel (there exists a f ( x) such
that K (xi , x j ) º f (xi )T f (x j )) The Kernel matrix is
Symmetric Positive Semi-definite.
Another version of Mercer’s theorem that isn’t
related to the kernel matrix is: K ( x , x ) function
i j
is a kernel for any g (u ) such that
ò
2
g (u ) du is finite then
ò K (u, v) g (u ) g (v)dudv ³ 0
This version of Mercer’s condition does not require us to
know ϕ (x)!!
Kernels satisfying Mercer’s condition are called Positive
Definite Kernel Functions!
Examples of Kernels
-Some common choices (the first two always satisfying
Mercer’s condition):
-Polynomial kernel K ( xi , x j ) = ( xi t x j + 1) p
-Gaussian* Kernel (data is lifted to infinite dimension):
1 2
K ( xi , x j ) = exp( - xi - x j )
2s 2
*The Gaussian kernel is also known as the squared exponential kernel – SE kernel – or radial basis
function (RBF) kernel
K ( xi , x j ) = tanh(kxi × x j - d )
-Sigmoidal :
(it is not a kernel for every k and 𝞭 ).
-In fact, SVM model using a sigmoid kernel function is
equivalent to a two-layer, feed-forward neural network!!
Some Valid Kernels
Boser, Guyon and Vapnik, 1992
• Polynomial (Splines etc.)
K (x i , x) = (1 + x i × x)n
• Gaussian (Radial Basis Function Networks)
2
xi - x
K (x i , x) = exp(- 2s 2 )
• Sigmoid (Two-Layer Perceptron)
K (x i , x) = tanh( L + x i × x) only for certain L
x1 k a1 y1
sign
x y
x2
k
a 2 y2
Feature Spaces
F : x ® F( x), R ® F d
example: ( x, y ) ® ( x , y , 2 xy )
2 2
14
Kernel Trick
Kernel Trick: Examples
Consider the mapping:
The points in feature space corresponding to
x1 and x2:
Find:
Kernel Trick: Examples
Kernel Functions: Example
§ Given x = (x1, x2, x3); y = (y1, y2, y3). Find the
kernel for the function:
f(x)=(x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3)
Also find f(x).f(y) without finding f(x) & f(y).
Verify your result using x = (1, 2, 3) & y = (4, 5, 6).
Kernel Functions: Example
§ Consider a two-dimensional input space 𝑋 ⊆ ℝE
together with the feature map:
∅: 𝒙 = 𝑥F , 𝑥E ⟼ ∅ 𝒙 = 𝑥FE , 𝑥EE , 2 𝑥F 𝑥E ∈ 𝐹
= ℝG
§ Find the Kernel corresponding to ∅ and 𝐹
§ Also find the Kernel function corresponding to:
∅: 𝒙 = 𝑥F , 𝑥E ⟼ ∅ 𝒙 = 𝑥FE , 𝑥EE , 𝑥F 𝑥E , 𝑥E 𝑥F ∈ 𝐹
= ℝH
§ What conclusions you can draw?
Kernel Functions: Example
§ Find a mapping f from ℝ𝟐 → ℝ𝟔 corresponding to
the kernel 𝐾 𝑿, 𝒀 = [2 𝑿. 𝒀 + 3]E
§ For a given kernel, is the mapping unique?
Kernel Functions
§ How to choose the mapping ϕ ?
ú Several common non-linear mappings can be used
ú In fact we do not even need to know what the mapping
is J
§ How to avoid CoD?
ú Dot product in feature space can be represented as a
kernel function of input points:
K(x,x’) = ϕ (x).ϕ (x’)
ú SVMs do not suffer from CoD
Complexity of SVMs depend on the number of Support
Vectors (SVs)
Kernel Functions
§ Which Kernel to choose?
ú Properties of the kinds of kernels that can be used
to replace the dot product have been studiedJ
ú Admissible Kernels functions are:
Polynomial Kernel
Gaussian Radial Basis Function Kernel
Sigmoid Kernel
ú Each of these Kernels result in a different nonlinear
classifier in the input space
Kernel Functions
§ Which Kernel to choose?
ú No golden rules for determining which admissible
kernel will result in the most accurate SVM L
ú Kernel chosen does not make a large difference in
resulting accuracy
ú SVM training always finds a global solution
§ Challenges in SVM
ú Scalability
Need to improve the speed of training/testing so that SVM
scales to very large data sets (millions of SVs)
ú Determining best Kernel for a given data set
ú More efficient implementations of SVM for multiclass
problems
Kernel Functions: Examples
§ Decision boundary a circle cantered at
(0.5,0.5) with r=0.2
§ Points on circles with centre at origin and radii
1 & 2 respectively.
Kernel Trick
• Simple trick to kernelize existing algorithms that are
based on inner products
• Replace inner products with Kernels
• Explicit feature representation not required
25
Making New Kernels
Important Kernel Issues
How to know which Kernel to use?
- This is a good question and actually still an open question,
many researches have been working to deal with this issue
but still we don’t have a firm answer. It is one of the weakness
of SVM. We will see an approach to this issue latter.
- How to verify that rising to higher dimension using a specific
kernel will map the data to a space in which they are linearly
separable?
- For most of the kernel function we don’t know the
corresponding mapping function ϕ (x) so we don’t know to
which dimension we rose the data. So even though rising to
higher dimension increases the likelihood that they will be
separable we can’t guarantee that.
Important Kernel Issues
Gaussian Radial Basis Kernel lifts the data to infinite dimension
so our data is always separable in this space so why don’t we
always use this kernel?
First of all we should decide which s to use in this kernel
1 2
( exp(- xi - x j )).
2s 2
Secondly, a strong kernel ,which lifts the data to infinite
dimension, sometimes may lead us the severe problem of
Overfitting
Symptoms of overfitting:
1. Low margin à poor classification performance
2. Large number of support vectorsà Slows down the
computation