Module 01 2
Module 01 2
Nicola Gnecco
Module 3 :
Clustering
Unsupervised learning
supervised learning
>
-
ML
-
> unsupervised learning
Unsupervised learning
F
nx
P
PCA: some intuition
>
-
Xi =
(ineR
I Let xi œ Rp , i = 1, . . . , n be samples of some centered p-dimensional random vector X œ Rp .
I Principal component analysis or PCA aims at finding q Æ p directions w1ú , . . . , wqú œ Rp , called
principal components, such that the data can be well represented in the lower-dimensional subspace
spanned by these directions: particular each observation
for
↑ observations
for all
wi ..., wat common
Example
R [p
:
200
million
= 55200 =
p
; a
/200x5
+
+
1 t 5
zos
compression
I
2
I I
-
-
(
200
= =
-
I ↑
Ix I t 5
W
-
225 40
·
WI + 2220 We +... + =
=
200X2ZCI
⑨
I I I
:
=> -
-
: I I
I *
*+ Inzow2+... + Zag Wa & 5
100 +
Xn
In Wi I I
-
I
M = 5M
200 X /M = 200 200 x5 + 5x/M =
5M + 1 , 000
=
PXn pXq + gxn qXu
Application 1: Data compression
400
92X114
p
=
92
=
10 , 304
112
↓
↓
30m
304
10
,
10
,
1
x - , XzE
,
10, 304
I Each image is represented as a 02X12
20 ◊ 20 matrix of pixels, which can be unfolded into a 400 ◊ 1 vector.
-
I With PCA, we can compress each image and represent it as a linear combination of only q = 16 basis
images (also known as principal components).
I
Xi 2:
1
:
w , + zi2 .
I
We + .. . + zi
who
I I I
I Y-
10 , 304
Application 1: Data compression
-
Wis
Application 1: Data compression
I Here are the resulting compressed (i.e., approximated) pictures for different number of basis images
used, q = 2, 5, 10, 30, 100, 200.
Application 1: Data compression
# coordinates
2X6
5X6
10 X6
200 X G
Application 2: Data interpretation
I Consider another problem. We take the US interest rate curves (also known as the yield curves) from
September 2009 to May 2015, on a daily basis.
Application 2: Data interpretation
I Consider another problem. We take the US interest rate curves (also known as the yield curves) from
September 2009 to May 2015, on a daily basis.
I On each of the 1480 days, we observe the snapshot of the yield curve on that day, with interest rates
that span from 3 months to 30 years maturity.
Application 2: Data interpretation
I Consider another problem. We take the US interest rate curves (also known as the yield curves) from
September 2009 to May 2015, on a daily basis.
I On each of the 1480 days, we observe the snapshot of the yield curve on that day, with interest rates
that span from 3 months to 30 years maturity.
I Each observation is a curve on a given day, made of 16 maturities, i.e., a 16 ◊ 1 vector.
I To understand the dynamics of the intest rates we plot all the curves (one per day) on the same graph.
Application 2: Data interpretation
·
I
* it Zi , :
Wi
I
+ Ziz W2
:
+ 213
·
Wi
Application 2: Data interpretation
I Interestingly, we can explain the curve moves as a linear combination of the first three principal
components (PC).
I In this case, these principal components have a nice interpretation.
I The first PC represents the level of curve, i.e., all the maturities move up (or down) in parallel.
I The second PC represents the slope of the curve, i.e., long-term maturities move more (or less)
compared to the short-term maturities.
I The third PC represents the curvature of the yield curve, i.e., mid-term maturities move more (or less)
compared to the long-term and short-term maturities.
Application 2: Data interpretation
average slope curvature
-
>
Wi ↑ Ziz
:
W2 + Ziz
·
Wa
Xi
&
F *
= w/
=
w2
W3
=
Application 2: Data interpretation
I The approximated curves, look very similar to the original ones.
Principal component analysis (PCA): Motivation
10/ >
-
100
Principal component analysis (PCA): Motivation
=
-
-
x2
X
• • •
• •
x1
-
•
•
•
•
Principal component analysis (PCA): Motivation
x2 Xin Zi1 ·
We
Wh
• • •
•
.
•
• •
x1
•
•
Principal component analysis (PCA): Motivation
I In this situation, we can describe the dataset quite well by just looking at their coefficient along the gray
line.
Principal component analysis (PCA): Motivation
x2
• • •
• •
x1
•
•
•
•
Principal component analysis (PCA): Motivation
x2
•
•
⑭ •
•
•
x1
•
•
•
•
Principal component analysis (PCA): Goal
The goal is to
Principal component analysis (PCA): Goal
The goal is to
I Find the directions (vectors) that ‘approximate’ best the data. W,
Principal component analysis (PCA): Goal
The goal is to
I Find the directions (vectors) that ‘approximate’ best the data.
I Find the coordinates of each data point with respect to these directions. 2 i ,
i = 1,
. . .,
n .
Principal component analysis (PCA): Goal
The goal is to
I Find the directions (vectors) that ‘approximate’ best the data.
I Find the coordinates of each data point with respect to these directions.
Obviously, some directions are better than others to approximate the dataset.
Principal component analysis (PCA): Goal
The goal is to
I Find the directions (vectors) that ‘approximate’ best the data.
I Find the coordinates of each data point with respect to these directions.
Obviously, some directions are better than others to approximate the dataset. Consider the two alternatives.
Principal component analysis (PCA): Goal
The goal is to
I Find the directions (vectors) that ‘approximate’ best the data.
I Find the coordinates of each data point with respect to these directions.
Obviously, some directions are better than others to approximate the dataset. Consider the two alternatives.
x2
• • •
• Tw
•
,
x1
•
•
•
•
Principal component analysis (PCA): Goal
The goal is to
I Find the directions (vectors) that ‘approximate’ best the data.
I Find the coordinates of each data point with respect to these directions.
Obviously, some directions are better than others to approximate the dataset. Consider the two alternatives.
x2
?
•
• • •
• •
h x1
:
•
•
•
Principal component analysis (PCA): Goal
The goal is to
I Find the directions (vectors) that ‘approximate’ best the data.
I Find the coordinates of each data point with respect to these directions.
Obviously, some directions are better than others to approximate the dataset. Consider the two alternatives.
x2
• •
"
•
• •
x1
•
•
•
•
Principal component analysis (PCA): Goal
The goal is to
I Find the directions (vectors) that ‘approximate’ best the data. > w -
I Find the coordinates of each data point with respect to these directions. > zi i -
,
n = 1
, ...,
.
Obviously, some directions are better than others to approximate the dataset. Consider the two alternatives.
x2
w,
sir
• • •
• •
x1
•
•
•
•
How to approximate a data point
x= zw
¸˚˙˝ + (x ≠ zw ),
¸ ˚˙ ˝
approximation error
where z œ R. The goal is to choose z to make the approximation as good as possible, or, equivalently,
to make the error as small as possible.
How to approximate a data point
x2
x1
How to approximate a data point
x2
x1
How to approximate a data point
x2
w
x
x1
How to approximate a data point
x2
zw + (X -
zw)
X =
-n
error
perior
x
zw
x1
How to approximate a data point
x2
w
x error
zw - Y
x1
How to approximate a data point
x2
w
error
x
zw =
x1
How to approximate a data point
w T (x ≠ zw ) = 0,
= un
w T (x ≠ zw ) = 0,
w T x ≠ zw T w = 0 ≈∆ z = (w T w )≠1 w T x .
How to approximate a data point
w T (x ≠ zw ) = 0,
I To ensure uniqueness of the solution, we impose that w has unit length, i.e., w T w = 1.
How to approximate a data point
w T (x ≠ zw ) = 0,
w T x ≠ zw T w = 0 ≈∆ z = (w T w )≠1 w T x .
I To ensure uniqueness of the solution, we impose that w has unit length, i.e., w T w = 1.
I Therefore, for any w , the optimal z has form
I z = wT x.
The PCA formulation
I Consider now a dataset x1 , . . . , xn , where x œ Rp and where we assume that each observation xi is
mean-centered, i = 1, . . . , n.
The PCA formulation
I Consider now a dataset x1 , . . . , xn , where x œ Rp and where we assume that each observation xi is
mean-centered, i = 1, . . . , n.
I The goal is to find the vector w and the coordinates zi = w T xi that minimize the average squared
length of the errors (xi ≠ zi w ), i = 1, . . . , n, under the condition w T w = 1.
-
W
2
min i-ziwitin)
min xi-ziw =
w
W
WTW = I St win = 1
S t
.
.
.
The PCA formulation
I Consider now a dataset x1 , . . . , xn , where x œ Rp and where we assume that each observation xi is
mean-centered, i = 1, . . . , n.
I The goal is to find the vector w and the coordinates zi = w T xi that minimize the average squared
length of the errors (xi ≠ zi w ), i = 1, . . . , n, under the condition w T w = 1.
I Formally, this can be written as ared length error
1ÿ
· n
approximation n
1ÿ
= arg min (xi ≠ ww T xi )T (xi ≠ ww T xi )
error w n
i=1
A n
B
T 1ÿ
= arg max w xi xiT w
w n
i=1
maximining - = arg max w T Sw ,
variance w
where S is the sample covariance matrix when the observations xi are centered.
The PCA formulation zi wi xi
.
>
=
- i
2 . w = w . zi = wwixi
-Zin
(xi-ZimT(X :
any min
(xi-wwixi wwixi
arymin t
-
1
=
=
argmin xii-xiwwixi-xiwww -
EIR
=
arymin-vitwixi arymax wit =
-
↑
=
argmaxwixix =
amax wiXT) --
w
S
Intermezzo: Eigenvalue decomposition of a symmetric matrix S
E
wiS w
I Consider the eigenvalue decomposition of S = V V T . max
any w
I is the diagonal matrix containing the eigenvalues ⁄1 , . . . , ⁄p of S.
. t WTW
= 1
S
I V is the orthonormal matrix containing the eigenvectors v1 , . . . , vp of S.
.
S T
⁄1 ... 0 C
| |
D
W
=U .. X
V = v1 vp .
. V, ...
0 ... ⁄p | |
=
in the basis of V , up
>
- w = V b .
...,
Viw V= V . b = b
b
=
Y Il
VTw
= V"w =
b
The PCA solution
...t
bixb
max
=
wiUNVTw c
max max
S t
.
.
w
+W = 1 I St . 1 = wiw =
birTVb
-
JP
= bb
The PCA solution
w T w = 1 ≈∆ b T V T Vb = b T b = 1.
The PCA solution
w T w = 1 ≈∆ b T V T Vb = b T b = 1.
I The problem to solve is 1 , =12 ...
=Xp70 .
p
ÿ
max b T b = max bj2 ⁄j , such that b T b = 1.
b b
j=1
↑
b, +
2
+ b2 +z +... + b +p
11 If
Il
O O
I
The PCA solution
w T w = 1 ≈∆ b T V T Vb = b T b = 1.
I The problem to solve is
p
ÿ
max b T b = max bj2 ⁄j , such that b T b = 1.
b b
j=1
T
I Since b b = 1 each entry of b must satisfy bj Æ 1, j = 1, . . . , p. Recalling that ⁄1 Ø · · · Ø ⁄p , the best
strategy is to set the entry
;
1 if j = 1,
bj =
0 else.
The PCA solution
I The direction that approximates best the dataset, known as the first principal component, is the
eigenvector of S corresponding to the greatest eigenvalue.
The PCA solution
I The direction that approximates best the dataset, known as the first principal component, is the
eigenvector of S corresponding to the greatest eigenvalue.
I The optimal coordinate, also known as ‘score’, of each data point with respect to the first principal
component is
zi = w1úT xi = v1T xi , i = 1, . . . , n.
Discussion
where i = 1, . . . , n.
I In principle, we could make the approximation error even smaller by considering more than one direction.
For example, we could find q < p directions so that we could write each observation as
I Il
100 400 x = z1 w1ú + · · · + zq wqú + error.
¸ ˚˙ ˝
-
approximation
General solution
-
such that w2T w2 = 1 and w1úT w2 = 0.
General solution
wz = V :
b >
=
b = u Twe
General solution
I
v1T w2 = 0 ≈∆ v1T Vb = v1T (v1 b1 + · · · + vp bp ) = 0 ≈∆ b1 = 0.
I
I I
E
( =) v ,Tv , b 1 + v , vz b2 +.. + V
Tp bp
,
= 0
~ mu Lu
I O O
General solution
p
ÿ
max b T b = max bj2 ⁄j , such that b T b = 1 and b1 = 0.
b b
j=1
"
& b2 +
⑮ + 2 +... +
by
II
↓ I
General solution
]
;
1 if j = 2,
bj =
0 else.
>
-
b =
I The remaining principal components w3ú , . . . , wqú can then be computed in an analogous way.
General solution
I The remaining principal components w3ú , . . . , wqú can then be computed in an analogous way.
I We have approximated each point xi œ Rp , i = 1, . . . , n, using a q-dimensional subspace, i.e.,
w1ú , . . . , wqú . In particular
General solution
I The remaining principal components w3ú , . . . , wqú can then be computed in an analogous way.
I We have approximated each point xi œ Rp , i = 1, . . . , n, using a q-dimensional subspace, i.e.,
w1ú , . . . , wqú . In particular
= v1 v1T xi + · · · + vq vqT xi ,
O particulartoeservation
W
-
-
&
zig 5
-Of
where i = 1, . . . , n.
wa
*
wi
↑ Zil
to all observations
evectorsoft
common
P V
Vo
:
, ...,
TX:
Coordinate
-
:
Zij
=
Xitvj = V
;
PCA via Singular value decomposition (SVD)
X ER**P assume has columns with tero mean
.
,
I We can do PCA using an alternative technique called SVD.
matrix
eigen decomposition of
with covariance .
PCA
*
X
Compute covarious S = nX .
[Y ...]
·
S v
of
=
evectors
Compute
·
,
p(1
· = V
,...., PCq =
vq
i = 1,
Xivq ..., n .
· Coordinates for Xi :
zis = Xiv , ...,
Zig
= ,
PCA via Singular value decomposition (SVD)
S T
| | SD11 ...
T
0 S T
| XW . ≠ v1T ≠
W| .. .. X
W X .. . . XW .. X
X = Wu1 ... un X W
U VU . V.
U| |V 0 ... Dpp
≠ vpT ≠
| | 0
UœRn◊n DœRn◊p V T œRp◊p
PCA via Singular value decomposition (SVD)
S T
| | SD11 ...
T
0 S T
| XW . ≠ v1T ≠
W| .. .. X
W X .. . . XW .. X
X = Wu1 ... un X W
U VU . V.
U| |V 0 ... Dpp
≠ vpT ≠
| | 0
UœRn◊n DœRn◊p V T œRp◊p
S T
| | SD11 ...
T
0 S T
| XW . ≠ v1T ≠
W| .. .. X
W X .. . . XW .. X
X = Wu1 ... un X W
U VU . V.
U| |V 0 ... Dpp
vpT
nXp | | 0
≠ ≠
S TS - -
Ap U
first e-vector of
S
T ↓
↓0 S
| | D11 ... T
| XW . ≠ v1T ≠
W| .. .. X
W X .. . . XW .. X
X = Wu1 ... un X W
U VU . V.
U| |V 0 ... Dpp
≠ vpT ≠ - pith e-vector
| | 0
of S
UœRn◊n DœRn◊p V T œRp◊p
xi ¥ zi1 v1 + · · · + ziq vq , i = 1, . . . , n.
- - -
PCA via Singular value decomposition (SVD)
xi ¥ zi1 v1 + · · · + ziq vq , i = 1, . . . , n.
I We can write this in matrix notation.
PCA via Singular value decomposition (SVD) > -
X = UDVF
xi ¥ zi1 v1 + · · · + ziq vq , i = 1, . . . , n.
I We can write this in matrix notation. ~ ~
A H
S *T
·
T TS TS T
≠ x1 ≠
- - z11 ... z1q ≠ v1T ≠
W
U
..
.
X W
V¥ U
..
.
XW
VU
..
.
X
V =
≠ xnT ≠ zn1 ... znq ≠ vqT ≠
~ -
X œRn◊p AœRn◊q HœRq◊p
X,
z
,1.
viT + 2 ,2 Vat +... + 2, g-vaTI
:
xi ¥ zi1 v1 + · · · + ziq vq , i = 1, . . . , n.
I We can write this in matrix notation.
S T S TS T
≠ x1T ≠ z11 ... z1q ≠ v1T ≠
W .. X W .. XW .. X
U . V¥ U . VU . V
≠ xnT ≠ zn1 ... znq ≠ vqT ≠
.....
X œRn◊p AœRn◊q HœRq◊p
I Entries in A are often called scores. Each of the n rows of A can be thought of as containing the
“activity-coefficients” of the basis vectors in H.
PCA via Singular value decomposition (SVD)
xi ¥ zi1 v1 + · · · + ziq vq , i = 1, . . . , n.
I We can write this in matrix notation.
S T S TS T
≠ x1T ≠ z11 ... z1q ≠ v1T ≠
W .. X W .. XW .. X
U . V¥ U . VU . V
≠ xnT ≠ zn1 ... znq ≠ vqT ≠
~ N
X œRn◊p AœRn◊q HœRq◊p
~
I Entries in A are often called scores. Each of the n rows of A can be thought of as containing the
“activity-coefficients” of the basis vectors in H.
~
- -
I Entries in H are often called loadings. Each of the q rows of H contains the corresponding eigenvector
v1 , . . . , vq of S.
PCA via Singular value decomposition (SVD)
- -
X = UDV T .
PCA via Singular value decomposition (SVD)
"I P
- vi
...
D O
· !
T
:
"Dpp T
Y/ ........ Un N up-
o .. O -
I :... ·
X e D vi
PCA via Singular value decomposition (SVD) Recall : Can write PCA
E
2
X = A .
da
I We can find the matrices A and H as follows. I I
I Compute the SVD of X , to obtain nX9 qXP
X = UDV T .
15
·i
vi
/
1
/
-
/X
:
/1 -
: vo
#
- -
P
p
- -
- /Il
LL
- I
- -
=
I -
up
.... N - -
-
O
-~
....
I -· - ---
N ~
X e D
-
PCA via Singular value decomposition (SVD)
X = UDV T .
X = UDV T .
VT
Alternative UD, H
F
Y H
=
: ·
Set A =
AERPER,
,
·
PCA via Singular value decomposition (SVD)
X = UDV T .
·
(
scores loadings
How good is the PCA approximation?
xi = zi1 v1 + · · · + ziq vq ,
 i = 1, . . . , n.
How good is the PCA approximation?
xi = zi1 v1 + · · · + ziq vq ,
 i = 1, . . . , n.
xi = zi1 v1 + · · · + ziq vq ,
 i = 1, . . . , n.
xi = Â
xi + error = zi1 v1 + · · · + zip vp , i = 1, . . . , n.
How good is the PCA approximation?
xi = zi1 v1 + · · · + ziq vq ,
 i = 1, . . . , n.
xi = Â
xi + error = zi1 v1 + · · · + zip vp , i = 1, . . . , n.
I We can compute the quality of the PCA as the ratio between the average (squared) length of the
approximated observations and the average (squared) length of the non-approximated observations. In
formulas,
I
· -
· ↓(x Y
-
,
n (x ,
, +...
X, +... +
+ knYn)
XnTXu)
Eto 1]
,
How good is the PCA approximation?
xi = zi1 v1 + · · · + ziq vq ,
 i = 1, . . . , n.
xi = Â
xi + error = zi1 v1 + · · · + zip vp , i = 1, . . . , n.
I We can compute the quality of the PCA as the ratio between the average (squared) length of the
approximated observations and the average (squared) length of the non-approximated observations. In
formulas,
of
is
Sa =
1/n
qn T
1/n i=1 Â
x Â
xi
qn iT = 1
i=1
⁄ + · · · + ⁄q
xi xi
.
⁄1 + · · · + ⁄p
zeigenvalues
x +... + +q
-
-
=
Xp
Sa
x+ .. +
& &
- ...............
I
-
0 8
!
..........
=>
.
elbow
!
_
" O
⑧
·I
!#
-----
⑧
* PC ↑ A I
# PC
8
9 3 22 34
I As a consequence, PCA applied to the covariance matrix is different from PCA applied to the correlation
matrix.