0% found this document useful (0 votes)
12 views114 pages

Module 01 2

The document outlines a training module on Unsupervised Learning, focusing on Principal Component Analysis (PCA) and its applications such as data compression and interpretation. It explains the differences between supervised and unsupervised learning, emphasizing exploratory data analysis and dimensionality reduction. The document also discusses the motivation behind PCA and its goal to find the best directions for approximating data points.

Uploaded by

Tesfaye Abera
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views114 pages

Module 01 2

The document outlines a training module on Unsupervised Learning, focusing on Principal Component Analysis (PCA) and its applications such as data compression and interpretation. It explains the differences between supervised and unsupervised learning, emphasizing exploratory data analysis and dimensionality reduction. The document also discusses the motivation behind PCA and its goal to find the best directions for approximating data points.

Uploaded by

Tesfaye Abera
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unsupervised Learning Training

Module 1: Principal Component Analysis

Nicola Gnecco

18–20 June, 2024

Ethiopian Statistical Association


Module 2 :
Principal component analysis
Autoencoders and Variational Autoencoders
Module 2 :

Module 3 :
Clustering
Unsupervised learning

I In supervised learning we are given features/predictors X1 , . . . , Xp , and a response Y . The goal is to


predict Y using X1 , . . . , Xp
I Example: Linear regression, logistic regression, etc.

supervised learning
>
-
ML
-
> unsupervised learning
Unsupervised learning

I In supervised learning we are given features/predictors X1 , . . . , Xp , and a response Y . The goal is to


predict Y using X1 , . . . , Xp
I Example: Linear regression, logistic regression, etc.
I In unsupervised learning we develop a set of statistical methods for the setting in which we have
only a set of features X1 , . . . , Xp , but no response.
I The goal is to do exploratory data analysis and to discover interesting patterns about the measurements
on X1 , . . . , Xp .
I Is there an informative way to visualize the data?
> Clustering
I Can we discover subgroups among the variables or among the observations? -

I Can we reduce the dimensionality?


- PCA
Variational Auto encoders
Auto encoders ,
Unsupervised learning

I In supervised learning we are given features/predictors X1 , . . . , Xp , and a response Y . The goal is to


predict Y using X1 , . . . , Xp
I Example: Linear regression, logistic regression, etc.
I In unsupervised learning we develop a set of statistical methods for the setting in which we have
only a set of features X1 , . . . , Xp , but no response.
I The goal is to do exploratory data analysis and to discover interesting patterns about the measurements
on X1 , . . . , Xp .
I Is there an informative way to visualize the data?
I Can we discover subgroups among the variables or among the observations?
I Can we reduce the dimensionality?
I In this lecture we will discuss one of the most important tools for dimension reduction: principal
component analysis.
PCA: some intuition

I Let xi œ Rp , i = 1, . . . , n be samples of some centered p-dimensional random vector X œ Rp .

F
nx
P
PCA: some intuition

>
-
Xi =

(ineR
I Let xi œ Rp , i = 1, . . . , n be samples of some centered p-dimensional random vector X œ Rp .
I Principal component analysis or PCA aims at finding q Æ p directions w1ú , . . . , wqú œ Rp , called
principal components, such that the data can be well represented in the lower-dimensional subspace
spanned by these directions: particular each observation
for

for some coefficients zi1 , . . . , ziq œ R.


1
xi ¥ zi1 w1ú + · · · + ziq wqú
I
-

↑ observations
for all
wi ..., wat common
Example

R [p
:

200
million
= 55200 =
p
; a

/200x5
+
+
1 t 5
zos
compression

I
2

I I
-
-

(
200
= =
-

I ↑
Ix I t 5
W
-

225 40
·

WI + 2220 We +... + =
=

200X2ZCI

I I I
:
=> -
-

: I I
I *
*+ Inzow2+... + Zag Wa & 5
100 +
Xn
In Wi I I
-

I
M = 5M
200 X /M = 200 200 x5 + 5x/M =
5M + 1 , 000

=
PXn pXq + gxn qXu
Application 1: Data compression

I One common application of PCA is image compression.


Application 1: Data compression

I One common application of PCA is image compression.


I We can try to reduce the dimensionality of complex images.
Application 1: Data compression

I One common application of PCA is image compression.


I We can try to reduce the dimensionality of complex images. N
Il
I Consider the following pictures taken from a dataset of more than 10000
- faces.

400

92X114
p
=
92
=
10 , 304

112



30m
304
10
,

10
,
1
x - , XzE
,

400 X 10, 304


Original data : need to save nx
p =
Application 1: Data compression

92x112 10, 304


I Each image is represented as a 20
- ◊ 20 matrix of pixels, which can be unfolded into a S
400 ◊ 1 vector.
Application 1: Data compression

10, 304
I Each image is represented as a 02X12
20 ◊ 20 matrix of pixels, which can be unfolded into a 400 ◊ 1 vector.
-

I With PCA, we can compress each image and represent it as a linear combination of only q = 16 basis
images (also known as principal components).

I
Xi 2:
1
:
w , + zi2 .
I
We + .. . + zi
who
I I I
I Y-
10 , 304
Application 1: Data compression

92x/12 = 10, 304 10 ,304


I Each image is represented as a -
20 ◊ -
20 matrix of pixels, which can be unfolded into a 400
- ◊ 1 vector.
I With PCA, we can compress each image and represent it as a linear combination of only q = 16 basis
images (also known as principal components).
-We
W,
x

-
Wis
Application 1: Data compression

I Here are the resulting compressed (i.e., approximated) pictures for different number of basis images
used, q = 2, 5, 10, 30, 100, 200.
Application 1: Data compression
# coordinates

2X6

5X6

10 X6

200 X G
Application 2: Data interpretation

I Consider another problem. We take the US interest rate curves (also known as the yield curves) from
September 2009 to May 2015, on a daily basis.
Application 2: Data interpretation

I Consider another problem. We take the US interest rate curves (also known as the yield curves) from
September 2009 to May 2015, on a daily basis.
I On each of the 1480 days, we observe the snapshot of the yield curve on that day, with interest rates
that span from 3 months to 30 years maturity.
Application 2: Data interpretation

I Consider another problem. We take the US interest rate curves (also known as the yield curves) from
September 2009 to May 2015, on a daily basis.
I On each of the 1480 days, we observe the snapshot of the yield curve on that day, with interest rates
that span from 3 months to 30 years maturity.
I Each observation is a curve on a given day, made of 16 maturities, i.e., a 16 ◊ 1 vector.
I To understand the dynamics of the intest rates we plot all the curves (one per day) on the same graph.
Application 2: Data interpretation

·
I
* it Zi , :
Wi
I
+ Ziz W2
:

+ 213
·
Wi
Application 2: Data interpretation

I Interestingly, we can explain the curve moves as a linear combination of the first three principal
components (PC).
I In this case, these principal components have a nice interpretation.
I The first PC represents the level of curve, i.e., all the maturities move up (or down) in parallel.
I The second PC represents the slope of the curve, i.e., long-term maturities move more (or less)
compared to the short-term maturities.
I The third PC represents the curvature of the yield curve, i.e., mid-term maturities move more (or less)
compared to the long-term and short-term maturities.
Application 2: Data interpretation
average slope curvature
-
>
Wi ↑ Ziz
:
W2 + Ziz
·
Wa
Xi
&

F *

= w/
=
w2
W3
=
Application 2: Data interpretation
I The approximated curves, look very similar to the original ones.
Principal component analysis (PCA): Motivation

I Consider a dataset x1 , . . . , xn of n iid copies of X œ Rp , where we assume that EX = 0.


Principal component analysis (PCA): Motivation

I Consider a dataset x1 , . . . , xn of n iid copies of X œ Rp , where we assume that EX = 0.


I Usually, the data does not ‘fill up’ the whole p-dimensional space.

10/ >
-

images live in lower dimensional space

100
Principal component analysis (PCA): Motivation

I Consider a dataset x1 , . . . , xn of n iid copies of X œ Rp , where we assume that EX = 0.


I Usually, the data does not ‘fill up’ the whole p-dimensional space.
Example: Consider a two-dimensional dataset, xi œ R2 , i = 1, . . . , n, where most of the observations lie close
to a line (a one-dimensional subspace of R2 ).
Principal component analysis (PCA): Motivation

=
-

-
x2
X

• • •

• •
x1
-





Principal component analysis (PCA): Motivation

x2 Xin Zi1 ·
We

Wh

• • •


.

• •
x1



Principal component analysis (PCA): Motivation

I In this situation, we can describe the dataset quite well by just looking at their coefficient along the gray
line.
Principal component analysis (PCA): Motivation

x2

• • •

• •
x1





Principal component analysis (PCA): Motivation

x2


⑭ •


x1





Principal component analysis (PCA): Goal
The goal is to
Principal component analysis (PCA): Goal
The goal is to
I Find the directions (vectors) that ‘approximate’ best the data. W,
Principal component analysis (PCA): Goal
The goal is to
I Find the directions (vectors) that ‘approximate’ best the data.
I Find the coordinates of each data point with respect to these directions. 2 i ,
i = 1,
. . .,
n .
Principal component analysis (PCA): Goal
The goal is to
I Find the directions (vectors) that ‘approximate’ best the data.
I Find the coordinates of each data point with respect to these directions.
Obviously, some directions are better than others to approximate the dataset.
Principal component analysis (PCA): Goal
The goal is to
I Find the directions (vectors) that ‘approximate’ best the data.
I Find the coordinates of each data point with respect to these directions.
Obviously, some directions are better than others to approximate the dataset. Consider the two alternatives.
Principal component analysis (PCA): Goal
The goal is to
I Find the directions (vectors) that ‘approximate’ best the data.
I Find the coordinates of each data point with respect to these directions.
Obviously, some directions are better than others to approximate the dataset. Consider the two alternatives.

x2

• • •

• Tw

,

x1





Principal component analysis (PCA): Goal
The goal is to
I Find the directions (vectors) that ‘approximate’ best the data.
I Find the coordinates of each data point with respect to these directions.
Obviously, some directions are better than others to approximate the dataset. Consider the two alternatives.

x2

?

• • •

• •
h x1

:


Principal component analysis (PCA): Goal
The goal is to
I Find the directions (vectors) that ‘approximate’ best the data.
I Find the coordinates of each data point with respect to these directions.
Obviously, some directions are better than others to approximate the dataset. Consider the two alternatives.

x2

• •

"

• •
x1





Principal component analysis (PCA): Goal
The goal is to
I Find the directions (vectors) that ‘approximate’ best the data. > w -

I Find the coordinates of each data point with respect to these directions. > zi i -
,
n = 1
, ...,
.

Obviously, some directions are better than others to approximate the dataset. Consider the two alternatives.

x2

w,

sir
• • •

• •
x1





How to approximate a data point

I Consider a data point (observation) x . Given a vector w we can always write

x= zw
¸˚˙˝ + (x ≠ zw ),
¸ ˚˙ ˝
approximation error

where z œ R. The goal is to choose z to make the approximation as good as possible, or, equivalently,
to make the error as small as possible.
How to approximate a data point

x2

x1
How to approximate a data point

x2

x1
How to approximate a data point

x2

w
x

x1
How to approximate a data point

x2
zw + (X -
zw)
X =

-n
error

perior
x

zw

x1
How to approximate a data point

x2

w
x error

zw - Y

x1
How to approximate a data point

x2

w
error
x

zw =

x1
How to approximate a data point

I The best approximation is achieved when the error is perpendicular to w . We impose

w T (x ≠ zw ) = 0,
= un

and solve for z.


error
How to approximate a data point

I The best approximation is achieved when the error is perpendicular to w . We impose

w T (x ≠ zw ) = 0,

and solve for z.


I The optimal z is given by

w T x ≠ zw T w = 0 ≈∆ z = (w T w )≠1 w T x .
How to approximate a data point

I The best approximation is achieved when the error is perpendicular to w . We impose

w T (x ≠ zw ) = 0,

and solve for z.


I The optimal z is given by 2
M
w x ≠ zw w = 0 ≈∆ z = (w T w )≠1 w T x .
T T

I To ensure uniqueness of the solution, we impose that w has unit length, i.e., w T w = 1.
How to approximate a data point

I The best approximation is achieved when the error is perpendicular to w . We impose

w T (x ≠ zw ) = 0,

and solve for z.


I The optimal z is given by

w T x ≠ zw T w = 0 ≈∆ z = (w T w )≠1 w T x .
I To ensure uniqueness of the solution, we impose that w has unit length, i.e., w T w = 1.
I Therefore, for any w , the optimal z has form

I z = wT x.
The PCA formulation

I Consider now a dataset x1 , . . . , xn , where x œ Rp and where we assume that each observation xi is
mean-centered, i = 1, . . . , n.
The PCA formulation

I Consider now a dataset x1 , . . . , xn , where x œ Rp and where we assume that each observation xi is
mean-centered, i = 1, . . . , n.
I The goal is to find the vector w and the coordinates zi = w T xi that minimize the average squared
length of the errors (xi ≠ zi w ), i = 1, . . . , n, under the condition w T w = 1.
-

W
2

min i-ziwitin)
min xi-ziw =

w
W

WTW = I St win = 1
S t
.
.
.
The PCA formulation

I Consider now a dataset x1 , . . . , xn , where x œ Rp and where we assume that each observation xi is
mean-centered, i = 1, . . . , n.
I The goal is to find the vector w and the coordinates zi = w T xi that minimize the average squared
length of the errors (xi ≠ zi w ), i = 1, . . . , n, under the condition w T w = 1.
I Formally, this can be written as ared length error
1ÿ
· n

arg min (xi ≠ zi w )T (xi ≠ zi w )


minimizing >
-
w n
i=1

approximation n
1ÿ
= arg min (xi ≠ ww T xi )T (xi ≠ ww T xi )
error w n
i=1
A n
B
T 1ÿ
= arg max w xi xiT w
w n
i=1
maximining - = arg max w T Sw ,
variance w

where S is the sample covariance matrix when the observations xi are centered.
The PCA formulation zi wi xi
.

>
=

- i
2 . w = w . zi = wwixi
-Zin
(xi-ZimT(X :
any min
(xi-wwixi wwixi
arymin t
-

1
=

=
argmin xii-xiwwixi-xiwww -
EIR

=
arymin-vitwixi arymax wit =

-

=
argmaxwixix =

amax wiXT) --
w

S
Intermezzo: Eigenvalue decomposition of a symmetric matrix S

I Consider the eigenvalue decomposition of S = V V T .


Intermezzo: Eigenvalue decomposition of a symmetric matrix S

I Consider the eigenvalue decomposition of S = V V T .


I is the diagonal matrix containing the eigenvalues ⁄1 , . . . , ⁄p of S.
Intermezzo: Eigenvalue decomposition of a symmetric matrix S

I Consider the eigenvalue decomposition of S = V V T .


I is the diagonal matrix containing the eigenvalues ⁄1 , . . . , ⁄p of S.
I V is the orthonormal matrix containing the eigenvectors v1 , . . . , vp of S.
Intermezzo: Eigenvalue decomposition of a symmetric matrix S

I Consider the eigenvalue decomposition of S = V V T .


I is the diagonal matrix containing the eigenvalues ⁄1 , . . . , ⁄p of S.
I V is the orthonormal matrix containing the eigenvectors v1 , . . . , vp of S.
S T
⁄1 ... 0 C
| |
D
W
=U .. X
V = v1 vp .
. V, ...
0 ... ⁄p | |
Intermezzo: Eigenvalue decomposition of a symmetric matrix S

I Consider the eigenvalue decomposition of S = V V T .


I is the diagonal matrix containing the eigenvalues ⁄1 , . . . , ⁄p of S.
I V is the orthonormal matrix containing the eigenvectors v1 , . . . , vp of S.
S T
⁄1 ... 0 C
| |
D
W
=U .. X
V = v1 vp .
. V, ...
0 ... ⁄p | |

Remark 1: Since S is positive semi-definite, ⁄1 , . . . , ⁄p Ø 0.


Intermezzo: Eigenvalue decomposition of a symmetric matrix S

I Consider the eigenvalue decomposition of S = V V T .


I is the diagonal matrix containing the eigenvalues ⁄1 , . . . , ⁄p of S.
I V is the orthonormal matrix containing the eigenvectors v1 , . . . , vp of S.
S T
⁄1 ... 0 C
| |
D
W
=U .. X
V = v1 vp .
. V, ...
0 ... ⁄p | |

Remark 1: Since S is positive semi-definite, ⁄1 , . . . , ⁄p Ø 0.


Remark 2: Since V is orthonormal, viT vi = 1 and viT vj = 0, i, j = 1, . . . , p and i ”= j. From these properties,
it follows that V ≠1 = V T .
Intermezzo: Eigenvalue decomposition of a symmetric matrix S

E
wiS w
I Consider the eigenvalue decomposition of S = V V T . max
any w
I is the diagonal matrix containing the eigenvalues ⁄1 , . . . , ⁄p of S.
. t WTW
= 1
S
I V is the orthonormal matrix containing the eigenvectors v1 , . . . , vp of S.
.

S T
⁄1 ... 0 C
| |
D
W
=U .. X
V = v1 vp .
. V, ...
0 ... ⁄p | |

Remark 1: Since S is positive semi-definite, ⁄1 , . . . , ⁄p Ø 0.


Remark 2: Since V is orthonormal, viT vi = 1 and viT vj = 0, i, j = 1, . . . , p and i ”= j. From these properties,
it follows that V ≠1 = V T .
Remark 3: We assume, without loss of generality, that ⁄1 Ø · · · Ø ⁄p .
The PCA solution

I We want to maximize w T Sw = w T V V T w under the constraint w T w = 1.


*
coordinates of w
-

=
in the basis of V , up
>
- w = V b .
...,

Viw V= V . b = b
b
=

Y Il

VTw

= V"w =
b
The PCA solution

I We want to maximize w T Sw = w T V V T w under the constraint w T w = 1.


I We can define b = V T w , which is the representation of vector w in the basis v1 , . . . , vp .
V b
butt b
w =
.

...t
bixb
max
=

wiUNVTw c
max max

S t
.
.
w
+W = 1 I St . 1 = wiw =
birTVb
-

JP
= bb
The PCA solution

I We want to maximize w T Sw = w T V V T w under the constraint w T w = 1.


I We can define b = V T w , which is the representation of vector w in the basis v1 , . . . , vp .
I By writing w = Vb, the constraint becomes

w T w = 1 ≈∆ b T V T Vb = b T b = 1.
The PCA solution

I We want to maximize w T Sw = w T V V T w under the constraint w T w = 1.


I We can define b = V T w , which is the representation of vector w in the basis v1 , . . . , vp .
I By writing w = Vb, the constraint becomes

w T w = 1 ≈∆ b T V T Vb = b T b = 1.
I The problem to solve is 1 , =12 ...
=Xp70 .

p
ÿ
max b T b = max bj2 ⁄j , such that b T b = 1.
b b
j=1


b, +
2
+ b2 +z +... + b +p
11 If
Il

O O
I
The PCA solution

I We want to maximize w T Sw = w T V V T w under the constraint w T w = 1.


I We can define b = V T w , which is the representation of vector w in the basis v1 , . . . , vp .
I By writing w = Vb, the constraint becomes

w T w = 1 ≈∆ b T V T Vb = b T b = 1.
I The problem to solve is
p
ÿ
max b T b = max bj2 ⁄j , such that b T b = 1.
b b
j=1

T
I Since b b = 1 each entry of b must satisfy bj Æ 1, j = 1, . . . , p. Recalling that ⁄1 Ø · · · Ø ⁄p , the best
strategy is to set the entry
;
1 if j = 1,
bj =
0 else.
The PCA solution

I Therefore, the optimal b is


S T
1
W.X
b1ú = U .. V .
0
The PCA solution

I Therefore, the optimal b is


S T
1
W.X
b1ú = U .. V .
0

I Recalling that b = V T w , we have that w = Vb and so


S T
C D 1
| |
vp U ... V = v1 .
W X
w1ú = Vb1ú = v1 ...
| | 0
The PCA solution

I Therefore, the optimal b is


S T
1
W.X
b1ú = U .. V .
0

I Recalling that b = V T w , we have that w = Vb and so


S T
C D 1
| |
vp U ... V = v1 .
W X
w1ú = Vb1ú = v1 ...
| | 0

I The direction that approximates best the dataset, known as the first principal component, is the
eigenvector of S corresponding to the greatest eigenvalue.
The PCA solution

I Therefore, the optimal b is


S T
1
W.X
b1ú = U .. V .
0

I Recalling that b = V T w , we have that w = Vb and so


S T
C D 1
| |
vp U ... V = v1 .
W X
w1ú = Vb1ú = v1 ...
| | 0

I The direction that approximates best the dataset, known as the first principal component, is the
eigenvector of S corresponding to the greatest eigenvalue.
I The optimal coordinate, also known as ‘score’, of each data point with respect to the first principal
component is
zi = w1úT xi = v1T xi , i = 1, . . . , n.
Discussion

I So far, we have approximated each point xi œ Rp , i = 1, . . . , n, using only a one-dimensional subspace,


i.e., a single vector w . In particular, we have decomposed each observation as
w
* +X ;
xi = zi w1
O
ú
+ (xi ≠ zi w1ú ) = v1 v1T xi + (xi ≠ v1 v1T xi ),
¸˚˙˝ ¸ ˚˙ ˝
approximation
~
error wi
Xi
where i = 1, . . . , n.
Discussion

I So far, we have approximated each point xi œ Rp , i = 1, . . . , n, using only a one-dimensional subspace,


i.e., a single vector w . In particular, we have decomposed each observation as

xi = zi w1ú + (xi ≠ zi w1ú ) = v1 v1T xi + (xi ≠ v1 v1T xi ),


¸˚˙˝ ¸ ˚˙ ˝
approximation error

where i = 1, . . . , n.
I In principle, we could make the approximation error even smaller by considering more than one direction.
For example, we could find q < p directions so that we could write each observation as
I Il
100 400 x = z1 w1ú + · · · + zq wqú + error.
¸ ˚˙ ˝
-

approximation
General solution

I The first principal component remains w1ú = v1 .


General solution

I The first principal component remains w1ú


= v1 .
:
I The second principal component is the direction that minimizes the approximation error and is
orthogonal to the first principal component w1ú :

w2ú = arg max w2T Sw2 ,


w2

-
such that w2T w2 = 1 and w1úT w2 = 0.
General solution

I The first principal component remains w1ú = v1 .


I The second principal component is the direction that minimizes the approximation error and is
orthogonal to the first principal component w1ú :

w2ú = arg max w2T Sw2 ,


w2

such that w2T w2 = 1 and w1úT w2 = 0.


I As before, we can define, b = V T w2 , which is the representation of vector w2 into basis v1 , . . . , vp .

wz = V :
b >
=
b = u Twe
General solution

I The first principal component remains w1ú = v1 .


I The second principal component is the direction that minimizes the approximation error and is
orthogonal to the first principal component w1ú :

w2ú = arg max w2T Sw2 ,


w2

such that w2T w2 = 1 and w1úT w2 = 0.


I As before, we can define, b = V T w2 , which is the representation of vector w2 into basis v1 , . . . , vp .
I By writing w2 = Vb, the first constraint becomes b T b = 1. The second constraint, becomes
-

I
v1T w2 = 0 ≈∆ v1T Vb = v1T (v1 b1 + · · · + vp bp ) = 0 ≈∆ b1 = 0.
I

I I
E
( =) v ,Tv , b 1 + v , vz b2 +.. + V
Tp bp
,
= 0
~ mu Lu

I O O
General solution

I The problem to solve is


General solution

I The problem to solve is + 17 + 2 . . .


- +p. 0 .

p
ÿ
max b T b = max bj2 ⁄j , such that b T b = 1 and b1 = 0.
b b
j=1

"

& b2 +
⑮ + 2 +... +
by
II
↓ I
General solution

I The problem to solve is


p
ÿ
max b T b = max bj2 ⁄j , such that b T b = 1 and b1 = 0.
b b
j=1

I Recalling that ⁄1 Ø · · · Ø ⁄p , the best strategy is to set the entry


;
1 if j = 2,
bj =
0 else.
General solution

I The problem to solve is


p
ÿ
max b T b = max bj2 ⁄j , such that b T b = 1 and b1 = 0.
b b
j=1

I Recalling that ⁄1 Ø · · · Ø ⁄p , the best strategy is to set the entry

]
;
1 if j = 2,
bj =
0 else.
>
-
b =

I Therefore, the second principal component is w2ú = Vb2ú = v2 .


General solution

I The problem to solve is


p
ÿ
max b T b = max bj2 ⁄j , such that b T b = 1 and b1 = 0.
b b
j=1

I Recalling that ⁄1 Ø · · · Ø ⁄p , the best strategy is to set the entry


;
1 if j = 2,
bj =
0 else.

I Therefore, the second principal component is w2ú = Vb2ú = v2 .


I The optimal coordinate, also known as ‘score’, of each data point with respect to the second principal
component w2ú is
zi2 = w2úT xi = v2T xi ,
for i = 1, . . . , n.
General solution

I The remaining principal components w3ú , . . . , wqú can then be computed in an analogous way.
General solution

I The remaining principal components w3ú , . . . , wqú can then be computed in an analogous way.
I We have approximated each point xi œ Rp , i = 1, . . . , n, using a q-dimensional subspace, i.e.,
w1ú , . . . , wqú . In particular
General solution

I The remaining principal components w3ú , . . . , wqú can then be computed in an analogous way.
I We have approximated each point xi œ Rp , i = 1, . . . , n, using a q-dimensional subspace, i.e.,
w1ú , . . . , wqú . In particular

xi ¥ zi1 w1ú + · · · + ziq wqú


S

= v1 v1T xi + · · · + vq vqT xi ,
O particulartoeservation
W
-

-
&
zig 5
-Of
where i = 1, . . . , n.
wa
*
wi

↑ Zil

to all observations
evectorsoft
common
P V
Vo
:
, ...,

TX:
Coordinate
-
:
Zij
=
Xitvj = V
;
PCA via Singular value decomposition (SVD)
X ER**P assume has columns with tero mean
.
,
I We can do PCA using an alternative technique called SVD.

matrix
eigen decomposition of
with covariance .
PCA
*
X
Compute covarious S = nX .

[Y ...]
·

S v
of
=

evectors
Compute
·
,

p(1
· = V
,...., PCq =

vq
i = 1,
Xivq ..., n .

· Coordinates for Xi :
zis = Xiv , ...,
Zig
= ,
PCA via Singular value decomposition (SVD)

I We can do PCA using an alternative technique called SVD.


I Consider the dataset X œ Rn◊p .
PCA via Singular value decomposition (SVD)

I We can do PCA using an alternative technique called SVD.


I Consider the dataset X œ Rn◊p .
I It is always possible to decompose X as follows,

S T
| | SD11 ...
T
0 S T
| XW . ≠ v1T ≠
W| .. .. X
W X .. . . XW .. X
X = Wu1 ... un X W
U VU . V.
U| |V 0 ... Dpp
≠ vpT ≠
| | 0
UœRn◊n DœRn◊p V T œRp◊p
PCA via Singular value decomposition (SVD)

I We can do PCA using an alternative technique called SVD.


I Consider the dataset X œ Rn◊p .
I It is always possible to decompose X as follows,

S T
| | SD11 ...
T
0 S T
| XW . ≠ v1T ≠
W| .. .. X
W X .. . . XW .. X
X = Wu1 ... un X W
U VU . V.
U| |V 0 ... Dpp
≠ vpT ≠
| | 0
UœRn◊n DœRn◊p V T œRp◊p

Remark 1: U is an orthonormal matrix, i.e., U T U = In .


PCA via Singular value decomposition (SVD)

I We can do PCA using an alternative technique called SVD.


I Consider the dataset X œ Rn◊p .
I It is always possible to decompose X as follows,

S T
| | SD11 ...
T
0 S T
| XW . ≠ v1T ≠
W| .. .. X
W X .. . . XW .. X
X = Wu1 ... un X W
U VU . V.
U| |V 0 ... Dpp
vpT
nXp | | 0
≠ ≠

UœRn◊n DœRn◊p V T œRp◊p

Remark 1: U is an orthonormal matrix, i.e., U T U = In .


Remark 2: The diagonal
 entries of D are the square roots of the eigenvalues of the covariance matrix S. In
other words, Djj = n⁄jj , for j = 1, . . . , p.
-
-
PCA via Singular value decomposition (SVD)

I We can do PCA using an alternative technique called SVD.


I Consider the dataset X œ Rn◊p .
I It is always possible to decompose X as follows, iI - · U

S TS - -
Ap U
first e-vector of
S
T ↓
↓0 S
| | D11 ... T
| XW . ≠ v1T ≠
W| .. .. X
W X .. . . XW .. X
X = Wu1 ... un X W
U VU . V.
U| |V 0 ... Dpp
≠ vpT ≠ - pith e-vector
| | 0
of S
UœRn◊n DœRn◊p V T œRp◊p

Remark 1: U is an orthonormal matrix, i.e., U T U = In .


Remark 2: The diagonal
 entries of D are the square roots of the eigenvalues of the covariance matrix S. In
other words, Djj = n⁄jj⑪, for j = 1, . . . , p.
Remark 3: The rows of V T contain the eigenvectors of S. Recall that V T V = Ip .
PCA via Singular value decomposition (SVD)

I PCA approximates each observation, a p ◊ 1 vector, as follows.


PCA via Singular value decomposition (SVD)

I PCA approximates each observation, a p ◊ 1 vector, as follows.

xi ¥ zi1 v1 + · · · + ziq vq , i = 1, . . . , n.
- - -
PCA via Singular value decomposition (SVD)

I PCA approximates each observation, a p ◊ 1 vector, as follows.

xi ¥ zi1 v1 + · · · + ziq vq , i = 1, . . . , n.
I We can write this in matrix notation.
PCA via Singular value decomposition (SVD) > -
X = UDVF

I PCA approximates each observation, a p ◊ 1 vector, as follows.

xi ¥ zi1 v1 + · · · + ziq vq , i = 1, . . . , n.
I We can write this in matrix notation. ~ ~

A H
S *T
·

T TS TS T
≠ x1 ≠
- - z11 ... z1q ≠ v1T ≠
W
U
..
.
X W
V¥ U
..
.
XW
VU
..
.
X
V =
≠ xnT ≠ zn1 ... znq ≠ vqT ≠
~ -
X œRn◊p AœRn◊q HœRq◊p

X,
z
,1.
viT + 2 ,2 Vat +... + 2, g-vaTI
:

for all observations


PCA via Singular value decomposition (SVD)

I PCA approximates each observation, a p ◊ 1 vector, as follows.

xi ¥ zi1 v1 + · · · + ziq vq , i = 1, . . . , n.
I We can write this in matrix notation.
S T S TS T
≠ x1T ≠ z11 ... z1q ≠ v1T ≠
W .. X W .. XW .. X
U . V¥ U . VU . V
≠ xnT ≠ zn1 ... znq ≠ vqT ≠

.....
X œRn◊p AœRn◊q HœRq◊p

I Entries in A are often called scores. Each of the n rows of A can be thought of as containing the
“activity-coefficients” of the basis vectors in H.
PCA via Singular value decomposition (SVD)

I PCA approximates each observation, a p ◊ 1 vector, as follows.

xi ¥ zi1 v1 + · · · + ziq vq , i = 1, . . . , n.
I We can write this in matrix notation.
S T S TS T
≠ x1T ≠ z11 ... z1q ≠ v1T ≠
W .. X W .. XW .. X
U . V¥ U . VU . V
≠ xnT ≠ zn1 ... znq ≠ vqT ≠
~ N
X œRn◊p AœRn◊q HœRq◊p
~

I Entries in A are often called scores. Each of the n rows of A can be thought of as containing the
“activity-coefficients” of the basis vectors in H.
~

- -

I Entries in H are often called loadings. Each of the q rows of H contains the corresponding eigenvector
v1 , . . . , vq of S.
PCA via Singular value decomposition (SVD)

- -

I We can find the matrices A and H as follows.


PCA via Singular value decomposition (SVD)

I We can find the matrices A and H as follows.


I Compute the SVD of X , to obtain

X = UDV T .
PCA via Singular value decomposition (SVD)

I We can find the matrices A and H as follows.


I Compute the SVD of X , to obtain
A
X = UDV T .

I Truncate the matrices U, D and V T to retain only q < p dimensions.


P

"I P
- vi
...
D O

· !
T
:
"Dpp T

Y/ ........ Un N up-
o .. O -

I :... ·
X e D vi
PCA via Singular value decomposition (SVD) Recall : Can write PCA

E
2

X = A .

da
I We can find the matrices A and H as follows. I I
I Compute the SVD of X , to obtain nX9 qXP

X = UDV T .

I Truncate the matrices U, D and V T to retain only q < p dimensions.


P

15
·i
vi
/

1
/
-

/X
:

/1 -
: vo
#
- -

P
p
- -

- /Il
LL
- I
- -
=

I -
up
.... N - -

-
O

-~
....

I -· - ---
N ~
X e D
-
PCA via Singular value decomposition (SVD)

I We can find the matrices A and H as follows.


I Compute the SVD of X , to obtain

X = UDV T .

I Truncate the matrices U, D and V T to retain only q < p dimensions.


truncate
U œ Rn◊n ≠≠≠≠æ U
 œ Rn◊q
truncate
D œ Rn◊p ≠≠≠≠æ D
 œ Rq◊q
truncate
V T œ Rp◊p ≠≠≠≠æ V
 T œ Rq◊p
PCA via Singular value decomposition (SVD)

I We can find the matrices A and H as follows.


I Compute the SVD of X , to obtain

X = UDV T .

I Truncate the matrices U, D and V T to retain only q < p dimensions.


truncate
U œ Rn◊n ≠≠≠≠æ U
 œ Rn◊q
truncate
D œ Rn◊p ≠≠≠≠æ D
 œ Rq◊q
truncate
V T œ Rp◊p ≠≠≠≠æ V
 T œ Rq◊p
~ -
I Set A = U
ÂD ÂT.
 and H = V

VT
Alternative UD, H
F
Y H
=
: ·
Set A =

AERPER,
,

·
PCA via Singular value decomposition (SVD)

I We can find the matrices A and H as follows.


I Compute the SVD of X , to obtain

X = UDV T .

I Truncate the matrices U, D and V T to retain only q < p dimensions.


truncate
U œ Rn◊n ≠≠≠≠æ U
 œ Rn◊q
truncate
D œ Rn◊p ≠≠≠≠æ D
 œ Rq◊q
truncate
* P V T œ Rp◊p ≠≠≠≠æ V
 T œ Rq◊p
~ nxq -
I Set A = U
ÂD ÂT.
 and H = V
Take away: PCA is equivalent to compute the SVD of the dataset X , truncate it to q < p dimensions and
obtain the score and loading matrices A and H.
~

·
(
scores loadings
How good is the PCA approximation?

I PCA approximates each observation as


How good is the PCA approximation?

I PCA approximates each observation as

xi = zi1 v1 + · · · + ziq vq ,
 i = 1, . . . , n.
How good is the PCA approximation?

I PCA approximates each observation as

xi = zi1 v1 + · · · + ziq vq ,
 i = 1, . . . , n.

I On the other hand, each non-approximated observation can be written as


How good is the PCA approximation?

I PCA approximates each observation as

xi = zi1 v1 + · · · + ziq vq ,
 i = 1, . . . , n.

I On the other hand, each non-approximated observation can be written as

xi = Â
xi + error = zi1 v1 + · · · + zip vp , i = 1, . . . , n.
How good is the PCA approximation?

I PCA approximates each observation as

xi = zi1 v1 + · · · + ziq vq ,
 i = 1, . . . , n.

I On the other hand, each non-approximated observation can be written as

xi = Â
xi + error = zi1 v1 + · · · + zip vp , i = 1, . . . , n.

I We can compute the quality of the PCA as the ratio between the average (squared) length of the
approximated observations and the average (squared) length of the non-approximated observations. In
formulas,

I
· -
· ↓(x Y
-
,

n (x ,
, +...

X, +... +
+ knYn)

XnTXu)
Eto 1]
,
How good is the PCA approximation?

I PCA approximates each observation as

xi = zi1 v1 + · · · + ziq vq ,
 i = 1, . . . , n.

I On the other hand, each non-approximated observation can be written as

xi = Â
xi + error = zi1 v1 + · · · + zip vp , i = 1, . . . , n.

I We can compute the quality of the PCA as the ratio between the average (squared) length of the
approximated observations and the average (squared) length of the non-approximated observations. In
formulas,
of
is

Sa =

1/n
qn T
1/n i=1 Â
x Â
xi
qn iT = 1
i=1
⁄ + · · · + ⁄q
xi xi
.
⁄1 + · · · + ⁄p
zeigenvalues
x +... + +q

-
-

=
Xp
Sa
x+ .. +

& &
- ...............
I

-
0 8

!
..........
=>
.

elbow
!
_
" O

·I
!#
-----

* PC ↑ A I
# PC
8
9 3 22 34

elbow plot/scree plot


Some comments

I The PCA solution/approximation is not invariant under scaling of the data,


scale
X ≠≠æ cX , for some c œ R.
Some comments

I The PCA solution/approximation is not invariant under scaling of the data,


scale
X ≠≠æ cX , for some c œ R.

I As a consequence, PCA applied to the covariance matrix is different from PCA applied to the correlation
matrix.

You might also like