Principal component analysis
(PCA)
lect5
Principal component analysis (PCA)
Principal component analysis (PCA) reduces the number of dimensions in
large datasets to principal components that retain most of the original
information. It does this by transforming potentially correlated variables into
a smaller set of variables, called principal components.
Karl Pearson is credited with the development of PCA in 1901, but it gained
popularity with the increased availability of computers, which allowed for
multivariate statistical computations1 at scale. PCA is very effective for
visualizing and exploring high-dimensional datasets, or data with many
features, as it can easily identify trends, patterns, or outliers.
PCA is commonly used for data preprocessing for use with machine learning
algorithms. It can extract the most informative features from large datasets while
preserving the most relevant information from the initial dataset. This reduces
model complexity as the addition of each new feature negatively impacts model
performance, which is also commonly referred to as the “curse of
dimensionality.” By projecting a high-dimensional dataset into a smaller feature
space, PCA also minimizes, or altogether eliminates, common issues such
as multicollinearity and overfitting. Multicollinearity occurs when two or more
independent variables are highly correlated with one another, which can be
problematic for causal modeling. Overfit models will generalize poorly to new
data, diminishing their value altogether. PCA is a commonly used approach
within regression analysis but it is also leveraged for a variety of use cases, such
as pattern recognition, signal processing, image processing, and more.
How principal component analysis works
• PCA summarizes the information content of large datasets into a smaller set of
uncorrelated variables known as principal components. These principal components
are linear combinations of the original variables that have the maximum variance
compared to other linear combinations. These components capture as much
information from the original dataset as possible.
• This statistical technique involves both linear algebra and matrix operations, and it
transforms the original dataset into a new coordinate system that is structured by the
principal components. The eigenvectors and eigenvalues from the covariance matrix
that underpin the principal components allow for the analysis of these linear
transformations.
• Imagine you have mapped out a dataset with multiple features, resulting in a multi-
dimensional scatterplot. Eigenvectors provide the direction of variance in the
scatterplot. Eigenvalues are the coefficients of the eigenvectors; these denote the
importance of this directional data. Therefore, a high eigenvalue means that the
corresponding eigenvector is more critical. Since principal components represent the
directions of maximum variance in the data, they are also the eigenvectors of the
covariance matrix.
How to calculate PCA
Two major components are calculated in PCA: the first principal component (PC1) and
the second principal component (PC2). Also PC3.PC4,……. and so on.
• First principal component PC1
The first principal component (PC1) is the direction in space along which the data
points have the highest or most variance. It is the line that best represents the shape
of the projected points. The larger the variability captured in the first component, the
larger the information retained from the original dataset. No other principal
component can have a higher variability.
• Second principal component PC2
We calculate the second principal component (PC2) in the same way as PC1. PC2
accounts for the next highest variance in the dataset and must be uncorrelated with
PC1. That is, PC2 must be orthogonal, that is perpendicular, to PC1. This relationship
can also be expressed as the correlation between PC1 and PC2 equals zero.
A scatterplot is typically used to show the relationship between PC1 and PC2 when
PCA is applied to a dataset. PC1 and PC2 axis will be perpendicular to each other.
Calculation steps of principal components analysis
The PCA computation process is summarized in the steps below, showing that
how the principal components are calculated and how they relate to the original
data.
• Standardize the range of continuous initial variables
Since PCA can bias towards specific features, it is important to evaluate whether
normalization of data is needed. Data should reflect a normal distribution with a
mean of zero and a standard deviation of one.
In this step, the mean values of the variables are calculated and subtracted from
the original dataset so that each variable contributes equally to the analysis.
This value is then divided by the standard deviation for each variable so that all
variables use the same scale.
• Compute the covariance matrix to identify correlations
Covariance (cov) measures how strongly correlated two or more variables
are. The covariance matrix summarizes the covariances associated with all pair
combinations of the initial variables in the dataset. Computing the covariance
matrix helps identify the relationships between the variables–that is, how the
variables vary from the mean with respect to each other. This data matrix is a
symmetric matrix, meaning the variable combinations can be represented as d
× d, where d is the number of dimensions. For example, for a 3-dimensional
dataset, there would be 3 × 3 or 9 variable combinations in the covariance
matrix.
The sign of the variables in the matrix tells us whether combinations are
correlated:
• Positive (the variables are correlated and increase or decrease at the same time)
• Negative (the variables are not correlated, meaning that one decreases while the other
increases)
• Zero (the variables are not related to each other)
• Compute the eigenvectors and eigenvalues of the covariance matrix
Here, we calculate the eigenvectors (principal components) and eigenvalues of
the covariance matrix. As eigenvectors, the principal components represent the
directions of maximum variance in the data. The eigenvalues represent the
amount of variance in each component. Ranking the eigenvectors by eigenvalue
identifies the order of principal components.
• Select the principal components
Here, we decide which components to keep and those to discard. Components
with low eigenvalues typically will not be as significant. Scree plots usually plot
the proportion of total variance explained and the cumulative proportion of
variance. These metrics help one to determine the optimal number of
components to retain. The point at which the Y axis of eigenvalues or total
variance explained creates an "elbow" will generally indicate how many PCA
components that we want to include.
• Transform the data into the new coordinate system
Finally, the data is transformed into the new coordinate system defined by
the principal components. That is, the feature vector created from the
eigenvectors of the covariance matrix projects the data onto the new axes
defined by the principal components. This creates new data, capturing most
of the information but with fewer dimensions than the original dataset.
Interpreting PCA results
• A PCA plot is a scatter plot created by using the first two principal components as
axes. The first principal component (PC1) is the x-axis, and the second principal
component (PC2) is the y-axis. The scatter plot shows the relationships between
observations (data points) and the new variables (the principal components). The
position of each point shows the values of PC1 and PC2 for that observation.
• The direction and length of the plot arrows indicate the loadings of the variables,
that is, how each variable contributes to the principal components. If a variable has
a high loading for a particular component, it is strongly correlated with that
component. This can highlight which variables have a significant impact on data
variations.
• The number of principal components that remain after applying PCA can help you
interpret the data output. The first principal component explains the most data
variance, and each later component accounts for less variance. Thus, the number of
components can indicate the amount of information retained from the original
dataset. Fewer components after applying PCA could mean that you didn’t capture
much data variation. More components indicate more data variation, but the results
may be harder to interpret. You can decide the optimal number of components to
retain using either a scree plot or the cumulative explained variance.
Applications of principal component analysis
Applying PCA can help preprocess or extract the most informative features from
datasets with many variables. Preprocessing reduces complexity while preserving
relevant information. Common scenarios that use PCA include:
• Image compression
PCA reduces image dimensionality while retaining essential information. It helps
create compact representations of images, making them easier to store and transmit.
• Data visualization
PCA helps to visualize high-dimensional data by projecting it into a lower-dimensional
space, such as a 2D or 3D plot. This simplifies data interpretation and exploration.
• Noise filtering
PCA can remove noise or redundant information from data by focusing on the
principal components that capture the underlying patterns.
When to use principal component analysis
There are many other dimensionality reduction techniques available, including linear
discriminant analysis, random forest, uniform manifold approximation and projection (UMAP),
and t-distributed stochastic neighbor (t-SNE). Consider the following factors to decide if PCA is
the right approach for your analysis:
• Linearity: PCA is a linear technique, while other techniques such as t-SNE and UMAP are
non-linear. This means that PCA is better suited for datasets with linear relationships
between variables. Non-linear techniques are better suited for datasets with non-linear or
more complex relationships between variables.
• Computation: PCA uses matrix operations for computation to efficiently manage large
datasets. Other techniques, such as t-SNE and UMAP, are expensive and may not be suitable
for large datasets.
• Information preservation: PCA preserves the maximum amount of variance in the data. t-
SNE and UMAP focus on preserving the local structure of the data. PCA is, therefore, better
suited for identifying the most important data variables. Non-linear techniques are better
suited for visualizing the data in lower dimensions.
• Feature extraction: PCA is a feature extraction technique. It produces new variables that are
linear combinations of the original variables. Other techniques (such as UMAP and t-SNE) do
not create new variables. This means PCA can identify the most important variables in the
data. Non-linear techniques are better suited for visualizing the data in lower dimensions.
• تقنية استكشافية تستخدم لتقليل أبعاد مجموعة البيانات إلى ثنائية أو ثﻼثية اﻷبعاد
• ممكن أن يستخدم ل:
• تقليل عدد اﻷبعاد في البيانات
• ابحث عن أنماط في البيانات عالية اﻷبعاد
• تصور البيانات ذات اﻷبعاد العالية
• أمثلة على التطبيقات
• :تمييز الوجوه
• ضغط الصورة
• تحليل التعبير الجيني
Principal Component Analysis
See online tutorials such as
[Link]
pal_components.pdf
X2
Y1
Y2
x
x
x
Note: Y1 is the x xx
x x
first eigen vector, x
x x
Y2 is the second. x
x
Y2 ignorable. x x
x
x x x X1
x x Key observation:
x x
x x variance = largest!
14
تحليل المكون الرئيسي :سمة واحدة أوﻻً
Temperature
42
40
• سؤال :ما مقدار اﻻنتشار 24
البيانات على طول المحور؟ 30
)المسافة إلى الوسط( 15
18
• التياين = اﻻنحراف المعياري 15
^2 30
n 15
i
( X X ) 2 30
35
s
2 i 1
)( n 1 30
40
30
15
في حالة مصفوفة ذات بعدين
X=Temperature Y=Humidity
40 90
• cov(X,Y)=0: independent
•Cov(X,Y)>0: move same dir 40 90
•Cov(X,Y)<0: move oppo dir 40 90
30 90
15 70
15 70
15 70
30 90
n
(X
15 70
i X )(Yi Y )
i 1 30 70
cov( X , Y )
( n 1) 30 70
30 90
40 70 16
30 90
More than two attributes: covariance matrix
C nxn (cij | cij cov( Dimi , Dim j ))
• Example for three attributes (x,y,z):
cov( x, x) cov( x, y ) cov( x, z )
C cov( y, x) cov( y, y ) cov( y, z )
cov( z , x) cov( z , y ) cov( z , z )
17
Eigenvalues & eigenvectors
• Vectors x having same direction as Ax are called
eigenvectors of A (A is an n by n matrix).
• In the equation Ax=x, is called an eigenvalue of A.
2 3 3 12 3
x 4 x
2 1 2 8 2
18
Eigenvalues & eigenvectors
• Ax=x (A-I)x=0
• How to calculate x and :
Calculate det(A-I), yields a polynomial (degree n)
• Determine roots to det(A-I)=0, roots are eigenvalues
• Solve (A- I) x=0 for each to obtain eigenvectors x
19
Principal components
1. principal component (PC1)
• The eigenvalue with the largest absolute value will indicate that the data have the
largest variance along its eigenvector, the direction along which there is greatest
variation
2. principal component (PC2)
• the direction with maximum variation left in data, orthogonal to the 1. PC
In general, only few directions manage to capture most of the variability in
the data.
20
Steps of PCA
• Let X be the mean
• For matrix C, vectors e
vector (taking the mean (=column vector) having
of all rows) same direction as Ce :
• Adjust the original data • eigenvectors of C is e such
that Ce=e,
by the mean • is called an eigenvalue of
X’ = X – X C.
• Compute the covariance • Ce=e (C-I)e=0
matrix C of adjusted X
• Most data mining packages
• Find the eigenvectors do this for you.
and eigenvalues of C.
21
Eigenvalues
• Calculate eigenvalues and eigenvectors x for
covariance matrix:
– Eigenvalues j are used for calculation of [% of total
variance] (Vj) for each component j:
j n
V j 100 n x n
x 1
x x 1
22
Principal components - Variance
25
20
Variance (%)
15
10
0
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
23
Transformed Data
• Eigenvalues j corresponds to variance on each
component j
• Thus, sort by j
• Take the first p eigenvectors ei; where p is the number of
top eigenvalues
• These are the directions with the largest variances
yi1 e1 xi1 x1
yi 2 e2 xi 2 x2
... ...
...
y e x x
ip p in n 24
An Example Mean1=24.1
Mean2=53.8
X1 X2 X1' X2' 100
90
80
70
60
19 63 -5.1 9.25 50 Series1
40
30
20
39 74 14.9 20.25 10
0
0 10 20 30 40 50
30 87 5.9 33.25
40
30
30 23 5.9 -30.75 20
10
0 Series1
15 35 -9.1 -18.75 -15 -10 -5
-10
0 5 10 15 20
-20
15 43 -9.1 -10.75 -30
-40
15 32 -9.1 -21.75 25
30 73 5.9 19.25
Covariance Matrix
75 106
• C=
106 482
• Using MATLAB, we find out:
– Eigenvectors:
– e1=(-0.98,-0.21), 1=51.8
– e2=(0.21,-0.98), 2=560.2
– Thus the second eigenvector is more important!
26
If we only keep one dimension: e2
0.5
yi
0.4
0.3
-10.14
0.2 -16.72
• We keep the dimension 0.1 -31.35
of e2=(0.21,-0.98) -40 -20
0
-0.1 0 20 40
31.374
16.464
• We can obtain the final -0.2
-0.3 8.624
data as -0.4 19.404
-0.5
-17.63
x
yi 0.21 0.98 i1 0.21* xi1 0.98 * xi 2
xi 2
27
28
29
30
PCA –> Original Data
• Retrieving old data (e.g. in data compression)
– RetrievedRowData=(RowFeatureVectorT x
FinalData)+OriginalMean
– Yields original data using the chosen components
31
Principal components
• General about principal components
– summary variables
– linear combinations of the original variables
– uncorrelated with each other
– capture as much of the original variance as possible
32