0% found this document useful (0 votes)

14 views32 pages

Understanding Principal Component Analysis

Uploaded by

jffuriri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views32 pages

Understanding Principal Component Analysis

Uploaded by

jffuriri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Principal component analysis

(PCA)
lect5
Principal component analysis (PCA)
Principal component analysis (PCA) reduces the number of dimensions in
large datasets to principal components that retain most of the original
information. It does this by transforming potentially correlated variables into
a smaller set of variables, called principal components.
Karl Pearson is credited with the development of PCA in 1901, but it gained
popularity with the increased availability of computers, which allowed for
multivariate statistical computations1 at scale. PCA is very effective for
visualizing and exploring high-dimensional datasets, or data with many
features, as it can easily identify trends, patterns, or outliers.
PCA is commonly used for data preprocessing for use with machine learning
algorithms. It can extract the most informative features from large datasets while
preserving the most relevant information from the initial dataset. This reduces
model complexity as the addition of each new feature negatively impacts model
performance, which is also commonly referred to as the “curse of
dimensionality.” By projecting a high-dimensional dataset into a smaller feature
space, PCA also minimizes, or altogether eliminates, common issues such
as multicollinearity and overfitting. Multicollinearity occurs when two or more
independent variables are highly correlated with one another, which can be
problematic for causal modeling. Overfit models will generalize poorly to new
data, diminishing their value altogether. PCA is a commonly used approach
within regression analysis but it is also leveraged for a variety of use cases, such
as pattern recognition, signal processing, image processing, and more.
How principal component analysis works
• PCA summarizes the information content of large datasets into a smaller set of
uncorrelated variables known as principal components. These principal components
are linear combinations of the original variables that have the maximum variance
compared to other linear combinations. These components capture as much
information from the original dataset as possible.
• This statistical technique involves both linear algebra and matrix operations, and it
transforms the original dataset into a new coordinate system that is structured by the
principal components. The eigenvectors and eigenvalues from the covariance matrix
that underpin the principal components allow for the analysis of these linear
transformations.
• Imagine you have mapped out a dataset with multiple features, resulting in a multi-
dimensional scatterplot. Eigenvectors provide the direction of variance in the
scatterplot. Eigenvalues are the coefficients of the eigenvectors; these denote the
importance of this directional data. Therefore, a high eigenvalue means that the
corresponding eigenvector is more critical. Since principal components represent the
directions of maximum variance in the data, they are also the eigenvectors of the
covariance matrix.
How to calculate PCA
Two major components are calculated in PCA: the first principal component (PC1) and
the second principal component (PC2). Also PC3.PC4,……. and so on.
• First principal component PC1
The first principal component (PC1) is the direction in space along which the data
points have the highest or most variance. It is the line that best represents the shape
of the projected points. The larger the variability captured in the first component, the
larger the information retained from the original dataset. No other principal
component can have a higher variability.
• Second principal component PC2
We calculate the second principal component (PC2) in the same way as PC1. PC2
accounts for the next highest variance in the dataset and must be uncorrelated with
PC1. That is, PC2 must be orthogonal, that is perpendicular, to PC1. This relationship
can also be expressed as the correlation between PC1 and PC2 equals zero.
A scatterplot is typically used to show the relationship between PC1 and PC2 when
PCA is applied to a dataset. PC1 and PC2 axis will be perpendicular to each other.
Calculation steps of principal components analysis
The PCA computation process is summarized in the steps below, showing that
how the principal components are calculated and how they relate to the original
data.
• Standardize the range of continuous initial variables
Since PCA can bias towards specific features, it is important to evaluate whether
normalization of data is needed. Data should reflect a normal distribution with a
mean of zero and a standard deviation of one.
In this step, the mean values of the variables are calculated and subtracted from
the original dataset so that each variable contributes equally to the analysis.
This value is then divided by the standard deviation for each variable so that all
variables use the same scale.
• Compute the covariance matrix to identify correlations
Covariance (cov) measures how strongly correlated two or more variables
are. The covariance matrix summarizes the covariances associated with all pair
combinations of the initial variables in the dataset. Computing the covariance
matrix helps identify the relationships between the variables–that is, how the
variables vary from the mean with respect to each other. This data matrix is a
symmetric matrix, meaning the variable combinations can be represented as d
× d, where d is the number of dimensions. For example, for a 3-dimensional
dataset, there would be 3 × 3 or 9 variable combinations in the covariance
matrix.
The sign of the variables in the matrix tells us whether combinations are
correlated:
• Positive (the variables are correlated and increase or decrease at the same time)
• Negative (the variables are not correlated, meaning that one decreases while the other
increases)
• Zero (the variables are not related to each other)
• Compute the eigenvectors and eigenvalues of the covariance matrix
Here, we calculate the eigenvectors (principal components) and eigenvalues of
the covariance matrix. As eigenvectors, the principal components represent the
directions of maximum variance in the data. The eigenvalues represent the
amount of variance in each component. Ranking the eigenvectors by eigenvalue
identifies the order of principal components.
• Select the principal components
Here, we decide which components to keep and those to discard. Components
with low eigenvalues typically will not be as significant. Scree plots usually plot
the proportion of total variance explained and the cumulative proportion of
variance. These metrics help one to determine the optimal number of
components to retain. The point at which the Y axis of eigenvalues or total
variance explained creates an "elbow" will generally indicate how many PCA
components that we want to include.
• Transform the data into the new coordinate system
Finally, the data is transformed into the new coordinate system defined by
the principal components. That is, the feature vector created from the
eigenvectors of the covariance matrix projects the data onto the new axes
defined by the principal components. This creates new data, capturing most
of the information but with fewer dimensions than the original dataset.
Interpreting PCA results
• A PCA plot is a scatter plot created by using the first two principal components as
axes. The first principal component (PC1) is the x-axis, and the second principal
component (PC2) is the y-axis. The scatter plot shows the relationships between
observations (data points) and the new variables (the principal components). The
position of each point shows the values of PC1 and PC2 for that observation.
• The direction and length of the plot arrows indicate the loadings of the variables,
that is, how each variable contributes to the principal components. If a variable has
a high loading for a particular component, it is strongly correlated with that
component. This can highlight which variables have a significant impact on data
variations.
• The number of principal components that remain after applying PCA can help you
interpret the data output. The first principal component explains the most data
variance, and each later component accounts for less variance. Thus, the number of
components can indicate the amount of information retained from the original
dataset. Fewer components after applying PCA could mean that you didn’t capture
much data variation. More components indicate more data variation, but the results
may be harder to interpret. You can decide the optimal number of components to
retain using either a scree plot or the cumulative explained variance.
Applications of principal component analysis
Applying PCA can help preprocess or extract the most informative features from
datasets with many variables. Preprocessing reduces complexity while preserving
relevant information. Common scenarios that use PCA include:
• Image compression
PCA reduces image dimensionality while retaining essential information. It helps
create compact representations of images, making them easier to store and transmit.
• Data visualization
PCA helps to visualize high-dimensional data by projecting it into a lower-dimensional
space, such as a 2D or 3D plot. This simplifies data interpretation and exploration.
• Noise filtering
PCA can remove noise or redundant information from data by focusing on the
principal components that capture the underlying patterns.
When to use principal component analysis
There are many other dimensionality reduction techniques available, including linear
discriminant analysis, random forest, uniform manifold approximation and projection (UMAP),
and t-distributed stochastic neighbor (t-SNE). Consider the following factors to decide if PCA is
the right approach for your analysis:
• Linearity: PCA is a linear technique, while other techniques such as t-SNE and UMAP are
non-linear. This means that PCA is better suited for datasets with linear relationships
between variables. Non-linear techniques are better suited for datasets with non-linear or
more complex relationships between variables.
• Computation: PCA uses matrix operations for computation to efficiently manage large
datasets. Other techniques, such as t-SNE and UMAP, are expensive and may not be suitable
for large datasets.
• Information preservation: PCA preserves the maximum amount of variance in the data. t-
SNE and UMAP focus on preserving the local structure of the data. PCA is, therefore, better
suited for identifying the most important data variables. Non-linear techniques are better
suited for visualizing the data in lower dimensions.
• Feature extraction: PCA is a feature extraction technique. It produces new variables that are
linear combinations of the original variables. Other techniques (such as UMAP and t-SNE) do
not create new variables. This means PCA can identify the most important variables in the
data. Non-linear techniques are better suited for visualizing the data in lower dimensions.
‫• تقنية استكشافية تستخدم لتقليل أبعاد مجموعة البيانات إلى ثنائية أو ثﻼثية اﻷبعاد‬
‫• ممكن أن يستخدم ل‪:‬‬
‫• تقليل عدد اﻷبعاد في البيانات‬
‫• ابحث عن أنماط في البيانات عالية اﻷبعاد‬
‫• تصور البيانات ذات اﻷبعاد العالية‬
‫• أمثلة على التطبيقات‬
‫• ‪:‬تمييز الوجوه‬
‫• ضغط الصورة‬
‫• تحليل التعبير الجيني‬
Principal Component Analysis
See online tutorials such as
[Link]
pal_components.pdf
X2

Y2
x
x
x
Note: Y1 is the x xx
x x
first eigen vector, x
x x
Y2 is the second. x
x
Y2 ignorable. x x
x
x x x X1
x x Key observation:
x x
x x variance = largest!

14
‫تحليل المكون الرئيسي‪ :‬سمة واحدة أوﻻً‬
‫‪Temperature‬‬
‫‪42‬‬
‫‪40‬‬
‫• سؤال‪ :‬ما مقدار اﻻنتشار‬ ‫‪24‬‬
‫البيانات على طول المحور؟‬ ‫‪30‬‬

‫)المسافة إلى الوسط(‬ ‫‪15‬‬

‫‪18‬‬
‫• التياين = اﻻنحراف المعياري‬ ‫‪15‬‬
‫^‪2‬‬ ‫‪30‬‬

‫‪n‬‬ ‫‪15‬‬

‫‪ i‬‬
‫(‬ ‫‪X‬‬ ‫‪‬‬ ‫‪X‬‬ ‫)‬ ‫‪2‬‬ ‫‪30‬‬
‫‪35‬‬
‫‪s ‬‬
‫‪2‬‬ ‫‪i 1‬‬
‫)‪( n  1‬‬ ‫‪30‬‬
‫‪40‬‬
‫‪30‬‬
‫‪15‬‬
‫في حالة مصفوفة ذات بعدين‬
X=Temperature Y=Humidity
40 90
• cov(X,Y)=0: independent
•Cov(X,Y)>0: move same dir 40 90
•Cov(X,Y)<0: move oppo dir 40 90
30 90
15 70
15 70
15 70
30 90
n

 (X
15 70
i  X )(Yi  Y )
i 1 30 70
cov( X , Y ) 
( n  1) 30 70
30 90
40 70 16

30 90
More than two attributes: covariance matrix
C nxn  (cij | cij  cov( Dimi , Dim j ))
• Example for three attributes (x,y,z):

 cov( x, x) cov( x, y ) cov( x, z ) 

 
C   cov( y, x) cov( y, y ) cov( y, z ) 
 cov( z , x) cov( z , y ) cov( z , z ) 
 
17
Eigenvalues & eigenvectors
• Vectors x having same direction as Ax are called
eigenvectors of A (A is an n by n matrix).
• In the equation Ax=x,  is called an eigenvalue of A.

 2 3   3  12   3
  x      4 x 
 2 1  2  8   2

18
Eigenvalues & eigenvectors

• Ax=x  (A-I)x=0
• How to calculate x and :
 Calculate det(A-I), yields a polynomial (degree n)

• Determine roots to det(A-I)=0, roots are eigenvalues 

• Solve (A- I) x=0 for each  to obtain eigenvectors x

19
Principal components
1. principal component (PC1)
• The eigenvalue with the largest absolute value will indicate that the data have the
largest variance along its eigenvector, the direction along which there is greatest
variation

2. principal component (PC2)

• the direction with maximum variation left in data, orthogonal to the 1. PC
In general, only few directions manage to capture most of the variability in
the data.

20
Steps of PCA
• Let X be the mean
• For matrix C, vectors e
vector (taking the mean (=column vector) having
of all rows) same direction as Ce :
• Adjust the original data • eigenvectors of C is e such
that Ce=e,
by the mean •  is called an eigenvalue of
X’ = X – X C.
• Compute the covariance • Ce=e  (C-I)e=0
matrix C of adjusted X
• Most data mining packages
• Find the eigenvectors do this for you.
and eigenvalues of C.

21
Eigenvalues
• Calculate eigenvalues  and eigenvectors x for
covariance matrix:
– Eigenvalues j are used for calculation of [% of total
variance] (Vj) for each component j:

j n
V j  100  n  x  n

x 1
x x 1

22
Principal components - Variance
25

20
Variance (%)

0
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10

23
Transformed Data
• Eigenvalues j corresponds to variance on each
component j
• Thus, sort by j
• Take the first p eigenvectors ei; where p is the number of
top eigenvalues
• These are the directions with the largest variances
 yi1   e1  xi1  x1 
    
 yi 2   e2  xi 2  x2 
 ...    ...  
    ... 
 y   e  x  x 
 ip   p  in n  24
An Example Mean1=24.1
Mean2=53.8
X1 X2 X1' X2' 100
90
80
70
60
19 63 -5.1 9.25 50 Series1
40
30
20
39 74 14.9 20.25 10
0
0 10 20 30 40 50

30 87 5.9 33.25
40

30
30 23 5.9 -30.75 20

0 Series1
15 35 -9.1 -18.75 -15 -10 -5
-10
0 5 10 15 20

-20

15 43 -9.1 -10.75 -30

-40

15 32 -9.1 -21.75 25

30 73 5.9 19.25
Covariance Matrix
75 106
• C=
106 482

• Using MATLAB, we find out:

– Eigenvectors:
– e1=(-0.98,-0.21), 1=51.8
– e2=(0.21,-0.98), 2=560.2
– Thus the second eigenvector is more important!

26
If we only keep one dimension: e2
0.5
yi
0.4
0.3
-10.14
0.2 -16.72
• We keep the dimension 0.1 -31.35
of e2=(0.21,-0.98) -40 -20
0
-0.1 0 20 40
31.374
16.464
• We can obtain the final -0.2
-0.3 8.624
data as -0.4 19.404
-0.5
-17.63

x 
yi  0.21  0.98 i1   0.21* xi1  0.98 * xi 2
 xi 2 

27
28
29
30
PCA –> Original Data
• Retrieving old data (e.g. in data compression)
– RetrievedRowData=(RowFeatureVectorT x
FinalData)+OriginalMean
– Yields original data using the chosen components

31
Principal components
• General about principal components
– summary variables
– linear combinations of the original variables
– uncorrelated with each other
– capture as much of the original variance as possible

Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
8 pages
Understanding PCA in AI-ML
No ratings yet
Understanding PCA in AI-ML
20 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
6 pages
PCA Applications in Finance Explained
No ratings yet
PCA Applications in Finance Explained
38 pages
PCA: Dimensionality Reduction Explained
No ratings yet
PCA: Dimensionality Reduction Explained
28 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
3 pages
Understanding PCA and Factor Analysis
No ratings yet
Understanding PCA and Factor Analysis
12 pages
PCA Tutorial: Principal Component Analysis
No ratings yet
PCA Tutorial: Principal Component Analysis
11 pages
Principal Component Analysis - Wikipedia
No ratings yet
Principal Component Analysis - Wikipedia
28 pages
PCA and Factor Analysis Overview
No ratings yet
PCA and Factor Analysis Overview
67 pages
Step-by-Step Guide to PCA Explained
No ratings yet
Step-by-Step Guide to PCA Explained
8 pages
PCA in Remote Sensing Explained
No ratings yet
PCA in Remote Sensing Explained
10 pages
PCA in Data Analytics Explained
No ratings yet
PCA in Data Analytics Explained
9 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
27 pages
PCA for Dimensionality Reduction Guide
No ratings yet
PCA for Dimensionality Reduction Guide
21 pages
Understanding Dimensionality Reduction Techniques
No ratings yet
Understanding Dimensionality Reduction Techniques
123 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
22 pages
Step-by-Step Guide to PCA
No ratings yet
Step-by-Step Guide to PCA
7 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
31 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
8 pages
Principal Component Analysis Overview
No ratings yet
Principal Component Analysis Overview
19 pages
PCA and Factor Analysis Overview
No ratings yet
PCA and Factor Analysis Overview
28 pages
Jolliffe 2014
No ratings yet
Jolliffe 2014
5 pages
Principal Components Analysis Overview
No ratings yet
Principal Components Analysis Overview
3 pages
Principal Component Analysis Explained
No ratings yet
Principal Component Analysis Explained
3 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
11 pages
Principal Component Analysis Overview
No ratings yet
Principal Component Analysis Overview
33 pages
Simplified Guide to PCA Analysis
No ratings yet
Simplified Guide to PCA Analysis
8 pages
Principal Component Analysis (PCA)
No ratings yet
Principal Component Analysis (PCA)
17 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
17 pages
PCA: A Step-by-Step Guide
No ratings yet
PCA: A Step-by-Step Guide
11 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
7 pages
Understanding Principal Component Analysis
100% (1)
Understanding Principal Component Analysis
18 pages
Principal Component Analysis
100% (1)
Principal Component Analysis
10 pages
PCA Lab: A Beginner's Guide
No ratings yet
PCA Lab: A Beginner's Guide
5 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
34 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
5 pages
Dimensionality Reduction Techniques Explained
No ratings yet
Dimensionality Reduction Techniques Explained
45 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
13 pages
Understanding Principal Components Analysis
No ratings yet
Understanding Principal Components Analysis
32 pages
Understanding Principal Components Analysis
No ratings yet
Understanding Principal Components Analysis
10 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
13 pages
PCA and Logistic Regression in IVF Outcomes
No ratings yet
PCA and Logistic Regression in IVF Outcomes
17 pages
Dimension Reduction
No ratings yet
Dimension Reduction
4 pages
PCA: Invention by Karl Pearson
No ratings yet
PCA: Invention by Karl Pearson
15 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
4 pages
Principal Component Analysis Explained
No ratings yet
Principal Component Analysis Explained
7 pages
PCA Overview by Eesha Tur Razia Babar
No ratings yet
PCA Overview by Eesha Tur Razia Babar
38 pages
PCA for Data Analysis Explained
No ratings yet
PCA for Data Analysis Explained
20 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
20 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
38 pages
Understanding Principal Components Analysis
No ratings yet
Understanding Principal Components Analysis
47 pages
PCA Implementation in Python
No ratings yet
PCA Implementation in Python
11 pages
Beginner's Guide to Principal Component Analysis
No ratings yet
Beginner's Guide to Principal Component Analysis
6 pages
PCA Lec
No ratings yet
PCA Lec
14 pages
Comprehensive Guide On Principal Component Analysis (PCA)
No ratings yet
Comprehensive Guide On Principal Component Analysis (PCA)
22 pages
PCA for Crime Data Analysis in R
No ratings yet
PCA for Crime Data Analysis in R
28 pages
Dimensionality Reduction in Machine Learning
No ratings yet
Dimensionality Reduction in Machine Learning
27 pages
Dimensionality Reduction Techniques Explained
No ratings yet
Dimensionality Reduction Techniques Explained
41 pages
Home Care for Mild Preeclampsia
No ratings yet
Home Care for Mild Preeclampsia
1 page
Relief Valve Calibration Procedure
No ratings yet
Relief Valve Calibration Procedure
4 pages
Science 9 Localized Summative Test
No ratings yet
Science 9 Localized Summative Test
2 pages
IFS Food Standard V7 Overview and Changes
No ratings yet
IFS Food Standard V7 Overview and Changes
6 pages
Historical Development of Psychology
No ratings yet
Historical Development of Psychology
5 pages
Admission Application for B.Com Course
No ratings yet
Admission Application for B.Com Course
3 pages
Sources of Wisdom in Christianity
No ratings yet
Sources of Wisdom in Christianity
6 pages
Mechanical BOQ for Packing Station
No ratings yet
Mechanical BOQ for Packing Station
9 pages
Geosynthetic Encased Columns For Soft Soil Improvement Marcio Almeida Ebook Complete Online Chapters
100% (6)
Geosynthetic Encased Columns For Soft Soil Improvement Marcio Almeida Ebook Complete Online Chapters
63 pages
Parramatta Night Time Economy Strategy
No ratings yet
Parramatta Night Time Economy Strategy
39 pages
CHED Citizen's Charter 2022 Edition
No ratings yet
CHED Citizen's Charter 2022 Edition
120 pages
Papua New Guinea Budget Manual
100% (3)
Papua New Guinea Budget Manual
32 pages
Govind Glass: UPVC & Aluminium Solutions
No ratings yet
Govind Glass: UPVC & Aluminium Solutions
26 pages
JICA Standard Safety Specification Guide
No ratings yet
JICA Standard Safety Specification Guide
20 pages
Women in Hadith: Mamluk Case Studies
No ratings yet
Women in Hadith: Mamluk Case Studies
25 pages
Doubly Linked List Implementation in C
No ratings yet
Doubly Linked List Implementation in C
5 pages
Torno y Desgaste: Definición y Experimentos
No ratings yet
Torno y Desgaste: Definición y Experimentos
4 pages
Understanding Computer Network Systems
No ratings yet
Understanding Computer Network Systems
18 pages
Thinning Control Hand Book
No ratings yet
Thinning Control Hand Book
64 pages
Electrical Repair Course Overview
100% (2)
Electrical Repair Course Overview
32 pages
MAXnet 5-Axis Motion Controller Overview
No ratings yet
MAXnet 5-Axis Motion Controller Overview
4 pages
3-Bedroom House Materials Estimate
No ratings yet
3-Bedroom House Materials Estimate
1 page
Emergence and Scope of Sociology
No ratings yet
Emergence and Scope of Sociology
74 pages
JWT Authorization in ASP.NET Core Guide
No ratings yet
JWT Authorization in ASP.NET Core Guide
9 pages
GMBA Jan 2017 Intake Batch Summary
No ratings yet
GMBA Jan 2017 Intake Batch Summary
21 pages
Financial Performance of Corporation Bank
No ratings yet
Financial Performance of Corporation Bank
8 pages
Counterinsurgency Operations Overview
No ratings yet
Counterinsurgency Operations Overview
42 pages
Reasons Adult Children Live at Home
No ratings yet
Reasons Adult Children Live at Home
2 pages
Scholarship Brochure 2026
No ratings yet
Scholarship Brochure 2026
16 pages

Understanding Principal Component Analysis

Uploaded by

Understanding Principal Component Analysis

Uploaded by

Principal component analysis

‫)المسافة إلى الوسط(‬ ‫‪15‬‬

 cov( x, x) cov( x, y ) cov( x, z ) 

• Determine roots to det(A-I)=0, roots are eigenvalues 

• Solve (A- I) x=0 for each  to obtain eigenvectors x

2. principal component (PC2)

15 43 -9.1 -10.75 -30

• Using MATLAB, we find out:

You might also like