Principal Component Analysis (PCA) Explained - Built in
Principal Component Analysis (PCA) Explained - Built in
FOR EMPLOYERS
D JOBS
AT A S C I E N C E COMPANIES
E X P E R T C O N T R I BARTICLES
UTORS SALARIES COURSES MY ITEMS
A Step-by-Step Explanation of
Principal Component Analysis
(PCA)
Learn how to use a PCA when working with large data sets.
U P D AT E D BY
Brennan Whitfield | Feb 23, 2024
[Link] 1/14
3/23/24, 6:56 PM Principal Component Analysis (PCA) Explained | Built In
into too much detail. Below we cover how principal component analysis works in a
FOR EMPLOYERS
simple step-by-step way, so everyone can understand it and make use of it — even
those without a strong mathematical background.
JOBS COMPANIES ARTICLES SALARIES COURSES MY ITEMS
Principal component analysis can be broken down into five steps. I’ll go through each
step, providing logical explanations of what PCA is doing and simplifying
mathematical concepts such as standardization, covariance, eigenvectors and
eigenvalues without focusing on how to compute them.
[Link] 2/14
3/23/24, 6:56 PM Principal Component Analysis (PCA) Explained | Built In
of variables into a smaller one that still contains most of the information in the large
FOR EMPLOYERS
set.
JOBS
ReducingCOMPANIES
the number of ARTICLES SALARIES
variables of a data COURSES
set naturally comes at the expense of MY ITEMS
accuracy, but the trick in dimensionality reduction is to trade a little accuracy for
simplicity. Because smaller data sets are easier to explore and visualize, and thus make
analyzing data points much easier and faster for machine learning algorithms without
extraneous variables to process.
So, to sum up, the idea of PCA is simple: reduce the number of variables of a
data set, while preserving as much information as possible.
[Link] 3/14
3/23/24, 6:56 PM Principal Component Analysis (PCA) Explained | Built In
FOR EMPLOYERS
Organizing information in principal components this way will allow you to reduce
dimensionality without losing much information, and this by discarding the
components with low information and considering the remaining components as your
new variables.
An important thing to realize here is that the principal components are less
interpretable and don’t have any real meaning since they are constructed as linear
combinations of the initial variables.
Geometrically speaking, principal components represent the directions of the data that
explain a maximal amount of variance, that is to say, the lines that capture most
information of the data. The relationship between variance and information here, is
that, the larger the variance carried by a line, the larger the dispersion of the data
points along it, and the larger the dispersion along a line, the more information it has.
To put all this simply, just think of principal components as new axes that provide the
best angle to see and evaluate the data, so that the differences between the
observations are better visible.
[Link] 4/14
3/23/24, 6:56 PM Principal Component Analysis (PCA) Explained | Built In
The second principal component is calculated in the same way, with the condition that
it is uncorrelated with (i.e., perpendicular to) the first principal component and that it
accounts for the next highest variance.
This continues until a total of p principal components have been calculated, equal to
the original number of variables.
STEP 1: STANDARDIZATION
[Link] 5/14
3/23/24, 6:56 PM Principal Component Analysis (PCA) Explained | Built In
The aim of this step is to standardize the range of the continuous initial variables so
FOR EMPLOYERS
that each one of them contributes equally to the analysis.
JOBS
More COMPANIES
specifically, ARTICLES
the reason SALARIES
why it is critical to perform COURSES MY ITEMS
standardization prior to PCA,
is that the latter is quite sensitive regarding the variances of the initial variables. That
is, if there are large differences between the ranges of initial variables, those variables
with larger ranges will dominate over those with small ranges (for example, a variable
that ranges between 0 and 100 will dominate over a variable that ranges between 0
and 1), which will lead to biased results. So, transforming the data to comparable
scales can prevent this problem.
Mathematically, this can be done by subtracting the mean and dividing by the
standard deviation for each value of each variable.
Once the standardization is done, all the variables will be transformed to the same
scale.
The aim of this step is to understand how the variables of the input data set are varying
from the mean with respect to each other, or in other words, to see if there is any
relationship between them. Because sometimes, variables are highly correlated in such
a way that they contain redundant information. So, in order to identify these
correlations, we compute the covariance matrix.
[Link] 6/14
3/23/24, 6:56 PM Principal Component Analysis (PCA) Explained | Built In
What do the covariances that we have as entries of the matrix tell us about
the correlations between the variables?
Now that we know that the covariance matrix is not more than a table that summarizes
the correlations between all the possible pairs of variables, let’s move to the next step.
Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute
from the covariance matrix in order to determine the principal components of the
data.
What you first need to know about eigenvectors and eigenvalues is that they always
come in pairs, so that every eigenvector has an eigenvalue. Also, their number is equal
to the number of dimensions of the data. For example, for a 3-dimensional data set,
there are 3 variables, therefore there are 3 eigenvectors with 3 corresponding
eigenvalues.
It is eigenvectors and eigenvalues who are behind all the magic of principal
components because the eigenvectors of the Covariance matrix are
[Link] 7/14
3/23/24, 6:56 PM Principal Component Analysis (PCA) Explained | Built In
actually the directions of the axes where there is the most variance (most information)
FOR EMPLOYERS
and that we call Principal Components. And eigenvalues are simply the coefficients
attached to eigenvectors, which give the amount of variance carried in each Principal
JOBS COMPANIES ARTICLES SALARIES COURSES MY ITEMS
Component.
By ranking your eigenvectors in order of their eigenvalues, highest to lowest, you get
the principal components in order of significance.
Let’s suppose that our data set is 2-dimensional with 2 variables x,y and that the
eigenvectors and eigenvalues of the covariance matrix are as follows:
If we rank the eigenvalues in descending order, we get λ1>λ2, which means that the
eigenvector that corresponds to the first principal component (PC1) is v1 and the one
that corresponds to the second principal component (PC2) is v2.
As we saw in the previous step, computing the eigenvectors and ordering them by their
eigenvalues in descending order, allow us to find the principal components in order of
significance. In this step, what we do is, to choose whether to keep all these
components or discard those of lesser significance (of low eigenvalues), and form with
the remaining ones a matrix of vectors that we call Feature vector.
[Link] 8/14
3/23/24, 6:56 PM Principal Component Analysis (PCA) Explained | Built In
So, the feature vector is simply a matrix that has as columns the eigenvectors of the
FOR EMPLOYERS
components that we decide to keep. This makes it the first step towards dimensionality
reduction, because if we choose to keep only p eigenvectors (components) out of n, the
JOBS COMPANIES ARTICLES SALARIES COURSES MY ITEMS
final data set will have only p dimensions.
Continuing with the example from the previous step, we can either form a feature
vector with both of the eigenvectors v1 and v2:
Or discard the eigenvector v2, which is the one of lesser significance, and form a
feature vector with v1 only:
So, as we saw in the example, it’s up to you to choose whether to keep all the
components or discard the ones of lesser significance, depending on what you are
looking for. Because if you just want to describe your data in terms of new variables
(principal components) that are uncorrelated without seeking to reduce
dimensionality, leaving out lesser significant components is not needed.
In the previous steps, apart from standardization, you do not make any changes on the
data, you just select the principal components and form the feature vector, but the
[Link] 9/14
3/23/24, 6:56 PM Principal Component Analysis (PCA) Explained | Built In
input data set remains always in terms of the original axes (i.e, in terms of the initial
FOR EMPLOYERS
variables).
JOBS
In COMPANIES
this step, which is the ARTICLES
last one, the aimSALARIES COURSES
is to use the feature vector formed using MY
theITEMS
eigenvectors of the covariance matrix, to reorient the data from the original axes to the
ones represented by the principal components (hence the name Principal Components
Analysis). This can be done by multiplying the transpose of the original data set by the
transpose of the feature vector.
References:
[Link] 10/14
3/23/24, 6:56 PM Principal Component Analysis (PCA) Explained | Built In
FOR EMPLOYERS
Your Expertise
Email Address
SUBSCRIBE
[Link] 11/14
3/23/24, 6:56 PM Principal Component Analysis (PCA) Explained | Built In
FOR EMPLOYERS
Expert Contributors
Built In’s expert contributor network publishes thoughtful, solutions-oriented stories written by
innovative tech professionals. It is the tech industry’s definitive destination for sharing compelling, first-
person accounts of problem-solving on the road to innovation.
LEARN MORE
[Link] 12/14
3/23/24, 6:56 PM Principal Component Analysis (PCA) Explained | Built In
Built In is the online community for startups and tech companies. Find startup jobs, tech news and
events.
About
Our Story
Careers
Content Descriptions
Company News
Get Involved
Resources
Customer Support
Share Feedback
Report a Bug
[Link] 13/14
3/23/24, 6:56 PM Principal Component Analysis (PCA) Explained | Built In
Tech A-Z
FOR EMPLOYERS
Browse Jobs
Built In Austin
Built In Boston
Built In Chicago
Built In Colorado
Built In LA
Built In NYC
Built In Seattle
© Built In 2024
Accessibility Statement
Copyright Policy
Privacy Policy
Terms of Use
CA Notice of Collection
[Link] 14/14