0% found this document useful (0 votes)
8 views3 pages

Pca d1

Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of complex data by identifying a new set of orthogonal dimensions ranked by variance. It involves calculating the covariance matrix, eigenvectors, and eigenvalues to transform the original data into a lower-dimensional space while preserving important patterns. The goal of PCA is to find linearly independent dimensions that can effectively represent the data without losing significant information.

Uploaded by

fly.high.1262
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views3 pages

Pca d1

Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of complex data by identifying a new set of orthogonal dimensions ranked by variance. It involves calculating the covariance matrix, eigenvectors, and eigenvalues to transform the original data into a lower-dimensional space while preserving important patterns. The goal of PCA is to find linearly independent dimensions that can effectively represent the data without losing significant information.

Uploaded by

fly.high.1262
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1 Principal Component Analysis

In real world data analysis tasks we analyze complex data i.e. multi dimensional data. We plot the
data and find various patterns in it or use it to train some machine learning models. One way to
think about dimensions is that suppose you have an data point x , if we consider this data point as
a physical object then dimensions are merely a basis of view, like where is the data located when
it is observed from horizontal axis or vertical axis.
As the dimensions of data increases, the difficulty to visualize it and perform computations on
it also increases. So, how to reduce the dimensions of a data:-
• Remove the redundant dimensions
• Only keep the most important dimensions
Let us first try to understand some terms:-
Variance : It is a measure of the variability or it simply measures how spread the data set is.
Mathematically, it is the average squared deviation from the mean score. We use the following
formula to compute variance var(x).
Σ
(xi −x̄)2
var(x) = N

Covariance : It is a measure of the extent to which corresponding elements from two sets of
ordered data move in the same direction. Formula is shown below denoted by cov(x,y) as the
covariance of x and y.
Σ
(xi −x̄)(yi −ȳ)
var(x) = N
• Here, xi is the value of x in ith dimension.
• x bar and y bar denote the corresponding mean values.
• One way to observe the covariance is how interrelated two data sets are.

Positive covariance means X and Y are positively related i.e. as X increases Y also increases.
Negative covariance depicts the exact opposite relation. However zero covariance means X and Y
are not related.
Now lets think about the requirement of data analysis.
Since we try to find the patterns among the data sets so we want the data to be spread out
across each dimension. Also, we want the dimensions to be independent. Such that if data has high
covariance when represented in some n number of dimensions then we replace those dimensions
with linear combination of those n dimensions. Now that data will only be dependent on linear
combination of those related n dimensions. (related = have high covariance)

1
So, what does Principal Component Analysis (PCA) do?
PCA finds a new set of dimensions (or a set of basis of views) such that all the dimensions are
orthogonal (and hence linearly independent) and ranked according to the variance of data along
them. It means more important principle axis occurs first. (more important = more variance/more
spread out data)

How does PCA work?


• Calculate the covariance matrix X of data points.
• Calculate eigenvectors and corresponding eigenvalues.
• Sort the eigenvectors according to their eigenvalues in decreasing order.

• Choose first k eigenvectors and that will be the new k dimensions.


• Transform the original n dimensional data points into k dimensions.

To understand the detail working of PCA, we should have knowledge of eigen values and eigen
vectors

Eigenvectors: The directions in which our data are dispersed.


Eigenvalues: The relative importance of these different directions.

[Covariance matrix].[Eigenvector] = [eigenvalue].[Eigenvector]


Lets look into what a covariance matrix is?
A covariance matrix of some data set in 4 dimensions a,b,c,d.
□ Va Ca,b Ca,c Ca,d Ca,e
□ Ca,b Vb Cb,c Cb,d Cb,e
Ca,c Cb,c Vc Cc,d Cc e
Ca,d Cb,d Cc,d Vd Cd e
Ca,e Cb,e Cc,e Cd,e Ve

Va : variance along dimension a


Ca,b : Covariance along dimension a and b
If we have a matrix X of m*n dimension such that it holds n data points of m dimensions, then
covariance matrix can be calculated as
1
C = (X − X̄ )(X − X̄ )T
x n−1

It is important to note that the covariance matrix contains:-

• variance of dimensions as the main diagonal elements.


• covariance of dimensions as the off diagonal elements.

Also, covariance matrix is symmetric (observe from the image above)

2
As, we discussed earlier we want the data to be spread out i.e. it should have high variance along
dimensions. Also we want to remove correlated dimensions i.e. covariance among the dimensions
should be zero (they should be linearly independent).
Therefore, our covariance matrix should have:-
• large numbers as the main diagonal elements.

• zero values as the off diagonal elements.


We call it a diagonal matrix. So, we have to transform the original data points such that their
covariance is a diagonal matrix.
Always normalize your data before doing PCA if we use data (features here) of different scales, we
get misleading components. We can also simply use correlation matrix instead of using covariance
matrix if features are of different scales.
This defines the goal of PCA:-
1. Find linearly independent dimensions which can losslessly represent the data points.
2. Those newly found dimensions should allow us to predict/reconstruct the original dimensions.

Common questions

Powered by AI

The mathematical procedures in PCA begin with computing the covariance matrix of the data, capturing variances within dimensions and covariances between them. Next, eigenvectors and eigenvalues of this matrix are calculated, with eigenvectors indicating the direction of maximum variance and eigenvalues quantifying their significance. The eigenvectors are then sorted based on their associated eigenvalues in descending order. The top k eigenvectors, corresponding to the largest eigenvalues, are chosen to form the new basis, defining the reduced-dimensionality space. Finally, the original dataset is projected onto this new basis, transforming the data into k dimensions while preserving critical variance information, thus achieving dimensionality reduction .

Zero covariance between dimensions in PCA indicates that the variables are linearly independent, meaning changes in one dimension do not predict or affect changes in another. This lack of linear relationship leads the transformed axes to be orthogonal, as orthogonal axes by definition regard dimensions that are uncorrelated. This relationship is significant because it ensures the transformed data space is aligned for maximum variance capture independently across each axis, reducing redundancy. Orthogonality simplifies analyzing multivariate data, making interpretation more intuitive and reducing noise .

The covariance matrix plays a central role in PCA as it quantifies the pairwise covariances between the dimensions of the dataset, revealing how variations in one dimension predict or affect another. To identify the principal components, PCA seeks a transformation of the original data such that the covariance matrix becomes diagonal, meaning all off-diagonal covariance terms become zero, indicating no linear correlation between dimensions. This is crucial because PCA aims to derive dimensions that are linearly independent, allowing for simpler and more interpretable data structures. The diagonal elements of this matrix, which are the variances of the transformed data, help in understanding the amount of variance captured by each principal component .

In PCA, eigenvectors represent the directions in which the data variance is maximized, meaning they indicate the principal axes of the transformed feature space. Eigenvalues, on the other hand, measure the amount of variance captured by each eigenvector, representing the importance or significance of each axis. The transformation process involves projecting the original data onto the space defined by the top k eigenvectors, which are selected based on their corresponding eigenvalues. This ensures that most of the variability in the dataset is retained while reducing its dimensionality .

The rationale for selecting only the top k eigenvectors in PCA is to focus on the dimensions that capture the most variance within the data, thereby maximizing information retention while minimizing dimensionality. This selection impacts data reconstruction by ensuring that the transformed data mostly preserves the structure of the original dataset, allowing for effective approximation with reduced complexity. However, choosing k requires balancing between retaining critical information and achieving simplification; too few dimensions may lead to significant information loss, while more dimensions may retain redundancies. This careful selection facilitates interpretation by highlighting the most influential dimensions without noise .

Not transforming the covariance matrix into a diagonal matrix before performing PCA means the dimensions remain correlated, inhibiting the independence needed for effective principal component analysis. This can lead to components that are not orthogonal, diminishing PCA’s ability to isolate the directions of maximum variance and causing redundant information to persist in the data. The final component structure may then inadequately reflect the true variability within the dataset, impairing interpretability and efficacy in model building or data compression .

Data normalization is important before applying PCA because it ensures that all features contribute equally to the analysis, particularly when they are measured on different scales. Not normalizing the data may lead PCA to be biased towards dimensions with larger ranges of values, thus skewing the component extraction towards those features and potentially producing misleading or erroneous principal components. By normalizing, each feature is on the same scale, ensuring the PCA identifies the dimensions that truly reflect the inherent structure and variance of the data .

If the dataset includes features of different scales, PCA can be adjusted by normalizing the data or by using a correlation matrix instead of a covariance matrix. Normalization scales all features to a common scale, preventing the PCA from favoring dimensions with inherently larger ranges and ensuring all dimensions contribute equally. Using a correlation matrix inherently accounts for differing scales by focusing on standardized covariance, underlining proportionate relationships over absolute ones. Such adjustments are necessary to avoid skewed representation of variance across dimensions, maintaining the robustness and accuracy of the PCA outcome .

The covariance matrix is symmetric in PCA because the covariance of any two dimensions is equal regardless of their order, i.e., cov(X,Y) = cov(Y,X). This symmetry implies that the eigenvectors of the covariance matrix form an orthogonal set and that it can be diagonalized, which is fundamental for transforming the original dataset into a set of uncorrelated principal components. This transformation helps in simplifying the data structure, leading to a more efficient representation with reduced dimensions while maintaining the maximum amount of variability .

The primary goal of PCA in data dimensionality reduction is to find a new set of dimensions, or principal components, that capture the maximum variance in the data while ensuring that these dimensions are orthogonal (linearly independent). PCA achieves this by calculating the covariance matrix of the data, then deriving its eigenvectors and eigenvalues. The eigenvectors, ranked by the magnitude of their corresponding eigenvalues in decreasing order, serve as the orthogonal axes of the transformed space. Each principal component is a linear combination of the original dimensions, aligned along directions of maximum variance, and are orthogonal because the eigenvectors are orthogonal, as per matrix properties .

You might also like