0% found this document useful (0 votes)
12 views5 pages

Understanding Principal Component Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views5 pages

Understanding Principal Component Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

PCA

Understanding Principal Component Analysis (PCA)

Introduction: Principal Component Analysis (PCA) is a powerful statistical technique


widely used in data analysis and dimensionality reduction. It helps to identify
patterns in data, uncover underlying structures, and simplify complex datasets.

Key Concepts:

1. Dimensionality Reduction: PCA aims to reduce the number of variables in a dataset


while preserving the most important information. It accomplishes this by
transforming the original variables into a new set of variables, called principal
components, which are linear combinations of the original variables.
2. Principal Components: These components are orthogonal to each other, meaning
they are uncorrelated. The first principal component captures the maximum variance
in the data, the second principal component captures the maximum remaining
variance orthogonal to the first, and so on.
3. Variance Explained: PCA provides a way to quantify the amount of variance
explained by each principal component. This information is crucial for understanding
how much information is retained after dimensionality reduction.
4. Eigenvalues and Eigenvectors: PCA involves calculating the eigenvalues and
eigenvectors of the covariance matrix of the original data. Eigenvalues represent the
amount of variance explained by each principal component, while eigenvectors
represent the direction of the principal components.

Steps in PCA:

1. Standardization: It is essential to standardize the variables before performing PCA


to ensure that each variable contributes equally to the analysis.
2. Covariance Matrix: Calculate the covariance matrix of the standardized data.
3. Eigenvalue Decomposition: Compute the eigenvalues and eigenvectors of the
covariance matrix.
4. Selection of Principal Components: Decide on the number of principal
components to retain based on the explained variance and the application
requirements.
5. Projection: Transform the original data onto the new coordinate system defined by
the selected principal components.

Applications:

1. Data Compression: PCA is used to reduce the dimensionality of large datasets while
retaining most of the important information, which is beneficial for efficient storage
and processing.
2. Pattern Recognition: PCA is applied in fields such as image processing and
computer vision to identify patterns and extract features from high-dimensional data.
3. Exploratory Data Analysis: PCA helps in visualizing and exploring the underlying
structure of data, making it easier to interpret complex datasets.

Conclusion: Principal Component Analysis is a valuable tool for exploratory data


analysis, dimensionality reduction, and pattern recognition. By transforming high-
dimensional data into a lower-dimensional space, PCA enables better understanding
and visualization of complex datasets, making it an indispensable technique in
various fields of science and engineering.

Let's consider an example of using PCA in the context of a dataset containing information
about different types of fruits based on various attributes such as weight, colour, diameter,
and sweetness level.

Suppose we have a dataset with the following attributes for each fruit:

1. Weight (in grams)


2. Colour (RGB values)
3. Diameter (in centimetres)
4. Sweetness level (measured on a scale from 1 to 10)

We want to perform PCA to reduce the dimensionality of this dataset and identify the most
important factors that contribute to the variability among the fruits.

Step 1: Standardization First, we standardize the attributes to ensure that each variable has
a mean of 0 and a standard deviation of 1. This step is crucial for PCA.

Step 2: Covariance Matrix Next, we calculate the covariance matrix of the standardized
data. The covariance matrix represents the relationships between the different attributes.

Step 3: Eigenvalue Decomposition We compute the eigenvalues and eigenvectors of the


covariance matrix. The eigenvalues represent the amount of variance explained by each
principal component, and the eigenvectors represent the direction of the principal
components.

Step 4: Selection of Principal Components Based on the eigenvalues, we decide on the


number of principal components to retain. We may choose to retain only the principal
components that explain a significant amount of variance in the data.

Step 5: Projection Finally, we transform the original data onto the new coordinate system
defined by the selected principal components. This gives us a lower-dimensional
representation of the dataset.

For example, after performing PCA, we might find that the first principal component is
primarily influenced by attributes related to size (weight and diameter), while the second
principal component is influenced by attributes related to colour and sweetness level.
This reduced representation allows us to analyse and visualize the dataset more effectively,
identifying patterns and similarities among different types of fruits based on the most
important factors.

Let's create a simplified numerical example to illustrate PCA with a small dataset containing
information about three types of fruits: apples, oranges, and bananas. We'll consider two
attributes for each fruit: weight (in grams) and diameter (in centimetres).

Our dataset looks like this:

Fruit Weight Diameter

(gm) (cm)

Apple 100 5

Apple 120 6

Orange 150 7

Orange 140 6.5

Banana 90 4

Banana 110 4.5

Step 1: Standardization

We standardize the data by subtracting the mean and dividing by the s.d. for each attribute.

Standardized Weight (gm) = (weight – Mean(Weight)) / S.D.(Weight)

Standardized Diameter (cm) = (Diameter – Mean(Diameter) / S.D.(Diameter)

Let's assume:
 Mean(Weight) = 120 gm
 S.D.(Weight) = 20 gm
 Mean(Diameter) = 5.5 cm
 S.D.(Diameter) = 1 cm
After standardization, our dataset becomes:

Fruit Standardized Standardized


Weight Diameter

Apple -1.0 -0.5

Apple 0.0 0.5

Orange 1.0 1.5

Orange 0.5 1.0

Banana -1.0 -1.5

Banana 0.0 -1.0

Step 2: Covariance Matrix

Next, we calculate the covariance matrix of the standardized data:

Covariance Weight Diameter


Weight 1.33 1.26
Diameter 1.26 1.33

Covariance Matrix: Weight Diameter ----------------------------------------- Weight 1.33 1.26 Diameter 1.26 1.33
Step 3: Eigenvalue Decomposition

We find the eigenvalues and eigenvectors of the covariance matrix:

 Eigenvalues: λ₁ ≈ 2.59, λ₂ ≈ 0.07


 Eigenvectors: v₁ ≈ [0.71, 0.71], v₂ ≈ [-0.71, 0.71]

Step 4: Selection of Principal Components

Since the first eigenvalue (λ₁) is much larger than the second eigenvalue (λ₂), we retain the
first principal component.

Step 5: Projection
We project the standardized data onto the first principal component:

Projected Data 1D):

Fruit Projected Value

Apple -1.0

Apple 0.35

Orange 1.42

Orange 0.71

Banana -1.42

Banana 0.35

These projected values represent the fruits' positions along the first principal component axis,
effectively reducing the dataset's dimensionality while preserving the most important
information.

You might also like