Dimensionality Reduction
Principal Component Analysis
Dimensionality Reduction
• The complexity of any classifier or regressor depends on the number of
inputs.
• This determines both the time and space complexity and the necessary
number of training examples to train such a classifier or regressor.
• In many learning problems, the datasets have large number of variables.
Sometimes, the number of variables is more than the number of
observations.
• For example, such situations have arisen in many scientific fields such as
image processing
• Statistical and machine learning methods have some difficulty when
dealing with such high-dimensional data.
• Normally the number of input variables is reduced before the machine
learning algorithms can be successfully applied.
Dimensionality Reduction
• In statistical and machine learning, dimensionality reduction
or dimension reduction is the process of reducing the number
of variables under consideration by obtaining a smaller set of
principal variables.
Dimensionality Reduction
• Dimensionality reduction may be implemented in two ways.
• Feature selection:
– In feature selection, we are interested in finding k of the total of n
features that give us the most information and we discard the other
(n−k) dimensions.
• Feature extraction
– In feature extraction, we are interested in finding a new set of k features
that are the combination of the original n features.
– These methods may be supervised or unsupervised depending on
whether or not they use the output information.
– The best known and most widely used feature extraction methods are
Principal Components Analysis (PCA) and Linear Discriminant Analysis
(LDA), which are both linear projection methods, unsupervised and
supervised respectively.
Measures of error
• In regression problems, we may use the Mean Squared Error (MSE) or the
Root Mean Squared Error (RMSE) as the measure of error.
• MSE is the sum, over all the data points, of the square of the difference
between the predicted and actual target variables, divided by the number
of data points.
• In classification problems, we may use the misclassification rate as a
measure of the error. This is defined as follows:
• misclassification rate = no. of misclassified examples/ total no. of examples
Why dimensionality reduction is useful?
• In most learning algorithms, the complexity depends on the number of
input dimensions, d, as well as on the size of the data sample, N, and for
reduced memory and computation, we are interested in reducing the
dimensionality of the problem.
• Decreasing d also decreases the complexity of the inference algorithm
during testing.
• When an input is decided to be unnecessary, we save the cost of
extracting it.
• Simpler models are more robust on small datasets. Simpler models have
less variance, that is, they vary less depending on the particulars of a
sample, including noise, outliers.
• When data can be explained with fewer features, we get a better idea
about the process that underlies the data, which allows knowledge
extraction.
• When data can be represented in a few dimensions without loss of
information, it can be plotted and analyzed visually for structure and
outliers.
• Curse of dimensionality refers to an exponential increase in the size of
data caused by a large number of dimensions. As the number of
dimensions of a data increases, it becomes more and more difficult to
process it. Dimension Reduction is a solution to the curse of
dimensionality
Subset selection
• In machine learning subset selection, sometimes also called feature
selection, or variable selection, or attribute selection, is the process of
selecting a subset of relevant features (variables, predictors) for use in
model construction.
• Feature selection techniques are used for four reasons:
– simplification of models to make them easier to interpret by
researchers/users
– shorter training times,
– to avoid the curse of dimensionality
– enhanced generalization by reducing over fitting
Principal component analysis
• Principal Component Analysis, or PCA, is a dimensionality-reduction
method that is often used to reduce the dimensionality of large data sets,
by transforming a large set of variables into a smaller one that still
contains most of the information in the large set.
• Because smaller data sets are easier to explore and visualize and make
analyzing data much easier and faster for machine learning algorithms
without extraneous variables to process.
• So to sum up, the idea of PCA is simple — reduce the number of variables
of a data set, while preserving as much information as possible.
STEP BY STEP EXPLANATION OF PCA
• STEP 1: STANDARDIZATION
• The aim of this step is to standardize the range of the continuous initial
variables so that each one of them contributes equally to the analysis.
• if there are large differences between the ranges of initial variables, those
variables with larger ranges will dominate over those with small ranges
(For example, a variable that ranges between 0 and 100 will dominate
over a variable that ranges between 0 and 1), which will lead to biased
results. So, transforming the data to comparable scales can prevent this
problem.
• Once the standardization is done, all the variables will be transformed to
the same scale.
• STEP 2: COVARIANCE MATRIX COMPUTATION
• The aim of this step is to understand how the variables of the input data
set are varying from the mean with respect to each other, or in other
words, to see if there is any relationship between them.
• Because sometimes, variables are highly correlated in such a way that they
contain redundant information.
• So, in order to identify these correlations, we compute the covariance
matrix.
• What do the covariances that we have as entries of the matrix tell us
about the correlations between the variables?
• It’s actually the sign of the covariance that matters :
• if positive then : the two variables increase or decrease together
(correlated)
• if negative then : One increases when the other decreases (Inversely
correlated)
• STEP 3: COMPUTE THE EIGENVECTORS AND EIGENVALUES OF THE
COVARIANCE MATRIX TO IDENTIFY THE PRINCIPAL COMPONENTS
• Eigenvectors and eigenvalues are the linear algebra concepts that we need
to compute from the covariance matrix in order to determine
the principal components of the data.
• Principal components are new variables that are constructed as linear
combinations or mixtures of the initial variables.
• These combinations are done in such a way that the new variables (i.e.,
principal components) are uncorrelated and most of the information
within the initial variables is squeezed or compressed into the first
components.
• So, the idea is 10-dimensional data gives you 10 principal components, but
PCA tries to put maximum possible information in the first component,
then maximum remaining information in the second and so on, until
having something like shown in the screen plot below.
• .
• Organizing information in principal components this way, will allow you to
reduce dimensionality without losing much information, and this by
discarding the components with low information and considering the
remaining components as your new variables.
• Geometrically speaking, principal components represent the directions of
the data that explain a maximal amount of variance, that is to say, the
lines that capture most information of the data.
• The relationship between variance and information here, is that, the larger
the variance carried by a line, the larger the dispersion of the data points
along it, and the larger the dispersion along a line, the more the
information it has.
HOW PCA CONSTRUCTS THE PRINCIPAL COMPONENTS
• As there are as many principal components as there are variables in the
data, principal components are constructed in such a manner that the first
principal component accounts for the largest possible variance in the data
set.
• let's learn how does PCA achieves the above-
mentioned purpose through an animation.
• Each blue dot on the plot represents a point from data given by its x & y
coordinate.
• A line P (red line) is drawn from the center of the dataset i.e. from the
mean of x & y.
• Every point on the graph is projected on this line shown by two sets of
points red & green.
• The spread or variance of data along line p is given by the distance
between the two big red points.
• As the line p rotates the distance between the two red points
changes according to the angle created by line p with the x-
axis.
• The purple lines which join a point and its projection
represent the error which arises when we approximate a
point by its projection.
• PCA creates new variables from old ones
• If the new variables closely approximate the old variables,
then approximation error should be small.
• The squared sum of the lengths of all purple lines gives the
total error in approximation.
• The angle which minimizes the squared sum of errors also
maximizes the distance between the red points.
• The direction of maximum spread is called the principal axis.
Once we know a principal axis, we subtract the variance along
this principal axis to obtain the remaining variance.
• We apply the same procedure to find the next principal axis
from the residual variance. Apart from being the direction of
maximum variance, next principal axis must be orthogonal to
the other principal axes.
Once, we get all the principal axes, the dataset is projected
onto these axes. The columns in the projected or transformed
dataset are called principal components.