0% found this document useful (0 votes)
3 views1 page

Varimax Rotation

The document provides a tutorial on performing factor analysis using Python's scikit-learn library, specifically applied to a food-texture dataset. It explains the process of standardizing data, fitting a factor analysis model, and interpreting the results, including communality and uniqueness of variables. The tutorial emphasizes the importance of understanding the underlying factors and their relationships to effectively interpret the data.

Uploaded by

Sandeepan
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views1 page

Varimax Rotation

The document provides a tutorial on performing factor analysis using Python's scikit-learn library, specifically applied to a food-texture dataset. It explains the process of standardizing data, fitting a factor analysis model, and interpreting the results, including communality and uniqueness of variables. The tutorial emphasizes the importance of understanding the underlying factors and their relationships to effectively interpret the data.

Uploaded by

Sandeepan
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Homepage SOGA Startpage SOGA-R Privacy Policy Accessibility Statement

Search with Google™ ... 

Department of Earth Sciences /

STATISTICS AND GEODATA ANALYSIS USING PYTHON (SOGA-PY)


INTRODUCTION TO PYTHON BASICS OF STATISTICS ADVANCED STATISTICS MACHINE LEARNING

Homepage  SOGA-Py  Advanced statistics  Multivariate approaches  Factor Analysis  A simple example of Factor Analysis in Python

 39 / 100 

A simple example of Factor Analysis in Python

A simple example of factor analysis in Python


In this example we compute a factor analysis, employing the scikit-learn library.

We assume that our data was generated by a linear transformation of a lower dimensional data set, with an overlay of white noise. The factor analysis allows us to retrieve these
underlying factors and thus to lower the dimensionality of our data.

Let's import the needed tools and get going. We will give further explanation along the way.

In [1]: from pandas import read_csv, Series, DataFrame


from [Link] import FactorAnalysis
from [Link] import StandardScaler
import [Link] as plt
import numpy as np

The data set


Let us get our hands dirty and apply a factor analysis on a the food-texture data set. We already discussed the data set in the section on principal component analysis, so you are
probably familiar with the data set.

This open source data set is available here and describes texture measurements of a pastry-type food.

In [2]: food = read_csv("[Link] index_col=0)


[Link]()

Out[2]: Oil Density Crispy Fracture Hardness

B110 16.5 2955 10 23 97

B136 17.7 2660 14 9 139

B171 16.2 2870 12 17 143

B192 16.7 2920 10 31 95

B225 16.3 2975 11 26 143

The data set consists of 50 rows (observations) and 5 columns (features/variables). The features are:

Oil: percentage oil in the pastry


Density: the product’s density (the higher the number, the more dense the product)
Crispy: a crispiness measurement, on a scale from 7 to 15 , with 15 being more crispy.
Fracture: the angle, in degrees, through which the pasty can be slowly bent before it fractures.
Hardness: a sharp point is used to measure the amount of force required before breakage occurs.

The FactorAnalysis class of scikit-learn


The class FactorAnalysis of the scikit-learn package, enables many methods around Factor analysis. When instantiating the class, we can pass it the desired number of factors.

FactorAnalysis(n_components = <factors>)
Replace <factors> with the amount of desired factors.

With this class we can perform a maximum-likelihood factor analysis on a covariance matrix or data matrix, specifying the desired number of factors. By an additional argument
rotation the transformation of the factors may be specified to be either varimax or quartimax , two types of orthogonal rotation or None (default) for no rotation.

Getting a reasonable value for the amount of factors desired is a tricky aspect of factor analysis. If we already have some understanding of the system that created our data, we
may make an make an educated guess about the number of latent variables.

If we do not know much about out data other than that the number of variables is not too large, we may simply try several values to initialize the model. In most cases though and to
do our due dilligence, we should use a more sophisticated approach and perform a principal component analysis (PCA) and get a good initial estimate of the number of factors.

Since we have already explained the PCA, we will not repeat it here and just make a guess and set the number of factor to be factors = 2 . Furthermore, we try the analysis with
the rotation set to varimax and with the default value rotation = None .

Let's compute this and plot it using matplotlib.

In [3]: X = StandardScaler().fit_transform(food) # Standardize the data


factors = 2
# a list of 2 tuples containing titles for and instances of or class
fas = [
("FA no rotation", FactorAnalysis(n_components = factors)),
("FA varimax", FactorAnalysis(n_components = factors, rotation="varimax")),
]

# Let's prepare some plots on one canvas (subplots)


fig, axes = [Link](ncols=len(fas), figsize=(10, 8))

'''
And loop over the variants of our analysis `fas`, zipped with the
plot axes `axes`
'''
for ax, (title, fa) in zip(axes, fas):
# Fit the model to the standardized food data
fa = [Link](X)
# and transpose the component (loading) matrix
factor_matrix = fa.components_.T
# Plot the data as a heat map
im = [Link](factor_matrix, cmap="RdBu_r", vmax=1, vmin=-1)
# and add the corresponding value to the center of each cell
for (i,j), z in [Link](factor_matrix):
[Link](j, i, str([Link](2)), ha="center", va="center")
# Tell matplotlib about the metadata of the plot
ax.set_yticks([Link](len([Link])))
if ax.get_subplotspec().is_first_col():
ax.set_yticklabels([Link])
else:
ax.set_yticklabels([])
ax.set_title(title)
ax.set_xticks([0, 1])
ax.set_xticklabels(["Factor 1", "Factor 2"])
# and squeeze the axes tight, to save space
plt.tight_layout()

# and add a colorbar


cb = [Link](im, ax=axes, location='right', label="loadings")
# show us the plot
[Link]()

Interpretation of the results


Before we interpret the results of the factor analysis, let us recall the basic idea behind it. Factor analysis creates linear combinations of factors to abstract the variable’s
underlying communality. This reduces the amount of factors in the data set, while preserving most of the variance.

This allows us to aggregate a large number of observable variables in a model to represent an underlying concept, making it easier to comprehend the data.

The variability in a data set X, is given by Σ, and its estimate Σ


^ is composed of the variability explained by a linear combination of the factors, which we call communality, and

of the remaining variability that can not be explained by a linear combination of the factors, namley the uniqueness.
T
^ = ^^ ^
Σ ΛΛ + Ψ
 
communality uniqueness

Now let's have a look at our results:

^
In the plot above we can see the loadings, which range from −1 to 1 . This part is represented by the Λ in the equation above. The loadings are the contribution of each original
variable to the factor. Variables with loading values further away from 0 have a larger part of their variability factor.

Now we will check the uniqueness for of each variable. Note that we only use the varimax rotation method, going forward.

The next plot is generated straight from a PandasSeries , using the .plot() method. This provides easy access to quick plots.

In [4]: fa = FactorAnalysis(n_components = 2, rotation="varimax")


[Link](X)
uniqueness = Series(fa.noise_variance_, index=[Link])
[Link](
kind="bar",
ylabel="Uniqueness"
)

Out[4]: <AxesSubplot:ylabel='Uniqueness'>

The uniqueness, which ranges from 0 to 1 . It is, sometimes also referred to as (white) noise, corresponds to the proportion of variability that can not be explained by a linear
combination of the factors. This part is represented by the Ψ
^ in the equation above. A high uniqueness for a variable indicates that the factors do not account well for its

variance.

Opposing the uniquess, stands the communality:

In [5]: # Communality
communality = Series([Link](fa.components_.T).sum(axis=1), index=[Link])
[Link](
kind="bar",
ylabel="communality"
)

Out[5]: <AxesSubplot:ylabel='communality'>

By squaring the loading we can compute the fraction of the variable’s total variance explained by the factors (Teetor 2011{target="_blank"}). This proportion of the variability is
denoted as communality.

A way way to calculate the the uniqueness, when you already computed the communlity is to subtract it from 1 . An appropriate factor model results in low values for
uniqueness and high values for communality. So if we see bad results for our model, we could try a different number of underlying factors (latent variables).

In [6]: #and back to uniqueness


(1 - communality).plot(kind="bar", ylabel="uniqueness")

Out[6]: <AxesSubplot:ylabel='uniqueness'>

Recall the factor analysis model: Σ


^ ^Λ
= Λ ^ ^
+ Ψ

Using our factor model fa we may calculate Σ


^ and compare it to the observed correlation matrix, S , by simple matrix algebra.

We use numpy to perform fast and efficient math operations. Note: We can also use pandas, since it "wraps" around numpy, basically just forwarding our commands to the library.

In [8]: # the word 'lambda' is reserved for a python-operator, so we use the underscore at the end
lambda_ = fa.components_
psi = [Link](uniqueness)
s = [Link]([Link](X))
sigma = [Link](lambda_.T, lambda_) + psi
residuals = (s - sigma)

We subtracted the fitted correlation matrix Σ


^ ( sigma ) from the observed correlation matrix S . The resulting matrix is called the residual matrix. Numbers close to 0 indicate

that our factor model is a good representation of the underlying system. Now lets plot the results.

In [9]: ax = [Link]()
im = [Link](residuals, cmap="RdBu_r", vmin=-1, vmax=1)
ax.tick_params(axis="x", bottom=False, labelbottom=False, top=True, labeltop=True)
ax.set_xticks(range(5))
ax.set_xticklabels([Link])
ax.set_yticks(range(5))
ax.set_yticklabels([Link])
for (i,j), z in [Link](residuals):
[Link](j, i, str([Link](3)), ha="center", va="center")

[Link](im, ax=ax, location='right')


ax.set_title("FA residual matrix")
plt.tight_layout()

Interpretation of the factors


The purpose of a rotation is to produce more extreme loadings. The idea behind this is to give meaning to the factors. This can help with their interpretation. From a
mathematical viewpoint, there is no difference between a rotated and unrotated matrix. The fitted model is the same, the uniquenesses are the same, and the proportion of
variance explained is the same.

Here we fit three factor models, one with no rotation , one with varimax rotation, and one with quartimax rotation. We then make a scatter plot of the first and second
loadings.

In [10]: methods = [
("FA No rotation", FactorAnalysis(2,)),
("FA Varimax", FactorAnalysis(2, rotation="varimax")),
("FA Quartimax", FactorAnalysis(2, rotation="quartimax")),
]
fig, axes = [Link](ncols=3, figsize=(10, 8), sharex=True, sharey=True)

for ax, (method, fa) in zip(axes, methods):


fa = [Link](X)

components = fa.components_

vmax = [Link](components).max()
[Link](components[0,:], components[1, :])
[Link](0, -1, 1, color='k')
[Link](0, -1, 1, color='k')
for i,j, z in zip(components[0, :], components[1, :], [Link]):
[Link](i+.02, j+.02, str(z), ha="center")
ax.set_title(str(method))
if ax.get_subplotspec().is_first_col():
ax.set_ylabel("Factor 1")
ax.set_xlabel("Factor 2")

plt.tight_layout()
[Link]()

What does this mean?


How can I interpret these factors? If two variables have loadings further away from 0 for the same factor, we know they are related. We want to understand the data in order to
give meaningful names to the latent variables.

Taking a look at the plot in the middle FA Varimax above, it appears that Factor 1 describes a variable that makes pastry soft, less crisp. This description fits flaky pastry
rather well.

Whereas the loadings in Factor 2 show high density and little oil . This would fit the classification of hot water crust pastry

Citation

The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by
soga[at][Link].

You may use this project freely under the Creative Commons Attribution-ShareAlike 4.0 International License.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.

 

SOGA SOGA-R

SOGA Project Service Navigation This Page

 SOGA Startpage  Homepage  Print

 SOGA-R  SOGA Startpage  Subscribe RSS-Feed

 SOGA-R

 Privacy Policy

 Accessibility Statement

The Latin words veritas, justitia, and libertas, which frame the seal of Freie Universität Berlin, stand for the values that
have defined the academic ethos of Freie Universität since its founding in December 1948.

You might also like