Principal Component Analysis - storyboard


To show the basic idea behind PCA a small data set is generated. It consists of five persons/samples (Hansen, Jensen, Petersen, Pedersen and Nielsen) and for each person three variables exists describing the added workload, the added salary and the "distance to work".
   

When plotting these five samples a three-dimensional space is required where the axes would be the three variables.

In the figure to the right the three variables for the first sample - Hansen - are marked on the three separate axes. The specific location of the three black spheres represent the variable values of Hansen: 1.0, 0.1 and 1.2 for the added workload, the added salary and the "distance to work",  respectively.

   

The three axes are then turned into orthogonal positions and thereby form the three-dimensional space.

The location of Hansen is noted to be:

  • 1.0 up of the green workload axis
  • 0.1 out of the red salary axis
  • and 1.2 out of the blue "distance to work" axis
   
Now all five samples are in place, all in direct relation to their values of added workload, added salary and "distance to work" from the original data-table.
   
It is observed that all samples seem to be located close to a flat two-dimensional plane in the three-dimensional box. This is an important fact - best seen in the movie.
   
This plane - seen as the transparent square - may be described using two new axes (white). The exact spatial location is determined by a minimization of a least squares residual.
   

The frame to the right is a zoom from the backside of the system of coordinates. The distance from each sample into the transparent plane is the residual of that sample (the cylinder) - the length is equal to the size of the residual.

The sum of the (squared) residuals from all the samples are made as small as possible when placing the white axes.

   

The residuals are removed and the samples projected orthogonal onto the new plane described by the two white axes. These are also known as principal components or latent factors.

The new coordinates of the samples in the new flat two-dimensional space described by the principal components (white axes) are known as scores and together with their definition (the corresponding loading), they are called principal components.

   

Now three unit vectors are placed in origin of the original three-dimensional space along the axes green, red and blue axes and in their terminal point a sphere is placed. These are also projected onto the transparent flat plane described by the principal components.

This leads to the definition of loadings, which describes the relationship between the original variables.

   

When removing the original system of coordinates (the red, blue and green axes) and the unit vectors it leaves the new (reduced) system of coordinates represented by the principal components (white axes). In this system of coordinates both samples and variables are located - the plot is also called bi-plot because it describes the relationship between both the samples/persons and the variables.

One may now conclude:

  • Jensen and Hansen are approximately similar because they are located relatively close to each other. Additionally it can be seen that Hansen and Jensen have high values of workload because the green workload sphere is located closely to these samples/persons.
  • Jensen and Hansen posses reverse properties compared to Petersen and Pedersen because they are located opposite from each other.
  • The variable "distance to work" does not describe the samples in any way because the blue "distance to work" sphere is located in origin.

The above conclusions are drawn from visual exploration of the bi-plot only. One may correctly argue that similar conclusion could have been drawn from just inspecting the raw data - but when analyzing much larger data sets this is no longer an option. Similar conclusions may also be drawn by using traditional statistics - but it does not e.g. automatically point to important variables. Using traditional statistics one must manually test all possible correlations - a cumbersome job when having thousands of samples each consisting of thousands of variables.

In conclusion, four types of information may be retrieved using PCA:

  • The relationship between the samples are described by the score values - closely located samples are correlated;
  • The relationship between the variables is described by the loading plot - closely related variables are correlated;
  • The distances from the samples to the principal components are described by the residuals;
  • The relationship between samples and variables are depicted in a bi-plot.