| To show the basic idea behind PCA a small data set
is generated. It consists of five persons/samples (Hansen, Jensen, Petersen, Pedersen and Nielsen) and
for each person
three variables exists describing the added workload, the added
salary and the "distance to work". |
 |
| |
|
|
When plotting these five samples a three-dimensional space is required
where the axes would be the three variables.
In the figure to the right the three variables for the first sample - Hansen
- are marked on the three separate axes. The specific location of the three black
spheres represent the variable values of Hansen: 1.0, 0.1 and 1.2 for
the added workload, the added salary and the "distance to work", respectively.
|
 |
|
|
|
|
The three axes are then turned into orthogonal positions and thereby
form the three-dimensional space.
The location of Hansen is noted to be:
- 1.0 up of the green workload axis
- 0.1 out of the red salary axis
- and 1.2 out of the blue "distance to work" axis
|
 |
| |
|
| Now all five samples are in place, all in direct relation to
their values of added workload, added salary and "distance to work" from
the original data-table. |
 |
| |
|
| It is observed that all samples seem to be located close to
a flat two-dimensional plane
in the three-dimensional box. This is an important fact - best seen in the
movie. |
 |
| |
|
| This plane - seen as the transparent square - may be described
using two new axes (white). The exact spatial location is determined by a
minimization of a least squares residual. |
 |
|
|
|
|
The frame to the right is a zoom from the backside of the system of
coordinates. The distance from each sample into the transparent plane is
the residual of that sample (the cylinder) - the length is equal to the size
of the residual.
The sum of the (squared) residuals from all the samples are made as small as possible when placing the white axes.
|
 |
|
|
|
|
The residuals are removed and the samples projected orthogonal onto the
new plane described by the two white axes. These are also known as principal
components or latent factors.
The new coordinates of the samples in the new flat two-dimensional space
described by the principal components (white axes) are known as scores
and together with their definition (the corresponding loading), they are
called principal components.
|
 |
|
|
|
|
Now three unit vectors are placed in origin of the original
three-dimensional space along
the axes green, red and blue axes and in their terminal point a sphere
is placed. These are also projected onto the transparent flat plane described
by the principal components.
This leads to the definition of loadings, which describes the
relationship between the original variables.
|
 |
|
|
|
|
When removing the original system of coordinates (the red, blue and green
axes) and the unit vectors it leaves the new (reduced) system of coordinates represented
by the principal components (white axes). In this system of coordinates
both samples and variables are located - the plot is also called bi-plot
because it describes the relationship between both the samples/persons
and the variables.
One may now conclude:
- Jensen and Hansen are approximately similar because they are located
relatively close to each other. Additionally it can be seen that
Hansen and Jensen have high values of workload because the green
workload
sphere is located closely to these samples/persons.
- Jensen and Hansen posses reverse properties compared to Petersen and
Pedersen because they are located opposite from each other.
- The variable "distance to work" does not describe the samples in any way because the
blue "distance to work" sphere is located in origin.
|
 |
|
The above conclusions are drawn from visual exploration of the bi-plot
only.
One may correctly argue that similar conclusion could have been drawn
from just inspecting the raw data - but when analyzing much larger data
sets this is no longer an option. Similar conclusions may also be drawn
by using traditional statistics - but it does not e.g. automatically point
to important variables. Using traditional statistics one must manually
test all possible correlations - a cumbersome job when having thousands
of samples each consisting of thousands of variables.
In conclusion, four types of information may be retrieved using PCA:
- The relationship between the samples are described by the score
values - closely located samples are correlated;
- The relationship between the variables is described by the loading
plot - closely related variables are correlated;
- The distances from the samples to the principal components are described
by the residuals;
- The relationship between samples and variables are depicted in a bi-plot.
|