ESM 244: 4
1
Recall: Ordination methods
In PCA, the axes (PRINCIPAL COMPONENTS) are chosen based on the direction of the data with the greatest variance (therefore explaining the most variance possible using a simplified number of dimensions).
Cartesian Coordinate System
...but we can define it however we want to.
We can redefine our primary axes.
How do we describe our data in this new system?
Eigenvectors and eigenvalues are paired information:
*Remember this...
Example: Let’s say instead of having two variables (x, y), we originally have 3 (age, hours watching TV, hours studying). For conceptual understanding, we’re going to say that these observations miraculously fall in the general shape of an ellipsoid (pancake-ish). Each point indicates an observation for a single person.
PCA: New axes are created (linear combinations of the original variables) such that the first (PC1) is the direction accounting for the most variance in multivariate data, the second (PC2) accounts for the second most (after PC1 taken into consideration), etc.
If PC1 and PC2 explain most of the variance in data (see eigenvalues), then we’d still be seeing most of the important things about our data if we just view on PC1 and PC2...
We’ve gone from 3 dimensions to 2 dimensions that explain the greatest possible amount of variance. It doesn’t show us everything, but it does show us a lot about the data in just 2 dimensions…
What did we just do?
Dimensionality Reduction
Converting complex multidimensional data into fewer dimensions to explain as much about the data as simply as possible
OK, so that doesn’t seem that cool going from 3 → 2 dimensions...but what if we could go from 15 → 2 dimensions and still describe 80% of variance in the data? Then that becomes pretty cool.
Simplified data, loaded as .csv ‘Patients.csv’
Using these data, how many principal components will we get?
Sure you can do it by hand, but…
prcomp() function in R:
For dataset ‘Patients.csv’:
What does this scaling term do?
Scaling data before PCA: You don’t have to do it, but it’s usually a good idea (advisable).
WHEN?
WHY?
What R gives us:
Standard deviations for new PCs. Higher SD = more variance explained in PC.
Remember how we said that the new components (PCs) are linear combinations of the original components? That’s what these give us – eigenvalues for those linear relationships.
THESE ARE THE EIGENVALUES! Also called “loadings” – correlation between each variable and the component
But how much of the variance do the PCs actually explain?
These tell us how much of the total variance is extracted by EACH PC (note order)
These tell us the cumulative amount of variance in data as you increase from the first PC (PC1) to the final PC (here, PC5)
So what do we choose for the ‘cut-off’ for the proportion of variance beyond which we say that additional components aren’t so helpful (i.e., how do we know how many PCs to retain)?
There aren’t really rules about where the cut-off should be. Some say 80% is pretty good, some say you need to look at the cumulative proportions, some say you need to look at the eigenvalues…
It’s really a judgment call.
Generally, the eigenvalues fall off quickly and the cumulative proportions increase quickly (especially useful for large numbers of initial variables):
# Principal Component
Cumulative Proportion of Variance
Eigenvalues
1
2
3
4
5
So let’s say we pick the first 3 components to stick with, since we’ve decided that they explain an acceptable amount of the total variance :
What can we learn based on our truncated “model”?
Yes, in this case you might say “This hardly seems worth it to decrease my dimensions from 5 to 3”…but in some cases you’ll have 50 variables and this can allow you to reduce it to just a few!
A scree-plot is useful for visualizing PC contributions
From: NYC Data Science Academy Higgs Boson Machine Learning Challenge https://nycdatascience.com/blog/student-works/secretepipeline-higgs-boson-machine-learning-challenge/
We can also visualize contributions of the different initial variables to the PCs
STHDA Factoextra R Package: Easy Multivariate Data Analyses and Elegant Visualization
http://www.sthda.com/english/wiki/factoextra-r-package-easy-multivariate-data-analyses-and-elegant-visualization#visualizing-dimension-reduction-analysis-outputs
If the point of ordination methods/dimensional reduction is to simplify our understanding of multivariate relationships, then there should also be a way to visualize that simplified information.
BIPLOTS: an approximation of original multidimensional space, reduced to 2 dimensions with information about variables (as vectors) and observations (as points)
BIPLOT EXAMPLE
A biplot for pca shows two things:
Interpreting biplot outputs: lines (variables)
0° difference = correlation of 1
180° difference = correlation of -1
90° or 270° difference = correlation of 0
INTERPRETING BIPLOT OUTPUTS: POINTS (observations)
In PC1 direction: SBP, Height, Weight, Cholesterol vary similarly (as one increases, the others increase)
Height, weight and cholesterol are minimally correlated with Age.
Not observable clusters (no grouping done)
PC1 Direction
PC2 Direction
Biplots (and dimensional reduction in general):
What else is there?
Mohammad Ali Zare Chahouki (2012) Classification and Ordination Methods as a Tool for Analyzing of Plant Communities, Intech Open (online).
PCA: No specification of explanatory variables and outcome variables...it’s just variables
What if we have a scenario where we have explanatory variables and outcome variables?
Cluster Analysis
Find similar groups of values/families
Unconstrained Ordination (PCA, nMDS, etc.)
Find maximum variance components for variables, distance-based methods
Constrained Ordination (RDA, CCA, etc.)
Find maximum variance components for dependent variables, explained by predictor variables
Discrimination Methods (MANOVA, etc.)
Test for significant differences in groups
Multivariate Approaches
Redundancy Analysis:
An example of RDA: Exploring leaf litter decomposition rates
Environmental (Independent) Variables
Dependent (Outcome) Variables
N
Temp
C:N
Etc.
k
A
DLV
C % Increase
Site 1
Site 2
Site 3
Site n
GENERAL DATA STRUCTURE:
“Redundancy analysis and Pearson correlations also revealed that leaf litter decomposition (k) varied across the sites according to climatic factors (Figure 7). Specifically, it was positively correlated to growing season length (GSL), degree-days (DD) and growing season average air temperature (Tair). Additionally, leaf litter decomposition was related to moisture (negatively) and temperature (positively) in the topsoil (Tsoil)…Similarly, willow leaf litter k and A were positively correlated to leaf litter N concentration (N) and negatively correlated to leaf litter C:N ratio (C:N)…”
Thanks to Sebastian Tapia for this example!
“We used a redundancy analysis to explore whether certain types of responses were related to the fishers’ socioeconomic characteristics. Fishers that would employ amplifying responses had greater economic wealth but lacked options. Fishers who would adopt dampening responses possessed characteristics associated with having livelihood options. Fishers who would adopt neither amplifying nor dampening responses were less likely to belong to community groups and sold the largest proportion of their catch.”
“Fishers that would employ amplifying responses had greater economic wealth but lacked options. Fishers who would adopt dampening responses possessed characteristics associated with having livelihood options. Fishers who would adopt neither amplifying nor dampening responses were less likely to belong to community groups and sold the largest proportion of their catch.”