Principal Component Analysis
(Slides adapted from John DeNero)
UC Berkeley Data 100 Summer 2019
Sam Lau
Learning goals:
Announcements
Principal Component Analysis
Principal Component Analysis for EDA
Goal: Plot the observations in a high-dimensional data matrix (many attributes) in two dimensions by picking two linear combinations of attributes.
Related Goal: Determine whether this two-dimensional plot is really showing the variability in the data. (If not, be wary of conclusions drawn using PCA.)
Principal Component Analysis for EDA
PCA is appropriate when:
Principal Component Analysis Computation
PCA: A Procedural View
Step 0: Center the data matrix by subtracting the mean of each attribute column.
Step 1: Find a linear combination of attributes, represented by a unit vector v, that summarizes each row of the data matrix as a scalar.
PCA: A Procedural View
(Demo)
Steps 2+: To find k principal components, choose the vector for each one in the same manner as the first, but ensure each one is orthogonal to all previous ones.
Connection to SVD: In practice, you don’t carry out steps 1+, but instead use singular value decomposition to find all principal components efficiently.
SVD & PCA
Singular value decomposition (SVD) describes a matrix decomposition:
SVD & PCA
Suppose X is a (50 x 3) matrix with rank 2.
What are dimensions of U, Σ, and V?
SVD & PCA
Principal component analysis (PCA) is a specific application of SVD:
PCA for Two-Dimensional Visualization
(Demo)
Computational recipe for creating a scatter plot using PCA:
Variance
Maximizing Variance
Variance is the expected squared deviation from the mean. For a finite set of attribute values with mean of zero, the variance is the average squared value.
odds = np.array([-1, 3, 5, -7]) # Attribute values with zero mean�np.var(odds) == np.average(odds**2) == 1/len(odds) * odds @ odds
Maximizing Variance
The total variance of a data matrix is the sum of the variances of the attributes.
The sum of squared singular values in PCA is equal to the total variance of the original data matrix.
Each squared singular value (a component score) indicates how much of the total variance is accounted for by the corresponding principal component.
Dividing by one over square root m is necessary to maintain this relationship.
Scree Plots
(Demo)
A scree plot shows the size of the diagonal values of Σ2, largest first.
If the first two singular values are large and all others are small, then two dimensions are enough to describe most of what distinguishes one observation from another.
If not, then a PCA scatter plot is omitting lots of information.
PCA: A Declarative View
Step 0: Center the data matrix by subtracting the mean of each column.
Step 1: Find an k-dimensional projection of the centered matrix that retains as much of the total variance of the original matrix as possible.
Equivalently, find a k-dimensional projection that minimizes the projection error.
In other words: PCA describes our goals for a good projection. SVD is the algorithm used to conduct PCA.
Question: First Principal Component
What’s the relationship between the first singular value and the scale of the x-axis in a PCA scatter plot?
Answer:
The x-axis positions are all the values of of the first column of UΣ. Since U is orthonormal, the column’s length is Σ.
The sum of the squares of these values is the first singular value squared.
What does 0.05 measure?
Why is this point at (0.11, 0)
Interpretation
Principal Component Directions (Axes)
(Demo)
A principal component direction is a linear comb of attributes.
Plotting the values of the first principal component direction can provide insight into how attributes are being combined.
Interpreting other principal components is challenging; the constraint that they are orthogonal to prior components strongly influences their directions.
Analyzing Attributes Instead of Observations
(Demo)
In some datasets, observations and attributes can be reversed.
How? Transpose the matrix and perform PCA.
FAQ
Wait, what’s PCA again?
FAQ
So PCA finds repeated features and discards them?
FAQ
What do you mean by “PCA summarizes the list of wines”?
FAQ
Why are those two goals equivalent?
Maximizing variance = spreading out red dots
Minimizing error = making red lines short
FAQ
Imagine that the black line is a stick, and the red lines are springs attached to the stick from the points.
The first PC is where the stick comes to rest.
SVD finds this for us.
Summary