1 of 28

Principal Component Analysis

(Slides adapted from John DeNero)

UC Berkeley Data 100 Summer 2019

Sam Lau

Learning goals:

  • Understand how to use SVD to conduct PCA.
  • Learn about the minimizing error and the maximizing variance views of PCA.
  • Gain ability to interpret 2D PCA plots and scree plots.

2 of 28

Announcements

  • HW3 due Friday (July 12)
  • Midterm next Tues!
    • 9:30am - 11am Tues, July 16 in 105 North Gate
    • Single hand-written two-sided cheat sheet allowed
    • Covers weeks 1-3
  • Midterm review session on Monday in place of lecture
    • Manana and Ishaan will run it
  • Small group tutoring this week: tinyurl.com/d100-tutor-week3

3 of 28

Principal Component Analysis

4 of 28

Principal Component Analysis for EDA

Goal: Plot the observations in a high-dimensional data matrix (many attributes) in two dimensions by picking two linear combinations of attributes.

Related Goal: Determine whether this two-dimensional plot is really showing the variability in the data. (If not, be wary of conclusions drawn using PCA.)

5 of 28

Principal Component Analysis for EDA

PCA is appropriate when:

  • Visually identifying clusters of similar observations in high dimensions.
  • You are still exploring the data. (If you already know what to predict, you probably don’t need PCA.)
  • You have reason to believe that the data are inherently low rank: there are many attributes, but only a few (perhaps unobserved) attributes mostly determine the rest through a linear association.

6 of 28

Principal Component Analysis Computation

7 of 28

PCA: A Procedural View

Step 0: Center the data matrix by subtracting the mean of each attribute column.

Step 1: Find a linear combination of attributes, represented by a unit vector v, that summarizes each row of the data matrix as a scalar.

  • v gives a one-dimensional projection of the data, the first principal component.
  • Unit vector v is chosen to minimizes the sum of squared distances between each point and its projection onto v.

8 of 28

PCA: A Procedural View

(Demo)

Steps 2+: To find k principal components, choose the vector for each one in the same manner as the first, but ensure each one is orthogonal to all previous ones.

Connection to SVD: In practice, you don’t carry out steps 1+, but instead use singular value decomposition to find all principal components efficiently.

9 of 28

SVD & PCA

Singular value decomposition (SVD) describes a matrix decomposition:

  • X = UΣVT (or XV = UΣ) where U and V are orthonormal and Σ is diagonal.
  • If X has rank r, then there will be r non-zero values on the diagonal of Σ.
  • The values in Σ, called singular values, are ordered from greatest to least.
  • The columns of U are the left singular vectors.
  • The columns of V are the right singular vectors.

10 of 28

SVD & PCA

Suppose X is a (50 x 3) matrix with rank 2.

What are dimensions of U, Σ, and V?

  • U: (50 x 2)
  • Σ: (2 x 2)
  • V: (3 x 2)
    • VT (2 x 3)

11 of 28

SVD & PCA

Principal component analysis (PCA) is a specific application of SVD:

  • X is a data matrix centered at the mean of each column and scaled down by sqrt(m) (number of observations).
  • The largest k singular values are kept, for some choice of k;�All other singular values are set to zero: the dimensionality reduction.
  • The first k columns of V are directions for the k principal components.
  • The first k columns of XV or UΣ contain the k principal components of X.

12 of 28

PCA for Two-Dimensional Visualization

(Demo)

Computational recipe for creating a scatter plot using PCA:

  • For data matrix D with�column means D, compute:

  • Use SVD to find XV = ; plot the first two columns of XV (or ).

13 of 28

Break!

Fill out Attendance:

http://bit.ly/at-d100

14 of 28

Variance

15 of 28

Maximizing Variance

Variance is the expected squared deviation from the mean. For a finite set of attribute values with mean of zero, the variance is the average squared value.

odds = np.array([-1, 3, 5, -7]) # Attribute values with zero mean�np.var(odds) == np.average(odds**2) == 1/len(odds) * odds @ odds

16 of 28

Maximizing Variance

The total variance of a data matrix is the sum of the variances of the attributes.

The sum of squared singular values in PCA is equal to the total variance of the original data matrix.

Each squared singular value (a component score) indicates how much of the total variance is accounted for by the corresponding principal component.

Dividing by one over square root m is necessary to maintain this relationship.

17 of 28

Scree Plots

(Demo)

A scree plot shows the size of the diagonal values of Σ2, largest first.

If the first two singular values are large and all others are small, then two dimensions are enough to describe most of what distinguishes one observation from another.

If not, then a PCA scatter plot is omitting lots of information.

18 of 28

PCA: A Declarative View

Step 0: Center the data matrix by subtracting the mean of each column.

Step 1: Find an k-dimensional projection of the centered matrix that retains as much of the total variance of the original matrix as possible.

Equivalently, find a k-dimensional projection that minimizes the projection error.

In other words: PCA describes our goals for a good projection. SVD is the algorithm used to conduct PCA.

19 of 28

Question: First Principal Component

What’s the relationship between the first singular value and the scale of the x-axis in a PCA scatter plot?

Answer:

The x-axis positions are all the values of of the first column of UΣ. Since U is orthonormal, the column’s length is Σ.

The sum of the squares of these values is the first singular value squared.

What does 0.05 measure?

Why is this point at (0.11, 0)

20 of 28

Interpretation

21 of 28

Principal Component Directions (Axes)

(Demo)

A principal component direction is a linear comb of attributes.

Plotting the values of the first principal component direction can provide insight into how attributes are being combined.

  • High variance attributes typically included (but not always).
  • Many attributes are often included, even if only a few are really important.

Interpreting other principal components is challenging; the constraint that they are orthogonal to prior components strongly influences their directions.

22 of 28

Analyzing Attributes Instead of Observations

(Demo)

In some datasets, observations and attributes can be reversed.

How? Transpose the matrix and perform PCA.

  • SVD on a different matrix: center each row and scale by the square root of the number of columns.
  • However, the rank of a matrix is the same as the rank of its transpose.

23 of 28

FAQ

Wait, what’s PCA again?

  • PCA is a method of summarizing data
  • Example: we have a bunch of wine bottles and we can measure many features about each wine
    • Color, Odor, Age, etc.
    • But some of these will measure related properties and will be redundant.
  • PCA lets us summarize these wines with fewer features.

24 of 28

FAQ

So PCA finds repeated features and discards them?

  • Not quite!
  • It creates new features by combining the old ones.
    • For example, PC1 could be 10 * age + 2 * acidity
  • It finds which combinations are the best ones out of all the possible combos

25 of 28

FAQ

What do you mean by “PCA summarizes the list of wines”?

  • Two perspectives:
    • A good summary captures features that have high variance because those features distinguish between different wines
    • A good summary lets you work backwards to find the original features (aka reconstruct them).
  • These two are equivalent!

26 of 28

FAQ

Why are those two goals equivalent?

Maximizing variance = spreading out red dots

Minimizing error = making red lines short

27 of 28

FAQ

Imagine that the black line is a stick, and the red lines are springs attached to the stick from the points.

The first PC is where the stick comes to rest.

SVD finds this for us.

28 of 28

Summary

  • PCA is a technique to summarize data
    • PCA has one goal stated two different ways:
      • Find directions that minimize projection error
      • Find directions that maximize captured variance
    • To conduct PCA, we use SVD (alternatives possible too)
  • If PCA is successful, the 2D plot will still preserve structure of original data
  • Scree plots tell us how much information lost in PCA