1 of 34

PCA

Principal Component Analysis

A unsupervised learning- dimensionality reduction technique

2 of 34

3 of 34

Too much of anything is good for nothing!

What happens when the given data set has too many variables?

What if most of the variables are correlated?

What happens to the accuracy of the model if the variables are correlated?

How can we overcome the situation?

Solution: Dimensionality Reduction

The process of reducing the number of features (dimensions) in a dataset while preserving its essential information.

4 of 34

Basics of Dimensionality Reduction

Dimensionality → the number of features/columns/variables in a dataset.The objective is to minimize this number.

Why?

  • Redundancy: Correlated features add noise (e.g., "total purchase amount" vs. "number of items purchased").
  • Efficiency: Fewer features improve computational efficiency and model performance.
  • Curse of Dimensionality: Higher dimensions result in sparser data, complicating analysis.

Reducing dimensionality leads to clearer insights and better models. Dimensionality reduction methods are crucial for simplifying models and preventing overfitting.

5 of 34

According to Hughes phenomenon, If the number of training samples is fixed and we keep on increasing the number of dimensions then the predictive power of our machine learning model first increases, but after a certain point it tends to decrease.

This is known as "The curse of dimensionality".

6 of 34

Curse of Dimensionality

Imagine you have a dataset with just a few dimensions, like a list of people's ages and heights. It's easy to visualize and understand this data because it's simple and intuitive.

Now, let's say you add more dimensions to the dataset, like weights, incomes, and geographical locations. As the number of dimensions increases, the space in which the data points exist becomes larger and more spread out.

Create Data Sparsity: The available data becomes sparser, meaning there are fewer data points relative to the size of the space. This makes it harder to find meaningful patterns or relationships in the data. Also, increased computational complexity, and the risk of overfitting

7 of 34

PRINCIPAL COMPONENT ANALYSIS (PCA)

Widely used method for unsupervised, dimensionality reduction machine learning method used to simplify a large data set into a smaller set while still maintaining significant patterns and trends.

8 of 34

What is PCA?

An unsupervised machine learning technique used to reduce the dimensionality of large datasets while preserving most of the variance.

Key Idea:

Transforms correlated variables into uncorrelated principal components (PCs).

Analogy:

"Viewing data from the most informative angle."

9 of 34

PCA (Principal Component Analysis)

Reduce the number of features (columns or dimensions) in your data while keeping as much useful information as possible.

Why Use PCA?

  • Helps visualize high-dimensional data (e.g., 100 features → 2 or 3 dimensions).
  • Speeds up machine learning models by removing unnecessary features.
  • Removes redundancy (correlated features).

10 of 34

Goal of PCA

(PCA) is known as a dimension reduction technique that projects the data on

K dimensions by maximizing the variance of the data.

Goal:

While the data in the higher-dimensional space is mapped to data in the lower-dimensional space, the variance of data in the lower-dimensional space should be maximized.

PCA converts a set of correlated features to a set of uncorrelated features.

11 of 34

Overview of PCA

A dimensionality reduction technique that:

  • Maximizes variance in the projected lower-dimensional space.
  • performs feature extraction by transforming the original features into new, uncorrelated features (principal components).

Key Idea:

  • Projects data onto K orthogonal axes (principal components PC) where variance is maximized.
  • PC Preserves the most important structure while reducing noise/redundancy.

Mathematically

  • Input: Correlated features X.
  • Output: Uncorrelated components Z=XW (where W = eigenvectors of XTX).

Why It Works:

  • Eigenvectors = Directions of max variance.
  • Eigenvalues = Amount of variance retained.

12 of 34

Key Concepts Behind PCA

Variance Matters: PCs capture directions of maximum variance.

Orthogonality: PCs are perpendicular (uncorrelated).

Eigenvalues: Indicate the importance of each PC.

13 of 34

Feature Extraction using PCA

If we have N independent variables. In feature extraction, we create N “new” independent variables, where each “new” independent variable is a combination of each of the N “old” independent variables.

However, we create these new independent variables in a specific way and order these new variables by how well they predict our dependent variable.

14 of 34

Height (X1)

Weight (x2)

Age (X3)

Y (active/not active)

170

65

25

180

70

30

185

68

28

After applying the PCA technique, we create new coordinate axes PC1, PC2, PC3 where each PC is a linear combination of the original features and is oriented in the direction of maximum variance in the data.

PC1=v11X1+v12X2+v13X3

PC2=v21X1+v22X2+v23X3

PC3=v31X1+v32X2+v33X3

This transformation allows you to represent the original data in a new coordinate system that emphasizes the directions with the most variability. This is useful for dimensionality reduction ( I.e. keep PC1, PC2 and discard PC3 (capture less variance), reduce data from 3D-->2D).

axis that captures the most variance.

orthogonal to PC1 and captures the second

most variance.

orthogonal to both PC1 and PC2, capturing the

third most variance.

Final Goal

PC1

PC2

15 of 34

Height (X1)

Weight (x2)

Age (X3)

Y (active/not active)

170

65

25

180

70

30

185

68

28

After applying the PCA technique, we create new coordinate axes PC1, PC2, PC3 where each PC is a linear combination of the original features and is oriented in the direction of maximum variance in the data.

PC1=v11X1+v12X2+v13X3

PC2=v21X1+v22X2+v23X3

PC3=v31X1+v32X2+v33X3

V11, v12, v21, v31 etc. are the components of the eigenvectors, which are also referred to as loadings.

These coefficients represent how much each original feature contributes to the corresponding principal component.

E.g. Suppose we calculate the eigenvectors v1,v2, v3 and get the following:

Eigenvector 1 (for PC1): [0.5,0.5,0.707]

Eigenvector 2 (for PC2): [−0.707,0,0.707]

Eigenvector 3 (for PC3): [−0.408,−0.816,0.408]

Where,

  • Eigenvector for pc1: v1=[v11,v12,v13 ] T : Direction of PC1
  • Eigenvector for pc2: v2=[v21,v22,v23 ] T : Direction of PC2
  • Eigenvector for pc3: v3=[v31,v32,v33 ] T: Direction of PC3

Each eigenvector vᵢ has an associated eigenvalue λᵢ that quantifies how much variance its principal component captures from the original data.

16 of 34

Step-by-Step: How PCA Works

  1. Standardize the Data (Mean = 0, Variance = 1).
  2. Compute Covariance Matrix (Relationships between features).
  3. Calculate Eigenvectors & Eigenvalues (Principal Components).
  4. Sort PCs by Eigenvalues (Highest = Most important).
  5. Project Data onto PCs (Transform into new space).

17 of 34

18 of 34

Step 1: Standardize the Data

PCA is affected by the scale of features (e.g., height in cm vs. weight in kg).

  • Apply “Standardization”: Convert all features to the same scale (mean = 0, standard deviation = 1).

Why? Ensures all features contribute equally.

19 of 34

Step 2: Compute Covariance Matrix (Σ)

Covariance matrix Σ is a table showing how every feature relates to every other feature:

  • Diagonal entries: Variance of each feature (spread).
  • Off-diagonal entries: Covariance between feature pairs (relationships).
    • Positive → Features increase together
    • Negative → One increases while the other decreases
    • Zero → No linear relationship

Each entry cov(xi,xj)​ represents how features i and j vary together.

Why? We want to find directions in feature space where the data varies the most — and the covariance matrix tells us that.

Once the data is standardized, we want to understand how features vary together.

20 of 34

Step 03: Calculate the Eigenvectors and Eigenvalues of the Covariance Matrix

Next we decompose the covariance matrix into eigenvalues and eigenvectors.

  • Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from the covariance matrix to determine the principal components of the data.

Principal Components: By finding the eigenvector with the largest eigenvalue, we identify the direction of the largest spread in the data.

21 of 34

Eigenvectors and Eigenvalues

Covariance Matrix defines both the spread (variance) and the orientation (covariance) of our data.

  • Eigenvectors: Indicate the directions of maximum variance (most information) in the data (Principal Components).
    • Why? the eigenvectors of the covariance matrix are the directions where variance is maximized.
  • Eigenvalues: are simply the coefficients attached to eigenvectors, which give the amount of variance carried in each Principal Component.
    • Represent the magnitude of variance in each Principal Component.

22 of 34

Step 3: Find Eigenvectors & Eigenvalues

PCA uses Σ to: Find directions (eigenvectors) of max variance and Determine importance (eigenvalues) of each direction.

We solve: Σv = λv

  • Σ (Sigma) = Covariance matrix (shows how features relate to each other)
  • v = Eigenvector (direction (Principal Component))
  • λ (Lambda) = An eigenvalue (a number/scalar telling how important (variance) that direction is).
    • PCs are sorted by λ: biggest λ = most important

What we get?

  • All the eigenvectors together form new axes/directions in feature space — these are the principal components.
    • 1st eigenvector = direction of max variance
    • 2nd eigenvector = next best orthogonal direction
  • We sort them by eigenvalues (largest first).

How to Compute Eigenvectors/Values for PCA:

  1. Solve det⁡(Σ−λI)=0 for λ
  2. For each λ, solve Σv=λv

NOT COVERED IN THIS COURSE

23 of 34

Step 4 (Part A): Sort/Rank all Principal Components

Eigenvectors (v) = Best directions to project data.

Eigenvalues (λ) = How important each direction is.

PCA uses this to find the "most important" directions (principal components) and throw away low-variance directions to reduce dimensions.

Step 1: Sort Eigenvalues:

  • Arrange eigenvalues (λ) in descending order: λ1> λ2> …>λn (Largest λ = direction of max variance)

Step 2: Choose Top *k* Eigenvectors

  • Select the first *k* eigenvectors (PCs) corresponding to the largest eigenvalues. These define the most important directions in the data.

NEXT: Step 3: Determine *k* (How Many PCs to Keep?)

Sort eigenvalues in descending order (highest to lowest).

24 of 34

Example: Rank the Eigenvectors

Once we get the eigenvectors and eigenvalues:

Rank the eigenvectors: Order eigenvectors by eigenvalues, highest to lowest.

25 of 34

Step 4 (Part B): Select Top Principal Components

Choose the top k eigenvectors (principal components) that capture most of the variance.

How many components (k) to keep?

  • We keep the PCs that collectively capture most of the variance and discard those with minimal contribution.
  • Use the explained variance ratio:

How to decide how many PCs to keep, look at each eigenvalue’s contribution to total variance.

Common strategies:

  • Retain PCs covering >95% cumulative variance (for accuracy).
  • Keep PCs with λ ≥ 1 (Kaiser’s rule).
  • Use the elbow method in a scree plot.

26 of 34

How to decide top k PCs

  1. Elbow Method: Create a Scree plot
  2. For data visualization: select 2 or 3 PCs.
  3. Calculate the percentage of explained variance and analyze the contribution percentages:
    • Common approach: keep PCs that together explain a specific percentage of the total variance (e.g., 80% or 90%).

Each bar shows the explained variance percentage of individual components. The red line is a type of scree plot. By analyzing this plot, we decide to keep only the first two or three components because they capture almost all the variance in the dataset.

27 of 34

Example: Select Principal Components by Cumulative Explained Variance ( Quantitative Method)

To determine the percentage of variance (information) explained by each PC, we divide the eigenvalue of that PC by the sum of all eigenvalues, and then multiply by 100% for percentage.

28 of 34

Step 5: Transform the Data: Project Onto New Axes

Finally, we project our original standardized data Xstandardized onto the selected eigenvectors Wk:

Xreduced =Xstandardized⋅Wk

This gives us a new dataset with fewer dimensions (say, 2 instead of 5), where:

  • Each column is a principal component,
  • Most variance is retained.
  • Result: Original data (e.g., 10 features) → Reduced data (e.g., 2 features).

matrix of top k eigenvectors (each column is an eigenvector).

29 of 34

Summary

PCA finds new uncorrelated directions (principal components) that maximize variance in the data.

How?

  1. Compute Covariance Matrix (Σ)
    • Measures how features vary together.
  2. Solve Σv = λv
    • Eigenvectors (v): Directions of max variance (PC axes).
    • Eigenvalues (λ): Variance along each PC (larger λ = more important).
  3. Sort PCs
    • Keep top K PCs with largest λ (reduces dimensions).

Why It Works

  • 1st PC: Direction of max variance.
  • Next PCs: Orthogonal to previous, capturing remaining variance.

Result

Transformed data with fewer features, preserving most information.

�Data → Σ → Eigenvectors → Sorted PCs → Reduced Data

30 of 34

A Visualization using PCA

31 of 34

PCA in Python

(sklearn)

32 of 34

Key Takeaways

  • PCA = Find new axes (PCs) where data varies most.
  • Eigenvectors = directions, Eigenvalues = importance (how much variance is in that direction).
  • PCA chooses the directions that best explain the data.
  • Project data onto top PCs to reduce dimensions.
  • PCA is especially useful for preprocessing, noise reduction, and exploration.

33 of 34

Benefits

Downsides to PCA

Dimension Reduction: PCA reduces the number of features while retaining essential information, making datasets more manageable.

Noise Reduction: eliminate noise in data, improving feature quality and model performance.

Feature Transformation: transforms original features into orthogonal components, potentially revealing hidden patterns.

Visualization: PCA simplifies data into 2 or 3D for easier visual understanding.

Feature Selection: PCA ranks principal components, helping indirectly in feature selection.

Linearity: PCA assumes straight-line relationships in data, which isn't always the case.

Interpretability: Principal components can be hard to understand in terms of original features.

Loss of Information: PCA prioritizes variance and might lose important data details.

No Feature Learning: PCA doesn't create new features; it works with what's already there.

34 of 34

References:

  1. https://ai.plainenglish.io/math-behind-pca-bb3d179cffbc
  2. https://medium.com/@seymurqribov05/pca-principal-component-analysis-a-more-detailed-explanation-involving-eigenvalues-and-0abbadd238c4
  3. https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-unsupervised-learning (PCA part only)