PCA
Principal Component Analysis
A unsupervised learning- dimensionality reduction technique
Too much of anything is good for nothing!
What happens when the given data set has too many variables?
What if most of the variables are correlated?
What happens to the accuracy of the model if the variables are correlated?
How can we overcome the situation?
Solution: Dimensionality Reduction
The process of reducing the number of features (dimensions) in a dataset while preserving its essential information.
Basics of Dimensionality Reduction
Dimensionality → the number of features/columns/variables in a dataset.The objective is to minimize this number.
Why?
Reducing dimensionality leads to clearer insights and better models. Dimensionality reduction methods are crucial for simplifying models and preventing overfitting.
According to Hughes phenomenon, If the number of training samples is fixed and we keep on increasing the number of dimensions then the predictive power of our machine learning model first increases, but after a certain point it tends to decrease.
This is known as "The curse of dimensionality".
Curse of Dimensionality
Imagine you have a dataset with just a few dimensions, like a list of people's ages and heights. It's easy to visualize and understand this data because it's simple and intuitive.
Now, let's say you add more dimensions to the dataset, like weights, incomes, and geographical locations. As the number of dimensions increases, the space in which the data points exist becomes larger and more spread out.
Create Data Sparsity: The available data becomes sparser, meaning there are fewer data points relative to the size of the space. This makes it harder to find meaningful patterns or relationships in the data. Also, increased computational complexity, and the risk of overfitting
PRINCIPAL COMPONENT ANALYSIS (PCA)
Widely used method for unsupervised, dimensionality reduction machine learning method used to simplify a large data set into a smaller set while still maintaining significant patterns and trends.
What is PCA?
An unsupervised machine learning technique used to reduce the dimensionality of large datasets while preserving most of the variance.
Key Idea:
Transforms correlated variables into uncorrelated principal components (PCs).
Analogy:
"Viewing data from the most informative angle."
PCA (Principal Component Analysis)
Reduce the number of features (columns or dimensions) in your data while keeping as much useful information as possible.
Why Use PCA?
Goal of PCA
(PCA) is known as a dimension reduction technique that projects the data on
K dimensions by maximizing the variance of the data.
Goal:
While the data in the higher-dimensional space is mapped to data in the lower-dimensional space, the variance of data in the lower-dimensional space should be maximized.
PCA converts a set of correlated features to a set of uncorrelated features.
Overview of PCA
A dimensionality reduction technique that:
Key Idea:
Mathematically
Why It Works:
Key Concepts Behind PCA
Variance Matters: PCs capture directions of maximum variance.
Orthogonality: PCs are perpendicular (uncorrelated).
Eigenvalues: Indicate the importance of each PC.
Feature Extraction using PCA
If we have N independent variables. In feature extraction, we create N “new” independent variables, where each “new” independent variable is a combination of each of the N “old” independent variables.
However, we create these new independent variables in a specific way and order these new variables by how well they predict our dependent variable.
Height (X1) | Weight (x2) | Age (X3) | Y (active/not active) |
170 | 65 | 25 | … |
180 | 70 | 30 | … |
185 | 68 | 28 | |
After applying the PCA technique, we create new coordinate axes PC1, PC2, PC3 where each PC is a linear combination of the original features and is oriented in the direction of maximum variance in the data.
PC1=v11X1+v12X2+v13X3
PC2=v21X1+v22X2+v23X3
PC3=v31X1+v32X2+v33X3
This transformation allows you to represent the original data in a new coordinate system that emphasizes the directions with the most variability. This is useful for dimensionality reduction ( I.e. keep PC1, PC2 and discard PC3 (capture less variance), reduce data from 3D-->2D).
axis that captures the most variance.
orthogonal to PC1 and captures the second
most variance.
orthogonal to both PC1 and PC2, capturing the
third most variance.
Final Goal
PC1
PC2
Height (X1) | Weight (x2) | Age (X3) | Y (active/not active) |
170 | 65 | 25 | … |
180 | 70 | 30 | … |
185 | 68 | 28 | |
After applying the PCA technique, we create new coordinate axes PC1, PC2, PC3 where each PC is a linear combination of the original features and is oriented in the direction of maximum variance in the data.
PC1=v11X1+v12X2+v13X3
PC2=v21X1+v22X2+v23X3
PC3=v31X1+v32X2+v33X3
V11, v12, v21, v31 etc. are the components of the eigenvectors, which are also referred to as loadings.
These coefficients represent how much each original feature contributes to the corresponding principal component.
E.g. Suppose we calculate the eigenvectors v1,v2, v3 and get the following:
Eigenvector 1 (for PC1): [0.5,0.5,0.707]
Eigenvector 2 (for PC2): [−0.707,0,0.707]
Eigenvector 3 (for PC3): [−0.408,−0.816,0.408]
Where,
Each eigenvector vᵢ has an associated eigenvalue λᵢ that quantifies how much variance its principal component captures from the original data.
Step-by-Step: How PCA Works
Step 1: Standardize the Data
PCA is affected by the scale of features (e.g., height in cm vs. weight in kg).
Why? Ensures all features contribute equally.
Step 2: Compute Covariance Matrix (Σ)
Covariance matrix Σ is a table showing how every feature relates to every other feature:
Each entry cov(xi,xj) represents how features i and j vary together.
Why? We want to find directions in feature space where the data varies the most — and the covariance matrix tells us that.
Once the data is standardized, we want to understand how features vary together.
Step 03: Calculate the Eigenvectors and Eigenvalues of the Covariance Matrix
Next we decompose the covariance matrix into eigenvalues and eigenvectors.
Principal Components: By finding the eigenvector with the largest eigenvalue, we identify the direction of the largest spread in the data.
Eigenvectors and Eigenvalues
Covariance Matrix defines both the spread (variance) and the orientation (covariance) of our data.
Step 3: Find Eigenvectors & Eigenvalues
PCA uses Σ to: Find directions (eigenvectors) of max variance and Determine importance (eigenvalues) of each direction.
We solve: Σv = λv
What we get?
How to Compute Eigenvectors/Values for PCA:
NOT COVERED IN THIS COURSE
Step 4 (Part A): Sort/Rank all Principal Components
Eigenvectors (v) = Best directions to project data.
Eigenvalues (λ) = How important each direction is.
PCA uses this to find the "most important" directions (principal components) and throw away low-variance directions to reduce dimensions.
Step 1: Sort Eigenvalues:
Step 2: Choose Top *k* Eigenvectors
NEXT: Step 3: Determine *k* (How Many PCs to Keep?)
Sort eigenvalues in descending order (highest to lowest).
Example: Rank the Eigenvectors
Once we get the eigenvectors and eigenvalues:
Rank the eigenvectors: Order eigenvectors by eigenvalues, highest to lowest.
Step 4 (Part B): Select Top Principal Components
Choose the top k eigenvectors (principal components) that capture most of the variance.
How many components (k) to keep?
How to decide how many PCs to keep, look at each eigenvalue’s contribution to total variance.
Common strategies:
How to decide top k PCs
Each bar shows the explained variance percentage of individual components. The red line is a type of scree plot. By analyzing this plot, we decide to keep only the first two or three components because they capture almost all the variance in the dataset.
Example: Select Principal Components by Cumulative Explained Variance ( Quantitative Method)
To determine the percentage of variance (information) explained by each PC, we divide the eigenvalue of that PC by the sum of all eigenvalues, and then multiply by 100% for percentage.
Step 5: Transform the Data: Project Onto New Axes
Finally, we project our original standardized data Xstandardized onto the selected eigenvectors Wk:
Xreduced =Xstandardized⋅Wk
This gives us a new dataset with fewer dimensions (say, 2 instead of 5), where:
matrix of top k eigenvectors (each column is an eigenvector).
Summary
PCA finds new uncorrelated directions (principal components) that maximize variance in the data.
How?
Why It Works
Result
Transformed data with fewer features, preserving most information.
�Data → Σ → Eigenvectors → Sorted PCs → Reduced Data
A Visualization using PCA
PCA in Python
(sklearn)
Key Takeaways
Benefits | Downsides to PCA |
Dimension Reduction: PCA reduces the number of features while retaining essential information, making datasets more manageable. Noise Reduction: eliminate noise in data, improving feature quality and model performance. Feature Transformation: transforms original features into orthogonal components, potentially revealing hidden patterns. Visualization: PCA simplifies data into 2 or 3D for easier visual understanding. Feature Selection: PCA ranks principal components, helping indirectly in feature selection. | Linearity: PCA assumes straight-line relationships in data, which isn't always the case. Interpretability: Principal components can be hard to understand in terms of original features. Loss of Information: PCA prioritizes variance and might lose important data details. No Feature Learning: PCA doesn't create new features; it works with what's already there. |
References: