Applied
Big Data Analytics
Zubair Nabi
Lecture 6
Segmenting E-commerce Users
Our Big Data Analytics Stack
Our Big Data Analytics Stack: Today’s Lecture
Use Case for Today
User Segmentation in E-commerce
What is User Segmentation in E-commerce?
The process of dividing a customer base into distinct groups based on shared characteristics or behaviors for targeted marketing and engagement
E-commerce businesses use segmentation to tailor product recommendations, personalized offers, and marketing strategies, leading to improved customer retention, cross-selling, and targeting based on behavior or demographics
Some Key Concepts in E-commerce
The goods or services available for purchase, often listed with details like price, description, and reviews
Item/Product
A virtual basket where consumers place items they intend to purchase before proceeding to checkout
Cart
The process by which consumers complete their purchase, using methods such as credit cards, digital wallets, or bank transfers
Payment
The complete process of purchasing items, from adding them to the cart to making the payment
Transaction
Feedback from previous customers about a product or service, typically scored on a scale, which helps others decide whether to buy
Rating/Review
Representative E-commerce Dataset
Transaction ID | Customer ID | Product Name | Category | Price | Quantity | Purchase Date | Payment Method | Shipping City |
T001 | C001 | Wireless Earbuds | Electronics | 49.99 | 1 | 2024-09-01 | Credit Card | New York |
T002 | C002 | Running Shoes | Sportswear | 79.99 | 2 | 2024-09-02 | PayPal | Los Angeles |
T003 | C001 | Laptop Stand | Office | 29.99 | 1 | 2024-09-03 | Credit Card | New York |
T004 | C003 | Smartwatch | Electronics | 199.99 | 1 | 2024-09-04 | Debit Card | Chicago |
T005 | C004 | Yoga Mat | Sportswear | 24.99 | 1 | 2024-09-05 | Credit Card | San Francisco |
T006 | C005 | LED Desk Lamp | Home & Decor | 19.99 | 3 | 2024-09-06 | PayPal | Boston |
T007 | C006 | Blender | Kitchen | 59.99 | 1 | 2024-09-06 | Credit Card | Seattle |
T008 | C007 | Smartphone Case | Electronics | 12.99 | 2 | 2024-09-07 | Debit Card | Austin |
T009 | C003 | Noise-cancelling Headphones | Electronics | 89.99 | 1 | 2024-09-08 | Credit Card | Chicago |
T010 | C008 | Water Bottle | Sportswear | 14.99 | 1 | 2024-09-09 | PayPal | Miami |
What would be the simplest way of segmentation?
Answer:
Segment Users by Location
Transaction ID | Customer ID | Product Name | Category | Price | Quantity | Purchase Date | Payment Method | Shipping City |
T001 | C001 | Wireless Earbuds | Electronics | 49.99 | 1 | 2024-09-01 | Credit Card | New York |
T002 | C002 | Running Shoes | Sportswear | 79.99 | 2 | 2024-09-02 | PayPal | Los Angeles |
T003 | C001 | Laptop Stand | Office | 29.99 | 1 | 2024-09-03 | Credit Card | New York |
T004 | C003 | Smartwatch | Electronics | 199.99 | 1 | 2024-09-04 | Debit Card | Chicago |
T005 | C004 | Yoga Mat | Sportswear | 24.99 | 1 | 2024-09-05 | Credit Card | San Francisco |
SELECT Shipping_City, COUNT(Customer_ID)
FROM ecommerce_table
GROUP BY Shipping_City;
Segment Users by High-Spending
Transaction ID | Customer ID | Product Name | Category | Price | Quantity | Purchase Date | Payment Method | Shipping City |
T001 | C001 | Wireless Earbuds | Electronics | 49.99 | 1 | 2024-09-01 | Credit Card | New York |
T002 | C002 | Running Shoes | Sportswear | 79.99 | 2 | 2024-09-02 | PayPal | Los Angeles |
T003 | C001 | Laptop Stand | Office | 29.99 | 1 | 2024-09-03 | Credit Card | New York |
T004 | C003 | Smartwatch | Electronics | 199.99 | 1 | 2024-09-04 | Debit Card | Chicago |
T005 | C004 | Yoga Mat | Sportswear | 24.99 | 1 | 2024-09-05 | Credit Card | San Francisco |
SELECT Customer_ID, SUM(Price * Quantity)
FROM ecommerce_table
GROUP BY Customer_ID
HAVING SUM(Price * Quantity) > 200;
Segment Users by Product Category
Transaction ID | Customer ID | Product Name | Category | Price | Quantity | Purchase Date | Payment Method | Shipping City |
T001 | C001 | Wireless Earbuds | Electronics | 49.99 | 1 | 2024-09-01 | Credit Card | New York |
T002 | C002 | Running Shoes | Sportswear | 79.99 | 2 | 2024-09-02 | PayPal | Los Angeles |
T003 | C001 | Laptop Stand | Office | 29.99 | 1 | 2024-09-03 | Credit Card | New York |
T004 | C003 | Smartwatch | Electronics | 199.99 | 1 | 2024-09-04 | Debit Card | Chicago |
T005 | C004 | Yoga Mat | Sportswear | 24.99 | 1 | 2024-09-05 | Credit Card | San Francisco |
SELECT Customer_ID, Category, COUNT(Product_Name) AS Product_Purchase_Count
FROM ecommerce_table
GROUP BY Customer_ID, Category
ORDER BY Customer_ID, Product_Purchase_Count DESC;
But it is hard to scale segmentation using SQL…
Why?
Shortcomings of SQL for Segmentation
Limited to Explicit Criteria: SQL requires predefined conditions set by the user for segmentation. It cannot discover segments based on hidden patterns or relationships in the data, potentially missing insights
Static Analysis: SQL queries provide results based on the dataset at the time of execution. As user behaviors evolve, you must manually adjust queries, making it less responsive to changing data patterns
Challenges with High-Dimensional Data: SQL performs well with structured data but struggles with high-dimensional datasets involving multiple variables
What we need instead..
Uncover hidden patterns and insights that predefined criteria might miss, allowing for a more dynamic and comprehensive understanding of customer segments
Techniques that can analyze large datasets to automatically identify natural groupings based on user behavior and characteristics
Enter Unsupervised Learning
Unsupervised learning is a machine learning approach that analyzes data without predefined labels, enabling algorithms to discover hidden patterns and structures within the dataset
This technique is especially valuable for tasks such as clustering, where it groups similar data points based on their inherent characteristics
Supervised vs Unsupervised Learning
Supervised
Learning
Unsupervised
Learning
Definition
Data Requirements
Goal
Common Algorithms
Use Cases
Uses labeled data to train models
Analyzes data without predefined labels to discover patterns
Requires a dataset with input-output pairs (features and labels)
Predict outcomes or classify data based on learned patterns
Linear regression, logistic regression, decision trees, support vector machines
Customer Churn Prediction, Fraud Detection, Dynamic Pricing, Sentiment Analysis
Uses only input data without labels
Identify hidden structures or groupings within the data
K-means clustering, hierarchical clustering, principal component analysis (PCA)
Customer Segmentation, Market Basket Analysis, Anomaly Detection, Behavioural Analysis
Focus of Lecture 7, 8, 9
Different Types of Unsupervised Learning Techniques
Groups similar data points together based on their characteristics, enabling the discovery of inherent patterns in the data, such as K-Means, Hierarchical, and DBSCAN
Identifies interesting relationships and patterns among variables in large datasets, commonly used for market basket analysis to uncover product purchase correlations, such as, Apriori Algorithm and FP-Growth
Clustering
A process that reduces the number of features in a dataset while preserving essential information, facilitating easier analysis and visualization, such as Principal Component Analysis (PCA)
Dimensionality Reduction
Association Rule Learning
Let’s start with clustering
In this course, we will be focusing on K-means, its variants, and Gaussian Mixture Model (GMM) because they tend to scale well for large datasets
In comparison, DBSCAN and most hierarchical clustering algorithms are generally not as scalable, making them less suited for big data use cases where efficient parallel and distributed processing is essential
Clustering Scalability Issues
DBSCAN doesn't scale well because it relies on expensive neighborhood queries and global connectivity checks, which often require O(n2) computations in the worst case
Most hierarchical clustering algorithms scale poorly because they typically require computing and storing pairwise distances between data points, leading to quadratic time and memory complexity
Additionally, repeatedly merging or splitting clusters involves global operations that further increase computational costs, making these methods impractical for very large datasets
Hierarchical Clustering
DBSCAN
What is K-Means Clustering?
A partitional clustering algorithm that divides data into a specified number of clusters (K) based on distance to cluster centroids
It is efficient for large datasets but assumes spherical cluster shapes and can be sensitive to outliers
Its simplicity, speed, and broad applicability have made it a popular choice
K-Means Clustering Algorithm
2. Initialize centroids by randomly selecting k data points
1. Choose the number of clusters (k)
3. Assign each data point to the nearest centroid using a distance metric (e.g., Euclidean distance), forming k clusters
4. Update the centroids by calculating the mean of all points in each cluster
5. Repeat steps 3-4 until the centroids no longer change or a set number of iterations is reached
K-Means Clustering Algorithm (2)
The notebook randomly generates the number of transactions and total spending for 1000 customers
Runs and visualizes the algorithm step by step
https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/KMeansClustering.ipynb
How do you Evaluate an Unsupervised Learning Technique like K-Means Clustering?
Measuring the performance of unsupervised learning algorithms, such as clustering, is inherently different from supervised learning since there are no true labels to compare against
Of course we can use domain-specific output metrics such as customer segmentation effectiveness in marketing (i.e., revenue increase after targeting) or anomaly detection accuracy
But there are other options too
K-Means Clustering Evaluation Methods
Looks for a point where adding more clusters yields diminishing returns in reducing within-cluster variance
Measures how well-separated the clusters are
Silhouette Score
Elbow Method
K-Means Clustering Evaluation
Elbow Method
Silhouette Score
Definition
Formula
Insight
It measures the sum of squared distances (inertia) between data points and their nearest cluster center for different values of k (number of clusters)
It combines two factors: intra-cluster cohesion (how close points are to others in the same cluster) and inter-cluster separation (how far points are from points in other clusters)
The optimal k is at the elbow point, where increasing k further results in diminishing returns in terms of reducing inertia
a(i) is the intra-cluster distance, i.e., the average distance between point i and all other points in the same cluster
b(i) is the inter-cluster distance, i.e., the average distance between point i and points in the nearest (different) cluster
The optimal k is the one that maximizes the average silhouette score, indicating that the clusters are well-separated and compact
Within-Cluster Sum of Squares
K-Means Clustering Evaluation (2)
Elbow Method
Silhouette Score
Advantages
Shortcomings
Provides a good, intuitive indication of when adding more clusters no longer significantly improves the model
The "elbow" is sometimes subjective and not always clearly visible
Provides a quantitative score that balances cohesion and separation
Computationally expensive for large datasets, as it requires calculating distances between all points in the dataset
K-Means Clustering Evaluation (3)
The notebook randomly generates the number of transactions and total spending for 1000 customers
Visualizes both Elbow Score and Silhouette Plot
https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/KMeansClustering-Evaluation.ipynb
K-Means has a few variants that improve its performance
K-Means Variants
K-Means++ enhances the initialization process by selecting initial centroids in a way that they are more spread out across the data space
Bisecting K-Means, works by recursively splitting clusters into two until the desired number of clusters is reached. This top-down divisive approach often improves clustering quality and scalability
K-Means++
Mini-Batch K-Means processes data in small, random batches instead of the entire dataset at once
Mini-Batch K-Means
Bisecting K-Means
K-Means Variants (2)
Reduces the likelihood of poor initialization, which can lead to suboptimal clustering
Faster convergence due to better starting points, often resulting in fewer iterations needed to reach optimal clustering
Constructs a hierarchical clustering structure by recursively splitting clusters into two, leading to more balanced partitions
Efficiently targets high-variance clusters for splitting, making it scalable for distributed environments
K-Means++
Significantly faster than standard K-Means, especially for large datasets, because it reduces the computation required for each iteration
Provides a good trade-off between convergence speed and clustering accuracy
Mini-Batch K-Means
Bisecting K-Means
But what if the data does not fit into the memory of a single server?
We will have to use a cluster of machines
How can we run K-Means in a parallel and distributed fashion?
Parallel K-Means++ Clustering Algorithm
Parallel K-Means++ Clustering Algorithm (2)
Similar to vanilla K-Means, the notebook randomly generates the number of transactions and total spending for 1000 customers
It initializes centroids, assigns data points, and updates centroids in parallel
Runs and visualizes the algorithm step by step
https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/Parallel-KMeansPlusPlusClustering.ipynb
The example that we looked at is parallel but not distributed
How can we achieve distributed computation?
+
Spark Implementation of K-Means
We will be using SparkML instead of MLlib
The former uses Dataframes while the latter is too low level and uses RDDs
+
Spark Implementation of K-Means (2)
1. Initialization: Initial centroids are chosen using a variant of K-means++, which selects well-spread centroids in a scalable, distributed manner
2. Iterative Assignment and Update:
Assignment Step: Each data point is assigned to its nearest centroid in parallel
Update Step: Partial sums and counts are computed across partitions to calculate new centroids
These steps iterate until the centroids stabilize (or until a maximum iteration count is reached)
+
Spark Implementation of K-Means (3)
Similar to earlier examples, the notebook randomly generates the number of transactions and total spending for 1000 customers
It uses the Spark ML implementation of K-Means
Runs and visualizes the output
+
https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/Spark-KMeans.ipynb
While standard K-Means partitions data into flat clusters, many datasets benefit from a hierarchical approach that recursively refines groupings for more balanced, insightful clusters
Enter Bisecting K-Means
Bisecting K-Means is a hierarchical clustering algorithm that recursively splits data into two clusters
It starts with all data in one cluster and divides the most heterogeneous clusters until the desired number of clusters is reached
This approach often yields more balanced, scalable, and refined clusters for large datasets
Bisecting K-Means Clustering Algorithm
2. Select a Cluster to Split: Identify the cluster with the highest heterogeneity (e.g., the largest sum of squared distances) or choose one at random
1. Initialization: Start with one cluster that contains all the data points
3. Bisection: Run standard K-Means with k=2 on the selected cluster, possibly repeating several times to find the best split
4. Replace Cluster: Replace the chosen cluster with the two new clusters from the split
5. Repeat: Repeat steps 2, 3, and 4 until the desired number of clusters is reached
Bisecting K-Means Clustering Algorithm (2)
Similar to vanilla K-Means, the notebook randomly generates the number of transactions and total spending for 1000 customers
Runs and visualizes the Bisecting algorithm step by step
https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/BisectingKMeansClustering.ipynb
Spark Implementation of Bisecting K-Means
2. Group Clusters for Splitting: Group all eligible clusters together so that they can be processed simultaneously
1. Select Eligible Clusters: Identify clusters at the current hierarchical level that are eligible for splitting (e.g., those with high SSE or above a minimum size)
3. Parallel k=2 Splitting: Execute the K-Means k=2 subroutine on each grouped cluster in parallel using Spark’s distributed RDD framework
4. Aggregate and Update: Aggregate the results from the parallel splits, evaluate each split, and update the cluster hierarchy by replacing the split clusters with the new sub-clusters
5. Iterate Until Completion: Repeat the above steps until the desired number of leaf clusters is reached or no further eligible clusters remain
Spark Implementation of Bisecting K-Means (2)
Continuing with the same running example but using Bisecting K-Means
+
https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/Spark-BisectingKMeans.ipynb
K-Means faces two key limitations in real-world applications:
Let’s look at two examples:
In ecommerce, a customer might frequently purchase premium products yet occasionally hunt for deals
The purchasing behavior of seasonable shoppers is highly concentrated during holiday seasons with sporadic activity during the rest of the year. This creates a cluster that's elongated along the "seasonality" dimension rather than forming a neat, spherical group
Enter Gaussian Mixture Model (GMM)
Gaussian Mixture Models (GMMs) model data as a mixture of Gaussian distributions
They assign each point a probability of belonging to each cluster, providing soft assignments
GMMs can capture complex, non-spherical cluster shapes by estimating unique covariances for each cluster
Key Properties of Gaussian Mixture Models
GMMs are typically optimized using the Expectation-Maximization (EM) algorithm, which iteratively refines the estimates of the Gaussian parameters and the probability of each point belonging to each cluster
Like many iterative methods, the outcome of GMMs can depend on the initial parameter settings, so multiple runs or careful initialization methods are often needed
EM Algorithm
GMMs allow for different covariance structures (e.g., full, diagonal, spherical), giving flexibility in modeling the shape and orientation of clusters
Covariance Types
Sensitivity to Initialization
Gaussian Mixture Model Algorithm
2. Expectation (E) Step: For each data point, calculate the probability that it belongs to each Gaussian component based on the current parameters
1. Initialization: Start with initial guesses for the parameters—mixing coefficients, means, and covariance matrices of the Gaussian components
3. Maximization (M) Step: Update the parameters (mixing coefficients, means, and covariances) by maximizing the expected likelihood computed in the E-step
4. Iteration: Repeat the E and M steps until the parameters converge or the changes become negligible
How to Evaluate a Gaussian Mixture Model?
Quantifies how probable the observed data is given the model's parameters
A higher log-likelihood means the model's parameters better explain the observed data
Use it as a baseline to assess overall model fit, especially when comparing models of equal complexity
Log
Likelihood (LL)
Balances model fit and complexity by penalizing the number of parameters to prevent overfitting
Rewards models that achieve high LL while penalizing those with excessive complexity; lower AIC indicates a better model
Ideal for model selection in scenarios where moderate dataset sizes are involved and slight complexity is acceptable if it significantly improves the fit
Akaike Information Criterion (AIC)
Similar to AIC but imposes a stricter penalty for model complexity, especially effective for large datasets
Strongly penalizes complex models; a lower BIC value indicates a model that balances fit and simplicity, reducing the risk of overfitting
Particularly useful when working with large datasets
Bayesian Information Criterion (BIC)
Log Likelihood (LL)
n: Number of data points
k: Number of Gaussian components
πj: Mixing coefficient for component j
N(xi∣μj,Σj): probability density of data point xi under the Gaussian with mean μj and covariance Σj
Measures the overall probability of the observed data under the model, summing the log of weighted Gaussian probabilities for each data point
Akaike Information Criterion (AIC)
k: Number of Gaussian components
LL: Log-likelihood of the model
Balances model fit and complexity by penalizing the total number of parameters, helping to select models that achieve good likelihood with fewer parameters
Bayesian Information Criterion (BIC)
n: Number of data points
k: Number of Gaussian components
LL: Log-likelihood of the model
Similar to AIC but with a stronger penalty based on the sample size, favoring simpler models to reduce the risk of overfitting, especially in large datasets
Gaussian Mixture Model Algorithm
Let’s now apply GMM to our use case
https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/GMM.ipynb
Spark Implementation of Gaussian Mixture
2. Local E-Step: Each worker computes the soft assignments and partial statistics (e.g., weighted sums, log-likelihood contributions) for its data
1. Data Partitioning & Broadcasting: The dataset is split across multiple partitions, and current model parameters are broadcast to all worker nodes
3. Global Aggregation & M-Step: Partial results are combined (reduced) across partitions to update the model parameters, which are then broadcast for the next iteration
4. Iteration: Repeat the E and M steps until the parameters converge or the changes become negligible
Spark Implementation of Gaussian Mixture (2)
Continuing with the same running example but using Spark’s implementation of GMM
+
https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/Spark-GMM.ipynb
Clustering can reveal hidden groups, but beware of the curse of high dimensionality!
The Curse of High Dimensionality
In high-dimensional spaces, the distance between points becomes less meaningful, making it difficult for K-Means to accurately cluster similar data points. As dimensions increase, the data becomes sparser, and clusters may appear indistinguishable
High-dimensional datasets often contain irrelevant or redundant features that can confuse the clustering process
Curse of Dimensionality
The computational cost of K-Means grows with the number of dimensions. High-dimensional datasets require more calculations for distance metrics and centroid updates, which can lead to longer processing times and resource consumption
Increased Computational Complexity
Noise and Irrelevant Features
Visualizing clusters in high-dimensional spaces is inherently challenging, making it difficult to assess the clustering outcomes
Visualization and Interpretability
The Curse of High Dimensionality (2)
The dataset contains 10 different features related to customer behavior (e.g., Total_Spending, Number_of_Transactions, etc.). As the dimensionality increases, the relationships between these features become more complex and harder to visualize
As dimensionality increases, the distances between points become less informative. This can lead to misclassifications where points that seem close in the selected dimensions (in the 3D plots) are actually distant when considering all dimensions. The clustering results might not accurately reflect the true data distribution, highlighting K-Means' limitations in high-dimensional contexts
https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/KMeans-HighDimensionality.ipynb
Enter Principal Component Analysis
A statistical technique used for dimensionality reduction in large datasets
Helps to simplify complex data while retaining its essential characteristic
PCA transforms high-dimensional data into a lower-dimensional space, enabling easier interpretation and insights
Principal Component Analysis
In the context of Principal Component Analysis (PCA), principal components refer to the new variables generated from the original dataset that capture the most significant patterns of variance in the data
Variance refers to a statistical measure of the dispersion or spread of a set of data points around their mean. It quantifies how much the data points differ from the mean value of the dataset
Definition
By transforming the original features into principal components, PCA reduces the number of dimensions while retaining the most important information
This allows for easier visualization and analysis of the data
PCA prioritizes components with the highest variance for dimensionality reduction. By retaining only the principal components that explain a significant proportion of the total variance
Dimensionality Reduction
Principal Component Analysis (2)
The first principal component captures the maximum variance in the data. The second principal component captures the maximum variance orthogonal to the first, and so on. This means:
The first principal component accounts for the largest proportion of variance
The second principal component accounts for the second largest proportion of variance, and is uncorrelated with the first
This process continues for as many principal components as there are original dimensions (up to the minimum of the number of observations or features)
Variance Capture
Each principal component can be interpreted as a new feature that summarizes the information in the original features. They help in understanding the structure and relationships in the data, even if the original features are high-dimensional or complex
High variance along a particular direction indicates that the data points are widely spread out along that direction, suggesting that this direction contains significant information about the structure of the data
Conversely, low variance indicates that the data points are closely clustered around the mean along that direction, suggesting that it may not be as informative
Interpretation
Principal Component Analysis (3)
Each principal component has an associated explained variance ratio, which indicates the proportion of the dataset's total variance that is captured by that component
This helps in understanding how many components are needed to adequately represent the original data
Variance Ratio
When you perform PCA, you can choose to keep only a subset of the principal components based on their explained variance
For example, you might decide to keep only the first 2 or 3 principal components if they explain a significant portion of the total variance (e.g., 90% or 95%)
Number of Principal Components
Principal Component Analysis Algorithm
Input: Dataset X (n samples, m features)
Calculate the covariance matrix C from the standardized dataset:
C = (1 / (n - 1)) * (X^T * X)
(where X^T is the transpose of X)
A covariance matrix indicates how much the variables change together and revealing their relationships and variances
For each feature (column) in X:
Calculate the eigenvalues (λ) and eigenvectors (v) of the covariance matrix C:
- Solve the equation: C * v = λ * v
Eigenvector: Arrows that point in the directions along which the data varies the most
Eigenvalue: How much variance is present in the data along the directions specified by the eigenvectors
Principal Component Analysis Algorithm (2)
Input: Dataset X (n samples, m features)
Choose the first k eigenvectors corresponding to the top k eigenvalues. These eigenvectors are the principal components
Sort the eigenvalues in descending order and rearrange the corresponding eigenvectors accordingly
This enables the prioritization of the components that capture the most variance in the data
Create a new dataset Y by projecting the standardized dataset X onto the selected principal components:
Y = X * V_k
(where V_k is the matrix of the top k eigenvectors)
This essentially "translates" the variance structure of the original dataset into a new coordinate system
Principal Component Analysis Algorithm (3)
This code applies PCA to reduce a 10-dimensional e-commerce dataset to 6 principal components
The data is first standardized, then transformed into a lower-dimensional space
https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/PCA-HighDimensionality.ipynb
Spark Implementation of PCA
2. Standardize Each Partition: For each partition, compute the feature means and standard deviations to standardize the data. Then combine these statistics (e.g., using weighted averages) to get global standardization parameters
1. Divide the Dataset: Each partition gets processed in parallel
3. Compute Local Covariance Matrices: Calculate the covariance matrix for each partition, then aggregate them (typically via a weighted sum) to form the global covariance matrix
4. Compute Eigenvalues and Eigenvectors: Compute the eigen-decomposition of the global covariance matrix
5. Sort, Select, and Project: Sort the eigenvalues in descending order, select the top
k eigenvectors, and project the standardized dataset onto these principal components to obtain the transformed dataset
Spark Principal Component Analysis Algorithm
Let’s reimplement our PCA example using Spark
https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/PCA-Spark.ipynb
Other than customer segmentation, another major use case in E-commerce is identifying co-occurrences
Which items or events frequently occur together. This is crucial for understanding relationships within the data that may not be immediately apparent
Representative Dataset: Transactions
transaction_id | customer_id | transaction_date | item_id | quantity |
1 | 1001 | 2024-01-01 | 1 | 2 |
1 | 1001 | 2024-01-01 | 2 | 1 |
1 | 1001 | 2024-01-01 | 5 | 3 |
2 | 1002 | 2024-01-02 | 3 | 1 |
2 | 1002 | 2024-01-02 | 4 | 1 |
3 | 1001 | 2024-01-03 | 1 | 1 |
3 | 1001 | 2024-01-03 | 6 | 2 |
4 | 1003 | 2024-01-04 | 5 | 5 |
4 | 1003 | 2024-01-04 | 7 | 1 |
5 | 1002 | 2024-01-05 | 2 | 1 |
5 | 1002 | 2024-01-05 | 8 | 4 |
Representative Dataset: Items
item_id | item_name | category | price | quantity |
1 | Milk | Dairy | 1.5 | 2 |
2 | Bread | Bakery | 2 | 1 |
3 | Eggs | Dairy | 3 | 3 |
4 | Butter | Dairy | 2.5 | 1 |
5 | Apples | Produce | 0.75 | 1 |
6 | Chicken | Meat | 5 | 1 |
7 | Cereal | Grocery | 4 | 2 |
8 | Yogurt | Dairy | 1.25 | 5 |
Let’s try SQL again
SELECT
i1.item_name AS item1,
i2.item_name AS item2,
COUNT(*) AS frequency
FROM Transactions t
JOIN Items i1 ON t.item_id = i1.item_id
JOIN Transactions t2 ON t.transaction_id = t2.transaction_id
JOIN Items i2 ON t2.item_id = i2.item_id
WHERE i1.item_id < i2.item_id
GROUP BY item1, item2
HAVING COUNT(*) >= 10
ORDER BY frequency DESC;
We already know that SQL is not going to scale…
Enter Unsupervised Frequent Pattern Mining
Frequent item mining works by analyzing transactional datasets to identify itemsets that appear together in transactions more frequently than a specified threshold
It is widely applied in market basket analysis, helping businesses understand customer purchasing patterns and optimize product placement and promotions
Two Main Algorithms used for Frequent Pattern Mining
Apriori
FP-Growth
Methodology
Performance
Memory Usage
Use Cases
Iteratively identifies frequent itemsets by scanning the dataset multiple times
Constructs a compact data structure called the FP-tree, which retains the essential information from the dataset
While effective for smaller datasets, Apriori's performance can degrade significantly with larger datasets
Requires more memory as it stores all candidate itemsets explicitly
Effective for smaller, less complex datasets
More efficient for large datasets as it requires fewer passes over the data
More memory-efficient due to the use of the FP-tree structure, which summarizes the dataset in a compact form
Best suited for large, complex datasets where performance is critical
FP-Growth Algorithm
Scan the Database: The algorithm scans the database to determine the frequency of each item. It creates a list of frequent items based on a user-defined minimum support threshold
Sort and Filter: The frequent items are sorted in descending order of their frequency
Build the FP-Tree: A compact data structure called the FP-tree is constructed. The FP-tree is a prefix tree that stores the frequent itemsets in a hierarchical manner. Each transaction is represented in the tree by adding its frequent items, creating paths that represent transactions
FP-Tree Construction
Conditional Pattern Base: For each item in the FP-tree, the algorithm constructs a conditional pattern base, which is a set of prefix paths in the tree that lead to that item
Build Conditional FP-Tree: From the conditional pattern base, a conditional FP-tree is built for each item. This tree represents all the transactions that contain the item and allows for mining the frequent itemsets that include that item
Recursion: The algorithm recursively mines the conditional FP-trees until no further frequent itemsets can be found
FP-Tree Mining
Let’s recap Prefix Tree from Lecture 3
Used to store a dynamic set of strings where common prefixes are shared, allowing for efficient retrieval, insertion, and prefix-based search operations
Prefix Tree (Trie)
FP-Tree Construction Example
Minimum support threshold: 0.5
With 3 transactions, an item must appear in at least 2 transactions
Transaction ID | Items |
1 | A, B |
2 | B, C, D |
3 | A, C, D, E |
https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/FPGrowth-Steps.ipynb
FP-Growth Key Concepts
Items that are considered the "cause" or the condition under which the rule applies
Antecedents
The items that are likely to be purchased as a result of the antecedents
Consequents
The proportion of transactions that contain the itemset (both antecedents and consequents)
Support
How often the consequent is purchased when the antecedent is purchased
Confidence
How much more likely the consequent is purchased when the antecedent is purchased compared to when the antecedent is not purchased
Lift
FP-Growth Key Concepts (2)
When a customer buys Milk and Bread (antecedents), there is an 80% chance (confidence of 0.80) that they will also buy Butter
This rule applies to 5% of all transactions (support of 0.05)
The lift of 1.25 indicates that buying Milk and Bread increases the likelihood of buying Butter by 25% compared to the average purchase behavior
Antecedents | Consequents | Support | Confidence | Lift |
{Milk, Bread} | {Butter} | 0.05 | 0.8 | 1.25 |
FP-Growth Implementation
The code generates 1000 random transactions and encodes data using One-Hot Encoding
FP-Growth is used to generate association rules from the frequent itemsets with antecedents, consequents, support, confidence, and lift
https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/FPGrowth.ipynb
Parallel FP-Growth Algorithm
Scan the Database in Parallel: The algorithm scans the database in chunks to determine the frequency of each item. After processing all chunks, aggregate the frequency counts to identify the global frequencies of each item across the entire dataset
Sort and Filter: The frequent items are sorted in descending order of their frequency
Build the FP-Tree in Parallel: Create local FP-tree for each chunk and then aggregate into a single FP-tree that represents all the frequent itemsets across the entire dataset
FP-Tree Construction
Conditional Pattern Base: The global FP-tree is traversed, and frequent itemsets are extracted in a parallel manner
Build Conditional FP-Tree: Subsets of the FP-tree are processed independently to create conditional FP-trees for further mining
Recursion: The algorithm recursively mines the conditional FP-trees until no further frequent itemsets can be found
FP-Tree Mining
Spark FP-Growth Implementation
The code generates 1000 random transactions and converts them to a Spark dataframe
FP-Growth in Spark is used to generate association rules from the frequent itemsets in a distribution fashion with antecedents, consequents, support, confidence, and lift
https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/FPGrowth-Spark.ipynb
Some more use cases…
Graph clustering is used in GraphRAG
(Lecture 5)
PCA is used for image compression and recognition by reducing the dimensionality of image data while preserving key features
Genomic datasets often contain thousands of genes, resulting in a high-dimensional space where each dimension corresponds to a gene. PCA helps reduce the dimensionality of gene expression data
Frequent itemset mining analyzes clinical data to identify common combinations of symptoms or diagnoses that occur together
Summary
Unsupervised learning identifies patterns in unlabeled data, allowing for the discovery of hidden structures and relationships
K-means clustering partitions data into K distinct clusters based on feature similarity, making it efficient for analyzing large datasets
PCA (Principal Component Analysis) is a dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional space while preserving variance
FP-Growth is a frequent pattern mining algorithm that efficiently discovers frequent itemsets in large transaction datasets
All 3 algorithms apply to big data as their implementation is easily parallelized and distributed
Recommended Reading
Chapters 6, 7, and 11 of “Mining of Massive Datasets” by Jure Leskovec, Anand Rajaraman, and Jeff Ullman
Spark ML guide: https://spark.apache.org/docs/latest/ml-guide.html