1 of 91

Applied

Big Data Analytics

Zubair Nabi

2 of 91

Lecture 6

Segmenting E-commerce Users

3 of 91

Our Big Data Analytics Stack

4 of 91

Our Big Data Analytics Stack: Today’s Lecture

5 of 91

Use Case for Today

User Segmentation in E-commerce

6 of 91

What is User Segmentation in E-commerce?

The process of dividing a customer base into distinct groups based on shared characteristics or behaviors for targeted marketing and engagement

E-commerce businesses use segmentation to tailor product recommendations, personalized offers, and marketing strategies, leading to improved customer retention, cross-selling, and targeting based on behavior or demographics

7 of 91

Some Key Concepts in E-commerce

The goods or services available for purchase, often listed with details like price, description, and reviews

Item/Product

A virtual basket where consumers place items they intend to purchase before proceeding to checkout

Cart

The process by which consumers complete their purchase, using methods such as credit cards, digital wallets, or bank transfers

Payment

The complete process of purchasing items, from adding them to the cart to making the payment

Transaction

Feedback from previous customers about a product or service, typically scored on a scale, which helps others decide whether to buy

Rating/Review

8 of 91

Representative E-commerce Dataset

Transaction ID

Customer ID

Product Name

Category

Price

Quantity

Purchase Date

Payment Method

Shipping City

T001

C001

Wireless Earbuds

Electronics

49.99

1

2024-09-01

Credit Card

New York

T002

C002

Running Shoes

Sportswear

79.99

2

2024-09-02

PayPal

Los Angeles

T003

C001

Laptop Stand

Office

29.99

1

2024-09-03

Credit Card

New York

T004

C003

Smartwatch

Electronics

199.99

1

2024-09-04

Debit Card

Chicago

T005

C004

Yoga Mat

Sportswear

24.99

1

2024-09-05

Credit Card

San Francisco

T006

C005

LED Desk Lamp

Home & Decor

19.99

3

2024-09-06

PayPal

Boston

T007

C006

Blender

Kitchen

59.99

1

2024-09-06

Credit Card

Seattle

T008

C007

Smartphone Case

Electronics

12.99

2

2024-09-07

Debit Card

Austin

T009

C003

Noise-cancelling Headphones

Electronics

89.99

1

2024-09-08

Credit Card

Chicago

T010

C008

Water Bottle

Sportswear

14.99

1

2024-09-09

PayPal

Miami

9 of 91

What would be the simplest way of segmentation?

10 of 91

Answer:

11 of 91

Segment Users by Location

Transaction ID

Customer ID

Product Name

Category

Price

Quantity

Purchase Date

Payment Method

Shipping City

T001

C001

Wireless Earbuds

Electronics

49.99

1

2024-09-01

Credit Card

New York

T002

C002

Running Shoes

Sportswear

79.99

2

2024-09-02

PayPal

Los Angeles

T003

C001

Laptop Stand

Office

29.99

1

2024-09-03

Credit Card

New York

T004

C003

Smartwatch

Electronics

199.99

1

2024-09-04

Debit Card

Chicago

T005

C004

Yoga Mat

Sportswear

24.99

1

2024-09-05

Credit Card

San Francisco

SELECT Shipping_City, COUNT(Customer_ID)

FROM ecommerce_table

GROUP BY Shipping_City;

12 of 91

Segment Users by High-Spending

Transaction ID

Customer ID

Product Name

Category

Price

Quantity

Purchase Date

Payment Method

Shipping City

T001

C001

Wireless Earbuds

Electronics

49.99

1

2024-09-01

Credit Card

New York

T002

C002

Running Shoes

Sportswear

79.99

2

2024-09-02

PayPal

Los Angeles

T003

C001

Laptop Stand

Office

29.99

1

2024-09-03

Credit Card

New York

T004

C003

Smartwatch

Electronics

199.99

1

2024-09-04

Debit Card

Chicago

T005

C004

Yoga Mat

Sportswear

24.99

1

2024-09-05

Credit Card

San Francisco

SELECT Customer_ID, SUM(Price * Quantity)

FROM ecommerce_table

GROUP BY Customer_ID

HAVING SUM(Price * Quantity) > 200;

13 of 91

Segment Users by Product Category

Transaction ID

Customer ID

Product Name

Category

Price

Quantity

Purchase Date

Payment Method

Shipping City

T001

C001

Wireless Earbuds

Electronics

49.99

1

2024-09-01

Credit Card

New York

T002

C002

Running Shoes

Sportswear

79.99

2

2024-09-02

PayPal

Los Angeles

T003

C001

Laptop Stand

Office

29.99

1

2024-09-03

Credit Card

New York

T004

C003

Smartwatch

Electronics

199.99

1

2024-09-04

Debit Card

Chicago

T005

C004

Yoga Mat

Sportswear

24.99

1

2024-09-05

Credit Card

San Francisco

SELECT Customer_ID, Category, COUNT(Product_Name) AS Product_Purchase_Count

FROM ecommerce_table

GROUP BY Customer_ID, Category

ORDER BY Customer_ID, Product_Purchase_Count DESC;

14 of 91

But it is hard to scale segmentation using SQL…

Why?

15 of 91

Shortcomings of SQL for Segmentation

Limited to Explicit Criteria: SQL requires predefined conditions set by the user for segmentation. It cannot discover segments based on hidden patterns or relationships in the data, potentially missing insights

Static Analysis: SQL queries provide results based on the dataset at the time of execution. As user behaviors evolve, you must manually adjust queries, making it less responsive to changing data patterns

Challenges with High-Dimensional Data: SQL performs well with structured data but struggles with high-dimensional datasets involving multiple variables

16 of 91

What we need instead..

Uncover hidden patterns and insights that predefined criteria might miss, allowing for a more dynamic and comprehensive understanding of customer segments

Techniques that can analyze large datasets to automatically identify natural groupings based on user behavior and characteristics

17 of 91

Enter Unsupervised Learning

Unsupervised learning is a machine learning approach that analyzes data without predefined labels, enabling algorithms to discover hidden patterns and structures within the dataset

This technique is especially valuable for tasks such as clustering, where it groups similar data points based on their inherent characteristics

18 of 91

Supervised vs Unsupervised Learning

Supervised

Learning

Unsupervised

Learning

Definition

Data Requirements

Goal

Common Algorithms

Use Cases

Uses labeled data to train models

Analyzes data without predefined labels to discover patterns

Requires a dataset with input-output pairs (features and labels)

Predict outcomes or classify data based on learned patterns

Linear regression, logistic regression, decision trees, support vector machines

Customer Churn Prediction, Fraud Detection, Dynamic Pricing, Sentiment Analysis

Uses only input data without labels

Identify hidden structures or groupings within the data

K-means clustering, hierarchical clustering, principal component analysis (PCA)

Customer Segmentation, Market Basket Analysis, Anomaly Detection, Behavioural Analysis

Focus of Lecture 7, 8, 9

19 of 91

Different Types of Unsupervised Learning Techniques

Groups similar data points together based on their characteristics, enabling the discovery of inherent patterns in the data, such as K-Means, Hierarchical, and DBSCAN

Identifies interesting relationships and patterns among variables in large datasets, commonly used for market basket analysis to uncover product purchase correlations, such as, Apriori Algorithm and FP-Growth

Clustering

A process that reduces the number of features in a dataset while preserving essential information, facilitating easier analysis and visualization, such as Principal Component Analysis (PCA)

Dimensionality Reduction

Association Rule Learning

20 of 91

Let’s start with clustering

21 of 91

In this course, we will be focusing on K-means, its variants, and Gaussian Mixture Model (GMM) because they tend to scale well for large datasets

In comparison, DBSCAN and most hierarchical clustering algorithms are generally not as scalable, making them less suited for big data use cases where efficient parallel and distributed processing is essential

22 of 91

Clustering Scalability Issues

DBSCAN doesn't scale well because it relies on expensive neighborhood queries and global connectivity checks, which often require O(n2) computations in the worst case

Most hierarchical clustering algorithms scale poorly because they typically require computing and storing pairwise distances between data points, leading to quadratic time and memory complexity

Additionally, repeatedly merging or splitting clusters involves global operations that further increase computational costs, making these methods impractical for very large datasets

Hierarchical Clustering

DBSCAN

23 of 91

What is K-Means Clustering?

A partitional clustering algorithm that divides data into a specified number of clusters (K) based on distance to cluster centroids

It is efficient for large datasets but assumes spherical cluster shapes and can be sensitive to outliers

Its simplicity, speed, and broad applicability have made it a popular choice

24 of 91

K-Means Clustering Algorithm

2. Initialize centroids by randomly selecting k data points

1. Choose the number of clusters (k)

3. Assign each data point to the nearest centroid using a distance metric (e.g., Euclidean distance), forming k clusters

4. Update the centroids by calculating the mean of all points in each cluster

5. Repeat steps 3-4 until the centroids no longer change or a set number of iterations is reached

25 of 91

K-Means Clustering Algorithm (2)

The notebook randomly generates the number of transactions and total spending for 1000 customers

Runs and visualizes the algorithm step by step

https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/KMeansClustering.ipynb

26 of 91

How do you Evaluate an Unsupervised Learning Technique like K-Means Clustering?

Measuring the performance of unsupervised learning algorithms, such as clustering, is inherently different from supervised learning since there are no true labels to compare against

Of course we can use domain-specific output metrics such as customer segmentation effectiveness in marketing (i.e., revenue increase after targeting) or anomaly detection accuracy

But there are other options too

27 of 91

K-Means Clustering Evaluation Methods

Looks for a point where adding more clusters yields diminishing returns in reducing within-cluster variance

Measures how well-separated the clusters are

Silhouette Score

Elbow Method

28 of 91

K-Means Clustering Evaluation

Elbow Method

Silhouette Score

Definition

Formula

Insight

It measures the sum of squared distances (inertia) between data points and their nearest cluster center for different values of k (number of clusters)

It combines two factors: intra-cluster cohesion (how close points are to others in the same cluster) and inter-cluster separation (how far points are from points in other clusters)

The optimal k is at the elbow point, where increasing k further results in diminishing returns in terms of reducing inertia

a(i) is the intra-cluster distance, i.e., the average distance between point i and all other points in the same cluster

b(i) is the inter-cluster distance, i.e., the average distance between point i and points in the nearest (different) cluster

The optimal k is the one that maximizes the average silhouette score, indicating that the clusters are well-separated and compact

Within-Cluster Sum of Squares

29 of 91

K-Means Clustering Evaluation (2)

Elbow Method

Silhouette Score

Advantages

Shortcomings

Provides a good, intuitive indication of when adding more clusters no longer significantly improves the model

The "elbow" is sometimes subjective and not always clearly visible

Provides a quantitative score that balances cohesion and separation

Computationally expensive for large datasets, as it requires calculating distances between all points in the dataset

30 of 91

K-Means Clustering Evaluation (3)

The notebook randomly generates the number of transactions and total spending for 1000 customers

Visualizes both Elbow Score and Silhouette Plot

https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/KMeansClustering-Evaluation.ipynb

31 of 91

K-Means has a few variants that improve its performance

32 of 91

K-Means Variants

K-Means++ enhances the initialization process by selecting initial centroids in a way that they are more spread out across the data space

Bisecting K-Means, works by recursively splitting clusters into two until the desired number of clusters is reached. This top-down divisive approach often improves clustering quality and scalability

K-Means++

Mini-Batch K-Means processes data in small, random batches instead of the entire dataset at once

Mini-Batch K-Means

Bisecting K-Means

33 of 91

K-Means Variants (2)

Reduces the likelihood of poor initialization, which can lead to suboptimal clustering

Faster convergence due to better starting points, often resulting in fewer iterations needed to reach optimal clustering

Constructs a hierarchical clustering structure by recursively splitting clusters into two, leading to more balanced partitions

Efficiently targets high-variance clusters for splitting, making it scalable for distributed environments

K-Means++

Significantly faster than standard K-Means, especially for large datasets, because it reduces the computation required for each iteration

Provides a good trade-off between convergence speed and clustering accuracy

Mini-Batch K-Means

Bisecting K-Means

34 of 91

But what if the data does not fit into the memory of a single server?

35 of 91

We will have to use a cluster of machines

36 of 91

How can we run K-Means in a parallel and distributed fashion?

37 of 91

Parallel K-Means++ Clustering Algorithm

  1. Choose the number of clusters (k)
  2. Initialize centroids using K-Means++ in parallel:
    1. Randomly select the first centroid from the data points
    2. For each remaining centroid, calculate the distance from each data point to the nearest existing centroid in parallel:
      1. Use a distance metric (e.g., Euclidean distance) to compute distances
      2. Select the next centroid with a probability proportional to the square of that distance
  3. For each data point, compute the distance to each centroid in parallel and assign the data point to the closest centroid, forming k clusters
  4. Calculate the mean of all points in each cluster. This can be done in parallel by aggregating the sums of points in each cluster and counting the number of points assigned to each cluster
  5. Repeat steps 3-4 until the centroids no longer change or a set number of iterations is reached

38 of 91

Parallel K-Means++ Clustering Algorithm (2)

Similar to vanilla K-Means, the notebook randomly generates the number of transactions and total spending for 1000 customers

It initializes centroids, assigns data points, and updates centroids in parallel

Runs and visualizes the algorithm step by step

https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/Parallel-KMeansPlusPlusClustering.ipynb

39 of 91

The example that we looked at is parallel but not distributed

How can we achieve distributed computation?

40 of 91

+

41 of 91

Spark Implementation of K-Means

We will be using SparkML instead of MLlib

The former uses Dataframes while the latter is too low level and uses RDDs

+

42 of 91

Spark Implementation of K-Means (2)

1. Initialization: Initial centroids are chosen using a variant of K-means++, which selects well-spread centroids in a scalable, distributed manner

2. Iterative Assignment and Update:

Assignment Step: Each data point is assigned to its nearest centroid in parallel

Update Step: Partial sums and counts are computed across partitions to calculate new centroids

These steps iterate until the centroids stabilize (or until a maximum iteration count is reached)

+

43 of 91

Spark Implementation of K-Means (3)

Similar to earlier examples, the notebook randomly generates the number of transactions and total spending for 1000 customers

It uses the Spark ML implementation of K-Means

Runs and visualizes the output

+

https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/Spark-KMeans.ipynb

44 of 91

While standard K-Means partitions data into flat clusters, many datasets benefit from a hierarchical approach that recursively refines groupings for more balanced, insightful clusters

45 of 91

Enter Bisecting K-Means

Bisecting K-Means is a hierarchical clustering algorithm that recursively splits data into two clusters

It starts with all data in one cluster and divides the most heterogeneous clusters until the desired number of clusters is reached

This approach often yields more balanced, scalable, and refined clusters for large datasets

46 of 91

Bisecting K-Means Clustering Algorithm

2. Select a Cluster to Split: Identify the cluster with the highest heterogeneity (e.g., the largest sum of squared distances) or choose one at random

1. Initialization: Start with one cluster that contains all the data points

3. Bisection: Run standard K-Means with k=2 on the selected cluster, possibly repeating several times to find the best split

4. Replace Cluster: Replace the chosen cluster with the two new clusters from the split

5. Repeat: Repeat steps 2, 3, and 4 until the desired number of clusters is reached

47 of 91

Bisecting K-Means Clustering Algorithm (2)

Similar to vanilla K-Means, the notebook randomly generates the number of transactions and total spending for 1000 customers

Runs and visualizes the Bisecting algorithm step by step

https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/BisectingKMeansClustering.ipynb

48 of 91

Spark Implementation of Bisecting K-Means

2. Group Clusters for Splitting: Group all eligible clusters together so that they can be processed simultaneously

1. Select Eligible Clusters: Identify clusters at the current hierarchical level that are eligible for splitting (e.g., those with high SSE or above a minimum size)

3. Parallel k=2 Splitting: Execute the K-Means k=2 subroutine on each grouped cluster in parallel using Spark’s distributed RDD framework

4. Aggregate and Update: Aggregate the results from the parallel splits, evaluate each split, and update the cluster hierarchy by replacing the split clusters with the new sub-clusters

5. Iterate Until Completion: Repeat the above steps until the desired number of leaf clusters is reached or no further eligible clusters remain

49 of 91

Spark Implementation of Bisecting K-Means (2)

Continuing with the same running example but using Bisecting K-Means

+

https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/Spark-BisectingKMeans.ipynb

50 of 91

K-Means faces two key limitations in real-world applications:

  1. It makes hard assignments, meaning each point is assigned to only one cluster
  2. It assumes clusters are spherical, which doesn't capture more complex, irregular cluster shapes

51 of 91

Let’s look at two examples:

In ecommerce, a customer might frequently purchase premium products yet occasionally hunt for deals

The purchasing behavior of seasonable shoppers is highly concentrated during holiday seasons with sporadic activity during the rest of the year. This creates a cluster that's elongated along the "seasonality" dimension rather than forming a neat, spherical group

52 of 91

Enter Gaussian Mixture Model (GMM)

Gaussian Mixture Models (GMMs) model data as a mixture of Gaussian distributions

They assign each point a probability of belonging to each cluster, providing soft assignments

GMMs can capture complex, non-spherical cluster shapes by estimating unique covariances for each cluster

53 of 91

Key Properties of Gaussian Mixture Models

GMMs are typically optimized using the Expectation-Maximization (EM) algorithm, which iteratively refines the estimates of the Gaussian parameters and the probability of each point belonging to each cluster

Like many iterative methods, the outcome of GMMs can depend on the initial parameter settings, so multiple runs or careful initialization methods are often needed

EM Algorithm

GMMs allow for different covariance structures (e.g., full, diagonal, spherical), giving flexibility in modeling the shape and orientation of clusters

Covariance Types

Sensitivity to Initialization

54 of 91

Gaussian Mixture Model Algorithm

2. Expectation (E) Step: For each data point, calculate the probability that it belongs to each Gaussian component based on the current parameters

1. Initialization: Start with initial guesses for the parameters—mixing coefficients, means, and covariance matrices of the Gaussian components

3. Maximization (M) Step: Update the parameters (mixing coefficients, means, and covariances) by maximizing the expected likelihood computed in the E-step

4. Iteration: Repeat the E and M steps until the parameters converge or the changes become negligible

55 of 91

How to Evaluate a Gaussian Mixture Model?

Quantifies how probable the observed data is given the model's parameters

A higher log-likelihood means the model's parameters better explain the observed data

Use it as a baseline to assess overall model fit, especially when comparing models of equal complexity

Log

Likelihood (LL)

Balances model fit and complexity by penalizing the number of parameters to prevent overfitting

Rewards models that achieve high LL while penalizing those with excessive complexity; lower AIC indicates a better model

Ideal for model selection in scenarios where moderate dataset sizes are involved and slight complexity is acceptable if it significantly improves the fit

Akaike Information Criterion (AIC)

Similar to AIC but imposes a stricter penalty for model complexity, especially effective for large datasets

Strongly penalizes complex models; a lower BIC value indicates a model that balances fit and simplicity, reducing the risk of overfitting

Particularly useful when working with large datasets

Bayesian Information Criterion (BIC)

56 of 91

Log Likelihood (LL)

n: Number of data points

k: Number of Gaussian components

πj: Mixing coefficient for component j

N(xi∣μj​,Σj​): probability density of data point xi under the Gaussian with mean μj and covariance Σj

Measures the overall probability of the observed data under the model, summing the log of weighted Gaussian probabilities for each data point

57 of 91

Akaike Information Criterion (AIC)

k: Number of Gaussian components

LL: Log-likelihood of the model

Balances model fit and complexity by penalizing the total number of parameters, helping to select models that achieve good likelihood with fewer parameters

58 of 91

Bayesian Information Criterion (BIC)

n: Number of data points

k: Number of Gaussian components

LL: Log-likelihood of the model

Similar to AIC but with a stronger penalty based on the sample size, favoring simpler models to reduce the risk of overfitting, especially in large datasets

59 of 91

Gaussian Mixture Model Algorithm

Let’s now apply GMM to our use case

https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/GMM.ipynb

60 of 91

Spark Implementation of Gaussian Mixture

2. Local E-Step: Each worker computes the soft assignments and partial statistics (e.g., weighted sums, log-likelihood contributions) for its data

1. Data Partitioning & Broadcasting: The dataset is split across multiple partitions, and current model parameters are broadcast to all worker nodes

3. Global Aggregation & M-Step: Partial results are combined (reduced) across partitions to update the model parameters, which are then broadcast for the next iteration

4. Iteration: Repeat the E and M steps until the parameters converge or the changes become negligible

61 of 91

Spark Implementation of Gaussian Mixture (2)

Continuing with the same running example but using Spark’s implementation of GMM

+

https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/Spark-GMM.ipynb

62 of 91

Clustering can reveal hidden groups, but beware of the curse of high dimensionality!

63 of 91

The Curse of High Dimensionality

In high-dimensional spaces, the distance between points becomes less meaningful, making it difficult for K-Means to accurately cluster similar data points. As dimensions increase, the data becomes sparser, and clusters may appear indistinguishable

High-dimensional datasets often contain irrelevant or redundant features that can confuse the clustering process

Curse of Dimensionality

The computational cost of K-Means grows with the number of dimensions. High-dimensional datasets require more calculations for distance metrics and centroid updates, which can lead to longer processing times and resource consumption

Increased Computational Complexity

Noise and Irrelevant Features

Visualizing clusters in high-dimensional spaces is inherently challenging, making it difficult to assess the clustering outcomes

Visualization and Interpretability

64 of 91

The Curse of High Dimensionality (2)

The dataset contains 10 different features related to customer behavior (e.g., Total_Spending, Number_of_Transactions, etc.). As the dimensionality increases, the relationships between these features become more complex and harder to visualize

As dimensionality increases, the distances between points become less informative. This can lead to misclassifications where points that seem close in the selected dimensions (in the 3D plots) are actually distant when considering all dimensions. The clustering results might not accurately reflect the true data distribution, highlighting K-Means' limitations in high-dimensional contexts

https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/KMeans-HighDimensionality.ipynb

65 of 91

Enter Principal Component Analysis

A statistical technique used for dimensionality reduction in large datasets

Helps to simplify complex data while retaining its essential characteristic

PCA transforms high-dimensional data into a lower-dimensional space, enabling easier interpretation and insights

66 of 91

Principal Component Analysis

In the context of Principal Component Analysis (PCA), principal components refer to the new variables generated from the original dataset that capture the most significant patterns of variance in the data

Variance refers to a statistical measure of the dispersion or spread of a set of data points around their mean. It quantifies how much the data points differ from the mean value of the dataset

Definition

By transforming the original features into principal components, PCA reduces the number of dimensions while retaining the most important information

This allows for easier visualization and analysis of the data

PCA prioritizes components with the highest variance for dimensionality reduction. By retaining only the principal components that explain a significant proportion of the total variance

Dimensionality Reduction

67 of 91

Principal Component Analysis (2)

The first principal component captures the maximum variance in the data. The second principal component captures the maximum variance orthogonal to the first, and so on. This means:

The first principal component accounts for the largest proportion of variance

The second principal component accounts for the second largest proportion of variance, and is uncorrelated with the first

This process continues for as many principal components as there are original dimensions (up to the minimum of the number of observations or features)

Variance Capture

Each principal component can be interpreted as a new feature that summarizes the information in the original features. They help in understanding the structure and relationships in the data, even if the original features are high-dimensional or complex

High variance along a particular direction indicates that the data points are widely spread out along that direction, suggesting that this direction contains significant information about the structure of the data

Conversely, low variance indicates that the data points are closely clustered around the mean along that direction, suggesting that it may not be as informative

Interpretation

68 of 91

Principal Component Analysis (3)

Each principal component has an associated explained variance ratio, which indicates the proportion of the dataset's total variance that is captured by that component

This helps in understanding how many components are needed to adequately represent the original data

Variance Ratio

When you perform PCA, you can choose to keep only a subset of the principal components based on their explained variance

For example, you might decide to keep only the first 2 or 3 principal components if they explain a significant portion of the total variance (e.g., 90% or 95%)

Number of Principal Components

69 of 91

Principal Component Analysis Algorithm

Input: Dataset X (n samples, m features)

Calculate the covariance matrix C from the standardized dataset:

C = (1 / (n - 1)) * (X^T * X)

(where X^T is the transpose of X)

A covariance matrix indicates how much the variables change together and revealing their relationships and variances

For each feature (column) in X:

  1. Calculate the mean of the feature
  2. Subtract the mean from each value in the feature
  3. Divide by the standard deviation to obtain z-scores

Calculate the eigenvalues (λ) and eigenvectors (v) of the covariance matrix C:

- Solve the equation: C * v = λ * v

Eigenvector: Arrows that point in the directions along which the data varies the most

Eigenvalue: How much variance is present in the data along the directions specified by the eigenvectors

70 of 91

Principal Component Analysis Algorithm (2)

Input: Dataset X (n samples, m features)

Choose the first k eigenvectors corresponding to the top k eigenvalues. These eigenvectors are the principal components

Sort the eigenvalues in descending order and rearrange the corresponding eigenvectors accordingly

This enables the prioritization of the components that capture the most variance in the data

Create a new dataset Y by projecting the standardized dataset X onto the selected principal components:

Y = X * V_k

(where V_k is the matrix of the top k eigenvectors)

This essentially "translates" the variance structure of the original dataset into a new coordinate system

71 of 91

Principal Component Analysis Algorithm (3)

This code applies PCA to reduce a 10-dimensional e-commerce dataset to 6 principal components

The data is first standardized, then transformed into a lower-dimensional space

https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/PCA-HighDimensionality.ipynb

72 of 91

Spark Implementation of PCA

2. Standardize Each Partition: For each partition, compute the feature means and standard deviations to standardize the data. Then combine these statistics (e.g., using weighted averages) to get global standardization parameters

1. Divide the Dataset: Each partition gets processed in parallel

3. Compute Local Covariance Matrices: Calculate the covariance matrix for each partition, then aggregate them (typically via a weighted sum) to form the global covariance matrix

4. Compute Eigenvalues and Eigenvectors: Compute the eigen-decomposition of the global covariance matrix

5. Sort, Select, and Project: Sort the eigenvalues in descending order, select the top

k eigenvectors, and project the standardized dataset onto these principal components to obtain the transformed dataset

73 of 91

Spark Principal Component Analysis Algorithm

Let’s reimplement our PCA example using Spark

https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/PCA-Spark.ipynb

74 of 91

Other than customer segmentation, another major use case in E-commerce is identifying co-occurrences

Which items or events frequently occur together. This is crucial for understanding relationships within the data that may not be immediately apparent

75 of 91

Representative Dataset: Transactions

transaction_id

customer_id

transaction_date

item_id

quantity

1

1001

2024-01-01

1

2

1

1001

2024-01-01

2

1

1

1001

2024-01-01

5

3

2

1002

2024-01-02

3

1

2

1002

2024-01-02

4

1

3

1001

2024-01-03

1

1

3

1001

2024-01-03

6

2

4

1003

2024-01-04

5

5

4

1003

2024-01-04

7

1

5

1002

2024-01-05

2

1

5

1002

2024-01-05

8

4

76 of 91

Representative Dataset: Items

item_id

item_name

category

price

quantity

1

Milk

Dairy

1.5

2

2

Bread

Bakery

2

1

3

Eggs

Dairy

3

3

4

Butter

Dairy

2.5

1

5

Apples

Produce

0.75

1

6

Chicken

Meat

5

1

7

Cereal

Grocery

4

2

8

Yogurt

Dairy

1.25

5

77 of 91

Let’s try SQL again

SELECT

i1.item_name AS item1,

i2.item_name AS item2,

COUNT(*) AS frequency

FROM Transactions t

JOIN Items i1 ON t.item_id = i1.item_id

JOIN Transactions t2 ON t.transaction_id = t2.transaction_id

JOIN Items i2 ON t2.item_id = i2.item_id

WHERE i1.item_id < i2.item_id

GROUP BY item1, item2

HAVING COUNT(*) >= 10

ORDER BY frequency DESC;

78 of 91

We already know that SQL is not going to scale…

79 of 91

Enter Unsupervised Frequent Pattern Mining

Frequent item mining works by analyzing transactional datasets to identify itemsets that appear together in transactions more frequently than a specified threshold

It is widely applied in market basket analysis, helping businesses understand customer purchasing patterns and optimize product placement and promotions

80 of 91

Two Main Algorithms used for Frequent Pattern Mining

Apriori

FP-Growth

Methodology

Performance

Memory Usage

Use Cases

Iteratively identifies frequent itemsets by scanning the dataset multiple times

Constructs a compact data structure called the FP-tree, which retains the essential information from the dataset

While effective for smaller datasets, Apriori's performance can degrade significantly with larger datasets

Requires more memory as it stores all candidate itemsets explicitly

Effective for smaller, less complex datasets

More efficient for large datasets as it requires fewer passes over the data

More memory-efficient due to the use of the FP-tree structure, which summarizes the dataset in a compact form

Best suited for large, complex datasets where performance is critical

81 of 91

FP-Growth Algorithm

Scan the Database: The algorithm scans the database to determine the frequency of each item. It creates a list of frequent items based on a user-defined minimum support threshold

Sort and Filter: The frequent items are sorted in descending order of their frequency

Build the FP-Tree: A compact data structure called the FP-tree is constructed. The FP-tree is a prefix tree that stores the frequent itemsets in a hierarchical manner. Each transaction is represented in the tree by adding its frequent items, creating paths that represent transactions

FP-Tree Construction

Conditional Pattern Base: For each item in the FP-tree, the algorithm constructs a conditional pattern base, which is a set of prefix paths in the tree that lead to that item

Build Conditional FP-Tree: From the conditional pattern base, a conditional FP-tree is built for each item. This tree represents all the transactions that contain the item and allows for mining the frequent itemsets that include that item

Recursion: The algorithm recursively mines the conditional FP-trees until no further frequent itemsets can be found

FP-Tree Mining

82 of 91

Let’s recap Prefix Tree from Lecture 3

Used to store a dynamic set of strings where common prefixes are shared, allowing for efficient retrieval, insertion, and prefix-based search operations

Prefix Tree (Trie)

83 of 91

FP-Tree Construction Example

Minimum support threshold: 0.5

With 3 transactions, an item must appear in at least 2 transactions

Transaction ID

Items

1

A, B

2

B, C, D

3

A, C, D, E

https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/FPGrowth-Steps.ipynb

84 of 91

FP-Growth Key Concepts

Items that are considered the "cause" or the condition under which the rule applies

Antecedents

The items that are likely to be purchased as a result of the antecedents

Consequents

The proportion of transactions that contain the itemset (both antecedents and consequents)

Support

How often the consequent is purchased when the antecedent is purchased

Confidence

How much more likely the consequent is purchased when the antecedent is purchased compared to when the antecedent is not purchased

Lift

85 of 91

FP-Growth Key Concepts (2)

When a customer buys Milk and Bread (antecedents), there is an 80% chance (confidence of 0.80) that they will also buy Butter

This rule applies to 5% of all transactions (support of 0.05)

The lift of 1.25 indicates that buying Milk and Bread increases the likelihood of buying Butter by 25% compared to the average purchase behavior

Antecedents

Consequents

Support

Confidence

Lift

{Milk, Bread}

{Butter}

0.05

0.8

1.25

86 of 91

FP-Growth Implementation

The code generates 1000 random transactions and encodes data using One-Hot Encoding

FP-Growth is used to generate association rules from the frequent itemsets with antecedents, consequents, support, confidence, and lift

https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/FPGrowth.ipynb

87 of 91

Parallel FP-Growth Algorithm

Scan the Database in Parallel: The algorithm scans the database in chunks to determine the frequency of each item. After processing all chunks, aggregate the frequency counts to identify the global frequencies of each item across the entire dataset

Sort and Filter: The frequent items are sorted in descending order of their frequency

Build the FP-Tree in Parallel: Create local FP-tree for each chunk and then aggregate into a single FP-tree that represents all the frequent itemsets across the entire dataset

FP-Tree Construction

Conditional Pattern Base: The global FP-tree is traversed, and frequent itemsets are extracted in a parallel manner

Build Conditional FP-Tree: Subsets of the FP-tree are processed independently to create conditional FP-trees for further mining

Recursion: The algorithm recursively mines the conditional FP-trees until no further frequent itemsets can be found

FP-Tree Mining

88 of 91

Spark FP-Growth Implementation

The code generates 1000 random transactions and converts them to a Spark dataframe

FP-Growth in Spark is used to generate association rules from the frequent itemsets in a distribution fashion with antecedents, consequents, support, confidence, and lift

https://colab.research.google.com/github/zubair-nabi/applied-big-data/blob/main/Notebooks/Lecture6/FPGrowth-Spark.ipynb

89 of 91

Some more use cases…

Graph clustering is used in GraphRAG

(Lecture 5)

PCA is used for image compression and recognition by reducing the dimensionality of image data while preserving key features

Genomic datasets often contain thousands of genes, resulting in a high-dimensional space where each dimension corresponds to a gene. PCA helps reduce the dimensionality of gene expression data

Frequent itemset mining analyzes clinical data to identify common combinations of symptoms or diagnoses that occur together

90 of 91

Summary

Unsupervised learning identifies patterns in unlabeled data, allowing for the discovery of hidden structures and relationships

K-means clustering partitions data into K distinct clusters based on feature similarity, making it efficient for analyzing large datasets

PCA (Principal Component Analysis) is a dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional space while preserving variance

FP-Growth is a frequent pattern mining algorithm that efficiently discovers frequent itemsets in large transaction datasets

All 3 algorithms apply to big data as their implementation is easily parallelized and distributed

91 of 91

Recommended Reading

Chapters 6, 7, and 11 of “Mining of Massive Datasets” by Jure Leskovec, Anand Rajaraman, and Jeff Ullman

Spark ML guide: https://spark.apache.org/docs/latest/ml-guide.html