2 of 78

Association Rule Mining

Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

Association Rule Mining

Monday, May 08, 2023

Market-Basket transactions

Example of Association Rules

{Diaper} → {Beer},�{Milk, Bread} → {Eggs,Coke},�{Beer, Bread} → {Milk},

Implication means co-occurrence, not causality!

3 of 78

Binary Representation Market basket data can be represented in a binary format ,where each row corresponds to a transaction and each column corresponds to an item.
An item can be treated as a binary variable whose value is one if the item is present in a transaction and zero otherwise.
The presence of an item in a transaction is often considered more important than its absence, an item is an asymmetric binary variable

5 of 78

Definition: Frequent Itemset

Itemset

Let be the set of all items in a market basket data
be the set of all transactions
Each transaction ti contains a subset of items chosen from I
In association analysis, a collection of zero or more items is termed an itemset.
If an itemset contains k items, it is called a k-itemset
analysis, a collection of zero or more items is termed an itemset. If an itemset
Example: {Beer, Diapers, Milk} is an example of a 3-itemset.
The null or empty set is an itemset that does not contain any items.

Support count (σ)

the number of transactions that contain a particular itemset. Mathematically, the support count, o(X), for an itemset X can be stated as follows:

6 of 78

>> Support and confidence

Support determines how often a rule is applicable to a given data set
confidence determines how frequently items in Y appear in transactions that contain X.

7 of 78

Definition: Association Rule

Association Rule Mining

Monday, May 08, 2023

Example:

Association Rule

An implication expression of the form X → Y, where X and Y are itemsets
Example:� {Milk, Diaper} → {Beer}

Rule Evaluation Metrics

Support (s)

Fraction of transactions that contain both X and Y

Confidence (c)

Measures how often items in Y �appear in transactions that�contain X

8 of 78

Association Rule Mining Task

Given a set of transactions T, the goal of association rule mining is to find all rules having

support ≥ minsup threshold
confidence ≥ minconf threshold

Brute-force approach:

List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf thresholds

Association Rule Mining

Monday, May 08, 2023

9 of 78

Computational Complexity

Given d unique items:

Total number of itemsets = 2^d
Total number of possible association rules:

Association Rule Mining

Monday, May 08, 2023

If d=6, R = 602 rules

10 of 78

Mining Association Rules

Association Rule Mining

Monday, May 08, 2023

Example of Rules:�

{Milk,Diaper} → {Beer} (s=0.4, c=0.67)�{Milk,Beer} → {Diaper} (s=0.4, c=1.0)

{Diaper,Beer} → {Milk} (s=0.4, c=0.67)

{Beer} → {Milk,Diaper} (s=0.4, c=0.67) �{Diaper} → {Milk,Beer} (s=0.4, c=0.5)

{Milk} → {Diaper,Beer} (s=0.4, c=0.5)

Observations:

All the above rules are binary partitions of the same itemset: � {Milk, Diaper, Beer}
Rules originating from the same itemset have identical support but� can have different confidence
Thus, we may decouple the support and confidence requirements

11 of 78

Mining Association Rules

Two-step approach:

Frequent Itemset Generation

Generate all itemsets whose support ≥ minsup

Rule Generation

Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset

Frequent itemset generation is still computationally expensive

Association Rule Mining

Monday, May 08, 2023

12 of 78

Frequent Itemset Generation

Association Rule Mining

Monday, May 08, 2023

Given d items, there are 2^d possible candidate itemsets

13 of 78

Frequent Itemset Generation

Brute-force approach:

Each itemset in the lattice is a candidate frequent itemset
Count the support of each candidate by scanning the database

Match each transaction against every candidate
Complexity ~ O(NMw) => Expensive since M = 2^d !!!

Association Rule Mining

Monday, May 08, 2023

14 of 78

Frequent Itemset Generation Strategies

Reduce the number of candidates (M)

Complete search: M=2^d
Use pruning techniques to reduce M

Reduce the number of transactions (N)

Reduce size of N as the size of itemset increases
Used by DHP and vertical-based mining algorithms

Reduce the number of comparisons (NM)

Use efficient data structures to store the candidates or transactions
No need to match every candidate against every transaction

Association Rule Mining

15 of 78

Reducing Number of Candidates

Apriori principle:

If an itemset is frequent, then all of its subsets must also be frequent

Apriori principle holds due to the following property of the support measure:

Support of an itemset never exceeds the support of its subsets This is known as the anti-monotone property of support measure

Association Rule Mining

Monday, May 08, 2023

16 of 78

Illustrating Apriori Principle

Association Rule Mining

Monday, May 08, 2023

Found to be Infrequent

Pruned supersets

Any measure that possesses an anti-monotone property can be incorporated

directly into the mining algorithm to effectively prune the exponential search space of candidate itemsets

17 of 78

Frequent Itemset Generation in the Apriori Algorithm�

Apriori is the first association rule mining algorithm that pioneered the use of support-based pruning to systematically control the exponential growth of candidate item sets.
Example:

18 of 78

Illustrating Apriori Principle

Association Rule Mining

Monday, May 08, 2023

Minimum Support = 3

Items (1-itemsets)

If every subset is considered,

⁶C₁ + ⁶C₂ + ⁶C₃

6 + 15 + 20 = 41

With support-based pruning,

6 + 6 + 4 = 16

19 of 78

Illustrating Apriori Principle

Association Rule Mining

Monday, May 08, 2023

Minimum Support = 3

If every subset is considered,

⁶C₁ + ⁶C₂ + ⁶C₃

6 + 15 + 20 = 41

With support-based pruning,

6 + 6 + 4 = 16

Items (1-itemsets)

20 of 78

Illustrating Apriori Principle

Association Rule Mining

Monday, May 08, 2023

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate�candidates involving Coke�or Eggs)

Minimum Support = 3

If every subset is considered,

⁶C₁ + ⁶C₂ + ⁶C₃

6 + 15 + 20 = 41

With support-based pruning,

6 + 6 + 4 = 16

21 of 78

Illustrating Apriori Principle

Association Rule Mining

Monday, May 08, 2023

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate�candidates involving Coke�or Eggs)

Minimum Support = 3

If every subset is considered,

⁶C₁ + ⁶C₂ + ⁶C₃

6 + 15 + 20 = 41

With support-based pruning,

6 + 6 + 4 = 16

22 of 78

Illustrating Apriori Principle

Association Rule Mining

Monday, May 08, 2023

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate�candidates involving Coke�or Eggs)

Triplets (3-itemsets)

Minimum Support = 3

If every subset is considered,

⁶C₁ + ⁶C₂ + ⁶C₃

6 + 15 + 20 = 41

With support-based pruning,

6 + 6 + 4 = 16

If every subset is considered,

⁶C₁ + ⁶C₂ + ⁶C₃

6 + 15 + 20 = 41

With support-based pruning,

6 + 6 + 4 = 16

23 of 78

Illustrating Apriori Principle

Association Rule Mining

Monday, May 08, 2023

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate�candidates involving Coke�or Eggs)

Triplets (3-itemsets)

Minimum Support = 3

If every subset is considered,

⁶C₁ + ⁶C₂ + ⁶C₃

6 + 15 + 20 = 41

With support-based pruning,

6 + 6 + 4 = 16

24 of 78

Illustrating Apriori Principle

Association Rule Mining

Monday, May 08, 2023

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate�candidates involving Coke�or Eggs)

Triplets (3-itemsets)

Minimum Support = 3

If every subset is considered,

⁶C₁ + ⁶C₂ + ⁶C₃

6 + 15 + 20 = 41

With support-based pruning,

6 + 6 + 4 = 16

6 + 6 + 1 = 13

25 of 78

Apriori Algorithm

F_k: frequent k-itemsets
L_k: candidate k-itemsets

Algorithm

Let k=1
Generate F₁ = {frequent 1-itemsets}
Repeat until F_k is empty

Candidate Generation: Generate L_k+1from F_k
Candidate Pruning: Prune candidate itemsets in L_k+1containing subsets of length k that are infrequent
Support Counting: Count the support of each candidate in L_k+1by scanning the DB
Candidate Elimination: Eliminate candidates in L_k+1that are infrequent, leaving only those that are frequent => F_k+1

Association Rule Mining

Monday, May 08, 2023

27 of 78

The Apriori algorithm has two important characteristics.
First, it is a level-wise algorithm
it traverses the itemset lattice one level at a time, from frequent 1-itemsets to the maximum size of frequent itemsets
Second, it employs a generate-and-test strategy
At each iteration, new candidate itemsets are generated from the frequent itemsets found in the previous iteration. The support for each candidate is then counted and tested against the minsup threshold.
The total number of iterations needed by the algorithm is k_max+1,where k_max is the maximum size of the frequent itemsets.

31 of 78

Candidate Generation and Pruning

The apriori-gen function generates candidate itemsets by performing the following two operations:

1. Candidate Generation. This operation generates new candidate k itemsets based on the frequent (k - l)-itemsets found in the previous iteration.

2. Candidate Pruning. This operation eliminates some of the candidate k-itemsets using the support-based pruning strategy.

32 of 78

ways to generate candidate itemsets

The folIowing is a list of requirements for an effective candidate generation procedure

It should avoid generating too many unnecessary candidates. A candidate itemset is unnecessary if at least one of its subsets is infrequent
It must ensure that the candidate set is complete, i.e., no frequent itemsets are left out by the candidate generation procedure
It should not generate the same candidate itemset more than once

33 of 78

Generating Association Rules from Frequent Item sets�

Once the frequent item sets from transactions in a database D have been found, it is straightforward to generate strong association rules from them where strong association rules satisfy both minimum support and minimum confidence.

34 of 78

Association rules can be generated as follows:

For each frequent item set l, generate all nonempty subsets of l.
For every nonempty subset s of l, output the rule “s =>l - s” if

min conf, where min conf is the minimum confidence threshold.

35 of 78

Example

The frequent item set l = I1, I2, I5.
The nonempty subsets of l are

The resulting association rules are as shown below, each listed with its confidence

36 of 78

Brute-Force Method :The brute-force method considers every k-itemset as a potential candidate and then applies the candidate pruning step to remove any unnecessary candidates
F_k-1 x F₁ Method : Alternative method for candidate generation is to extend each frequent (k - 1)-itemset with other frequent items.

40 of 78

Support counting

Support counting is the process of determining the frequency of occurrence for every candidate item set that survives the candidate pruning step of the Apriori-gen function.
One approach for doing this is to compare each transaction against every candidate item set and to update the support counts of candidates contained in the transaction.
An alternative approach is to enumerate the item sets contained in each transaction and use them to update the support counts of their respective candidate item sets.

42 of 78

Support Counting Using a Hash Tree�

In the Apriori, algorithm, candidate item sets are partitioned into different buckets and stored in a hash tree.
During support counting, item sets contained
in each transaction are also hashed into their appropriate buckets.
instead of comparing each item set in the transaction with every candidate item set, it is matched only against candidate item sets that belong to the same bucket

44 of 78

Support Counting of Candidate Item sets

To reduce number of comparisons, store the candidate item sets in a hash structure

Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets

45 of 78

Support Counting Using a Hash Tree

1,4,7

2,5,8

36,9

Hash function

Hash tree structure. Each internal node of the tree uses the following hash function, h(p) : p mod 3, to determine which branch of the current node
All candidate item sets are stored at the leaf nodes of the hash tree.
Consider a transaction, f : {1,2,3,5,6}.
To update the support counts of the candidate item sets, the hash tree must be traversed in such a way that all the leaf nodes containing candidate 3-itemsets belonging to f must be visited at least once.

48 of 78

Apriori Algorithm Disadvantages�

Apriori needs a generation of candidate item sets. These item sets may be large in number if the item set in the database is huge.
Apriori needs multiple scans of the database to check the support of each item set generated and this leads to high costs.

49 of 78

Rule Generation

Each frequent k-item set Y can produce up to 2^k - 2 association rules
An association rule can be extracted by partitioning the item set Y into two non-empty subsets ,X and Y -X, such that

X 🡪Y-X satisfies the confidence threshold

50 of 78

Each of their support is identical to the support for X, the rules must satisfy the support threshold.

Computing the confidence of an association rule does not require additional scans of the transaction data set.

51 of 78

Confidence-Based Pruning�

To compare rules generated from the same frequent item set Y, the following theorem holds for the confidence measure.

53 of 78

Rule Generation in Apriori Algorithm

The Apriori algorithm uses a level-wise approach for generating association rules, where each level corresponds to the number of items that belong to the rule consequent.
Initially, all the high-confidence rules that have only one item in the rule consequent are extracted.

54 of 78

A lattice structure for the association rules generated from the frequent item set {a,b,c,d}.
If any node in the lattice has low confidence entire sub graph spanned by the node can be pruned immediately.

57 of 78

Limitations of Apriori Algorithm

It may need to generate a huge number of candidate sets. For example, if there are 10⁴ frequent 1-itemsets, the Apriori algorithm will need to generate more than 10⁷candidate 2-itemsets.
The Algorithm incurs considerable I/O overhead since It may need to repeatedly scan the database and check a large set of candidates by pattern matching.

58 of 78

Compact Representation of Flequent Itemsets

The number of frequent item sets produced from a transaction data set can be very large. It is useful to identify a small representative set of item sets from which all other frequent item sets can be derived.
Two such representations are presented in this section in the form of maximal and closed

frequent item sets

59 of 78

Closed Frequent Itemsets

61 of 78

Maximal Frequent Itemsets

62 of 78

An item set is maximal frequent if it is frequent and none of its immediate supersets is frequent

64 of 78

FP-Growth Algorithm

An alternative algorithm called FP-growth that takes a radically different approach to discovering frequent item sets.
The algorithm does not subscribe to the generate-and-test paradigm of Apriori Instead, it encodes the data set using a compact data structure called an FP

65 of 78

Frequent Pattern Growth Algorithm�

A frequent pattern is generated without the need for candidate generation.

FP growth algorithm represents the database in the form of a tree called a frequent pattern tree or FP tree.

66 of 78

FP Tree�

Frequent Pattern Tree is a tree-like structure that is made with the initial item sets of the database.
The purpose of the FP tree is to mine the most frequent pattern. Each node of the FP tree represents an item of the item set.
The root node represents null while the lower nodes represent the item sets.

67 of 78

Example of FP-Growth Algorithm�

72 of 78

Advantages -FP Growth Algorithm

This algorithm needs to scan the database only twice.
The pairing of items is not done in this algorithm and this makes it faster.
The database is stored in a compact version in memory.
It is efficient and scalable for mining both long and short frequent patterns.

1 of 78

2 of 78

3 of 78

4 of 78

5 of 78

6 of 78

7 of 78

8 of 78

9 of 78

10 of 78

11 of 78

12 of 78

13 of 78

14 of 78

15 of 78

16 of 78

17 of 78

18 of 78

19 of 78

20 of 78

21 of 78

22 of 78

23 of 78

24 of 78

25 of 78

26 of 78

27 of 78

28 of 78

29 of 78

30 of 78

31 of 78

32 of 78

33 of 78

34 of 78

35 of 78

36 of 78

37 of 78

38 of 78

39 of 78

40 of 78

41 of 78

42 of 78

43 of 78

44 of 78

45 of 78

46 of 78

47 of 78

48 of 78

49 of 78

50 of 78

51 of 78

52 of 78

53 of 78

54 of 78

55 of 78

56 of 78

57 of 78

58 of 78

59 of 78

60 of 78

61 of 78

62 of 78

63 of 78

64 of 78

65 of 78

66 of 78

67 of 78

68 of 78

69 of 78

70 of 78

71 of 78

72 of 78

73 of 78

74 of 78

75 of 78

76 of 78

77 of 78

78 of 78