1 of 78

UNIT-IV

Association Analysis:

Basic Concepts and Algorithms

2 of 78

Association Rule Mining

  • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

Association Rule Mining

Monday, May 08, 2023

Market-Basket transactions

Example of Association Rules

{Diaper} → {Beer},�{Milk, Bread} → {Eggs,Coke},�{Beer, Bread} → {Milk},

Implication means co-occurrence, not causality!

3 of 78

  • Binary Representation Market basket data can be represented in a binary format ,where each row corresponds to a transaction and each column corresponds to an item.
  • An item can be treated as a binary variable whose value is one if the item is present in a transaction and zero otherwise.
  • The presence of an item in a transaction is often considered more important than its absence, an item is an asymmetric binary variable

4 of 78

5 of 78

Definition: Frequent Itemset

Itemset

  • Let be the set of all items in a market basket data
  • be the set of all transactions
  • Each transaction ti contains a subset of items chosen from I
  • In association analysis, a collection of zero or more items is termed an itemset.
  • If an itemset contains k items, it is called a k-itemset
  • analysis, a collection of zero or more items is termed an itemset. If an itemset
  • Example: {Beer, Diapers, Milk} is an example of a 3-itemset.
  • The null or empty set is an itemset that does not contain any items.

Support count (σ)

  • the number of transactions that contain a particular itemset. Mathematically, the support count, o(X), for an itemset X can be stated as follows:

6 of 78

>> Support and confidence

  • Support determines how often a rule is applicable to a given data set
  • confidence determines how frequently items in Y appear in transactions that contain X.

7 of 78

Definition: Association Rule

Association Rule Mining

Monday, May 08, 2023

Example:

  • Association Rule
    • An implication expression of the form X → Y, where X and Y are itemsets
    • Example:� {Milk, Diaper} → {Beer}

  • Rule Evaluation Metrics
    • Support (s)
      • Fraction of transactions that contain both X and Y
    • Confidence (c)
      • Measures how often items in Y �appear in transactions that�contain X

8 of 78

Association Rule Mining Task

  • Given a set of transactions T, the goal of association rule mining is to find all rules having
    • support ≥ minsup threshold
    • confidence ≥ minconf threshold

  • Brute-force approach:
    • List all possible association rules
    • Compute the support and confidence for each rule
    • Prune rules that fail the minsup and minconf thresholds

Association Rule Mining

Monday, May 08, 2023

9 of 78

Computational Complexity

  • Given d unique items:
    • Total number of itemsets = 2d
    • Total number of possible association rules:

Association Rule Mining

Monday, May 08, 2023

If d=6, R = 602 rules

10 of 78

Mining Association Rules

Association Rule Mining

Monday, May 08, 2023

Example of Rules:�

{Milk,Diaper} → {Beer} (s=0.4, c=0.67)�{Milk,Beer} → {Diaper} (s=0.4, c=1.0)

{Diaper,Beer} → {Milk} (s=0.4, c=0.67)

{Beer} → {Milk,Diaper} (s=0.4, c=0.67) �{Diaper} → {Milk,Beer} (s=0.4, c=0.5)

{Milk} → {Diaper,Beer} (s=0.4, c=0.5)

Observations:

  • All the above rules are binary partitions of the same itemset: � {Milk, Diaper, Beer}
  • Rules originating from the same itemset have identical support but� can have different confidence
  • Thus, we may decouple the support and confidence requirements

11 of 78

Mining Association Rules

  • Two-step approach:
    1. Frequent Itemset Generation
      • Generate all itemsets whose support ≥ minsup

    • Rule Generation
      • Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset

  • Frequent itemset generation is still computationally expensive

Association Rule Mining

Monday, May 08, 2023

12 of 78

Frequent Itemset Generation

Association Rule Mining

Monday, May 08, 2023

Given d items, there are 2d possible candidate itemsets

13 of 78

Frequent Itemset Generation

  • Brute-force approach:
    • Each itemset in the lattice is a candidate frequent itemset
    • Count the support of each candidate by scanning the database

    • Match each transaction against every candidate
    • Complexity ~ O(NMw) => Expensive since M = 2d !!!

Association Rule Mining

Monday, May 08, 2023

14 of 78

Frequent Itemset Generation Strategies

  • Reduce the number of candidates (M)
    • Complete search: M=2d
    • Use pruning techniques to reduce M

  • Reduce the number of transactions (N)
    • Reduce size of N as the size of itemset increases
    • Used by DHP and vertical-based mining algorithms

  • Reduce the number of comparisons (NM)
    • Use efficient data structures to store the candidates or transactions
    • No need to match every candidate against every transaction

Association Rule Mining

15 of 78

Reducing Number of Candidates

  • Apriori principle:
    • If an itemset is frequent, then all of its subsets must also be frequent
  • Apriori principle holds due to the following property of the support measure:

    • Support of an itemset never exceeds the support of its subsets This is known as the anti-monotone property of support measure

Association Rule Mining

Monday, May 08, 2023

16 of 78

Illustrating Apriori Principle

Association Rule Mining

Monday, May 08, 2023

Found to be Infrequent

Pruned supersets

Any measure that possesses an anti-monotone property can be incorporated

directly into the mining algorithm to effectively prune the exponential search space of candidate itemsets

17 of 78

Frequent Itemset Generation in the Apriori Algorithm

  • Apriori is the first association rule mining algorithm that pioneered the use of support-based pruning to systematically control the exponential growth of candidate item sets.
  • Example:

18 of 78

Illustrating Apriori Principle

Association Rule Mining

Monday, May 08, 2023

Minimum Support = 3

Items (1-itemsets)

If every subset is considered,

6C1 + 6C2 + 6C3

6 + 15 + 20 = 41

With support-based pruning,

6 + 6 + 4 = 16

19 of 78

Illustrating Apriori Principle

Association Rule Mining

Monday, May 08, 2023

Minimum Support = 3

If every subset is considered,

6C1 + 6C2 + 6C3

6 + 15 + 20 = 41

With support-based pruning,

6 + 6 + 4 = 16

Items (1-itemsets)

20 of 78

Illustrating Apriori Principle

Association Rule Mining

Monday, May 08, 2023

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate�candidates involving Coke�or Eggs)

Minimum Support = 3

If every subset is considered,

6C1 + 6C2 + 6C3

6 + 15 + 20 = 41

With support-based pruning,

6 + 6 + 4 = 16

21 of 78

Illustrating Apriori Principle

Association Rule Mining

Monday, May 08, 2023

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate�candidates involving Coke�or Eggs)

Minimum Support = 3

If every subset is considered,

6C1 + 6C2 + 6C3

6 + 15 + 20 = 41

With support-based pruning,

6 + 6 + 4 = 16

22 of 78

Illustrating Apriori Principle

Association Rule Mining

Monday, May 08, 2023

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate�candidates involving Coke�or Eggs)

Triplets (3-itemsets)

Minimum Support = 3

If every subset is considered,

6C1 + 6C2 + 6C3

6 + 15 + 20 = 41

With support-based pruning,

6 + 6 + 4 = 16

If every subset is considered,

6C1 + 6C2 + 6C3

6 + 15 + 20 = 41

With support-based pruning,

6 + 6 + 4 = 16

23 of 78

Illustrating Apriori Principle

Association Rule Mining

Monday, May 08, 2023

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate�candidates involving Coke�or Eggs)

Triplets (3-itemsets)

Minimum Support = 3

If every subset is considered,

6C1 + 6C2 + 6C3

6 + 15 + 20 = 41

With support-based pruning,

6 + 6 + 4 = 16

24 of 78

Illustrating Apriori Principle

Association Rule Mining

Monday, May 08, 2023

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate�candidates involving Coke�or Eggs)

Triplets (3-itemsets)

Minimum Support = 3

If every subset is considered,

6C1 + 6C2 + 6C3

6 + 15 + 20 = 41

With support-based pruning,

6 + 6 + 4 = 16

6 + 6 + 1 = 13

25 of 78

Apriori Algorithm

    • Fk: frequent k-itemsets
    • Lk: candidate k-itemsets

  • Algorithm
    • Let k=1
    • Generate F1 = {frequent 1-itemsets}
    • Repeat until Fk is empty
      • Candidate Generation: Generate Lk+1 from Fk
      • Candidate Pruning: Prune candidate itemsets in Lk+1 containing subsets of length k that are infrequent
      • Support Counting: Count the support of each candidate in Lk+1 by scanning the DB
      • Candidate Elimination: Eliminate candidates in Lk+1 that are infrequent, leaving only those that are frequent => Fk+1

Association Rule Mining

Monday, May 08, 2023

26 of 78

27 of 78

  • The Apriori algorithm has two important characteristics.
  • First, it is a level-wise algorithm
  • it traverses the itemset lattice one level at a time, from frequent 1-itemsets to the maximum size of frequent itemsets
  • Second, it employs a generate-and-test strategy
  • At each iteration, new candidate itemsets are generated from the frequent itemsets found in the previous iteration. The support for each candidate is then counted and tested against the minsup threshold.
  • The total number of iterations needed by the algorithm is kmax +1,where kmax is the maximum size of the frequent itemsets.

28 of 78

29 of 78

30 of 78

31 of 78

Candidate Generation and Pruning

  • The apriori-gen function generates candidate itemsets by performing the following two operations:

1. Candidate Generation. This operation generates new candidate k itemsets based on the frequent (k - l)-itemsets found in the previous iteration.

2. Candidate Pruning. This operation eliminates some of the candidate k-itemsets using the support-based pruning strategy.

32 of 78

ways to generate candidate itemsets

The folIowing is a list of requirements for an effective candidate generation procedure

  • It should avoid generating too many unnecessary candidates. A candidate itemset is unnecessary if at least one of its subsets is infrequent
  • It must ensure that the candidate set is complete, i.e., no frequent itemsets are left out by the candidate generation procedure
  • It should not generate the same candidate itemset more than once

33 of 78

Generating Association Rules from Frequent Item sets

  • Once the frequent item sets from transactions in a database D have been found, it is straightforward to generate strong association rules from them where strong association rules satisfy both minimum support and minimum confidence.

34 of 78

Association rules can be generated as follows:

  • For each frequent item set l, generate all nonempty subsets of l.
  • For every nonempty subset s of l, output the rule “s =>l - s” if

  • min conf, where min conf is the minimum confidence threshold.

35 of 78

Example

  • The frequent item set l = I1, I2, I5.
  • The nonempty subsets of l are

  • The resulting association rules are as shown below, each listed with its confidence

36 of 78

  • Brute-Force Method :The brute-force method considers every k-itemset as a potential candidate and then applies the candidate pruning step to remove any unnecessary candidates
  • Fk-1 x F1 Method : Alternative method for candidate generation is to extend each frequent (k - 1)-itemset with other frequent items.

37 of 78

38 of 78

39 of 78

40 of 78

Support counting

  • Support counting is the process of determining the frequency of occurrence for every candidate item set that survives the candidate pruning step of the Apriori-gen function.
  • One approach for doing this is to compare each transaction against every candidate item set and to update the support counts of candidates contained in the transaction.
  • An alternative approach is to enumerate the item sets contained in each transaction and use them to update the support counts of their respective candidate item sets.

41 of 78

Example

42 of 78

Support Counting Using a Hash Tree�

  • In the Apriori, algorithm, candidate item sets are partitioned into different buckets and stored in a hash tree.
  • During support counting, item sets contained
  • in each transaction are also hashed into their appropriate buckets.
  • instead of comparing each item set in the transaction with every candidate item set, it is matched only against candidate item sets that belong to the same bucket

43 of 78

44 of 78

Support Counting of Candidate Item sets

  • To reduce number of comparisons, store the candidate item sets in a hash structure
    • Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets

45 of 78

Support Counting Using a Hash Tree

1,4,7

2,5,8

36,9

Hash function

  • Hash tree structure. Each internal node of the tree uses the following hash function, h(p) : p mod 3, to determine which branch of the current node
  • All candidate item sets are stored at the leaf nodes of the hash tree.
  • Consider a transaction, f : {1,2,3,5,6}.
  • To update the support counts of the candidate item sets, the hash tree must be traversed in such a way that all the leaf nodes containing candidate 3-itemsets belonging to f must be visited at least once.

46 of 78

47 of 78

48 of 78

Apriori Algorithm Disadvantages�

  • Apriori needs a generation of candidate item sets. These item sets may be large in number if the item set in the database is huge.
  • Apriori needs multiple scans of the database to check the support of each item set generated and this leads to high costs.

49 of 78

Rule Generation

  • Each frequent k-item set Y can produce up to 2k - 2 association rules
  • An association rule can be extracted by partitioning the item set Y into two non-empty subsets ,X and Y -X, such that

X 🡪Y-X satisfies the confidence threshold

50 of 78

  • Each of their support is identical to the support for X, the rules must satisfy the support threshold.

  • Computing the confidence of an association rule does not require additional scans of the transaction data set.

51 of 78

Confidence-Based Pruning�

To compare rules generated from the same frequent item set Y, the following theorem holds for the confidence measure.

52 of 78

53 of 78

Rule Generation in Apriori Algorithm

  • The Apriori algorithm uses a level-wise approach for generating association rules, where each level corresponds to the number of items that belong to the rule consequent.
  • Initially, all the high-confidence rules that have only one item in the rule consequent are extracted.

54 of 78

  • A lattice structure for the association rules generated from the frequent item set {a,b,c,d}.
  • If any node in the lattice has low confidence entire sub graph spanned by the node can be pruned immediately.

55 of 78

56 of 78

57 of 78

Limitations of Apriori Algorithm

  • It may need to generate a huge number of candidate sets. For example, if there are 104 frequent 1-itemsets, the Apriori algorithm will need to generate more than 107 candidate 2-itemsets.
  • The Algorithm incurs considerable I/O overhead since It may need to repeatedly scan the database and check a large set of candidates by pattern matching.

58 of 78

Compact Representation of Flequent Itemsets

  • The number of frequent item sets produced from a transaction data set can be very large. It is useful to identify a small representative set of item sets from which all other frequent item sets can be derived.
  • Two such representations are presented in this section in the form of maximal and closed

frequent item sets

59 of 78

Closed Frequent Itemsets

60 of 78

61 of 78

Maximal Frequent Itemsets

62 of 78

An item set is maximal frequent if it is frequent and none of its immediate supersets is frequent

63 of 78

64 of 78

FP-Growth Algorithm

  • An alternative algorithm called FP-growth that takes a radically different approach to discovering frequent item sets.
  • The algorithm does not subscribe to the generate-and-test paradigm of Apriori Instead, it encodes the data set using a compact data structure called an FP

65 of 78

Frequent Pattern Growth Algorithm�

  • A frequent pattern is generated without the need for candidate generation.

  • FP growth algorithm represents the database in the form of a tree called a frequent pattern tree or FP tree.

66 of 78

FP Tree�

  • Frequent Pattern Tree is a tree-like structure that is made with the initial item sets of the database.
  • The purpose of the FP tree is to mine the most frequent pattern. Each node of the FP tree represents an item of the item set.
  • The root node represents null while the lower nodes represent the item sets. 

67 of 78

Example of FP-Growth Algorithm�

68 of 78

69 of 78

70 of 78

71 of 78

72 of 78

Advantages -FP Growth Algorithm

  • This algorithm needs to scan the database only twice.
  • The pairing of items is not done in this algorithm and this makes it faster.
  • The database is stored in a compact version in memory.
  • It is efficient and scalable for mining both long and short frequent patterns.

73 of 78

74 of 78

75 of 78

76 of 78

77 of 78

78 of 78