1 of 74

����������DATA MINING�AND �DATA WAREHOUSING

B.TECH - VI SEMESTER

R17 Regulation

Dr. Anupriya Koneru

2 of 74

UNIT-III�

Association Rule Mining: 

    • Overview- Frequent patterns
    • Mining Single-Dimensional Boolean Association Rules from Transactional Databases :
      • Apriori algorithm
      • FP Growth algorithm
    • Mining Multilevel Association Rules from transaction Databases
    • Mining Multidimensional Association Rules from Relational Databases
    • From Association Mining to Correlation Analysis
    • Constraint based Association Mining

3 of 74

Overview- Frequent patterns

  • Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures) that appear frequently in a data set.
  • For example, a set of items, such as milk and bread, that appear frequently together in a transaction data set is a frequent itemset.
  • A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it occurs frequently in a shopping history database, is a (frequent) sequential pattern.
  • A substructure can refer to different structural forms, such as subgraphs, subtrees, or sublattices, which may be combined with itemsets or subsequences.
  • If a substructure occurs frequently, it is called a (frequent) structured pattern.
  • Finding frequent patterns plays an essential role in mining associations, correlations, and many other interesting relationships among data.
  • Moreover, it helps in data classification, clustering, and other data mining tasks.
  • Thus, frequent pattern mining has become an important data mining task and a focused theme in data mining research.
  • This unit is dedicated to methods of frequent itemset mining .

4 of 74

Overview- Frequent patterns

Market – Basket analysis A market basket is a collection

  • A market basket is a collection of items purchased by a customer in a single transaction, which is a well-defined business activity.
  • For example, a customer's visits to a grocery store or an online purchase from a virtual store on the Web are typical customer transactions.
  • Retailers accumulate huge collections of transactions by recording business activities over time.
  • One common analysis run against a transactions database is to find sets of items, or itemsets, that appear together in many transactions.
  • The discovery of interesting correlation relationships among huge amounts of business transaction records can help in many business decision-making processes such as catalog design, cross-marketing, and customer shopping behavior analysis.
  • An itemset containing i items is called an i-itemset.
  • The percentage of transactions that contain an itemset is called the itemsets support.
  • Itemset to be interesting, its support must be higher than a user-specified minimum. Such itemsets are said to be frequent.

5 of 74

Overview- Frequent patterns

Market – Basket analysis A market basket is a collection

  • A typical example of frequent itemset mining is market basket analysis.
  • This process analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets”
  • The discovery of these associations can help retailers develop marketing strategies by gaining insight into which items are frequently purchased together by customers.
  • For instance, if customers are buying milk, how likely are they to also buy bread (and what kind of bread) on the same trip to the supermarket?
  • This information can lead to increased sales by helping retailers do selective marketing and plan their shelf space.
  • Let’s look at an example of how market basket analysis can be useful

6 of 74

Overview- Frequent patterns

Market – Basket analysis A market basket is a collection

Suppose, as manager of an AllElectronics branch, you would like to learn more about the buying habits of your customers. Specifically, you wonder, “Which groups or sets of items are customers likely to purchase on a given trip to the store?”

  • To answer your question, market basket analysis may be performed on the retail data of customer transactions at your store.
  • You can then use the results to plan marketing or advertising strategies, or in the design of a new catalog.
  • For instance, market basket analysis may help you design different store layouts.
  • In one strategy, items that are frequently purchased together can be placed in proximity to further encourage the combined sale of such items.
  • If customers who purchase computers also tend to buy antivirus software at the same time, then placing the hardware display close to the software display may help increase the sales of both items.
  • In an alternative strategy, placing hardware and software at opposite ends of the store may entice customers who purchase such items to pick up other items along the way.

7 of 74

Overview- Frequent patterns

Market – Basket analysis A market basket is a collection

  • For instance, after deciding on an expensive computer, a customer may observe security systems for sale while heading toward the software display to purchase antivirus software, and may decide to purchase a home security system as well.
  • Market basket analysis can also help retailers plan which items to put on sale at reduced prices.
  • If customers tend to purchase computers and printers together, then having a sale on printers may encourage the sale of printers as well as computers.
  • If we think of the universe as the set of items available at the store, then each item has a Boolean variable representing the presence or absence of that item. Each basket can then be represented by a Boolean vector of values assigned to these variables.
  • The Boolean vectors can be analyzed for buying patterns that reflect items that are frequently associated or purchased together.
  • These patterns can be represented in the form of association rules.

computer ⇒ antivirus software [support = 2%,confidence = 60%].

8 of 74

Overview- Frequent patterns

Market – Basket analysis A market basket is a collection

  • Rule support and confidence are two measures of rule interestingness.
  • They respectively reflect the usefulness and certainty of discovered rules.
  • A support of 2% for association Rule means that 2% of all the transactions under analysis show that computer and financial management software are purchased together.
  • A confidence of 60% means that 60% of the customers who purchased a computer also bought the software.
  • Typically, association rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold.

9 of 74

Overview- Frequent patterns

Frequent Itemsets, Closed Itemsets, and Association Rules

  • Let I = {I1, I2,..., Im} be an itemset.
  • Let D, the task-relevant data, be a set of database transactions where each transaction T is a nonempty itemset such that T ⊆ I.
  • Each transaction is associated with an identifier, called a TID. Let A be a set of items. A transaction T is said to contain A if A ⊆ T.
  • An association rule is an implication of the form A ⇒ B, where A ⊂ I, B ⊂ I, A = ∅, B = ∅, and A ∩B = φ.
  • The rule A ⇒ B holds in the transaction set D with support s, where s is the percentage of transactions in D that contain A ∪B (i.e., the union of sets A and B say, or, both A and B).
  • This is taken to be the probability, P(A ∪B)
  • The rule A ⇒ B has confidence c in the transaction set D, where c is the percentage of transactions in D containing A that also contain B.

This is taken to be the conditional probability, P(B|A). That is,

support(A⇒B) =P(A ∪B)

confidence(A⇒B) =P(B|A)

10 of 74

Overview- Frequent patterns

  • Rules that satisfy both a minimum support threshold (min sup) and a minimum confidence threshold (min conf ) are called strong.
  • By convention, we write support and confidence values so as to occur between 0% and 100%, rather than 0 to 1.0.
  • A set of items is referred to as an itemset. An itemset that contains k items is a k-itemset. The set {computer, antivirus software} is a 2-itemset.
  • The occurrence frequency of an itemset is the number of transactions that contain the itemset. This is also known, simply, as the frequency, support count, or count of the itemset. Note that the itemset support defined in Eq. is sometimes referred to as relative support, whereas the occurrence frequency is called the absolute support.
  • If the relative support of an itemset I satisfies a prespecified minimum support threshold (i.e., the absolute support of I satisfies the corresponding minimum support count threshold), then I is a frequent itemset.
  • The set of frequent k-itemsets is commonly denoted by Lk .
  • From Confidence Eq. we have

Confidence(A⇒B) = P(B|A) = support(A ∪B)/ support(A)

= support_count(A ∪B)/ support _count(A)

11 of 74

Overview- Frequent patterns

In general, association rule mining can be viewed as a two-step process:

  1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a predetermined minimum support count, min sup.
  2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence.
  3. Additional interestingness measures can be applied for the discovery of correlation relationships between associated items.
  4. Because the second step is much less costly than the first, the overall performance of mining association rules is determined by the first step.
  5. A major challenge in mining frequent itemsets from a large data set is the fact that such mining often generates a huge number of itemsets satisfying the minimum support (min sup) threshold, especially when min sup is set low.
  6. This is because if an itemset is frequent, each of its subsets is frequent as well.
  7. A long itemset will contain a combinatorial number of shorter, frequent sub-itemsets.
  8. For example, a frequent itemset of length 100, such as {a1, a2,..., a100}, contains 100 1 = 100 frequent 1-itemsets: {a1}, {a2}, . . . , {a100}; 100 2 frequent 2-itemsets: {a1, a2}, {a1, a3}, . . . ,{a99, a100}; and so on. The total number of frequent itemsets that it contains is

12 of 74

Overview- Frequent patterns

  • This is too huge a number of itemsets for any computer to compute or store.
  • To overcome this difficulty, we introduce the concepts of closed frequent itemset and maximal frequent itemset.
  • An itemset X is closed in a data set D if there exists no proper super-itemset Y such that Y has the same support count as X in D.
  • An itemset X is a closed frequent itemset in set D if X is both closed and frequent in D.
  • An itemset X is a maximal frequent itemset (or max-itemset) in a data set D if X is frequent, and there exists no super-itemset Y such that X ⊂ Y and Y is frequent in D

13 of 74

Overview- Frequent patterns

Association Rule Mining: A Road Map

Boolean vs. quantitative associations (Based on the types of values handled)

– buys(x, “SQLServer”) ^ buys(x, “DMBook”) ® buys(x, “DBMiner”) [0.2%, 60%]

– age(x, “30..39”) ^ income(x, “42..48K”) ® buys(x, “PC”) [1%, 75%]

• Single dimension vs. multiple dimensional associations (see ex. Above)

• Single level vs. multiple-level analysis

– What brands of beers are associated with what brands of diapers?

• Various extensions

– Correlation, causality analysis

• Association does not necessarily imply correlation or causality

Maxpatterns and closed itemsets

– Constraints enforced

• E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?

14 of 74

Overview- Frequent patterns

Association Rule Mining: A Road Map

Market basket analysis is just one form of frequent pattern mining. In fact, there are many kinds of frequent patterns, association rules, and correlation relationships. Frequent pattern mining can be classified in various ways, based on the following criteria:

Based on the completeness of patterns to be mined:

    • As we discussed in the previous subsection, we can mine the complete set of frequent itemsets, the closed frequent itemsets, and the maximal frequent itemsets, given a minimum support threshold.
    • We can also mine constrained frequent itemsets (i.e., those that satisfy a set of user-defined constraints), approximate frequent itemsets (i.e., those that derive only approximate support counts for the mined frequent itemsets), near-match frequent itemsets (i.e., those that tally the support count of the near or almost matching itemsets), top-k frequent itemsets (i.e., the k most frequent itemsets for a user-specified value, k), and so on.
    • Different applications may have different requirements regarding the completeness of the patterns to be mined, which in turn can lead to different evaluation and optimization methods.
    • In this Unit, our study of mining methods focuses on mining the complete set of frequent itemsets, closed frequent itemsets, and constrained frequent itemsets.

15 of 74

Overview- Frequent patterns

Association Rule Mining: A Road Map

Based on the levels of abstraction involved in the rule set:

    • Some methods for association rule mining can find rules at differing levels of abstraction.
    • For example, suppose that a set of association rules mined includes the following rules where X is a variable representing a customer:

buys(X, “computer”) ⇒ buys(X, “HP printer”)

buys(X, “laptop computer”) ⇒ buys(X, “HP printer”)

    • In the above Rules, the items bought are referenced at different levels of abstraction (e.g., “computer” is a higher-level abstraction of “laptop computer”).
    • We refer to the rule set mined as consisting of multilevel association rules.
    • If, instead, the rules within a given set do not reference items or attributes at different levels of abstraction, then the set contains single-level association rules.

16 of 74

Overview- Frequent patterns

Association Rule Mining: A Road Map

Based on the number of data dimensions involved in the rule:

    • If the items or attributes in an association rule reference only one dimension, then it is a single-dimensional association rule.
    • Note that Rule (5.1), for example, could be rewritten as Rule (5.8):

buys(X, “computer”) ⇒ buys(X, “antivirus software”) (5.8)

    • Rules (5.6), (5.7), and (5.8) are single-dimensional association rules because they each refer to only one dimension, buys.
    • If a rule references two or more dimensions, such as the dimensions age, income, and buys, then it is a multidimensional association rule.
    • The following rule is an example of a multidimensional rule:

age(X, “30...39”)∧income(X, “42K...48K”)⇒buys(X, “high resolution TV”).

17 of 74

Overview- Frequent patterns

Association Rule Mining: A Road Map

Based on the types of values handled in the rule:

    • If a rule involves associations between the presence or absence of items, it is a Boolean association rule.
    • For example, Rules (5.1), (5.6), and (5.7) are Boolean association rules obtained from market basket analysis.
    • If a rule describes associations between quantitative items or attributes, then it is a quantitative association rule. In these rules, quantitative values for items or attributes are partitioned into intervals.
    • Rule (5.9) is also considered a quantitative association rule. Note that the quantitative attributes, age and income, have been discretized.

18 of 74

Overview- Frequent patterns

Association Rule Mining: A Road Map

Based on the kinds of rules to be mined:

    • Frequent pattern analysis can generate various kinds of rules and other interesting relationships.
    • Association rules are the most popular kind of rules generated from frequent patterns. Typically, such mining can generate a large number of rules, many of which are redundant or do not indicate a correlation relationship among itemsets.
    • Thus, the discovered associations can be further analyzed to uncover statistical correlations, leading to correlation rules.
    • We can also mine strong gradient relationships among itemsets, where a gradient is the ratio of the measure of an item when compared with that of its parent (a generalized itemset), its child (a specialized itemset), or its sibling (a comparable itemset).
    • One such example is: “The average sales from Sony Digital Camera increase over 16% when sold together with Sony Laptop Computer”: both Sony Digital Camera and Sony Laptop Computer are siblings, where the parent itemset is Sony.

19 of 74

Overview- Frequent patterns

Association Rule Mining: A Road Map

Based on the kinds of patterns to be mined:

    • Many kinds of frequent patterns can be mined from different kinds of data sets.
    • For this chapter, our focus is on frequent itemset mining, that is, the mining of frequent itemsets (sets of items) from transactional or relational data sets.
    • However, other kinds of frequent patterns can be found from other kinds of data sets. Sequential pattern mining searches for frequent subsequences in a sequence data set, where a sequence records an ordering of events.
    • For example, with sequential pattern mining, we can study the order in which items are frequently purchased. For instance, customers may tend to first buy a PC, followed by a digital camera, and then a memory card.Structured pattern mining searches for frequentsubstructures in a structured data set.
    • Notice that structure is a general concept that covers many different kinds of structural forms, such as graphs, lattices, trees, sequences, sets, single items, or combinations of such structures. Single items are the simplest form of structure.
    • Each element of an itemset may contain a subsequence, a subtree, and so on, and such containment relationships can be defined recursively. Therefore, structured pattern mining can be considered as the most general form of frequent pattern mining.

20 of 74

Mining Single-Dimensional Boolean Association Rules from Transactional Databases

Mining Frequent Patterns

The method that mines the complete set of frequent itemsets with candidate generation. Apriori property & The Apriori Algorithm.

Apriori property

All nonempty subsets of a frequent item set most also be frequent.

– An item set I does not satisfy the minimum support threshold, minsup, then I is not frequent, i.e., support(I) < min-sup

– If an item A is added to the item set I then the resulting item set (I U A) can not occur more frequently than I.

• Monotonic functions are functions that move in only one direction.

• This property is called anti-monotonic.

• If a set cannot pass a test, all its supersets will fail the same test as well. This property is monotonic in failing the test.

21 of 74

Definition: Frequent Itemset

  • Itemset
    • A collection of one or more items
      • Example: {Milk, Bread, Diaper}
    • k-itemset
      • An itemset that contains k items
  • Support count (σ)
    • Frequency of occurrence of an itemset
    • E.g. σ({Milk, Bread,Diaper}) = 2
  • Support
    • Fraction of transactions that contain an itemset
    • E.g. s({Milk, Bread, Diaper}) = 2/5
  • Frequent Itemset
    • An itemset whose support is greater than or equal to a minsup threshold

22 of 74

Definition: Association Rule

Example:

  • Association Rule
    • An implication expression of the form X → Y, where X and Y are itemsets
    • Example:� {Milk, Diaper} → {Beer}

  • Rule Evaluation Metrics
    • Support (s)
      • Fraction of transactions that contain both X and Y
    • Confidence (c)
      • Measures how often items in Y �appear in transactions that�contain X

23 of 74

Association Rule Mining

  • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

Market-Basket transactions

Example of Association Rules

{Diaper} → {Beer},�{Milk, Bread} → {Eggs,Coke},�{Beer, Bread} → {Milk},

Implication means co-occurrence, not causality!

24 of 74

Association Rule Mining Task

  • Given a set of transactions T, the goal of association rule mining is to find all rules having
    • support ≥ minsup threshold
    • confidence ≥ minconf threshold

  • Brute-force approach:
    • List all possible association rules
    • Compute the support and confidence for each rule
    • Prune rules that fail the minsup and minconf thresholds

Computationally prohibitive!

25 of 74

Mining Association Rules

Example of Rules:�

{Milk,Diaper} → {Beer} (s=0.4, c=0.67)�{Milk,Beer} → {Diaper} (s=0.4, c=1.0)

{Diaper,Beer} → {Milk} (s=0.4, c=0.67)

{Beer} → {Milk,Diaper} (s=0.4, c=0.67) �{Diaper} → {Milk,Beer} (s=0.4, c=0.5)

{Milk} → {Diaper,Beer} (s=0.4, c=0.5)

Observations:

  • All the above rules are binary partitions of the same itemset: � {Milk, Diaper, Beer}
  • Rules originating from the same itemset have identical support but� can have different confidence
  • Thus, we may decouple the support and confidence requirements

26 of 74

Mining Single-Dimensional Boolean Association Rules from Transactional Databases

“How is the Apriori property used in the algorithm?”

To understand this, let us look at how Lk−1 is used to find Lk for k ≥ 2. A two-step process is followed, consisting of join and prune actions.

  1. The join step: To find Lk , a set of candidate k-itemsets is generated by joining Lk−1 with itself. This set of candidates is denoted Ck .
    • Let l1 and l2 be itemsets in Lk−1. The notation li[j] refers to the jth item in li (e.g., l1[k − 2] refers to the second to the last item in l1). For efficient implementation, Apriori assumes that items within a transaction or itemset are sorted in lexicographic order.
    • For the (k − 1)-itemset, li , this means that the items are sorted such that li[1] < li[2] < ··· < li[k − 1]. The join, Lk−1 ✶ Lk−1, is performed, where members of Lk−1 are joinable if their first (k − 2) items are in common.
    • That is, members l1 and l2 of Lk−1 are joined if (l1[1] = l2[1]) ∧ (l1[2] = l2[2]) ∧ ··· ∧ (l1[k − 2] = l2[k − 2]) ∧(l1[k − 1] < l2[k − 1]). The condition l1[k − 1] < l2[k − 1] simply ensures that no duplicates are generated.
    • The resulting itemset formed by joining l1 and l2 is {l1[1], l1[2],..., l1[k − 2], l1[k − 1], l2[k − 1]}.

27 of 74

Mining Single-Dimensional Boolean Association Rules from Transactional Databases

“How is the Apriori property used in the algorithm?”

To understand this, let us look at how Lk−1 is used to find Lk for k ≥ 2. A two-step process is followed, consisting of join and prune actions.

2. The prune step: Ck is a superset of Lk , that is, its members may or may not be frequent, but all of the frequent k-itemsets are included in Ck .

    • A database scan to determine the count of each candidate in Ck would result in the determination of Lk (i.e., all candidates having a count no less than the minimum support count are frequent by definition, and therefore belong to Lk).
    • Ck , however, can be huge, and so this could involve heavy computation. To reduce the size of Ck , the Apriori property is used as follows.
    • Any (k − 1)-itemset that is not frequent cannot be a subset of a frequent k-itemset.
    • Hence, if any (k − 1)-subset of a candidate k-itemset is not in Lk−1, then the candidate cannot be frequent either and so can be removed from Ck.
    • This subset testing can be done quickly by maintaining a hash tree of all frequent itemsets.

28 of 74

29 of 74

Mining Single-Dimensional Boolean Association Rules from Transactional Databases

Example : Apriori. Let’s look at a concrete example, based on the AllElectronicstransaction database, D, of Table . There are nine transactions in this database, that is, |D| = 9. We use Figure to illustrate the Apriori algorithm for finding frequent itemsets in D.

  1. In the first iteration of the algorithm, each item is a member of the set of candidate 1-itemsets, C1. The algorithm simply scans all of the transactions in order to count the number of occurrences of each item.
  2. Suppose that the minimum support count required is 2, that is, min sup = 2. (Here, we are referring to absolute support because we are using a support count. The corresponding relative support is 2/9 = 22%). The set of frequent 1-itemsets, L1, can then be determined. It consists of the candidate 1-itemsets satisfying minimum support. In our example, all of the candidates in C1 satisfy minimum support.
  3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 on L1 to generate a candidate set of 2-itemsets, C2. C2 consists of (L1 2 ) 2-itemsets. Note that no candidates are removed from C2 during the prune step because each subset of the candidates is also frequent

30 of 74

Mining Single-Dimensional Boolean Association Rules from Transactional Databases

4. Next, the transactions in D are scanned and the support count of each candidate itemset inC2 is accumulated, as shown in the middle table of the second row in Figure.

5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-itemsets in C2 having minimum support.

6. The generation of the set of candidate 3-itemsets,C3, is detailed in Figure. From the join step, we first getC3 = L2 L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}. Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can determine that the four latter candidates cannot possibly be frequent. We therefore remove them from C3, thereby saving the effort of unnecessarily obtaining their counts during the subsequent scan of D to determine L3. Note that when given a candidate k-itemset, we only need to check if its(k−1)-subsets are frequent since the Apriori algorithm uses a level-wise search strategy. The resulting pruned version of C3 is shown in the first table of the bottom row of Figure.

7. The transactions in D are scanned in order to determine L3, consisting of those candidate 3-itemsets in C3 having minimum support.

31 of 74

Mining Single-Dimensional Boolean Association Rules from Transactional Databases

32 of 74

Mining Single-Dimensional Boolean Association Rules from Transactional Databases

8. The algorithm uses L3 L3 to generate a candidate set of 4-itemsets, C4. Although the join results in {{I1, I2, I3, I5}}, this itemset is pruned because its subset {{I2, I3, I5}} is not frequent. Thus, C4 = φ, and the algorithm terminates, having found all of the frequent itemsets.

33 of 74

Mining Single-Dimensional Boolean Association Rules from Transactional Databases

The Apriori Algorithm

• Join Step: Ck is generated by joining Lk-1with itself

• Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset

34 of 74

35 of 74

Mining Single-Dimensional Boolean Association Rules from Transactional Databases

•The method that mines the complete set of frequent itemsets without generation.

• Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure – highly condensed, but complete for frequent pattern mining – avoid costly database scans

• Develop an efficient, FP-tree-based frequent pattern mining method

– A divide-and-conquer methodology: decompose mining tasks into smaller ones

– Avoid candidate generation: sub-database test only!

Construct FP-tree from a Transaction DB

36 of 74

Mining Single-Dimensional Boolean Association Rules from Transactional Databases

  • Completeness:

– never breaks a long pattern of any transaction

– preserves complete information for frequent pattern mining

• Compactness

– reduce irrelevant information—infrequent items are gone

– frequency descending ordering: more frequent items are more likely to be shared

    • never be larger than the original database (if not count node-links and counts)

- Example: For Connect-4 DB, compression ratio could be over 100

37 of 74

Mining Single-Dimensional Boolean Association Rules from Transactional Databases

•Mining Frequent Patterns Using FP-tree

General idea (divide-and-conquer)

– Recursively grow frequent pattern path using the FP-tree

• Method

– For each item, construct its conditional pattern-base, and then its conditional FP-tree

– Repeat the process on each newly created conditional FP-tree

– Until the resulting FP-tree is empty, or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)

Major Steps to Mine FP-tree

  1. Construct conditional pattern base for each node in the FP-tree
  2. Construct conditional FP-tree from each conditional pattern-base
  3. Recursively mine conditional FP-trees and grow frequent patterns obtained so far

- If the conditional FP-tree contains a single path, simply enumerate all the patterns

38 of 74

Mining Single-Dimensional Boolean Association Rules from Transactional Databases

•Mining Frequent Patterns Using FP-tree

Principles of Frequent Pattern Growth

39 of 74

Mining Single-Dimensional Boolean Association Rules from Transactional Databases

Why Is Frequent Pattern Growth Fast?

• Our performance study shows

– FP-growth is an order of magnitude faster than Apriori, and is also faster than tree

-projection

• Reasoning

– No candidate generation, no candidate test

– Use compact data structure

– Eliminate repeated database scan

Basic operation is counting and FP-tree building

40 of 74

41 of 74

42 of 74

Mining multilevel association rules from Transactional databases

  • Multilevel association rule: Multilevel association rules can be defined as applying association rules over different levels of data abstraction
  • For many applications, it is difficult to find strong associations among data items at low or primitive levels of abstraction due to the sparsity of data at those levels.
  • Strong associations discovered at high levels of abstraction may represent commonsense knowledge.
  • Moreover, what may represent common sense to one user may seem novel to another.
  • Therefore, data mining systems should provide capabilities for mining association rules at multiple levels of abstraction, with sufficient flexibility for easy traversal among different abstraction spaces.

43 of 74

44 of 74

  • Association rules generated from mining data at multiple levels of abstraction are called multiple-level or multilevel association rules. Multilevel association rules can be mined efficiently using concept hierarchies under a support-confidence framework.
  • In general, a top-down strategy is employed, where counts are accumulated for the calculation of frequent itemsets at each concept level, starting at the concept level1 and working downward in the hierarchy toward the more specific concept levels, until no more frequent itemsets can be found.
  • For each level, any algorithm for discovering frequent itemsets may be used, such as Apriori or its variations.
  • A number of variations to this approach are described, where each variation involves “playing” with the support threshold in a slightly different way.

45 of 74

  • Using uniform minimum support for all levels (referred to as uniform support):
    • The same minimum support threshold is used when mining at each level of abstraction.
    • For example, a minimum support threshold of 5% is used throughout (e.g., for mining from “computer” down to “laptop computer”).
    • Both “computer” and “laptop computer” are found to be frequent, while “desktop computer” is not.
    • When a uniform minimum support threshold is used, the search procedure is simplified.
    • The method is also simple in that users are required to specify only one minimum support threshold.
    • An Apriori-like optimization technique can be adopted

46 of 74

47 of 74

  • Using reduced minimum support at lower levels (referred to as reduced support):
    • Each level of abstraction has its own minimum support threshold. The deeper the level of abstraction, the smaller the corresponding threshold is.
    • For example, in Figure 5.12, the minimum support thresholds for levels 1 and 2 are 5% and 3%,
    • respectively. In this way, “computer,” “laptop computer,” and “desktop computer” are all considered frequent.

48 of 74

  • Using item or group-based minimum support (referred to as group-based support):
    • Because users or experts often have insight as to which groups are more important than others, it is sometimes more desirable to set up user-specific, item, or group based minimal support thresholds when mining multilevel rules.
    • For example, a user could set up the minimum support thresholds based on product price, or on items of interest, such as by setting particularly low support thresholds for laptop computers and flash drives in order to pay particular attention to the association patterns containing items in these categories
  • A serious side effect of mining multilevel association rules is its generation of many redundant rules across multiple levels of abstraction due to the “ancestor” relationships among items.

49 of 74

Mining Multidimensional Association Rules

  • Studied association rules that imply a single predicate, that is, the predicate buys.
  • For instance, in mining our AllElectronics database, we may discover the Boolean association rule

buys(X, “digital camera”)=>buys(X, “HP printer”)

  • Following the terminology used in multidimensional databases, we refer to each distinct predicate in a rule as a dimension.
  • Hence, we can refer to Rule as a single dimensional or intradimensional association rule because it contains a single distinct predicate (e.g., buys)with multiple occurrences (i.e., the predicate occurs more than once within the rule).
  • As we have seen in the previous sections such rules are commonly mined from transactional data.

50 of 74

  • Suppose, Instead of using a transactional database, sales and related information are stored in a relational database or data warehouse.
  • Such data stores are multidimensional, by definition.
  • For instance, in addition to keeping track of the items purchased in sales transactions, a relational database may record other attributes associated with the items, such as the quantity purchased or the price, or the branch location of the sale.
  • Additional relational information regarding the customers who purchased the items, such as customer age, occupation, credit rating, income, and address, may also be stored.
  • Considering each database attribute or warehouse dimension as a predicate, we can therefore mine association rules containing multiple predicates, such as

age(X, “20...29”)^occupation(X, “student”)=>buys(X, “laptop”)

51 of 74

  • Association rules that involve two or more dimensions or predicates can be referred to as multidimensional association rules.
  • Rule contains three predicates (age, occupation, and buys), each of which occurs only once in the rule. Multidimensional association rules with no repeated predicates are called interdimensional association rules.
  • We can also mine multidimensional association rules with repeated predicates, which contain multiple occurrences of some predicates. These rules are called hybrid-dimensional association rules.
  • An example of such a rule is the following, where the predicate buys is repeated:

age(X, “20...29”)^buys(X, “laptop”)=>buys(X, “HP printer”)

52 of 74

  • Database attributes can be categorical or quantitative.
  • Categorical attributes have a finite number of possible values, with no ordering among the values (e.g., occupation, brand, color).Categorical attributes are also called nominal attributes, because their values are “names of things.”
  • Quantitative attributes are numeric and have an implicit ordering among values (e.g., age, income, price).
  • Techniques for mining multidimensional association rules can be categorized into two basic approaches regarding the treatment of quantitative attributes.
  • In the first approach, quantitative attributes are discretized using predefined concept hierarchies. Replace the original numeric values of this attribute by interval labels, such as “0. . . 20K”, “21K . . . 30K”, “31K . . . 40K”, and so on.

53 of 74

  • Discretization is static and predetermined. Mining multidimensional association rules using static discretization of quantitative attributes.
  • In the second approach, quantitative attributes are discretized or clustered into “bins” based on the distribution of the data.
  • These bins may be further combined during the mining process.
  • The discretization process is dynamic and established so as to satisfy some mining criteria, such as maximizing the confidence of the rules mined.
  • Because this strategy treats the numeric attribute values as quantities rather than as predefined ranges or categories, association rules mined from this approach are also referred to as (dynamic) quantitative association rules.

54 of 74

Mining Multidimensional Association Rules Using Static Discretization of Quantitative Attributes

  • Quantitative attributes, in this case, are discretized before mining using predefined concept hierarchies or data discretization techniques, where numeric values are replaced by interval labels.
  • Categorical attributes may also be generalized to higher conceptual levels if desired.
  • If the resulting task-relevant data are stored in a relational table, then any of the frequent itemset mining algorithms we have discussed can be modified easily so as to find all frequent predicate sets rather than frequent itemsets.
  • In particular, instead of searching on only one attribute like buys, we need to search through all of the relevant attributes, treating each attribute-value pair as an itemset.

55 of 74

Mining Multidimensional Association Rules Using Static Discretization of Quantitative Attributes

  • Alternatively, the transformed multidimensional data may be used to construct a data cube.
  • Data cubes are well suited for the mining of multidimensional association rules: They store aggregates (such as counts), in multidimensional space, which is essential for computing the support and confidence of multidimensional association rules.
  • Figure shows the lattice of cuboids defining a data cube for the dimensions age, income, and buys.
  • The cells of an n-dimensional cuboid can be used to store the support counts of the corresponding n-predicate sets.

56 of 74

Mining Multidimensional Association Rules Using Static Discretization of Quantitative Attributes

  • The base cuboid aggregates the task-relevant data by age, income, and buys;
  • the 2-D cuboid, (age, income), aggregates by age and income, and so on;
  • the 0-D (apex) cuboid contains the total number of transactions in the task-relevant data.

57 of 74

Mining Quantitative Association Rules

  • Quantitative association rules are multidimensional association rules in which the numeric attributes are dynamically discretized during the mining process so as to satisfy some mining criteria, such as maximizing the confidence or compactness of the rules mined.
  • In this section, we focus specifically on how to mine quantitative association rules having two quantitative attributes on the left-hand side of the rule and one categorical attribute on the right-hand side of the rule. These are as two-dimensional quantitative association rules, because they contain two quantitative dimensions

age(X, “30...39”)^income(X, “42K...48K”) => buys(X, “HDTV”)

58 of 74

  • “How can we find such rules?” Let’s look at an approach used in a system called ARCS (Association Rule Clustering System), which borrows ideas from image processing.
  • Essentially, this approach maps pairs of quantitative attributes onto a 2-D grid for tuples satisfying a given categorical attribute condition.
  • The grid is then searched for clusters of points from which the association rules are generated.

The following steps are involved in ARCS:

  • Binning: Quantitative attributes can have a very wide range of values defining their domain.
  • Just think about how big a 2-D grid would be if we plotted age and income as axes, where each possible value of age was assigned a unique position on one axis, and similarly, each possible value of income was assigned a unique position on the other axis!

59 of 74

  • To keep grids down to a manageable size, we instead partition the ranges of quantitative attributes into intervals.
  • These intervals are dynamic in that they may later be further combined during the mining process.
  • The partitioning process is referred to as binning, that is, where the intervals are considered “bins.”

Three common binning strategies area as follows:

  • Equal-width binning, where the interval size of each bin is the same
  • Equal-frequency binning, where each bin has approximately the same number of tuples assigned to it
  • Clustering-based binning, where clustering is performed on the quantitative attribute to group neighbouring points (judged based on various distance measures) into the same bin

60 of 74

  • ARCS uses equal-width binning, where the bin size for each quantitative attribute is input by the user.
  • A 2-D array for each possible bin combination involving both quantitative attributes is created.
  • Each array cell holds the corresponding count distribution for each possible class of the categorical attribute of the rule right-hand side.
  • By creating this data structure, the task-relevant data need only be scanned once.
  • The same 2-D array can be used to generate rules for any value of the categorical attribute, based on the same two quantitative attributes.
  • Finding frequent predicate sets: Once the 2-D array containing the count distribution for each category is set up, it can be scanned to find the frequent predicate sets that also satisfy minimum confidence.

61 of 74

  • Strong association rules can then be generated from these predicate sets, using a rule generation algorithm
  • Clustering the association rules: The strong association rules obtained in the previous step are then mapped to a 2-D grid. Figure shows a 2-D grid for 2-D quantitative association rules predicting the condition buys(X, “HDTV”) on the rule right-hand side, given the quantitative attributes age and income. The four Xs correspond to the rules

age(X, 34)^income(X, “31K...40K”)=>buys(X, “HDTV”)

age(X, 35)^income(X, “31K...40K”)=>buys(X, “HDTV”)

age(X, 34)^income(X, “41K...50K”)=>buys(X, “HDTV”)

age(X, 35)^income(X, “41K...50K”)=>buys(X, “HDTV”)

“Can we find a simpler rule to replace the above four rules?”

  • Four rules can be combined or “clustered” together to form the following simpler rule, which subsumes and replaces the above four rules:

age(X, “34...35”)^income(X, “31K...50K”)=>buys(X, “HDTV”)

62 of 74

63 of 74

From Association Mining to Correlation Analysis

  • Most association rule mining algorithms employ a support-confidence framework.
  • Often, many interesting rules can be found using low support thresholds.
  • Although minimum support and confidence thresholds help weed out or exclude the exploration of a good number of uninteresting rules, many rules so generated are still not interesting to the users.
  • Unfortunately, this is especially true when mining at low support thresholds or mining for long patterns.
  • This has been one of the major bottlenecks for successful application of association rule mining.
  • First look at how even strong association rules can be uninteresting and misleading.
  • We then discuss how the support-confidence framework can be supplemented with additional interestingness measures based on statistical significance and correlation analysis.

64 of 74

From Association Mining to Correlation Analysis� Strong Rules Are Not Necessarily Interesting: An Example

  • Whether or not a rule is interesting can be assessed either subjectively or objectively.
  • Ultimately, only the user can judge if a given rule is interesting, and this judgment, being subjective, may differ from one user to another.
  • However, objective interestingness measures, based on the statistics “behind” the data, can be used as one step toward the goal of weeding out uninteresting rules from presentation to the user.

“How can we tell which strong association rules are really interesting?”

A misleading “strong” association rule. Suppose we are interested in analyzing transactions at AllElectronics with respect to the purchase of computer games and videos. Let game refer to the transactions containing computer games, and video refer to those containing videos. Of the 10,000 transactions analyzed, the data show that 6,000 of the customer transactions included computer games, while 7,500 included videos, and 4,000 included both computer games and videos. Suppose that a data mining program for discovering association rules is run on the data, using a minimum support of, say, 30% and a minimum confidence of 60%. The following association rule is discovered:

65 of 74

From Association Mining to Correlation Analysis� Strong Rules Are Not Necessarily Interesting: An Example

buys(X, “computer games”)=>buys(X, “videos”) [support = 40%, confidence = 66%]

  • Rule is a strong association rule and would therefore be reported, since its support value of 4,000/10,000 =40% and confidence value of 4,000/6,000 =66% satisfy the minimum support and minimum confidence thresholds, respectively.
  • However, Rule is misleading because the probability of purchasing videos is 75%, which is even larger than 66%. In fact, computer games and videos are negatively associated because the purchase of one of these items actually decreases the likelihood of purchasing the other.
  • Without fully understanding this phenomenon, we could easily make unwise business decisions based on Rule
  • Hence, alternatives to the support-confidence framework can be useful in mining interesting data relationships.

66 of 74

From Association Mining to Correlation Analysis� From Association Analysis to Correlation Analysis

  • As we have seen above, the support and confidence measures are insufficient at filtering out uninteresting association rules.
  • To tackle this weakness, a correlation measure can be used to augment the support-confidence framework for association rules.
  • This leads to correlation rules of the form

A=>B [support, confidence. Correlation]

  • That is, a correlation rule is measured not only by its support and confidence but also by the correlation between itemsets A and B.
  • There are many different correlation measures from which to choose. In this section, we study various correlation measures to determine which would be good for mining large data sets.

67 of 74

From Association Mining to Correlation Analysis� From Association Analysis to Correlation Analysis

  • Lift is a simple correlation measure that is given as follows:
  • The occurrence of itemset A is independent of the occurrence of itemset B if P(A U B) = P(A)P(B); otherwise, itemsets A and B are dependent and correlated as events.
  • The lift between the occurrence of A and B can be measured by computing

lift(A, B) = P(AUB)/P(A)P(B)

  • If lift(A, B) <1 - Occurrence of A is negatively correlated with the occurrence of B.
  • If lift(A, B) >1 - A and B are positively correlated, meaning that the occurrence of one implies the occurrence of the other.
  • If lift(A, B) = 1, then A and B are independent and there is no correlation between them.

P(B|A)/P(B), or con f (A=>B)/sup(B)

which is also referred as the lift of the association (or correlation) rule A=>B.

68 of 74

From Association Mining to Correlation Analysis� From Association Analysis to Correlation Analysis

  • The second correlation measure is the measure, To compute value, we take the squared difference between the observed and expected value for a slot (A and B pair) in the contingency table, divided by the expected value. This amount is summed for all slots of the contingency table.
  • Let’s examine two other correlation measures, all _confidence and cosine, as defined below.

69 of 74

From Association Mining to Correlation Analysis� From Association Analysis to Correlation Analysis

  • “Are these two measures better than lift and in assessing the correlation relationship?”
  • To answer this question, Comparison of four correlation meaures on game-and-video data.
  • Let D1 be the original game (g) and video (v) data set from Table. We add two more data sets, D0 and D2, where D0 has zero null-transactions, and D2 has 10,000 null-transactions (instead of only 500 as in D1). The values of all four correlation measures are shown in Table

70 of 74

From Association Mining to Correlation Analysis� From Association Analysis to Correlation Analysis

  • In summary, the use of only support and confidence measures to mine associations results in the generation of a large number of rules, most of which are uninteresting to the user.
  • Instead, we can augment the support-confidence framework with a correlation measure, resulting in the mining of correlation rules.
  • The added measure substantially reduces the number of rules generated, and leads to the discovery of more meaningful rules.
  • However, there seems to be no single correlation measure that works well for all cases.

71 of 74

Constraint-based Association mining

  • For a given set of task-relevant data, the datamining process may uncover thousands of rules, many of which are uninteresting to the user.
  • In constraint-based mining, mining is performed under the guidance of various kinds of constraints provided by the user. These constraints include the following.

1. Knowledge type constraints: These specify the type of knowledge to be mined, such as association.

2. Data constraints: These specify the set of task-relevant data.

3. Dimension/level constraints: These specify the dimension of the data, or levels of the concept hierarchies, to be used.

4. Interestingness constraints: These specify thresholds on statistical measures of rule interestingness, such as support and confidence.

5. Rule constraints: These specify the form of rules to be mined. Such constraints may be expressed as meta rules (rule templates), or by specifying the maximum or minimum number of predicates in the rule antecedent or consequent, or the satisfaction of particular predicates on attribute values, or their aggregates.

72 of 74

Meta rule-guided mining of association rules:

How are meta rules useful?

  • Meta rules allow users to specify the syntactic form of rules that they are interested in mining.
  • The rule forms can be used as constraints to help improve the efficiency of the mining process.
  • Meta rules may be based on the analyst's experience, expectations, or intuition regarding the data, or automatically generated based on the database schema.

Meta Rule of User interest Example : P1(X, Y) ∧ P2(X, W) ⇒ buys(X, “office software”)

The data mining system can then search for rules that match the given metarule.

For instance, the following Rule matches or complies with Metarule stated above:

age(X, “30..39”) ∧ income(X, “41K..60K”)⇒buys(X, “office software”)

73 of 74

Mining guided by additional rule constraints:

  • Rule constraints specifying set/subset relationships, constant initiation of variables, and aggregate functions can be specified by the user.
  • These may be used together with, or as an alternative to, meta rule-guided mining.
  • Rule constraints are used to make the mining process more efficient.
  • Rule constraints are used to mine hybrid-dimension association rules.

74 of 74

THANK YOU