����������DATA MINING�AND �DATA WAREHOUSING �
B.TECH - VI SEMESTER
R17 Regulation
Dr. Anupriya Koneru
UNIT-III�
Association Rule Mining:
�
�Overview- Frequent patterns
�Overview- Frequent patterns
Market – Basket analysis A market basket is a collection
�Overview- Frequent patterns
Market – Basket analysis A market basket is a collection
�Overview- Frequent patterns
Market – Basket analysis A market basket is a collection
Suppose, as manager of an AllElectronics branch, you would like to learn more about the buying habits of your customers. Specifically, you wonder, “Which groups or sets of items are customers likely to purchase on a given trip to the store?”
�Overview- Frequent patterns
Market – Basket analysis A market basket is a collection
computer ⇒ antivirus software [support = 2%,confidence = 60%].
�Overview- Frequent patterns
Market – Basket analysis A market basket is a collection
�Overview- Frequent patterns
Frequent Itemsets, Closed Itemsets, and Association Rules
This is taken to be the conditional probability, P(B|A). That is,
support(A⇒B) =P(A ∪B)
confidence(A⇒B) =P(B|A)
�Overview- Frequent patterns
Confidence(A⇒B) = P(B|A) = support(A ∪B)/ support(A)
= support_count(A ∪B)/ support _count(A)
�Overview- Frequent patterns
In general, association rule mining can be viewed as a two-step process:
�Overview- Frequent patterns
�Overview- Frequent patterns
Association Rule Mining: A Road Map
Boolean vs. quantitative associations (Based on the types of values handled)
– buys(x, “SQLServer”) ^ buys(x, “DMBook”) ® buys(x, “DBMiner”) [0.2%, 60%]
– age(x, “30..39”) ^ income(x, “42..48K”) ® buys(x, “PC”) [1%, 75%]
• Single dimension vs. multiple dimensional associations (see ex. Above)
• Single level vs. multiple-level analysis
– What brands of beers are associated with what brands of diapers?
• Various extensions
– Correlation, causality analysis
• Association does not necessarily imply correlation or causality
– Maxpatterns and closed itemsets
– Constraints enforced
• E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?
�Overview- Frequent patterns
Association Rule Mining: A Road Map
Market basket analysis is just one form of frequent pattern mining. In fact, there are many kinds of frequent patterns, association rules, and correlation relationships. Frequent pattern mining can be classified in various ways, based on the following criteria:
Based on the completeness of patterns to be mined:
�Overview- Frequent patterns
Association Rule Mining: A Road Map
Based on the levels of abstraction involved in the rule set:
buys(X, “computer”) ⇒ buys(X, “HP printer”)
buys(X, “laptop computer”) ⇒ buys(X, “HP printer”)
�Overview- Frequent patterns
Association Rule Mining: A Road Map
Based on the number of data dimensions involved in the rule:
buys(X, “computer”) ⇒ buys(X, “antivirus software”) (5.8)
age(X, “30...39”)∧income(X, “42K...48K”)⇒buys(X, “high resolution TV”).
�Overview- Frequent patterns
Association Rule Mining: A Road Map
Based on the types of values handled in the rule:
�Overview- Frequent patterns
Association Rule Mining: A Road Map
Based on the kinds of rules to be mined:
�Overview- Frequent patterns
Association Rule Mining: A Road Map
Based on the kinds of patterns to be mined:
Mining Single-Dimensional Boolean Association Rules from Transactional Databases
Mining Frequent Patterns
The method that mines the complete set of frequent itemsets with candidate generation. Apriori property & The Apriori Algorithm.
Apriori property
• All nonempty subsets of a frequent item set most also be frequent.
– An item set I does not satisfy the minimum support threshold, minsup, then I is not frequent, i.e., support(I) < min-sup
– If an item A is added to the item set I then the resulting item set (I U A) can not occur more frequently than I.
• Monotonic functions are functions that move in only one direction.
• This property is called anti-monotonic.
• If a set cannot pass a test, all its supersets will fail the same test as well. This property is monotonic in failing the test.
Definition: Frequent Itemset
Definition: Association Rule
Example:
Association Rule Mining
Market-Basket transactions
Example of Association Rules
{Diaper} → {Beer},�{Milk, Bread} → {Eggs,Coke},�{Beer, Bread} → {Milk},
Implication means co-occurrence, not causality!
Association Rule Mining Task
⇒ Computationally prohibitive!
Mining Association Rules
Example of Rules:�
{Milk,Diaper} → {Beer} (s=0.4, c=0.67)�{Milk,Beer} → {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
{Beer} → {Milk,Diaper} (s=0.4, c=0.67) �{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)
Observations:
Mining Single-Dimensional Boolean Association Rules from Transactional Databases
“How is the Apriori property used in the algorithm?”
To understand this, let us look at how Lk−1 is used to find Lk for k ≥ 2. A two-step process is followed, consisting of join and prune actions.
Mining Single-Dimensional Boolean Association Rules from Transactional Databases
“How is the Apriori property used in the algorithm?”
To understand this, let us look at how Lk−1 is used to find Lk for k ≥ 2. A two-step process is followed, consisting of join and prune actions.
2. The prune step: Ck is a superset of Lk , that is, its members may or may not be frequent, but all of the frequent k-itemsets are included in Ck .
Mining Single-Dimensional Boolean Association Rules from Transactional Databases
Example : Apriori. Let’s look at a concrete example, based on the AllElectronicstransaction database, D, of Table . There are nine transactions in this database, that is, |D| = 9. We use Figure to illustrate the Apriori algorithm for finding frequent itemsets in D.
Mining Single-Dimensional Boolean Association Rules from Transactional Databases
4. Next, the transactions in D are scanned and the support count of each candidate itemset inC2 is accumulated, as shown in the middle table of the second row in Figure.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-itemsets in C2 having minimum support.
6. The generation of the set of candidate 3-itemsets,C3, is detailed in Figure. From the join step, we first getC3 = L2 L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}. Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can determine that the four latter candidates cannot possibly be frequent. We therefore remove them from C3, thereby saving the effort of unnecessarily obtaining their counts during the subsequent scan of D to determine L3. Note that when given a candidate k-itemset, we only need to check if its(k−1)-subsets are frequent since the Apriori algorithm uses a level-wise search strategy. The resulting pruned version of C3 is shown in the first table of the bottom row of Figure.
7. The transactions in D are scanned in order to determine L3, consisting of those candidate 3-itemsets in C3 having minimum support.
Mining Single-Dimensional Boolean Association Rules from Transactional Databases
Mining Single-Dimensional Boolean Association Rules from Transactional Databases
8. The algorithm uses L3 L3 to generate a candidate set of 4-itemsets, C4. Although the join results in {{I1, I2, I3, I5}}, this itemset is pruned because its subset {{I2, I3, I5}} is not frequent. Thus, C4 = φ, and the algorithm terminates, having found all of the frequent itemsets.
Mining Single-Dimensional Boolean Association Rules from Transactional Databases
The Apriori Algorithm
• Join Step: Ck is generated by joining Lk-1with itself
• Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset
Mining Single-Dimensional Boolean Association Rules from Transactional Databases
•The method that mines the complete set of frequent itemsets without generation.
• Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure – highly condensed, but complete for frequent pattern mining – avoid costly database scans
• Develop an efficient, FP-tree-based frequent pattern mining method
– A divide-and-conquer methodology: decompose mining tasks into smaller ones
– Avoid candidate generation: sub-database test only!
Construct FP-tree from a Transaction DB
Mining Single-Dimensional Boolean Association Rules from Transactional Databases
– never breaks a long pattern of any transaction
– preserves complete information for frequent pattern mining
• Compactness
– reduce irrelevant information—infrequent items are gone
– frequency descending ordering: more frequent items are more likely to be shared
- Example: For Connect-4 DB, compression ratio could be over 100
Mining Single-Dimensional Boolean Association Rules from Transactional Databases
•Mining Frequent Patterns Using FP-tree
General idea (divide-and-conquer)
– Recursively grow frequent pattern path using the FP-tree
• Method
– For each item, construct its conditional pattern-base, and then its conditional FP-tree
– Repeat the process on each newly created conditional FP-tree
– Until the resulting FP-tree is empty, or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)
Major Steps to Mine FP-tree
- If the conditional FP-tree contains a single path, simply enumerate all the patterns
Mining Single-Dimensional Boolean Association Rules from Transactional Databases
•Mining Frequent Patterns Using FP-tree
Principles of Frequent Pattern Growth
Mining Single-Dimensional Boolean Association Rules from Transactional Databases
Why Is Frequent Pattern Growth Fast?
• Our performance study shows
– FP-growth is an order of magnitude faster than Apriori, and is also faster than tree
-projection
• Reasoning
– No candidate generation, no candidate test
– Use compact data structure
– Eliminate repeated database scan
Basic operation is counting and FP-tree building
Mining multilevel association rules from Transactional databases�
Mining Multidimensional Association Rules
buys(X, “digital camera”)=>buys(X, “HP printer”)
age(X, “20...29”)^occupation(X, “student”)=>buys(X, “laptop”)
age(X, “20...29”)^buys(X, “laptop”)=>buys(X, “HP printer”)
Mining Multidimensional Association Rules Using Static Discretization of Quantitative Attributes
Mining Multidimensional Association Rules Using Static Discretization of Quantitative Attributes
Mining Multidimensional Association Rules Using Static Discretization of Quantitative Attributes
Mining Quantitative Association Rules
age(X, “30...39”)^income(X, “42K...48K”) => buys(X, “HDTV”)
The following steps are involved in ARCS:
Three common binning strategies area as follows:
age(X, 34)^income(X, “31K...40K”)=>buys(X, “HDTV”)
age(X, 35)^income(X, “31K...40K”)=>buys(X, “HDTV”)
age(X, 34)^income(X, “41K...50K”)=>buys(X, “HDTV”)
age(X, 35)^income(X, “41K...50K”)=>buys(X, “HDTV”)
“Can we find a simpler rule to replace the above four rules?”
age(X, “34...35”)^income(X, “31K...50K”)=>buys(X, “HDTV”)
From Association Mining to Correlation Analysis
From Association Mining to Correlation Analysis� Strong Rules Are Not Necessarily Interesting: An Example
“How can we tell which strong association rules are really interesting?”
A misleading “strong” association rule. Suppose we are interested in analyzing transactions at AllElectronics with respect to the purchase of computer games and videos. Let game refer to the transactions containing computer games, and video refer to those containing videos. Of the 10,000 transactions analyzed, the data show that 6,000 of the customer transactions included computer games, while 7,500 included videos, and 4,000 included both computer games and videos. Suppose that a data mining program for discovering association rules is run on the data, using a minimum support of, say, 30% and a minimum confidence of 60%. The following association rule is discovered:
From Association Mining to Correlation Analysis� Strong Rules Are Not Necessarily Interesting: An Example
buys(X, “computer games”)=>buys(X, “videos”) [support = 40%, confidence = 66%]
From Association Mining to Correlation Analysis� From Association Analysis to Correlation Analysis
A=>B [support, confidence. Correlation]
From Association Mining to Correlation Analysis� From Association Analysis to Correlation Analysis
lift(A, B) = P(AUB)/P(A)P(B)
P(B|A)/P(B), or con f (A=>B)/sup(B)
which is also referred as the lift of the association (or correlation) rule A=>B.
From Association Mining to Correlation Analysis� From Association Analysis to Correlation Analysis
From Association Mining to Correlation Analysis� From Association Analysis to Correlation Analysis
From Association Mining to Correlation Analysis� From Association Analysis to Correlation Analysis
Constraint-based Association mining
1. Knowledge type constraints: These specify the type of knowledge to be mined, such as association.
2. Data constraints: These specify the set of task-relevant data.
3. Dimension/level constraints: These specify the dimension of the data, or levels of the concept hierarchies, to be used.
4. Interestingness constraints: These specify thresholds on statistical measures of rule interestingness, such as support and confidence.
5. Rule constraints: These specify the form of rules to be mined. Such constraints may be expressed as meta rules (rule templates), or by specifying the maximum or minimum number of predicates in the rule antecedent or consequent, or the satisfaction of particular predicates on attribute values, or their aggregates.
Meta rule-guided mining of association rules:
How are meta rules useful?
Meta Rule of User interest Example : P1(X, Y) ∧ P2(X, W) ⇒ buys(X, “office software”)
The data mining system can then search for rules that match the given metarule.
For instance, the following Rule matches or complies with Metarule stated above:
age(X, “30..39”) ∧ income(X, “41K..60K”)⇒buys(X, “office software”)
Mining guided by additional rule constraints:
THANK YOU