1 of 37

Associate Rule Mining

FP – Growth Algorithm

2 of 37

3 of 37

FP-growth (frequent pattern growth): Mining Frequent Itemsets without Candidate Generation

  • The first scan of the database is the same as Apriori, which derives the set of frequent items (1-itemsets) and their support counts .
  • Let the minimum support count be 2.
  • The set of frequent items is sorted in the order of descending support count.

4 of 37

Example:

5 of 37

Step 1: Count Item Frequencies : We first count the frequency of each item in the entire dataset.

Item

Frequency

Priority

I1

6

2

I2

7

1

I3

6

3

I4

2

4

I5

2

5

6 of 37

Step 2: Sort Items in Transactions Based on Frequency: Now we reorder each transaction in descending order of frequency. If two items have the same frequency, you can keep their original order.

TID

Items

T1

I2, I1, I5

T2

I2, I4

T3

I2, I3

T4

I2, I1, I4

T5

I1, I3

T6

I2, I3

T7

I1, I3

T8

I2, I1, I3, I5

T9

I2, I1, I3

Item

Frequency

Priority

I1

6

2

I2

7

1

I3

6

3

I4

2

4

I5

2

5

7 of 37

TID

Items

T1

I2, I1, I5

T2

I2, I4

T3

I2, I3

T4

I2, I1, I4

T5

I1, I3

T6

I2, I3

T7

I1, I3

T8

I2, I1, I3, I5

T9

I2, I1, I3

8 of 37

TID

Items

T1

I2, I1, I5

T2

I2, I4

T3

I2, I3

T4

I2, I1, I4

T5

I1, I3

T6

I2, I3

T7

I1, I3

T8

I2, I1, I3, I5

T9

I2, I1, I3

NULL

I2:1

I1:1

I5:1

NULL

I2:2

I1:1

I5:1

I4:1

NULL

I2:3

I1:1

I5:1

I4:1

I3:1

NULL

I2:4

I1:2

I5:1

I4:1

I3:1

I4:1

NULL

I2:4

I1:2

I5:1

I4:1

I3:1

I1:1

I3:1

NULL

I2:5

I1:2

I5:1

I4:1

I3:2

I1:1

I3:1

NULL

I2:6

I1:3

I5:1

I4:1

I3:2

I1:2

I3:2

I3:1

I5:1

NULL

I2:7

I1:4

I5:1

I4:1

I3:2

I1:2

I3:2

I3:2

I5:1

9 of 37

Step 3: Construct the FP-Tree: We start with a root node and insert transactions one by one.

TID

Items

T1

I2, I1, I5

T2

I2, I4

T3

I2, I3

T4

I2, I1, I4

T5

I1, I3

T6

I2, I3

T7

I1, I3

T8

I2, I1, I3, I5

T9

I2, I1, I3

10 of 37

Step 4: Mine the FP-Tree for Frequent Patterns

11 of 37

12 of 37

Mining Various Kinds of Association Rules:Mining Multilevel Association Rules

Association rules generated from mining data at multiple levels of abstraction are called multiple-level or multilevel association rules.

13 of 37

Using uniform minimum support for all levels (referred to as uniform support):

  • The same minimum support threshold is used when mining at each level of abstraction.
  • For example a minimum support threshold of 5% is used throughout (e.g., for mining from “computer” down to “laptop computer”). Both “computer” and “laptop computer” are found to be frequent, while “desktop computer” is not.

14 of 37

  • Using reduced minimum support at lower levels (referred to as reduced support):
  • Each level of abstraction has its own minimum support threshold.
  • The deeper the level of abstraction, the smaller the corresponding threshold

15 of 37

Using item or group-based minimum support (referred to as group-based support):

Because users or experts often have insight as to which groups are more important than others, it is sometimes more desirable to set up user-specific, item, or group based minimal support thresholds when mining multilevel rules.

For example, a user could set up the minimum support thresholds based on product price, or on items of interest, such as by setting particularly low support thresholds for laptop computers and flash drives in order to pay particular attention to the association patterns containing items in these categories.

16 of 37

Mining Multidimensional Association Rules from Relational Databases and Data Warehouses

  • Association rules that imply a single predicate, that is, the predicate buys Is called as single dimensional or intra dimensional association rule.

Example:

buys(X, “digital camera”) ⇒ buys(X, “HP printer”)

  • It contains a single distinct predicate (e.g., buys) with multiple occurrences (i.e., the predicate occurs more than once within the rule.

17 of 37

  • Association rules that involve two or more dimensions or predicates can be referred to as multidimensional association rules

Example:

age(X, “20...29”) ∧ occupation(X, “student”) ⇒ buys (X,“laptop”)

  • Each of which occurs only once in the rule. Hence, we say that it has no repeated predicates.
  • Multidimensional association rules with no repeated predicates are called inter dimensional association rules.

18 of 37

  • multidimensional association rules with repeated predicates, which contain multiple occurrences of some predicates. These rules are called hybrid-dimensional association rule

Example:

age(X, “20...29”)∧buys(X, “laptop”)⇒buys(X, “HP printer”)

19 of 37

Mining Multidimensional Association Rules Using Static Discretization of Quantitative Attributes

  • Techniques for mining multidimensional association rules can be categorized into two basic approaches regarding the treatment of quantitative attributes.
  • In the first approach, quantitative attributes are discretized using predefined concept hierarchies.
  • This discretization occurs before mining. For instance, a concept hierarchy for income may be used to replace the original numeric values of this attribute by interval labels, such as “0. . . 20K”, “21K . . . 30K”, “31K . . . 40K”, and so on.
  • In the second approach, quantitative attributes are discretized or clustered into “bins” based on the distribution of the data.

20 of 37

Graph Pattern Mining

  • Graph mining is a process in which the mining techniques are used in finding a pattern or relationship in the given real-world collection of graphs.
  • By mining the graph, frequent substructures and relationships can be identified which helps in clustering the graph sets, finding a relationship between graph sets, or discriminating or characterizing graphs.
  • This task is important since data is naturally represented as graph in many domains (e.g. social networks, chemical molecules, map of roads in a country). It is thus desirable to analyze graph data to discover interesting, unexpected, and useful patterns, that can be used to understand the data or take decisions.

21 of 37

What is a graph: graph is a set of vertices and edges, having some labels.

Types of graphs: connected and disconnected 

A connected graph in this context is a graph where it is possible to go from any node to any node cities by following the roads.  If a graph is not connected, it is said to be a disconnected graph.

22 of 37

Types of graphs: directed and undirected

a graph where vertices and edges. Some edges can be travelled in both directions(directed graph) while some edges may be travelled in only a single direction is known as undirected graph.

23 of 37

 Frequent Subgraph Mining

  •  The goal of subgraph mining is to discover interesting subgraph(s) appearing in a set of graphs (a graph database).
  • subgraph has been considered as interesting if it appears multiple times in a set of graphs. In other words, we want to discover subgraphs that are common to multiple graphs.
  • The task of finding frequent subgraphs in a set of graphs is called  frequent subgraph mining.
  • graph database (a set of graphs)
  • a parameter called the minimum support threshold (minsup).

24 of 37

frequent subgraph is a subgraph that appears in at least minsup graphs from a graph database.

set the minsup parameter to 3

25 of 37

 subgraph is frequent and is said to have a support (a frequency) of 3 since it appears in three of the input graphs.

26 of 37

the minsup parameter is generally set by trial and error.  If this parameter is set too high, few subgraphs will be found, while if it is set too low, hundred or millions of subgraphs may be found, depending on the input database.

27 of 37

Mining frequent subgraphs in a single graph

  • frequent subgraph mining that consists of finding all frequent subgraphs in a single graph rather than in a graph database.
  • This graph contains seven vertices and six edges. If we perform frequent subgraph mining on this single graph by setting the minsup parameter to 2

28 of 37

29 of 37

These subgraphs are said to be frequent because they appear at least twice in the input graph.

Algorithms Used: Gspan, Closed Graph

30 of 37

Applications

  • XML Structures
  • Network Analysis
  • Control Flow Analysis
  • Biological Structures
  • Road maps
  • Chemical meticulous

31 of 37

Sequential Pattern Mining

  • Sequential pattern mining is the mining of frequently appearing series events or subsequences as patterns.
  • Sequential pattern mining has numerous real-life applications due to the fact that data is naturally encoded as sequences of symbols in many fields such as bioinformatics, e-learning, market basket analysis, texts, and webpage click-stream analysis.

Ex: Consider the following sequence database, representing the purchases made by customers in a retail store.

32 of 37

  • Each sequence represents the items purchased by a customer at different times. A sequence is an ordered list of itemsets (sets of items bought together). For example, in this database, the first sequence (SID 1) indicates that a customer bought some items a and b together, then purchased an item c, then purchased items f and together, then purchased an item g, and then finally purchased an item e.  
  • Sequential pattern mining is being used to find subsequences that appear often in a sequence database, i.e. that are common to several sequences. Those subsequences are called the frequent sequential patterns.
  • Sequential pattern mining  can be used to find the sequences of items frequently bought by customers. This can be useful to understand the behavior of customers to take marketing decisions.

33 of 37

  • For example, the patterns <{a}> and <{a}, {g}> are frequent and have a support of 3 and 2 sequences, respectively. In other words, these patterns appears in 3 and 2 sequences of the input database, respectively.  
  • The pattern <{a}> appears in the sequences 1, 2 and 3, while the pattern <{a}, {g}> appears in sequences 1 and 3.   These patterns are interesting as they represent some behavior common to several customers.

34 of 37

35 of 37

CONSTRAINT BASED ASSOCIATION RULES:

  • A data mining process may uncover thousands of rules from a given set of data, most of which end up being unrelated or uninteresting to the users.
  • users have a good sense of which “direction” of mining may lead to interesting patterns and the “form” of the patterns or rules they would like to find.
  • a good heuristic is to have the users specify such intuition or expectations as constraints to confine the search space. This strategy is known as constraint-based mining.
  • Constraint based mining provides

User Flexibility: provides constraints on what to be mined.

System Optimization: explores constraints to help efficient mining.

36 of 37

37 of 37

The constraints can include the following:

Knowledge type constraints: These specify the type of knowledge to be mined, such as association or correlation.

Data constraints: These specify the set of task-relevant data.

Dimension/level constraints: These specify the desired dimensions (or attributes) of the data, or levels of the concept hierarchies, to be used in mining.

Interestingness constraints: These specify thresholds on statistical measures of rule interestingness, such as support, confidence, and correlation.

Rule constraints: These specify the form of rules to be mined. Such constraints may be expressed as rule templates, as the maximum or minimum number of predicates that can occur in the rule antecedent or consequent, or as relationships among attributes, attribute values, and/or aggregates. The constraints can be specified using a high-level declarative data mining query language and user interface.