Data Mining: �Concepts and Techniques� � Mining sequence patterns in transactional databases
*
Data Mining: Concepts and Techniques
1
Chapter 8. Mining Stream, Time-Series, and Sequence Data
*
Data Mining: Concepts and Techniques
2
Sequence Databases & Sequential Patterns
*
Data Mining: Concepts and Techniques
3
What Is Sequential Pattern Mining?
*
Data Mining: Concepts and Techniques
4
A sequence database
A sequence : < (ef) (ab) (df) c b >
An element may contain a set of items.
Items within an element are unordered
and we list them alphabetically.
<a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>
Given support threshold min_sup =2, <(ab)c> is a sequential pattern
SID | sequence |
10 | <a(abc)(ac)d(cf)> |
20 | <(ad)c(bc)(ae)> |
30 | <(ef)(ab)(df)cb> |
40 | <eg(af)cbc> |
Challenges on Sequential Pattern Mining
*
Data Mining: Concepts and Techniques
5
Sequential Pattern Mining Algorithms
*
Data Mining: Concepts and Techniques
6
The Apriori Property of Sequential Patterns
*
Data Mining: Concepts and Techniques
7
<a(bd)bcb(ade)>
50
<(be)(ce)d>
40
<(ah)(bf)abf>
30
<(bf)(ce)b(fg)>
20
<(bd)cb(ac)>
10
Sequence
Seq. ID
Given support threshold min_sup =2
GSP—Generalized Sequential Pattern Mining
*
Data Mining: Concepts and Techniques
8
Finding Length-1 Sequential Patterns
*
Data Mining: Concepts and Techniques
9
<a(bd)bcb(ade)>
50
<(be)(ce)d>
40
<(ah)(bf)abf>
30
<(bf)(ce)b(fg)>
20
<(bd)cb(ac)>
10
Sequence
Seq. ID
min_sup =2
Cand | Sup |
<a> | 3 |
<b> | 5 |
<c> | 4 |
<d> | 3 |
<e> | 3 |
<f> | 2 |
<g> | 1 |
<h> | 1 |
GSP: Generating Length-2 Candidates
*
Data Mining: Concepts and Techniques
10
| <a> | <b> | <c> | <d> | <e> | <f> |
<a> | <aa> | <ab> | <ac> | <ad> | <ae> | <af> |
<b> | <ba> | <bb> | <bc> | <bd> | <be> | <bf> |
<c> | <ca> | <cb> | <cc> | <cd> | <ce> | <cf> |
<d> | <da> | <db> | <dc> | <dd> | <de> | <df> |
<e> | <ea> | <eb> | <ec> | <ed> | <ee> | <ef> |
<f> | <fa> | <fb> | <fc> | <fd> | <fe> | <ff> |
| <a> | <b> | <c> | <d> | <e> | <f> |
<a> | | <(ab)> | <(ac)> | <(ad)> | <(ae)> | <(af)> |
<b> | | | <(bc)> | <(bd)> | <(be)> | <(bf)> |
<c> | | | | <(cd)> | <(ce)> | <(cf)> |
<d> | | | | | <(de)> | <(df)> |
<e> | | | | | | <(ef)> |
<f> | | | | | | |
51 length-2
Candidates
The GSP Mining Process
*
Data Mining: Concepts and Techniques
11
<a> <b> <c> <d> <e> <f> <g> <h>
<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>
<abb> <aab> <aba> <baa> <bab> …
<abba> <(bd)bc> …
<(bd)cba>
1st scan: 8 cand. 6 length-1 seq. pat.
2nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all
3rd scan: 47 cand. 19 length-3 seq. pat. 20 cand. not in DB at all
4th scan: 8 cand. 6 length-4 seq. pat.
5th scan: 1 cand. 1 length-5 seq. pat.
Cand. cannot pass sup. threshold
Cand. not in DB at all
<a(bd)bcb(ade)>
50
<(be)(ce)d>
40
<(ah)(bf)abf>
30
<(bf)(ce)b(fg)>
20
<(bd)cb(ac)>
10
Sequence
Seq. ID
min_sup =2
Candidate Generate-and-test: Drawbacks
*
Data Mining: Concepts and Techniques
12
The SPADE Algorithm
*
Data Mining: Concepts and Techniques
13
The SPADE Algorithm
*
Data Mining: Concepts and Techniques
14
Bottlenecks of GSP and SPADE
*
Data Mining: Concepts and Techniques
15
Prefix and Suffix (Projection)
*
Data Mining: Concepts and Techniques
16
Prefix | Suffix (Prefix-Based Projection) |
<a> | <(abc)(ac)d(cf)> |
<aa> | <(_bc)(ac)d(cf)> |
<ab> | <(_c)(ac)d(cf)> |
Mining Sequential Patterns by Prefix Projections
*
Data Mining: Concepts and Techniques
17
SID | sequence |
10 | <a(abc)(ac)d(cf)> |
20 | <(ad)c(bc)(ae)> |
30 | <(ef)(ab)(df)cb> |
40 | <eg(af)cbc> |
Finding Seq. Patterns with Prefix <a>
*
Data Mining: Concepts and Techniques
18
SID | sequence |
10 | <a(abc)(ac)d(cf)> |
20 | <(ad)c(bc)(ae)> |
30 | <(ef)(ab)(df)cb> |
40 | <eg(af)cbc> |
Completeness of PrefixSpan
*
Data Mining: Concepts and Techniques
19
SID | sequence |
10 | <a(abc)(ac)d(cf)> |
20 | <(ad)c(bc)(ae)> |
30 | <(ef)(ab)(df)cb> |
40 | <eg(af)cbc> |
SDB
Length-1 sequential patterns
<a>, <b>, <c>, <d>, <e>, <f>
<a>-projected database
<(abc)(ac)d(cf)>
<(_d)c(bc)(ae)>
<(_b)(df)cb>
<(_f)cbc>
Length-2 sequential
patterns
<aa>, <ab>, <(ab)>,
<ac>, <ad>, <af>
Having prefix <a>
Having prefix <aa>
<aa>-proj. db
…
<af>-proj. db
Having prefix <af>
<b>-projected database
…
Having prefix <b>
Having prefix <c>, …, <f>
… …
Efficiency of PrefixSpan
*
Data Mining: Concepts and Techniques
20