1 of 20

Data Mining: �Concepts and Techniques� Mining sequence patterns in transactional databases

*

Data Mining: Concepts and Techniques

1

2 of 20

Chapter 8. Mining Stream, Time-Series, and Sequence Data

  • Mining data streams
  • Mining time-series data
  • Mining sequence patterns in transactional databases
  • Mining sequence patterns in biological data

*

Data Mining: Concepts and Techniques

2

3 of 20

Sequence Databases & Sequential Patterns

  • Transaction databases, time-series databases vs. sequence databases
  • Frequent patterns vs. (frequent) sequential patterns
  • Applications of sequential pattern mining
    • Customer shopping sequences:
      • First buy computer, then CD-ROM, and then digital camera, within 3 months.
    • Medical treatments, natural disasters (e.g., earthquakes), science & eng. processes, stocks and markets, etc.
    • Telephone calling patterns, Weblog click streams
    • DNA sequences and gene structures

*

Data Mining: Concepts and Techniques

3

4 of 20

What Is Sequential Pattern Mining?

  • Given a set of sequences, find the complete set of frequent subsequences

*

Data Mining: Concepts and Techniques

4

A sequence database

A sequence : < (ef) (ab) (df) c b >

An element may contain a set of items.

Items within an element are unordered

and we list them alphabetically.

<a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>

Given support threshold min_sup =2, <(ab)c> is a sequential pattern

SID

sequence

10

<a(abc)(ac)d(cf)>

20

<(ad)c(bc)(ae)>

30

<(ef)(ab)(df)cb>

40

<eg(af)cbc>

5 of 20

Challenges on Sequential Pattern Mining

  • A huge number of possible sequential patterns are hidden in databases
  • A mining algorithm should
    • find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold
    • be highly efficient, scalable, involving only a small number of database scans
    • be able to incorporate various kinds of user-specific constraints

*

Data Mining: Concepts and Techniques

5

6 of 20

Sequential Pattern Mining Algorithms

  • Concept introduction and an initial Apriori-like algorithm
    • Agrawal & Srikant. Mining sequential patterns, ICDE’95
  • Apriori-based method: GSP (Generalized Sequential Patterns:)
  • Pattern-growth methods: FreeSpan & PrefixSpan
  • Vertical format-based mining: SPADE
  • Mining closed sequential patterns: CloSpan

*

Data Mining: Concepts and Techniques

6

7 of 20

The Apriori Property of Sequential Patterns

  • A basic property: Apriori (Agrawal & Sirkant’94)
    • If a sequence S is not frequent
    • Then none of the super-sequences of S is frequent
    • E.g, <hb> is infrequent 🡪 so do <hab> and <(ah)b>

*

Data Mining: Concepts and Techniques

7

<a(bd)bcb(ade)>

50

<(be)(ce)d>

40

<(ah)(bf)abf>

30

<(bf)(ce)b(fg)>

20

<(bd)cb(ac)>

10

Sequence

Seq. ID

Given support threshold min_sup =2

8 of 20

GSP—Generalized Sequential Pattern Mining

  • GSP (Generalized Sequential Pattern) mining algorithm
    • proposed by Agrawal and Srikanth
  • Outline of the method
    • Initially, every item in DB is a candidate of length-1
    • for each level (i.e., sequences of length-k) do
      • scan database to collect support count for each candidate sequence
      • generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori
    • repeat until no frequent sequence or no candidate can be found
  • Major strength: Candidate pruning by Apriori

*

Data Mining: Concepts and Techniques

8

9 of 20

Finding Length-1 Sequential Patterns

  • Examine GSP using an example
  • Initial candidates: all singleton sequences
    • <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h>
  • Scan database once, count support for candidates

*

Data Mining: Concepts and Techniques

9

<a(bd)bcb(ade)>

50

<(be)(ce)d>

40

<(ah)(bf)abf>

30

<(bf)(ce)b(fg)>

20

<(bd)cb(ac)>

10

Sequence

Seq. ID

min_sup =2

Cand

Sup

<a>

3

<b>

5

<c>

4

<d>

3

<e>

3

<f>

2

<g>

1

<h>

1

10 of 20

GSP: Generating Length-2 Candidates

*

Data Mining: Concepts and Techniques

10

<a>

<b>

<c>

<d>

<e>

<f>

<a>

<aa>

<ab>

<ac>

<ad>

<ae>

<af>

<b>

<ba>

<bb>

<bc>

<bd>

<be>

<bf>

<c>

<ca>

<cb>

<cc>

<cd>

<ce>

<cf>

<d>

<da>

<db>

<dc>

<dd>

<de>

<df>

<e>

<ea>

<eb>

<ec>

<ed>

<ee>

<ef>

<f>

<fa>

<fb>

<fc>

<fd>

<fe>

<ff>

<a>

<b>

<c>

<d>

<e>

<f>

<a>

<(ab)>

<(ac)>

<(ad)>

<(ae)>

<(af)>

<b>

<(bc)>

<(bd)>

<(be)>

<(bf)>

<c>

<(cd)>

<(ce)>

<(cf)>

<d>

<(de)>

<(df)>

<e>

<(ef)>

<f>

51 length-2

Candidates

11 of 20

The GSP Mining Process

*

Data Mining: Concepts and Techniques

11

<a> <b> <c> <d> <e> <f> <g> <h>

<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)><(ef)>

<abb> <aab> <aba> <baa> <bab>

<abba> <(bd)bc> …

<(bd)cba>

1st scan: 8 cand. 6 length-1 seq. pat.

2nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all

3rd scan: 47 cand. 19 length-3 seq. pat. 20 cand. not in DB at all

4th scan: 8 cand. 6 length-4 seq. pat.

5th scan: 1 cand. 1 length-5 seq. pat.

Cand. cannot pass sup. threshold

Cand. not in DB at all

<a(bd)bcb(ade)>

50

<(be)(ce)d>

40

<(ah)(bf)abf>

30

<(bf)(ce)b(fg)>

20

<(bd)cb(ac)>

10

Sequence

Seq. ID

min_sup =2

12 of 20

Candidate Generate-and-test: Drawbacks

  • A huge set of candidate sequences generated.
    • Especially 2-item candidate sequence.
  • Multiple Scans of database needed.
    • The length of each candidate grows by one at each database scan.
  • Inefficient for mining long sequential patterns.
    • A long pattern grow up from short patterns
    • The number of short patterns is exponential to the length of mined patterns.

*

Data Mining: Concepts and Techniques

12

13 of 20

The SPADE Algorithm

  • SPADE (Sequential PAttern Discovery using Equivalent Class) developed by Zaki 2001
  • A vertical format sequential pattern mining method
  • A sequence database is mapped to a large set of
    • Item: <SID, EID>
  • Sequential pattern mining is performed by
    • growing the subsequences (patterns) one item at a time by Apriori candidate generation

*

Data Mining: Concepts and Techniques

13

14 of 20

The SPADE Algorithm

*

Data Mining: Concepts and Techniques

14

15 of 20

Bottlenecks of GSP and SPADE

  • A huge set of candidates could be generated
    • 1,000 frequent length-1 sequences generate s huge number of length-2 candidates!
  • Multiple scans of database in mining
  • Mining long sequential patterns
    • Needs an exponential number of short candidates
    • A length-100 sequential pattern needs 1030 candidate sequences!

*

Data Mining: Concepts and Techniques

15

16 of 20

Prefix and Suffix (Projection)

  • <a>, <aa>, <a(ab)> and <a(abc)> are prefixes of sequence <a(abc)(ac)d(cf)>
  • Given sequence <a(abc)(ac)d(cf)>

*

Data Mining: Concepts and Techniques

16

Prefix

Suffix (Prefix-Based Projection)

<a>

<(abc)(ac)d(cf)>

<aa>

<(_bc)(ac)d(cf)>

<ab>

<(_c)(ac)d(cf)>

17 of 20

Mining Sequential Patterns by Prefix Projections

  • Step 1: find length-1 sequential patterns
    • <a>, <b>, <c>, <d>, <e>, <f>
  • Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets:
    • The ones having prefix <a>;
    • The ones having prefix <b>;
    • The ones having prefix <f>

*

Data Mining: Concepts and Techniques

17

SID

sequence

10

<a(abc)(ac)d(cf)>

20

<(ad)c(bc)(ae)>

30

<(ef)(ab)(df)cb>

40

<eg(af)cbc>

18 of 20

Finding Seq. Patterns with Prefix <a>

  • Only need to consider projections w.r.t. <a>
    • <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>

  • Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>
    • Further partition into 6 subsets
      • Having prefix <aa>;
      • Having prefix <af>

*

Data Mining: Concepts and Techniques

18

SID

sequence

10

<a(abc)(ac)d(cf)>

20

<(ad)c(bc)(ae)>

30

<(ef)(ab)(df)cb>

40

<eg(af)cbc>

19 of 20

Completeness of PrefixSpan

*

Data Mining: Concepts and Techniques

19

SID

sequence

10

<a(abc)(ac)d(cf)>

20

<(ad)c(bc)(ae)>

30

<(ef)(ab)(df)cb>

40

<eg(af)cbc>

SDB

Length-1 sequential patterns

<a>, <b>, <c>, <d>, <e>, <f>

<a>-projected database

<(abc)(ac)d(cf)>

<(_d)c(bc)(ae)>

<(_b)(df)cb>

<(_f)cbc>

Length-2 sequential

patterns

<aa>, <ab>, <(ab)>,

<ac>, <ad>, <af>

Having prefix <a>

Having prefix <aa>

<aa>-proj. db

<af>-proj. db

Having prefix <af>

<b>-projected database

Having prefix <b>

Having prefix <c>, …, <f>

… …

20 of 20

Efficiency of PrefixSpan

  • No candidate sequence needs to be generated
  • Projected databases keep shrinking
  • Major cost of PrefixSpan: constructing projected databases
    • Can be improved by pseudo-projections

*

Data Mining: Concepts and Techniques

20