1 of 20

Data Mining: �Concepts and Techniques� � Mining sequence patterns in transactional databases

*

Data Mining: Concepts and Techniques

1

2 of 20

Chapter 8. Mining Stream, Time-Series, and Sequence Data

Mining data streams
Mining time-series data
Mining sequence patterns in transactional databases
Mining sequence patterns in biological data

*

Data Mining: Concepts and Techniques

2

3 of 20

Sequence Databases & Sequential Patterns

Transaction databases, time-series databases vs. sequence databases
Frequent patterns vs. (frequent) sequential patterns
Applications of sequential pattern mining

Customer shopping sequences:

First buy computer, then CD-ROM, and then digital camera, within 3 months.

Medical treatments, natural disasters (e.g., earthquakes), science & eng. processes, stocks and markets, etc.
Telephone calling patterns, Weblog click streams
DNA sequences and gene structures

*

Data Mining: Concepts and Techniques

3

4 of 20

What Is Sequential Pattern Mining?

Given a set of sequences, find the complete set of frequent subsequences

*

Data Mining: Concepts and Techniques

4

A sequence database

A sequence : < (ef) (ab) (df) c b >

An element may contain a set of items.

Items within an element are unordered

and we list them alphabetically.

<a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>

Given support threshold min_sup =2, <(ab)c> is a sequential pattern

SID	sequence
10	<a(abc)(ac)d(cf)>
20	<(ad)c(bc)(ae)>
30	<(ef)(ab)(df)cb>
40	<eg(af)cbc>

5 of 20

Challenges on Sequential Pattern Mining

A huge number of possible sequential patterns are hidden in databases
A mining algorithm should

find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold
be highly efficient, scalable, involving only a small number of database scans
be able to incorporate various kinds of user-specific constraints

*

Data Mining: Concepts and Techniques

5

6 of 20

Sequential Pattern Mining Algorithms

Concept introduction and an initial Apriori-like algorithm

Agrawal & Srikant. Mining sequential patterns, ICDE’95

Apriori-based method: GSP (Generalized Sequential Patterns:)
Pattern-growth methods: FreeSpan & PrefixSpan
Vertical format-based mining: SPADE
Mining closed sequential patterns: CloSpan

*

Data Mining: Concepts and Techniques

6

7 of 20

The Apriori Property of Sequential Patterns

A basic property: Apriori (Agrawal & Sirkant’94)

If a sequence S is not frequent
Then none of the super-sequences of S is frequent
E.g, <hb> is infrequent 🡪 so do <hab> and <(ah)b>

*

Data Mining: Concepts and Techniques

7

<a(bd)bcb(ade)>

50

<(be)(ce)d>

40

<(ah)(bf)abf>

30

<(bf)(ce)b(fg)>

20

<(bd)cb(ac)>

10

Sequence

Seq. ID

Given support threshold min_sup =2

8 of 20

GSP—Generalized Sequential Pattern Mining

GSP (Generalized Sequential Pattern) mining algorithm

proposed by Agrawal and Srikanth

Outline of the method

Initially, every item in DB is a candidate of length-1
for each level (i.e., sequences of length-k) do

scan database to collect support count for each candidate sequence
generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori

repeat until no frequent sequence or no candidate can be found

Major strength: Candidate pruning by Apriori

*

Data Mining: Concepts and Techniques

8

9 of 20

Finding Length-1 Sequential Patterns

Examine GSP using an example
Initial candidates: all singleton sequences

<a>, , <c>, <d>, <e>, <f>, <g>, <h>

Scan database once, count support for candidates

*

Data Mining: Concepts and Techniques

9

<a(bd)bcb(ade)>

50

<(be)(ce)d>

40

<(ah)(bf)abf>

30

<(bf)(ce)b(fg)>

20

<(bd)cb(ac)>

10

Sequence

Seq. ID

min_sup =2

Cand	Sup
<a>	3
<b>	5
<c>	4
<d>	3
<e>	3
<f>	2
<g>	1
<h>	1

10 of 20

GSP: Generating Length-2 Candidates

*

Data Mining: Concepts and Techniques

10

	<a>	<b>	<c>	<d>	<e>	<f>
<a>	<aa>	<ab>	<ac>	<ad>	<ae>	<af>
<b>	<ba>	<bb>	<bc>	<bd>	<be>	<bf>
<c>	<ca>	<cb>	<cc>	<cd>	<ce>	<cf>
<d>	<da>	<db>	<dc>	<dd>	<de>	<df>
<e>	<ea>	<eb>	<ec>	<ed>	<ee>	<ef>
<f>	<fa>	<fb>	<fc>	<fd>	<fe>	<ff>

	<a>	<b>	<c>	<d>	<e>	<f>
<a>		<(ab)>	<(ac)>	<(ad)>	<(ae)>	<(af)>
<b>			<(bc)>	<(bd)>	<(be)>	<(bf)>
<c>				<(cd)>	<(ce)>	<(cf)>
<d>					<(de)>	<(df)>
<e>						<(ef)>
<f>

51 length-2

Candidates

11 of 20

The GSP Mining Process

*

Data Mining: Concepts and Techniques

11

<abb> <aab> <aba> <baa> <bab> …

<abba> <(bd)bc> …

<(bd)cba>

1^st scan: 8 cand. 6 length-1 seq. pat.

2^nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all

3^rd scan: 47 cand. 19 length-3 seq. pat. 20 cand. not in DB at all

4^th scan: 8 cand. 6 length-4 seq. pat.

5^th scan: 1 cand. 1 length-5 seq. pat.

Cand. cannot pass sup. threshold

Cand. not in DB at all

<a(bd)bcb(ade)>

50

<(be)(ce)d>

40

<(ah)(bf)abf>

30

<(bf)(ce)b(fg)>

20

<(bd)cb(ac)>

10

Sequence

Seq. ID

min_sup =2

12 of 20

Candidate Generate-and-test: Drawbacks

A huge set of candidate sequences generated.

Especially 2-item candidate sequence.

Multiple Scans of database needed.

The length of each candidate grows by one at each database scan.

Inefficient for mining long sequential patterns.

A long pattern grow up from short patterns
The number of short patterns is exponential to the length of mined patterns.

*

Data Mining: Concepts and Techniques

12

13 of 20

The SPADE Algorithm

SPADE (Sequential PAttern Discovery using Equivalent Class) developed by Zaki 2001
A vertical format sequential pattern mining method
A sequence database is mapped to a large set of

Item: <SID, EID>

Sequential pattern mining is performed by

growing the subsequences (patterns) one item at a time by Apriori candidate generation

*

Data Mining: Concepts and Techniques

13

14 of 20

The SPADE Algorithm

*

Data Mining: Concepts and Techniques

14

15 of 20

Bottlenecks of GSP and SPADE

A huge set of candidates could be generated

1,000 frequent length-1 sequences generate s huge number of length-2 candidates!

Multiple scans of database in mining
Mining long sequential patterns

Needs an exponential number of short candidates
A length-100 sequential pattern needs 10³⁰ candidate sequences!

*

Data Mining: Concepts and Techniques

15

16 of 20

Prefix and Suffix (Projection)

<a>, <aa>, <a(ab)> and <a(abc)> are prefixes of sequence <a(abc)(ac)d(cf)>
Given sequence <a(abc)(ac)d(cf)>

*

Data Mining: Concepts and Techniques

16

Prefix	Suffix (Prefix-Based Projection)
<a>	<(abc)(ac)d(cf)>
<aa>	<(_bc)(ac)d(cf)>
<ab>	<(_c)(ac)d(cf)>

17 of 20

Mining Sequential Patterns by Prefix Projections

Step 1: find length-1 sequential patterns

<a>, , <c>, <d>, <e>, <f>

Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets:

The ones having prefix <a>;
The ones having prefix ;
…
The ones having prefix <f>

*

Data Mining: Concepts and Techniques

17

SID	sequence
10	<a(abc)(ac)d(cf)>
20	<(ad)c(bc)(ae)>
30	<(ef)(ab)(df)cb>
40	<eg(af)cbc>

18 of 20

Finding Seq. Patterns with Prefix <a>

Only need to consider projections w.r.t. <a>

<a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>

Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>

Further partition into 6 subsets

Having prefix <aa>;
…
Having prefix <af>

*

Data Mining: Concepts and Techniques

18

SID	sequence
10	<a(abc)(ac)d(cf)>
20	<(ad)c(bc)(ae)>
30	<(ef)(ab)(df)cb>
40	<eg(af)cbc>

19 of 20

Completeness of PrefixSpan

*

Data Mining: Concepts and Techniques

19

SID	sequence
10	<a(abc)(ac)d(cf)>
20	<(ad)c(bc)(ae)>
30	<(ef)(ab)(df)cb>
40	<eg(af)cbc>

SDB

Length-1 sequential patterns

<a>-projected database

<(abc)(ac)d(cf)>

<(_d)c(bc)(ae)>

<(_b)(df)cb>

<(_f)cbc>

Length-2 sequential

patterns

<aa>, <ab>, <(ab)>,

Having prefix <a>

Having prefix <aa>

<aa>-proj. db

…

<af>-proj. db

Having prefix <af>

-projected database

…

Having prefix

Having prefix <c>, …, <f>

… …

20 of 20

Efficiency of PrefixSpan

No candidate sequence needs to be generated
Projected databases keep shrinking
Major cost of PrefixSpan: constructing projected databases

Can be improved by pseudo-projections

*

Data Mining: Concepts and Techniques

20