1 of 1

ASM in Action: A Fast and Practical Learned Cardinality Estimator

Sangoh Lee (POSTECH), Kyoungmin Kim (EPFL), Wook-Shin Han (POSTECH)

 

1. Autoregressive Model for per-table statistics estimation

2. Importance Sampling for join statistics estimation

3. Multi-dimensional Statistics Merging for all sub-query cardinality estimation

1. Database Setup & Issuing Query

3. Join Key Distributions and Sampling

4. Multi-dimensional Statistics Merging and Estimation Graph

Architecture of ASM

 

S

T

A

B

X

Y

A

R

A

B

S

B

T

 

 

AR model learned over table R (offline)

 

X

PR(X|X<3)

x ~ PR(X|X<3)

PR(X)

Y

A

Filter-During-Sample

PR(Y|x)

PR(Y|Y>5, x)

y ~ PR(Y|Y>5, x)

Sample Attributes

 

PR(A|x, y)

Importance distribution q(A)

Aggregate PR(A|x, y), PS(A)

Sample a ~ q(A)

q(A)

1/3

2

1

2/3

1/7

6/7

Element-wise product/avg

2

1

1/4

3/4

PR(A|x, y)

PS(A)

2

1

Importance distribution q(B|a)

Sample b ~ q(B|a)

Aggregate PS(B|a), PT(B)

Demo Scenarios

2. Hypergraph & Single-table Cardinalities

S

T

1.A

2.B

 

Subquery

Estimated Cardinality

 

 

 

 

 

 

Memo table

  • Using the statistics from the first (selectivity) and second phase (P(join key)), ASM estimates the cardinalities of all subqueries.
  • To estimate the cardinalities of thousands of subqueries, ASM exploits dynamic programming (DP).

Statistics from 1, 2

DP

The DBA starts the tour by issuing the query on the selected database, with the baseline estimators and ASM.

The DBA observes intra-/inter- table ordering of each table in the query, with the interactive join query hypergraph.

The DBA can watch how AR models have estimated the selectivity of filters.

The DBA observes how ASM inferred the join key distributions and the importance distributions.

The DBA interacts with the estimation graph, which includes a multi-dimensional statistics merging phase.

The DBA can analyze how the cost/cardinality of each subquery (node) is under-/over-estimated, and the resulting plan.

R

S

T

 

X

Y

A

A

B

B

A

B

PR(A|X<3 , Y>5)

PS(A)

PS(B|a)

PT(B)

Sample a

 

A

B

 

 

 

1

2

3

 

Cost Model

Plan Enumeration

Query Optimizer

SPJ Query

Learned CE for Query Optimization

Query Execution Plan

R

S

T

 

 

 

Learned CE

Query-driven (Supervised)

Data-driven (Unsupervised)

  • Large training data collection overhead
  • Not scalable to databases of large number of tables
  • Small query coverage, i.e., rare support for LIKE/In/Disjunctive filters (FactorJoin, NeuroCard)
  • Independence assumption between join keys (FactorJoin)
  • Estimate sub-queries independently (NeuroCard) or assume independence while utilizing dynamic programming (FactorJoin)

Also check our research paper!

Check our paper!