1 of 39

ASM: Harmonizing �Autoregressive Model, �Sampling, and �Multi-dimensional Statistics Merging �for Cardinality Estimation�

June 2024

Kyoungmin Kim (POSTECH🡪EPFL), Sangoh Lee (POSTECH), �Injung Kim (Handong Global University), Wook-Shin Han (POSTECH)

1

2 of 39

2

Kim et al., SIGMOD 18

Park et al., SIGMOD 20

Kim et al., SIGMOD 21

Kim et al., SIGMOD 22

Kim et al., PODS 23

Kim et al., SIGMOD 24

Lee et al., SIGMOD 24

Lee et al., SIGMOD 24

Ph.D. @ POSTECH

3 of 39

Cardinality Estimation (CE)

3

Learned estimators have tried to reduce estimation errors�of traditional approaches (e.g., histograms, sampling)

 

Cost Model

Plan Enumeration

Query

(SPJ)

Query plan

Query optimizer

 

R

S

T

 

 

4 of 39

Supervised Methods

4

Large overhead of �collecting training examples

Rare support of complex filters �(string, disjunction, …) and joins�(cyclic, self, …)

Q

 

Features

Offline/Online

5 of 39

Unsupervised Methods

5

 

 

 

Statistics

based on Q

 

 

 

Independence assumption (merging multiple statistics)

(Kim et al., 2022)

Offline

Online

6 of 39

Unsupervised Methods

6

 

 

Q

(S)

Statistics

based on Q

 

 

Normalization (inferring small queries)

(Kim et al., 2022)

Offline

Online

7 of 39

Unsupervised Methods

7

 

 

Q

(S)

Statistics

based on Q

 

 

Normalization (inferring small queries)

(Kim et al., 2022)

Offline

Online

Focus on estimating a single query…

What about estimating �multiple sub-queries efficiently?

8 of 39

FactorJoin (Wu et al., 2023)

8

JSingle tables

 

Offline

 

Join key distributions

(conditioned on filters)

 

 

 

 

Online

Captures correlation �between join keys

9 of 39

FactorJoin (Wu et al., 2023)

9

JSingle tables

 

Offline

 

Join key distributions

(conditioned on filters)

 

 

 

 

Online

Captures correlation �between join keys

+ Dynamic programming �to estimate sub-queries efficiently

… But introducing independence back

10 of 39

Why? Merge Statistics in 1D

10

 

 

 

 

R

S

 

To infer join selectivity

Join key distributions

 

RS

11 of 39

Why? Merge Statistics in 1D

11

 

 

 

 

R

S

 

 

RS

T

 

Join key distributions

 

 

 

 

Reuse = copy

12 of 39

Our Solution: ASM

  • Autoregressive (AR) model (unsupervised) to produce accurate per-table statistics (join key distributions conditioned on selection filters)
    • Mathematically explain why AR outperforms more recent models, e.g., normalizing flows,�due to its special sampling-based approximation, progressive sampling (Yang et al., 2019)
    • Optimize the overlooked bottlenecks
    • Extend to support complex & disjunctive filters (and discuss more potential to be extended)

  • Sampling to merge the join key distributions w/o independence
    • Sample join key values to estimate join selectivities
    • Importance sampling, unbiased estimations

  • Multi-dimensional statistics merging to estimate multiple sub-queries efficiently using dynamic programming but still w/o independence
    • Sample just once and share the samples across all sub-queries
    • Sharing even helps preserving ranking between sub-queries 🡪 better query plans

12

13 of 39

13

Autoregressive (AR) model for�per-table statistics estimation

14 of 39

AR Model

14

A1, A2, . . . . . . . . . , An

 

T

15 of 39

Attribute Orders (Offline)

  • Intra-table order: place join keys after the remaining attributes
  • Inter-table order: global order for all (equivalent) join keys in the schema

15

a

a

c

a

b

c

b

1

1

2

A

B

C

D

1

1

1

2

E

F

Intra-table

a

a

c

a

b

c

b

1

1

2

A

B

C

D

1

1

1

2

E

F

Inter-table

B

A

D

C

F

E

16 of 39

16

Sampling for �join query statistics estimation

17 of 39

Interleaving Sampling and Inference (Online)

17

B

A

D

C

1/3

a

a

c

a

b

c

b

1

1

2

A

B

C

D

2/3

2

1

 

 

Importance Sampling

 

 

1

1

1

2

 

E

F

F

E

1/4

3/4

2

1

 

Q

�AR models

1/3

2/3

1/4

3/4

Infer join key distributions for ❶

1/7

6/7

Sample (e.g., D=E=1)

Element-wise

product/avg

18 of 39

Interleaving Sampling and Inference (Online)

18

B

A

D

C

0

a

a

c

a

b

c

b

1

1

2

A

B

C

D

2/2

b

a

Importance Sampling

 

Use sampled values as conditions

 

1

1

1

2

E

F

F

E

�AR models

0

c

1/4

b

c

a

1/4

 

2/4

1/4

1/4

2/4

0

2/2

0

0

1

0

Sample (e.g., B=C=a)

 

 

 

Q

Infer join key distributions for ❷

19 of 39

Estimating |Q| (Online)

19

a

a

c

a

b

c

b

1

1

2

A

B

C

D

1

1

1

2

E

F

 

A

D

F

E

B

C

 

 

 

 

Q

20 of 39

20

Multi-dimensional statistics merging for sub-query statistics estimation

21 of 39

Dynamic Programming over Samples (Online)

  •  

21

A

A

T1

T2

T1 T2

 

 

22 of 39

Dynamic Programming over Samples (Online)

  •  

22

A

A

B

T1

T2

B

T3

T1 T2

T1 T2 T3

 

 

 

23 of 39

Dynamic Programming over Samples (Online)

  •  

23

A

A

B

T1

T2

B

T3

B

T4

T1 T2

T1 T2 T3

T1 T2 T3 T4

 

 

 

 

24 of 39

Advantages of Our Approach

  • Generally works for cyclic joins, self-joins, and even multi-key joins

  • No independence assumption between join keys

  • Indirectly preserves the ranking between sub-queries 🡪 better query plans
    • Because we share samples and factors across sub-queries
    • If we had sampled differently for each sub-query, �some sub-queries would’ve been over-estimated and some under-estimated

24

 

 

 

Over

Under

25 of 39

25

Experiments

26 of 39

Benchmarks

26

N: Numerical, C: Categorical, S: String matching, D: Disjunctive

Complex

Simple

27 of 39

CE Competitors

  • PostgreSQL (we also inject CE results into PostgreSQL)
  • TrueCard: oracle, exact cardinality given

  • CorrelatedSampling: estimate each sub-query using offline sampling

  • NeuroCard: AR model + one full join template
  • BayesCard: Bayesian network + multiple templates
  • DeepDB: SPN + multiple templates
  • FLAT: FSPN + multiple templates

  • FactorJoin: state-of-the-art (single-table models, 1D statistics merging)
    • FactorJoin (BayesCard) for STATS-CEB (ML)
    • FactorJoin (1% Sample) for IMDB-JOB and Stack (no ML)
      • FactorJoin (AR), our modification for better performance

27

Can run STATS-CEB only

28 of 39

End-to-End Performance on IMDB-JOB (113 Q’s)

28

❶ Near-optimal performance

Better

29 of 39

End-to-End Performance on Stack (200 Q’s)

29

❶ Robustness of AR for selective filters

Superiority of our sampling �+ multi-dimensional statistics merging

Larger gaps due to more selective filters and challenging join patterns (large, cyclic, self-joins)

Better

30 of 39

End-to-End Performance on STATS-CEB (146 Q’s)

30

Better

31 of 39

Case Analysis for Near-/Sub-optimal Plans

31

  • Visualized MEMO table
    • Upper nodes: larger sub-queries
    • Node sizes: plan costs of TrueCard
    • Red nodes: under-estimated by ASM
    • Blue nodes: over-estimated by ASM
  • Visualized MEMO table
    • Blue path: chosen by TrueCard
    • Green path: chosen by ASM

Homogenous node sizes

Near-optimal although some nodes are red

(High tolerance of inconsistency)

Heterogenous node sizes

Sub-optimal although the ranking is preserved

(Low tolerance of accuracy)

Come and check our demo on Thursday 5-6:30pm!

32 of 39

Conclusion

  • Autoregressive (AR) models for per-table statistics estimation,�low-variance sampling-based approximation and large query coverage

  • (Importance) Sampling for unbiased join query statistics estimation

  • Multi-dimensional statistics merging for multiple sub-query statistics estimation w/o independence between join keys

32

33 of 39

Thank you

33

34 of 39

New Factors to Multiply

  •  

34

A

A

B

T1

T2

B

T3

B

T4

T1 T2

T1 T2 T3

T1 T2 T3 T4

 

 

 

 

35 of 39

Summary

35

Approach

Efficient

Effective

Easy-to�-deploy

Large Coverage

Traditional

O

X

O

O

Learning-based

X

X

X

FactorJoin (SOTA)

O

O

ASM (Ours)

O

O

O

O

A

ASM

M

36 of 39

Model Size & Training Time

36

37 of 39

End-to-End Performance on STATS-CEB

37

  • The simplest benchmark (three queries take >80% of total time)

38 of 39

Analysis on IMDB-JOB

38

Type-I

Type-II

Type-III

Type-I

Type-II

Type-III

39 of 39

Data Updates

39