1 of 39

ASM: Harmonizing �Autoregressive Model, �Sampling, and �Multi-dimensional Statistics Merging �for Cardinality Estimation�

June 2024

Kyoungmin Kim (POSTECH🡪EPFL), Sangoh Lee (POSTECH), �Injung Kim (Handong Global University), Wook-Shin Han (POSTECH)

1

2 of 39

2

Kim et al., SIGMOD 18

Park et al., SIGMOD 20

Kim et al., SIGMOD 21

Kim et al., SIGMOD 22

Kim et al., PODS 23

Kim et al., SIGMOD 24

Lee et al., SIGMOD 24

Ph.D. @ POSTECH

3 of 39

Cardinality Estimation (CE)

3

Learned estimators have tried to reduce estimation errors�of traditional approaches (e.g., histograms, sampling)

Cost Model

Plan Enumeration

Query

(SPJ)

Query plan

Query optimizer

R

S

T

4 of 39

Supervised Methods

4

Large overhead of �collecting training examples

Rare support of complex filters �(string, disjunction, …) and joins�(cyclic, self, …)

Q

Features

Offline/Online

5 of 39

Unsupervised Methods

5

Statistics

based on Q

Independence assumption (merging multiple statistics)

(Kim et al., 2022)

Offline

Online

6 of 39

Unsupervised Methods

6

Q

(S)

Statistics

based on Q

Normalization (inferring small queries)

(Kim et al., 2022)

Offline

Online

7 of 39

Unsupervised Methods

7

Q

(S)

Statistics

based on Q

Normalization (inferring small queries)

(Kim et al., 2022)

Offline

Online

Focus on estimating a single query…

What about estimating �multiple sub-queries efficiently?

8 of 39

FactorJoin (Wu et al., 2023)

8

J�Single tables

Offline

Join key distributions

(conditioned on filters)

Online

Captures correlation �between join keys

9 of 39

FactorJoin (Wu et al., 2023)

9

J�Single tables

Offline

Join key distributions

(conditioned on filters)

Online

Captures correlation �between join keys

+ Dynamic programming �to estimate sub-queries efficiently

… But introducing independence back

10 of 39

Why? Merge Statistics in 1D

10

R

S

To infer join selectivity

Join key distributions

RS

11 of 39

Why? Merge Statistics in 1D

11

R

S

RS

T

Join key distributions

Reuse = copy

12 of 39

Our Solution: ASM

Autoregressive (AR) model (unsupervised) to produce accurate per-table statistics (join key distributions conditioned on selection filters)

Mathematically explain why AR outperforms more recent models, e.g., normalizing flows,�due to its special sampling-based approximation, progressive sampling (Yang et al., 2019)
Optimize the overlooked bottlenecks
Extend to support complex & disjunctive filters (and discuss more potential to be extended)

Sampling to merge the join key distributions w/o independence

Sample join key values to estimate join selectivities
Importance sampling, unbiased estimations

Multi-dimensional statistics merging to estimate multiple sub-queries efficiently using dynamic programming but still w/o independence

Sample just once and share the samples across all sub-queries
Sharing even helps preserving ranking between sub-queries 🡪 better query plans

12

13 of 39

13

Autoregressive (AR) model for�per-table statistics estimation

14 of 39

AR Model

14

A₁, A₂, . . . . . . . . . , A_n

T

15 of 39

Attribute Orders (Offline)

Intra-table order: place join keys after the remaining attributes
Inter-table order: global order for all (equivalent) join keys in the schema

15

a

c

a

b

c

b

1

2

A

B

C

D

1

2

E

F

Intra-table

a

c

a

b

c

b

1

2

A

B

C

D

1

2

E

F

Inter-table

❶

❷

B

A

D

C

F

E

16 of 39

16

Sampling for �join query statistics estimation

17 of 39

Interleaving Sampling and Inference (Online)

17

B

A

D

C

1/3

a

c

a

b

c

b

1

2

A

B

C

D

2/3

2

1

Importance Sampling

1

2

E

F

E

1/4

3/4

2

1

Q

�AR models

1/3

2/3

1/4

3/4

Infer join key distributions for ❶

1/7

6/7

Sample (e.g., D=E=1)

Element-wise

product/avg

❶

18 of 39

Interleaving Sampling and Inference (Online)

18

B

A

D

C

0

a

c

a

b

c

b

1

2

A

B

C

D

2/2

b

a

Importance Sampling

Use sampled values as conditions

1

2

E

F

E

�AR models

0

c

1/4

b

c

a

1/4

2/4

1/4

2/4

0

2/2

0

1

0

Sample (e.g., B=C=a)

Q

❷

Infer join key distributions for ❷

19 of 39

Estimating |Q| (Online)

19

a

c

a

b

c

b

1

2

A

B

C

D

1

2

E

F

A

D

F

E

B

C

Q

❶

❷

20 of 39

20

Multi-dimensional statistics merging for sub-query statistics estimation

21 of 39

Dynamic Programming over Samples (Online)

21

A

T₁

T₂


T₁T₂

22 of 39

Dynamic Programming over Samples (Online)

22

A

B

T₁

T₂

B

T₃


T₁T₂
T₁T₂T₃

23 of 39

Dynamic Programming over Samples (Online)

23

A

B

T₁

T₂

B

T₃

B

T₄


T₁T₂
T₁T₂T₃
T₁T₂T₃T₄
…

24 of 39

Advantages of Our Approach

Generally works for cyclic joins, self-joins, and even multi-key joins

No independence assumption between join keys

Indirectly preserves the ranking between sub-queries 🡪 better query plans

Because we share samples and factors across sub-queries
If we had sampled differently for each sub-query, �some sub-queries would’ve been over-estimated and some under-estimated

24

Over

Under

25 of 39

25

Experiments

26 of 39

Benchmarks

26

N: Numerical, C: Categorical, S: String matching, D: Disjunctive

Complex

Simple

27 of 39

CE Competitors

PostgreSQL (we also inject CE results into PostgreSQL)
TrueCard: oracle, exact cardinality given

CorrelatedSampling: estimate each sub-query using offline sampling

NeuroCard: AR model + one full join template
BayesCard: Bayesian network + multiple templates
DeepDB: SPN + multiple templates
FLAT: FSPN + multiple templates

FactorJoin: state-of-the-art (single-table models, 1D statistics merging)

FactorJoin (BayesCard) for STATS-CEB (ML)
FactorJoin (1% Sample) for IMDB-JOB and Stack (no ML)

FactorJoin (AR), our modification for better performance

27

Can run STATS-CEB only

28 of 39

End-to-End Performance on IMDB-JOB (113 Q’s)

28

❶ Near-optimal performance

Better

29 of 39

End-to-End Performance on Stack (200 Q’s)

29

❶ Robustness of AR for selective filters

❷ Superiority of our sampling �+ multi-dimensional statistics merging

Larger gaps due to more selective filters and challenging join patterns (large, cyclic, self-joins)

Better

30 of 39

End-to-End Performance on STATS-CEB (146 Q’s)

30

Better

31 of 39

Case Analysis for Near-/Sub-optimal Plans

31

Visualized MEMO table

Upper nodes: larger sub-queries
Node sizes: plan costs of TrueCard
Red nodes: under-estimated by ASM
Blue nodes: over-estimated by ASM

Visualized MEMO table

Blue path: chosen by TrueCard
Green path: chosen by ASM

Homogenous node sizes

Near-optimal although some nodes are red

(High tolerance of inconsistency)

Heterogenous node sizes

Sub-optimal although the ranking is preserved

(Low tolerance of accuracy)

Come and check our demo on Thursday 5-6:30pm!

32 of 39

Conclusion

Autoregressive (AR) models for per-table statistics estimation,�low-variance sampling-based approximation and large query coverage

(Importance) Sampling for unbiased join query statistics estimation

Multi-dimensional statistics merging for multiple sub-query statistics estimation w/o independence between join keys

32

33 of 39

Thank you

33

34 of 39

New Factors to Multiply

34

A

B

T₁

T₂

B

T₃

B

T₄

❶

❷


T₁T₂
T₁T₂T₃
T₁T₂T₃T₄
…

35 of 39

Summary

35

Approach	Efficient	Effective	Easy-to�-deploy	Large Coverage
Traditional	O	X	O	O
Learning-based	X		X	X
FactorJoin (SOTA)	O		O
ASM (Ours)	O	O	O	O

A

ASM

M

36 of 39

Model Size & Training Time

36

37 of 39

End-to-End Performance on STATS-CEB

37

The simplest benchmark (three queries take >80% of total time)

38 of 39

Analysis on IMDB-JOB

38

Type-I

Type-II

Type-III

Type-I

Type-II

Type-III

39 of 39

Data Updates

39