ASM: Harmonizing �Autoregressive Model, �Sampling, and �Multi-dimensional Statistics Merging �for Cardinality Estimation�
June 2024
Kyoungmin Kim (POSTECH🡪EPFL), Sangoh Lee (POSTECH), �Injung Kim (Handong Global University), Wook-Shin Han (POSTECH)
1
2
Kim et al., SIGMOD 18
Park et al., SIGMOD 20
Kim et al., SIGMOD 21
Kim et al., SIGMOD 22
Kim et al., PODS 23
Kim et al., SIGMOD 24
Lee et al., SIGMOD 24
Lee et al., SIGMOD 24
Ph.D. @ POSTECH
Cardinality Estimation (CE)
3
Learned estimators have tried to reduce estimation errors�of traditional approaches (e.g., histograms, sampling)
Cost Model
Plan Enumeration
Query
(SPJ)
Query plan
Query optimizer
R
S
T
Supervised Methods
4
Large overhead of �collecting training examples
Rare support of complex filters �(string, disjunction, …) and joins�(cyclic, self, …)
Q
Features
Offline/Online
Unsupervised Methods
5
Statistics
based on Q
Independence assumption (merging multiple statistics)
(Kim et al., 2022)
Offline
Online
Unsupervised Methods
6
Q
(S)
Statistics
based on Q
Normalization (inferring small queries)
(Kim et al., 2022)
Offline
Online
Unsupervised Methods
7
Q
(S)
Statistics
based on Q
Normalization (inferring small queries)
(Kim et al., 2022)
Offline
Online
Focus on estimating a single query…
What about estimating �multiple sub-queries efficiently?
FactorJoin (Wu et al., 2023)
8
J�Single tables
Offline
Join key distributions
(conditioned on filters)
Online
Captures correlation �between join keys
FactorJoin (Wu et al., 2023)
9
J�Single tables
Offline
Join key distributions
(conditioned on filters)
Online
Captures correlation �between join keys
+ Dynamic programming �to estimate sub-queries efficiently
… But introducing independence back
Why? Merge Statistics in 1D
10
R
S
To infer join selectivity
Join key distributions
RS
Why? Merge Statistics in 1D
11
R
S
RS
T
Join key distributions
Reuse = copy
Our Solution: ASM
12
13
Autoregressive (AR) model for�per-table statistics estimation
AR Model
14
A1, A2, . . . . . . . . . , An
T
Attribute Orders (Offline)
15
a
a
c
a
b
c
b
1
1
2
A
B
C
D
1
1
1
2
E
F
Intra-table
a
a
c
a
b
c
b
1
1
2
A
B
C
D
1
1
1
2
E
F
Inter-table
❶
❷
B
A
D
C
F
E
16
Sampling for �join query statistics estimation
Interleaving Sampling and Inference (Online)
17
B
A
D
C
1/3
a
a
c
a
b
c
b
1
1
2
A
B
C
D
2/3
2
1
Importance Sampling
1
1
1
2
E
F
F
E
1/4
3/4
2
1
Q
�AR models
1/3
2/3
1/4
3/4
Infer join key distributions for ❶
1/7
6/7
Sample (e.g., D=E=1)
Element-wise
product/avg
❶
Interleaving Sampling and Inference (Online)
18
B
A
D
C
0
a
a
c
a
b
c
b
1
1
2
A
B
C
D
2/2
b
a
Importance Sampling
Use sampled values as conditions
1
1
1
2
E
F
F
E
�AR models
0
c
1/4
b
c
a
1/4
2/4
1/4
1/4
2/4
0
2/2
0
0
1
0
Sample (e.g., B=C=a)
Q
❷
Infer join key distributions for ❷
Estimating |Q| (Online)
19
a
a
c
a
b
c
b
1
1
2
A
B
C
D
1
1
1
2
E
F
A
D
F
E
B
C
Q
❶
❷
20
Multi-dimensional statistics merging for sub-query statistics estimation
Dynamic Programming over Samples (Online)
21
A
A
T1
T2
| |
T1 T2 | |
Dynamic Programming over Samples (Online)
22
A
A
B
T1
T2
B
T3
| |
T1 T2 | |
T1 T2 T3 | |
Dynamic Programming over Samples (Online)
23
A
A
B
T1
T2
B
T3
B
T4
| |
T1 T2 | |
T1 T2 T3 | |
T1 T2 T3 T4 | |
… | |
Advantages of Our Approach
24
Over
Under
25
Experiments
Benchmarks
26
N: Numerical, C: Categorical, S: String matching, D: Disjunctive
Complex
Simple
CE Competitors
27
Can run STATS-CEB only
End-to-End Performance on IMDB-JOB (113 Q’s)
28
❶ Near-optimal performance
Better
End-to-End Performance on Stack (200 Q’s)
29
❶ Robustness of AR for selective filters
❷ Superiority of our sampling �+ multi-dimensional statistics merging
Larger gaps due to more selective filters and challenging join patterns (large, cyclic, self-joins)
Better
End-to-End Performance on STATS-CEB (146 Q’s)
30
Better
Case Analysis for Near-/Sub-optimal Plans
31
Homogenous node sizes
Near-optimal although some nodes are red
(High tolerance of inconsistency)
Heterogenous node sizes
Sub-optimal although the ranking is preserved
(Low tolerance of accuracy)
Come and check our demo on Thursday 5-6:30pm!
Conclusion
32
Thank you
33
New Factors to Multiply
34
A
A
B
T1
T2
B
T3
B
T4
❶
❷
| |
T1 T2 | |
T1 T2 T3 | |
T1 T2 T3 T4 | |
… | |
Summary
35
Approach | Efficient | Effective | Easy-to�-deploy | Large Coverage |
Traditional | O | X | O | O |
Learning-based | X | | X | X |
FactorJoin (SOTA) | O | | O | |
ASM (Ours) | O | O | O | O |
A
ASM
M
Model Size & Training Time
36
End-to-End Performance on STATS-CEB
37
Analysis on IMDB-JOB
38
Type-I
Type-II
Type-III
Type-I
Type-II
Type-III
Data Updates
39