AutoDiff: combining Auto-encoder and Diffusion model for tabular data synthesizing
Joint work with
Xiaofeng, Darren, Merhdad, and Guang
Namjoon Suh
Question :
Diffusion model is not originally designed for “cont+cate” (i.e., heterogenous) features.
--> How to use diffusion model for tabular data synthesizer?
High-level Idea : Auto-encoder + diffusion model
Continuous
Embedding Space
Continuous
Embedding Space
Generated Data in
Embedding Space
Encoder
Decoder
Input data
Encoder
Decoder
Bottleneck
06.20.23
Very simple auto-encoder
Reconstruction head
Binary
Categorical
Continuous
Binary ->0,1
Categorical
Ex) Animal:
Cat : 1, dog : 2, bird : 3
Continuous: Normalizing
1. Using this pre-processing, we can preserve the original dimension of tabular data.
2. Sampling in diffusion model can be fast as the column dimension in real-world dataset is small.
Pre-processing for numerical variables
1. Min-max scaling (Stasy)
2. Gaussian Quantile Transformation (TabDDPM)
Gaussian Quantile Transformation
Mixed-type data
: Variables that have both continuous and discrete variables..
CH20 variable in Obesity Data
Q : How to capture these types of variables ?
Idea :
(ex : 10% of the entire data points), we encode them through integers..
diffusion.
columns. We combine the information from these two columns. For instance…
1
2
3
0
CH20 from Auto-Diff
Encoded Column from Auto-Diff
Final Synthetic data
2.10
1
2
Mixed-type data
: Variables that have both continuous and discrete variables..
Q : How to capture these types of variables ?
Idea :
(ex : 10% of the entire data points), we encode them through integers..
diffusion.
columns. We combine the information from these two columns. For instance…
0
CH20 from Auto-Diff
Encoded Column from Auto-Diff
Final Synthetic data
2.10
1
2
1
2
3
0
Another Example of Mixed-type
: Variables that have both continuous and discrete variables..
Hours per week in Adult Income dataset
Score-based Diffusion Model
Euler-Maruyama Method is used for sampling!
Refer “Score-based generative modeling through stochastic differential equations (Song et al, 2020)”
VP-SDE is used for perturbing data
Score-based Diffusion Model
Score-based Diffusion Model
+ min-max scaler
+ Gaussian Quantile Transformation
TabDDPM
(SOTA diffusion-based tabular synthesizer)
1. Use two different diffusion models
for modelling continuous / discrete variables.
forward process.
pre-processing continuous variable.
AutoGAN
to discrete dataset.
Can’t capture the correlations among features.
Quality of generated data : Fidelity, Utility, and Privacy
15 Real-World data :
1. abalone (binary)
2. adult (binary)
3. Bean (regression)
4. Churn (multi-classification)
5. faults (multi-classification)
6. HTRU (binary)
7. Indian_liver_patient (binary)
…..
15. wilt (binary)
6 Models for comparisons :
1. CTGAN
2. CTABGAN+
3. TVAE
4. Stasy
5. TabDDPM
6. AutoGAN
7. Stasy-AutoDiff
8. Med-AutoDiff
9. Tab-AutoDiff
GAN-based method
Diffusion-based method
Generate 10 synthetic data per real dataset for each model !
-> synthesize 900 fake datasets
Our model
Same with Stasy-AutoDiff
But use MSE error for cate
gorical var.
Fidelity test
Numerical feature : Wasserstein distance, Categorical feat : Jensen Shannon divergence
Averaged ranking among the model.
Num-Num (Pearson Corr), Cat-Cat (Theil’s U), Num-Cat (Correl Ratio)
L2-distance between correlation matrices of real and synthetic data.
Correlation plot Heatmap, abalone dataset
AutoDiff
TabDDPM
Utility test
Privacy test : Measured via Distances to Closest Records (DCR)
Synthetic
Real
Closest L2 distance
For each row in the Table.
DCR
AutoDiff
🡪 Concerns in privacy issue
🡪 Future work..
The higher the rank is, the closer to real data distribution is
Extension to time series setting with dependent rows
e.g. Medical Records
Q : can we extend the idea of AutoDiff in time series?
Time Series : Stock data
…….
Auto-encoder through RNN
We can get the latent variables via RNN structure and seems that it works reasonably well.
Generate new latent var. via diffusion model
Goal : Learn the following distribution via score-based diffusion model
1st Feature
2nd Feature
Kth Feature
Generate new latent var. via diffusion model
Goal : Learn the following distribution via score-based diffusion model
1st Feature
2nd Feature
Kth Feature
Extension to time series setting with dependent rows
e.g. Medical Records
Q : can we extend the idea of AutoDiff in time series?
Time Series : Stock data
…….
Auto-encoder through RNN
We can get the latent variables via RNN structure and seems that it works reasonably well.
Generate new latent var. via diffusion model
Goal : Learn the following distribution via score-based diffusion model
1st Feature
2nd Feature
Kth Feature
Generate new latent var. via diffusion model
Goal : Learn the following distribution via score-based diffusion model
1st Feature
2nd Feature
Kth Feature
Looking for a collaborator for this project.