1 of 28

AutoDiff: combining Auto-encoder and Diffusion model for tabular data synthesizing

Joint work with

Xiaofeng, Darren, Merhdad, and Guang

Namjoon Suh

2 of 28

Question :

Diffusion model is not originally designed for “cont+cate” (i.e., heterogenous) features.

--> How to use diffusion model for tabular data synthesizer?

High-level Idea : Auto-encoder + diffusion model

Continuous

Embedding Space

Continuous

Embedding Space

Generated Data in

Embedding Space

Encoder

Decoder

3 of 28

Input data

Encoder

Decoder

Bottleneck

06.20.23

Very simple auto-encoder

Reconstruction head

Binary

Categorical

Continuous

Binary ->0,1

Categorical

Ex) Animal:

Cat : 1, dog : 2, bird : 3

Continuous: Normalizing

1. Using this pre-processing, we can preserve the original dimension of tabular data.

2. Sampling in diffusion model can be fast as the column dimension in real-world dataset is small.

4 of 28

Pre-processing for numerical variables

1. Min-max scaling (Stasy)

2. Gaussian Quantile Transformation (TabDDPM)

5 of 28

Gaussian Quantile Transformation

Transforming Data to a Gaussian Distribution (geostatisticslessons.com)

6 of 28

Mixed-type data

: Variables that have both continuous and discrete variables..

CH20 variable in Obesity Data

Q : How to capture these types of variables ?

Idea :

If certain data values are repeated over certain thresholds,

(ex : 10% of the entire data points), we encode them through integers..

Append the encoded vector to the original data frame, and run the auto-

diffusion.

Through auto-diffusion, we will have newly generated CH20 and its encoded

columns. We combine the information from these two columns. For instance…

1

2

3

0

CH20 from Auto-Diff

Encoded Column from Auto-Diff

Final Synthetic data

2.10

1

2

7 of 28

Mixed-type data

: Variables that have both continuous and discrete variables..

Q : How to capture these types of variables ?

Idea :

If certain data values are repeated over certain thresholds,

(ex : 10% of the entire data points), we encode them through integers..

Append the encoded vector to the original data frame, and run the auto-

diffusion.

Through auto-diffusion, we will have newly generated CH20 and its encoded

columns. We combine the information from these two columns. For instance…

0

CH20 from Auto-Diff

Encoded Column from Auto-Diff

Final Synthetic data

2.10

1

2

1

2

3

0

8 of 28

Another Example of Mixed-type

: Variables that have both continuous and discrete variables..

Hours per week in Adult Income dataset

9 of 28

Score-based Diffusion Model

Euler-Maruyama Method is used for sampling!

Refer “Score-based generative modeling through stochastic differential equations (Song et al, 2020)”

VP-SDE is used for perturbing data

10 of 28

Score-based Diffusion Model

11 of 28

Score-based Diffusion Model

+ min-max scaler

+ Gaussian Quantile Transformation

12 of 28

TabDDPM

(SOTA diffusion-based tabular synthesizer)

1. Use two different diffusion models

for modelling continuous / discrete variables.

For each discrete variables, used separate

forward process.

Used Gaussian Quantile Transformation for

pre-processing continuous variable.

AutoGAN

Modified from MedGAN, which is only applicable

to discrete dataset.

For pre-processing, min-max scaler is used.

Can’t capture the correlations among features.

13 of 28

Quality of generated data : Fidelity, Utility, and Privacy

15 Real-World data :

1. abalone (binary)

2. adult (binary)

3. Bean (regression)

4. Churn (multi-classification)

5. faults (multi-classification)

6. HTRU (binary)

7. Indian_liver_patient (binary)

…..

15. wilt (binary)

6 Models for comparisons :

1. CTGAN

2. CTABGAN+

3. TVAE

4. Stasy

5. TabDDPM

6. AutoGAN

7. Stasy-AutoDiff

8. Med-AutoDiff

9. Tab-AutoDiff

GAN-based method

Diffusion-based method

Generate 10 synthetic data per real dataset for each model !

-> synthesize 900 fake datasets

Our model

Same with Stasy-AutoDiff

But use MSE error for cate

gorical var.

14 of 28

Fidelity test

Marginal distribution

Numerical feature : Wasserstein distance, Categorical feat : Jensen Shannon divergence

Averaged ranking among the model.

Joint distribution

Num-Num (Pearson Corr), Cat-Cat (Theil’s U), Num-Cat (Correl Ratio)

L2-distance between correlation matrices of real and synthetic data.

15 of 28

Correlation plot Heatmap, abalone dataset

AutoDiff

TabDDPM

16 of 28

Utility test

17 of 28

Privacy test : Measured via Distances to Closest Records (DCR)

Synthetic

Real

Closest L2 distance

For each row in the Table.

DCR

AutoDiff

Good Fidelity, Utility.
Memorize some rows in the table

🡪 Concerns in privacy issue

🡪 Future work..

The higher the rank is, the closer to real data distribution is

18 of 28

Extension to time series setting with dependent rows

e.g. Medical Records

Q : can we extend the idea of AutoDiff in time series?

19 of 28

Time Series : Stock data

…….

20 of 28

Auto-encoder through RNN

We can get the latent variables via RNN structure and seems that it works reasonably well.

21 of 28

Generate new latent var. via diffusion model

Goal : Learn the following distribution via score-based diffusion model

1^st Feature

2nd Feature

Kth Feature

22 of 28

Generate new latent var. via diffusion model

Goal : Learn the following distribution via score-based diffusion model

1^st Feature

2nd Feature

Kth Feature

23 of 28

Extension to time series setting with dependent rows

e.g. Medical Records

Q : can we extend the idea of AutoDiff in time series?

24 of 28

Time Series : Stock data

…….

25 of 28

Auto-encoder through RNN

We can get the latent variables via RNN structure and seems that it works reasonably well.

26 of 28

Generate new latent var. via diffusion model

Goal : Learn the following distribution via score-based diffusion model

1^st Feature

2nd Feature

Kth Feature

27 of 28

Generate new latent var. via diffusion model

Goal : Learn the following distribution via score-based diffusion model

1^st Feature

2nd Feature

Kth Feature

28 of 28

Looking for a collaborator for this project.

Fully commit to the project, joint first author
Familiar with / interested in diffusion model!
Excited to work on time series problem, and study with me.