1 of 28

AutoDiff: combining Auto-encoder and Diffusion model for tabular data synthesizing

Joint work with

Xiaofeng, Darren, Merhdad, and Guang

Namjoon Suh

2 of 28

Question :

Diffusion model is not originally designed for “cont+cate” (i.e., heterogenous) features.

--> How to use diffusion model for tabular data synthesizer?

High-level Idea : Auto-encoder + diffusion model

Continuous

Embedding Space

Continuous

Embedding Space

Generated Data in

Embedding Space

Encoder

Decoder

3 of 28

Input data

Encoder

Decoder

Bottleneck

06.20.23

Very simple auto-encoder

Reconstruction head

Binary

Categorical

Continuous

Binary ->0,1

Categorical

Ex) Animal:

Cat : 1, dog : 2, bird : 3

Continuous: Normalizing

1. Using this pre-processing, we can preserve the original dimension of tabular data.

2. Sampling in diffusion model can be fast as the column dimension in real-world dataset is small.

4 of 28

Pre-processing for numerical variables

1. Min-max scaling (Stasy)

2. Gaussian Quantile Transformation (TabDDPM)

5 of 28

Gaussian Quantile Transformation

6 of 28

Mixed-type data

: Variables that have both continuous and discrete variables..

CH20 variable in Obesity Data

Q : How to capture these types of variables ?

Idea :

  1. If certain data values are repeated over certain thresholds,

(ex : 10% of the entire data points), we encode them through integers..

  • Append the encoded vector to the original data frame, and run the auto-

diffusion.

  1. Through auto-diffusion, we will have newly generated CH20 and its encoded

columns. We combine the information from these two columns. For instance…

1

2

3

0

CH20 from Auto-Diff

Encoded Column from Auto-Diff

Final Synthetic data

2.10

1

2

7 of 28

Mixed-type data

: Variables that have both continuous and discrete variables..

Q : How to capture these types of variables ?

Idea :

  1. If certain data values are repeated over certain thresholds,

(ex : 10% of the entire data points), we encode them through integers..

  • Append the encoded vector to the original data frame, and run the auto-

diffusion.

  1. Through auto-diffusion, we will have newly generated CH20 and its encoded

columns. We combine the information from these two columns. For instance…

0

CH20 from Auto-Diff

Encoded Column from Auto-Diff

Final Synthetic data

2.10

1

2

1

2

3

0

8 of 28

Another Example of Mixed-type

: Variables that have both continuous and discrete variables..

Hours per week in Adult Income dataset

9 of 28

Score-based Diffusion Model

 

Euler-Maruyama Method is used for sampling!

 

 

VP-SDE is used for perturbing data

10 of 28

Score-based Diffusion Model

11 of 28

Score-based Diffusion Model

+ min-max scaler

+ Gaussian Quantile Transformation

12 of 28

TabDDPM

(SOTA diffusion-based tabular synthesizer)

1. Use two different diffusion models

for modelling continuous / discrete variables.

  1. For each discrete variables, used separate

forward process.

  1. Used Gaussian Quantile Transformation for

pre-processing continuous variable.

AutoGAN

  1. Modified from MedGAN, which is only applicable

to discrete dataset.

  1. For pre-processing, min-max scaler is used.

Can’t capture the correlations among features.

13 of 28

Quality of generated data : Fidelity, Utility, and Privacy

15 Real-World data :

1. abalone (binary)

2. adult (binary)

3. Bean (regression)

4. Churn (multi-classification)

5. faults (multi-classification)

6. HTRU (binary)

7. Indian_liver_patient (binary)

…..

15. wilt (binary)

6 Models for comparisons :

1. CTGAN

2. CTABGAN+

3. TVAE

4. Stasy

5. TabDDPM

6. AutoGAN

7. Stasy-AutoDiff

8. Med-AutoDiff

9. Tab-AutoDiff

GAN-based method

Diffusion-based method

Generate 10 synthetic data per real dataset for each model !

-> synthesize 900 fake datasets

Our model

Same with Stasy-AutoDiff

But use MSE error for cate

gorical var.

14 of 28

Fidelity test

  1. Marginal distribution

Numerical feature : Wasserstein distance, Categorical feat : Jensen Shannon divergence

Averaged ranking among the model.

  • Joint distribution

Num-Num (Pearson Corr), Cat-Cat (Theil’s U), Num-Cat (Correl Ratio)

L2-distance between correlation matrices of real and synthetic data.

15 of 28

Correlation plot Heatmap, abalone dataset

AutoDiff

TabDDPM

16 of 28

 

Utility test

17 of 28

Privacy test : Measured via Distances to Closest Records (DCR)

Synthetic

Real

Closest L2 distance

For each row in the Table.

DCR

AutoDiff

  1. Good Fidelity, Utility.
  2. Memorize some rows in the table

🡪 Concerns in privacy issue

🡪 Future work..

The higher the rank is, the closer to real data distribution is

18 of 28

Extension to time series setting with dependent rows

e.g. Medical Records

Q : can we extend the idea of AutoDiff in time series?

19 of 28

Time Series : Stock data

 

 

 

…….

20 of 28

Auto-encoder through RNN

We can get the latent variables via RNN structure and seems that it works reasonably well.

 

21 of 28

Generate new latent var. via diffusion model

 

 

Goal : Learn the following distribution via score-based diffusion model

 

1st Feature

2nd Feature

Kth Feature

22 of 28

Generate new latent var. via diffusion model

 

 

Goal : Learn the following distribution via score-based diffusion model

 

1st Feature

2nd Feature

Kth Feature

23 of 28

Extension to time series setting with dependent rows

e.g. Medical Records

Q : can we extend the idea of AutoDiff in time series?

24 of 28

Time Series : Stock data

 

 

 

…….

25 of 28

Auto-encoder through RNN

We can get the latent variables via RNN structure and seems that it works reasonably well.

 

26 of 28

Generate new latent var. via diffusion model

 

 

Goal : Learn the following distribution via score-based diffusion model

 

1st Feature

2nd Feature

Kth Feature

27 of 28

Generate new latent var. via diffusion model

 

 

Goal : Learn the following distribution via score-based diffusion model

 

1st Feature

2nd Feature

Kth Feature

28 of 28

Looking for a collaborator for this project.

  1. Fully commit to the project, joint first author
  2. Familiar with / interested in diffusion model!
  3. Excited to work on time series problem, and study with me.