1 of 1

Discovering and Mitigating Biases in CLIP-based Text-to-Image Generation

�Md Mehrab Tanjim¹, Krishna Kumar Singh², Kushal Kafle², Ritwik Sinha², Garrison W. Cottrell¹

¹UC San Diego, ²Adobe Research

University of California San Diego

Adobe Research

Motivation

Contributions

Our Debiasing Framework

References

CLIP (Contrastive Language-Image Pre-Training) [1], despite being trained on a large dataset, the CLIP model suffers from biases against protected attributes such as gender and race.
To the best of our knowledge, no study has been done for CLIP-based generative tasks.
In this work, we first reveal the queries for which the CLIP model biases the generated images in the text-to-image synthesis task.
We also propose several ways to mitigate the biases without retraining CLIP or the underlying generative model.

We discover the biases in the CLIP model and, using StyleCLIP, we show their negative impact on text-to-image generation.

We propose several techniques to debias without retraining CLIP or the generator.

Discovering Biases

CLIP + ID

CLIP + LPIPS

Output of StyleCLIP

Original

CLIP + Gender + ID

CLIP + Gender + LPIPS

CLIP + LPIPS + ID

CLIP + Gender + LPIPS + ID

Text-based Debiasing

CLIP + Gender

[1] lec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language super vision. In PMLR, 2021

[2]Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings CVPR, 2021

[3] Sandhini Agarwal, Gretchen Krueger, Jack Clark, Alec Radford, Jong Wook Kim, and Miles Brundage. Evaluating CLIP: towards characterization of broader capabilities and downstream implications. arXiv preprint arXiv:2108.02818, 2021.

[4] Md Mehrab Tanjim, Ritwik Sinha, Krishna Kumar Singh, Sridhar Mahadevan, David Arbour, Moumita Sinha, and Garrison W Cottrell. Generating and controlling diversity in image�search. In Proceedings WACV, 2022.

[5] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on CVPR.

[6] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018

First, we collect stock images for different occupation-related queries for the following professions [4]:

‘Plumber’, ‘Nurse’, ‘Administrative Assistant’, ‘Farmer’, ‘Security Guard’, ‘Executive Manager’, ‘Military Person’, ‘Maids & Housekeepers’.

We have two sets: one for gender (total 332 images) and one for race (total 379).

In the above figure, we show the ROC curve for both the gender and race image set. We can see the performance is quite low for most of the queries.

The above figures show, when it misranks the images, certain genders and races are misranked more than others. For example, we can see for plumbers, female plumbers are often misranked.

Related Work

Researchers have audited CLIP model for various classification tasks and discovered biases [3]. These studies focused on the bias problem with respect to classification.

However, any bias in CLIP models can therefore negatively impact the generation process as well. For this reason, here we focus on discovering biases in CLIP models for generative tasks.

For the generative model, we have chosen one of the most popular CLIP-based generators, StyleCLIP [2].

Original

Face of a carpenter

Face of a nurse

After Debiasing

Before Debiasing

After Debiasing

Before Debiasing

Original

After Debiasing

Before Debiasing

Original

After Debiasing

Before Debiasing

Face of a software engineer

Face of an administrative assistant

ROC curve for gender subset

ROC curve for race subset

Misrank comparison for race subset

Misrank comparison for gender subset

Original Image

“An image of a plumber”

GradCAM

Original Image

GradCAM

“An image of a farmer”

Original Image

GradCAM

Original Image

GradCAM

The above GradCAM visualizations show some examples where the CLIP model correctly focuses on instrument/background for male faces but for the female images, it focuses on the face, resulting in a misrank.

Results from our gradient-based debiasing framework with different combinations of identify preserving losses. Here, the text prompt for StyleCLIP is: ‘A plumber’.
We can see gradient-based debiasing works better than text-based debiasing. Also, using 3 out of 4 or using all combination works best.

Some examples of biases in CLIP-based generative model, StyleCLIP [2], and results from our debiasing framework are shown above.