1 of 1

Demographic Disparities in Zero-Shot Vision Tasks

Overview

We study whether zero-shot image-text models perform differently among demographic groups (evaluating CLIP[1]) in their intended use-cases of general object recognition and, if so, whether they perpetuate such disparities to downstream applications (evaluating Detic[2] [zero-shot detection], and LSeg[3] [zero-shot segmentation]).

We use a pair of disparity indicators for zero-shot vision tasks:

  • Disparity in outcomes by group, measured with Average Precision (AP)
  • Disparity in treatment by group, measured with Expected Calibration Error (ECE)[4]

We find:

  • A similar pattern of concepts with unequal outcomes by gender exists between the models across multiple evaluation datasets.
  • Models that use CLIP display a pattern of differential treatment of similar samples from different demographic groups.
  • These disparities are aligned with existing social biases.

Social Disparity Indicators

We observe a pattern of differential outcomes for many concepts between gender-annotated groups across all models.

Visual Genome[5]

  • 25K images with multi-class, hand-labeled object synsets from WordNet[6]. For example, "bat.n.01" (the animal) and "bat.n.05" (the sport equipment).
  • Binary gender annotations based on "man"- and "woman"- related synset terms, like "brother.n.01" and "wife.n.01".��ex: "bat.n.05"

Limitations of how we define gender:

  • Binarization of gender is reductive and excludes other genders
  • Relies on annotators' inherent perception of gender, can lead to misgendering
  • Static, external annotations based on visual representations is misaligned with an inclusive operationalization of gender[7]
  • People depicted in datasets deserve agency to share and update their gender info throughout dataset lifespan

Evaluation Dataset

Additional Findings

References

[1] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. pp. 8748–8763. PMLR (2021)

[2] Zhou, X., Girdhar, R., Joulin, A., Krähenbähl, P., Misra, I.: Detecting twenty-thousand classes using image-level supervision. In: ECCV (2022)

[3] Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: International Conference on Learning Representations (2022)

[4] Naeini, M.P., Cooper, G.F., Hauskrecht, M.: Obtaining well calibrated probabilities using bayesian binning. In: AAAI (2015)

[5] Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123(1), 32–73 (2017)

[6] Miller, G.A.: Wordnet: A lexical database for english. Commun. ACM 38(11), 39–41 (nov 1995)

[7] Devinney, H., Björklund, J., Björklund, H.: Theories of ”gender” in nlp bias research (2022)

Melissa Hall (melissahall@meta.com)

Laura Gustafson

Aaron Adcock

Candace Ross*

Ishan Misra*

* equal contribution in project leadership

We observe a pattern of differential treatment for many concepts between gender-annotated groups for Detic and LSeg.

Aggregate metrics like meanAP can obscure notable disparities between groups:

Error analyses show several potential root causes of observed disparities beyond model behavior:

Definition of concepts can differ between group�ex: "necktie.n.01"��������

Salience of object information changes between groups

ex: "hair.n.01"����������

Distributions of samples and correlating concepts

vary between groups