Demographic Disparities in Zero-Shot Vision Tasks
Overview
We study whether zero-shot image-text models perform differently among demographic groups (evaluating CLIP[1]) in their intended use-cases of general object recognition and, if so, whether they perpetuate such disparities to downstream applications (evaluating Detic[2] [zero-shot detection], and LSeg[3] [zero-shot segmentation]).
We use a pair of disparity indicators for zero-shot vision tasks:
We find:
Social Disparity Indicators
We observe a pattern of differential outcomes for many concepts between gender-annotated groups across all models.
Visual Genome[5]
Limitations of how we define gender:
Evaluation Dataset
Additional Findings
References
[1] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. pp. 8748–8763. PMLR (2021)
[2] Zhou, X., Girdhar, R., Joulin, A., Krähenbähl, P., Misra, I.: Detecting twenty-thousand classes using image-level supervision. In: ECCV (2022)
[3] Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: International Conference on Learning Representations (2022)
[4] Naeini, M.P., Cooper, G.F., Hauskrecht, M.: Obtaining well calibrated probabilities using bayesian binning. In: AAAI (2015)
[5] Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123(1), 32–73 (2017)
[6] Miller, G.A.: Wordnet: A lexical database for english. Commun. ACM 38(11), 39–41 (nov 1995)
[7] Devinney, H., Björklund, J., Björklund, H.: Theories of ”gender” in nlp bias research (2022)
Melissa Hall (melissahall@meta.com)
Laura Gustafson
Aaron Adcock
Candace Ross*
Ishan Misra*
* equal contribution in project leadership
We observe a pattern of differential treatment for many concepts between gender-annotated groups for Detic and LSeg.
Aggregate metrics like meanAP can obscure notable disparities between groups:
Error analyses show several potential root causes of observed disparities beyond model behavior:
Definition of concepts can differ between group�ex: "necktie.n.01"��������
Salience of object information changes between groups
ex: "hair.n.01"����������
Distributions of samples and correlating concepts
vary between groups