1 of 37

Out-of-Distribution Robustness in Computer Vision and NLP

Dan Hendrycks

2 of 37

Threat Model

3 of 37

Threat Model?

Motivating the Rules of the Game for Adversarial Example Research, Gilmer et al. arXiv: 1807.06732

4 of 37

Image Credit: Aleksander Mądry

Out-of-Distribution Robustness

5 of 37

Out-of-Distribution Robustness

In reality, test distribution will not match training

6 of 37

Out-of-Distribution Robustness

7 of 37

Evaluating the Robustness of NLP Models

8 of 37

Constructed by pairing or splitting datasets

Evaluating the Robustness of NLP Models

Hendrycks and Liu et al., ACL 2020, arXiv:2004.06100

9 of 37

Constructed by pairing or splitting datasets

Sentiment Analysis

Evaluating the Robustness of NLP Models

Hendrycks and Liu et al., ACL 2020, arXiv:2004.06100

10 of 37

Constructed by pairing or splitting datasets

Sentiment Analysis

American Chinese, Italian, and Japanese

Evaluating the Robustness of NLP Models

Hendrycks and Liu et al., ACL 2020, arXiv:2004.06100

11 of 37

Constructed by pairing or splitting datasets

Sentiment Analysis

American Chinese, Italian, and Japanese

Semantic Similarity

Headlines Images

Evaluating the Robustness of NLP Models

Hendrycks and Liu et al., ACL 2020, arXiv:2004.06100

12 of 37

Constructed by pairing or splitting datasets

Sentiment Analysis

American Chinese, Italian, and Japanese

Semantic Similarity

Headlines Images

Reading Comprehension

CNN DailyMail

Evaluating the Robustness of NLP Models

Hendrycks and Liu et al., ACL 2020, arXiv:2004.06100

13 of 37

Constructed by pairing or splitting datasets

Sentiment Analysis

American Chinese, Italian, and Japanese

Semantic Similarity

Headlines Images

Reading Comprehension

CNN DailyMail

Textual Entailment

Telephone Letters

Evaluating the Robustness of NLP Models

Hendrycks and Liu et al., ACL 2020, arXiv:2004.06100

14 of 37

Pretrained Transformers are More Robust

Hendrycks and Liu et al., ACL 2020, arXiv:2004.06100

15 of 37

Pretrained Transformers are More Robust

Hendrycks and Liu et al., ACL 2020, arXiv:2004.06100

16 of 37

Note: Bigger Models Are Not Always Better

Hendrycks and Liu et al., ACL 2020, arXiv:2004.06100

17 of 37

Robustness in Vision: ImageNet-C

Hendrycks and Dietterich, ICLR 2019, arXiv:1903.12261

18 of 37

Robustness in Vision: ImageNet-C

ResNet-50

76% Top-1 Accuracy (IID)

Modern classifiers do not generalize well to unexpected images

45%

43%

42%

50%

42%

37%

29%

37%

57%

66%

58%

56%

44%

70%

Hendrycks and Dietterich, ICLR 2019, arXiv:1903.12261

Slide Credit: Justin Gilmer

19 of 37

Larger Models Surprisingly Help

Hendrycks and Dietterich, ICLR 2019, arXiv:1903.12261

20 of 37

Just Train on Noise?

Generalizing to Unforeseen Corruptions is Difficult

Generalisation in humans and deep neural networks, Gheiros et al.

21 of 37

ImageNet-C Corruptions

PIL Operations

Hendrycks and Mu et al., ICLR 2020, arXiv:1912.02781

22 of 37

Diverse Augmentations

Hendrycks and Mu et al., ICLR 2020, arXiv:1912.02781

23 of 37

Diverse Augmentations with AugMix

Hendrycks and Mu et al., ICLR 2020, arXiv:1912.02781

24 of 37

Hendrycks and Mu et al., ICLR 2020, arXiv:1912.02781

25 of 37

CIFAR-10-C: Halving the Error Rate

Hendrycks and Mu et al., ICLR 2020, arXiv:1912.02781

26 of 37

Diverse Augmentation with DeepAugment

Hendrycks, Basart, Mu, Kadavath, Wang, Dorundo, Desai, Zhu, Parajuli, Guo, Song, Steinhardt, Gilmer. arXiv:2006.16241

27 of 37

Diverse Augmentations Can Help

Hendrycks, Basart, Mu, Kadavath, Wang, Dorundo, Desai, Zhu, Parajuli, Guo, Song, Steinhardt, Gilmer. arXiv:2006.16241

28 of 37

ImageNet-R

Hendrycks, Basart, Mu, Kadavath, Wang, Dorundo, Desai, Zhu, Parajuli, Guo, Song, Steinhardt, Gilmer. arXiv:2006.16241

29 of 37

ImageNet-R

ImageNet-R(endition) tests robustness to texture changes with 30K images
Train on normal ImageNet, test on paintings, drawings, toys, sculptures, ...

Hendrycks, Basart, Mu, Kadavath, Wang, Dorundo, Desai, Zhu, Parajuli, Guo, Song, Steinhardt, Gilmer. arXiv:2006.16241

The original ImageNet dataset discouraged such renditions since annotators were instructed to collect “photos only, no painting, no drawings, etc.” (Deng, 2012). We do the opposite.

ImageNet-Renditions (ImageNet-R) contains 30,000 images of ImageNet objects with different textures and styles.

We collect images primarily from Flickr and use queries such as “art,” “cartoons,” “graffiti,” “embroidery,” “graphics,” “origami,” “paintings,” “patterns,” “plastic objects,” “plush objects,” “sculptures,” “line drawings,” “tattoos,” “toys,” “video game,” and so on.

Even some primates species have demonstrated the ability to recognize shape through line drawings (Itakura, 1994; Tanaka, 2006), so other strong vision systems can generalize to changes in rendition, but current convnets generalize poorly.

ImageNet-R emphasizes shape over texture.

30 of 37

ImageNet-R

Hendrycks, Basart, Mu, Kadavath, Wang, Dorundo, Desai, Zhu, Parajuli, Guo, Song, Steinhardt, Gilmer. arXiv:2006.16241

31 of 37

ImageNet-R

Hendrycks, Basart, Mu, Kadavath, Wang, Dorundo, Desai, Zhu, Parajuli, Guo, Song, Steinhardt, Gilmer. arXiv:2006.16241

32 of 37

ImageNet-A

Naturally occurring examples can reliably confuse all known models
Adversarially mining examples for one model generalizes to all tested models
We adversarially construct a new, hard ImageNet test set called ImageNet-A. On ImageNet-A, a ResNeXt and DenseNet obtain an accuracy of <3%

Hendrycks, Zhao, Basart, Steinhardt, Song. arXiv:1907.07174

33 of 37

ImageNet-A

Hendrycks, Zhao, Basart, Steinhardt, Song. arXiv:1907.07174

34 of 37

Street View Store Fronts (SVSF)

34

Classify the business given an image of the storefront.
With our new dataset, we can test robustness to temporal, geographical, and hardware setup shifts.

Hendrycks, Basart, Mu, Kadavath, Wang, Dorundo, Desai, Zhu, Parajuli, Guo, Song, Steinhardt, Gilmer. arXiv:2006.16241

USA Old

USA New

France Old

France New

35 of 37

DeepFashion Remixed

35

Hendrycks, Basart, Mu, Kadavath, Wang, Dorundo, Desai, Zhu, Parajuli, Guo, Song, Steinhardt, Gilmer. arXiv:2006.16241

36 of 37

The Many Faces of Robustness

Hendrycks, Basart, Mu, Kadavath, Wang, Dorundo, Desai, Zhu, Parajuli, Guo, Song, Steinhardt, Gilmer. arXiv:2006.16241

37 of 37

Contributions

Established robustness benchmarks for NLP and Computer Vision (ImageNet-C,A,R)
Created data augmentation techniques (AugMix, DeepAugment) that can sometimes greatly improve robustness
Clarified the limits of robustness techniques and showed no existing technique consistently improves robustness