1 of 28

Uncertainty modeling from 50M to 1B

Dustin Tran

Schedule: Fri Jul 23, 10:00 AM - 10:30 AM EST

Link: https://sites.google.com/corp/view/udlworkshop2021/home

Abstract:

I'll talk about one specific problem I have with the field: scale. Many papers fix an architecture and try to improve log-likelihood, comparing to the original base architecture regardless of how much additional compute is used to outperform the original model. Yet, if we adjust for scale—for example, compare an ensemble of size 10 to a model scaled up 10x—we'd see improvements significantly diminish or vanish altogether. Ultimately, we should be examining the frontier of uncertainty-robustness performance as a function of compute. I'll substantiate this perspective with a few works with colleagues. These works advance the frontier with efficient ensembles alongside priors and inductive biases; and we'll examine uncertainty properties of existing giant models.

2 of 28

A pervasive problem in the field

Most papers fix an architecture and try to improve log-likelihood.

Is it reasonable to compare 100 MCMC samples to the original model?�Is it reasonable to compare an ensemble of size 10 to the original model?

Why are we ignoring the original model scaled up 10-100x?

3 of 28

All that matters is p(y | x).

4 of 28

The uncertainty-robustness frontier

Quality of uncertainty & robustness

Image source: NeurIPS 2020 tutorial on Uncertainty & Robustness

# Parameters

0.1M

10M

100M

10B

5 of 28

Ensembles as a Giant Model

Paths between subnetworks are independent ⇒ SGD-trained models have independent predictions by construction.
Bridge the gap from single model to ensembles by sharing parameters, learning how to decorrelate predictions during training.

Image source: NeurIPS 2020 tutorial on Uncertainty & Robustness

6 of 28

Efficient Ensembles by Sharing Parameters

Parameterize each weight matrix as a new weight matrix W multiplied by the outer product of two vectors r and s.��

There is an independent set of r and s vectors for each ensemble member; W is shared.

Known as BatchEnsemble.

[Wen+ 2020]

7 of 28

Efficient Ensembles by Sharing Parameters

BatchEnsemble has a convenient vectorization.�

Duplicate each example in a given mini-batch K times.

The model yields K outputs for each example.

Can interpret rank-1 weight perturbations as feature-wise transformations.

[Wen+ 2020]

8 of 28

The uncertainty-robustness frontier

Quality of uncertainty & robustness

# Parameters

0.1M

10M

100M

10B

Efficient ensembles

[BatchEnsemble]

Architecture�[ResNet, ViT]

9 of 28

The value of Bayes is the prior distribution.��Rank-1 BNNs place priors over BatchEnsemble’s rank-1 weights p(r), p(s).

Rank-1 priors induce distribution over all weights.

Can we improve ensembles with Bayes?

[Dusenberry+ 2020]

10 of 28

��Rank-1 BNNs use mixture posterior distributions to combine multimodal representations with distributional uncertainty.

Can we improve ensembles with Bayes?

[Dusenberry+ 2020]

11 of 28

The uncertainty-robustness frontier

Quality of uncertainty & robustness

Priors�[Rank-1 BNNs]

# Parameters

0.1M

10M

100M

10B

Efficient ensembles

[BatchEnsemble]

Architecture�[ResNet, ViT]

12 of 28

Does data augmentation work? Well..

[Wen+ 2021]

BatchEnsemble

MC Dropout

Deep Ensemble

BatchEnsemble

MC Dropout

Deep Ensemble

13 of 28

BatchEnsemble

MC Dropout

Deep Ensemble

BatchEnsemble

MC Dropout

Deep Ensemble

Does data augmentation work? Well..

14 of 28

Ensembles reduce overconfidence.
Mixup makes model underconfident.

⇒ Ensemble + Mixup are even more unconfident!

[Wen+ 2021]

15 of 28

Data augmentation conflates model and data uncertainty.

[Wen+ 2021]

16 of 28

ResNet-50 on ImageNet

With adjustment, ensembles + DA reach state-of-the-art calibration.

17 of 28

The uncertainty-robustness frontier

Quality of uncertainty & robustness

# Parameters

0.1M

10M

100M

10B

Priors & Invariances�[Rank-1 BNNs, Mixup]

Efficient ensembles

[BatchEnsemble]

18 of 28

The uncertainty-robustness frontier

Quality of uncertainty & robustness

# Parameters

0.1M

10M

100M

10B

}

19 of 28

What should we expect according to the literature?

Gains in accuracy and OOD generalization [Djolonga+ 2021]
But... “modern neural networks … �are poorly calibrated.”
“...increasing depth and width … negatively affect model calibration.”
(Guo et al. 2017; 1500 citations)
Many similar reports.

20 of 28

Experiment study

Models.

Weak supervision (BiT, ResNext WSL) with non-convolutional architectures (MLP-Mixer, ViT).
Semi-supervised (SimCLR).
Natural language (CLIP).
Classic (AlexNet).

Datasets. ImageNet, ImageNetV2, ImageNet-C, ImageNet-R, ImageNet-A.

[Minderer+ 2021]

21 of 28

1: Recent architectures are not miscalibrated.

[Minderer+ 2021]

22 of 28

1: Recent architectures are not miscalibrated.

[Minderer+ 2021]

23 of 28

2: Larger models deteriorate in-dist. but improve on OOD.

[Minderer+ 2021]

24 of 28

2: Larger models deteriorate in-dist. but improve on OOD.

[Minderer+ 2021]

25 of 28

2: Larger models deteriorate in-dist. but improve on OOD.

Accuracy and calibration deteriorate with distribution shift.
Larger models deteriorate less.

[Minderer+ 2021]

26 of 28

3: Accuracy predicts calibration.

Fit power law:

27 of 28

Takeaways

We need to trace the uncertainty-robustness frontier: understand progress relative to scale.�

Efficient ensembles are a dominant strategy for uncertainty & robustness performance.

Strategies with Bayes and data augmentation further improve the frontier.

At the giant model regime: �“Modern neural networks tend to make overconfident predictions.” ← Not true!

28 of 28

Thank you!

Rafael Müller

Matthias Minderer

Yeming Wen

Ghassen Jerfel

Josip Djolonga

Mike Dusenberry

Rob Romijnders

Jasper Snoek

Balaji Lakshminarayanan

Frances Hubis

Xiaohua Zhai

Neil Houlsby

Mario Lucic

Katherine Heller

Yian Ma

Jimmy Ba