1 of 146

Machine Learning Systems Design

ML beyond accuracy: Fairness, Security, Governance

CS 329S | Chip Huyen

Sara Hooker

2 of 146

My research agenda to-date has focused on:

  • Going beyond test-set accuracy
  • Training models that fulfill multiple desired criteria

Model Interpretability - reliable explanations for model behavior.

Model Compression - compact machine learning models to work in resource constrained environments.

Model fragility and security - deploy secure models that protect user privacy.

Fairness - imposes constraint on optimization that reflects societal norms of what is fair.

3 of 146

I’ll mention research collaborations with my colleagues:

Nyalleng Moorosi, Gregory Clark, Samy Bengio, Emily Denton, Aaron Courville, Yann Dauphin, Andrea Frome, Chirag Agarwal, Daniel Souza, Dumitru Erhan, Oreva Ahia, Julia Kreutzer.

The Imperfect Objective

Understanding trade-off between desiderata.

The role of model design: characterizing bias and developing trustworthy AI models

The Role of Model Design: Characterizing Bias and Developing Trustworthy AI Models

4 of 146

The imperfect objective.

5 of 146

6 of 146

What if discomfort is not uniform, but targeted?

7 of 146

Our goal is to bridge the technology design gap - develop technology that works for everyone.

8 of 146

Our goal is to bridge the technology design gap - develop technology that works for everyone.

Achieving this requires understanding how our modelling choices impact downstream impact.

9 of 146

Over the last decade “performance” of a model treated as synonymous with pursuit of top-1 accuracy.

10 of 146

Top-line metrics do not guarantee that the trained function fulfills other properties we may care about.

Empirical risk minimization - train a representation to minimize average error.

Compactness

Interpretability

Robustness

Fairness

11 of 146

Compactness

Interpretability

Adversarial Robustness

Fairness

Typical loss functions in machine learning (MSE, Hinge-Loss and CE) impose no preference for functions that are interpretable, fair, robust or guarantee privacy.

12 of 146

Donald Knuth said “computers do exactly what they are told, no more and no less.”

A model can fulfill an objective in many ways, while violating the spirit of said objective.

13 of 146

The Clever Hans Effect 1891 - 1907

Hans the horse:

  • arithmetic functions
  • identify colours
  • Count the crowd

14 of 146

High accuracy without true learning.

Experimental Design -

Can Hans answer a question if the human does not know the answer?

Hans answered correctly by picking up on microscopic clues.

15 of 146

The under-specification of our objective function often leads to undesirable model behavior termed “shortcut learning.”

Cow

Limousine

Berry et al. (paper link)

Hooker et al. 2019 (paper link)

16 of 146

The under-specification of our objective function often leads to undesirable model behavior “shortcut learning.”

Sheep

Dog

Blog link

17 of 146

Task is predicting Ending of Story:

Pay Attention to the Ending -

Work by Cai et al. show show you can discard the story plots and only train to choose the better of two endings reaches 72.5% accuracy.

High accuracy without “true” learning.

The under-specification of our objective function often leads to undesirable model behavior termed “shortcut learning.”

18 of 146

“Shortcut learning” is due to relative under/overrepresentation of training features.

For example, a model may learn to correlate blonde with being a female because there are far fewer blonde males in the training dataset.

This results in higher error rates on the long-tail -- the underrepresented features in the dataset.

19 of 146

When this happen in sensitive domains, there can be a huge cost to human welfare.

High accuracy without “true” learning.

Esteva et al. (link)

Zech et al. 2018 (link)

AlBadaway et al. 2018 (link)

Skin lesions

Pneumonia

20 of 146

Gender shades (link)

Shankar et al. (link)

Zhao, Jieyu et al. (2017)

How a model treats underrepresented features in the long-tail of the distribution often coincides with notions of fairness.

21 of 146

Geographic bias in how we collect our datasets. Shankar et al. (2017) show models perform far worse on locales undersampled in the training set.

No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets

for the Developing World (Shankar et al. (link))

22 of 146

Men also like shopping (and cooking too).

Undersampling/oversampling leads to undesirable spurious correlations.

Zhao, Jieyu et al. (2017) show Activity recognition datasets exhibit stereotype-aligned gender biases.

23 of 146

Delegating learning of the function to the model can (and has) led to Clever Hans moments.

High accuracy without “true” learning.

Overfitting to pattern matching

Memorization leads to Leakage of private text

Carlini et al. 2021

Can generate factually incorrect statements

Brown et al, 2020

24 of 146

Open challenges in auditing and mitigating harmful bias.

25 of 146

Preferences about how our trained model should behave on subset of sensitive or protected features.

Fairness

Legally protected features:

Certain attributes are protected by law. For example, in the US it is illegal to discriminate based upon race, color, religion, sex, national origin, disability.

Legal framework will differ by country.

Sensitive features:

Income, eye color, hair, skin color, accent, locale.

These features may not be protected by law, but are often correlated with protected attributes .

26 of 146

Your choice of tool to audit and mitigating algorithmic bias will depend upon whether you know:

- the sensitive features which are adversely impacted

- have comprehensive labels for these features

  • Unknown bias
  • Incomplete or no labels for sensitive features
  • Known concern
  • Comprehensive labels

27 of 146

  • Unknown bias
  • Incomplete or no labels for sensitive features
  • Known concern
  • Comprehensive labels

Your choice of tool to audit and mitigate algorithmic bias will depend upon whether you know:

- the sensitive features which are adversely impacted

- have comprehensive labels for these features

1.

28 of 146

  1. With known and comprehensive labels - track impact using intersectional metrics

What is it?�Statistically evaluate model performance (e.g. accuracy, error rates) by “subgroup”�e.g. skin tone, gender, age�

Requires�Good, “balanced” test sets that are representative of the actual use-case(s) for the model in production

28

Male

Female

Non-binary

Type I

Type II

Type III

Type IV

Type V

Type VI

Acc/FNP/FPR/other

Fitzpatrick Skin Type

29 of 146

Example of intersectional audit

Gender Shades - Evaluated classifiers’ performance across genders, skin types, and intersection of gender and skin type

29

30 of 146

When labels are known and complete - opens up range of remedies to mitigate impact

30

Data-Based

  1. Re-balance or re-weight sensitive features to balance training set.

2. Remove problematic feature from training set (not always feasible)

3. Tailored Augmentation Strategies

31 of 146

Examples of counterfactual data augmentation remedies:

31

Counterfactual data augmentation strategies (CDA):

    • Can involve duplicating examples and swapping gendered terms in training data can help with debiasing word embeddings, pretrained language models, coreference resolution models.

“the man who pioneered the church named it [...])”

Generate a counterfactual sentence by substituting the word’s gender-partner in its place

“the woman who pioneered the church [...])”

[[Lu et al. (2018), Maudslay et al. (2019), Webster et al. (2021)]]

32 of 146

When labels are known and complete - range of remedies to mitigate impact.

32

Data-Based

  1. Re-balance or re-weight sensitive features to balance training set.

2. Remove problematic feature from training set (not always feasible due to proxy variables)

3. Tailored Augmentation Strategies

Model-Based

  • Min diff - penalizes model for differences in treatment of distributions

2. Rate constraint - guaranteeing recall, data co-occurrence or another rate metric is at least/most [x%].

[[zhao et al. 2017]]

3. Worst case error constraint

33 of 146

  • Unknown bias

Incomplete or no labels for sensitive features

  • Known concern
  • Comprehensive labels

What about where we don’t have complete labels for the sensitive attribute we care about?

2.

34 of 146

How does my model perform...

Classification accuracy / precision-recall curve / logarithmic loss / area under the curve / mean squared error / mean absolute error / �F1 score / standard deviation / variance / confidence intervals / KL divergence / �false positive rate / false negative rate / �

How might my model perform...

on a sample of test data / on cross-slices of test data / on an individual data point / if a datapoint �is perturbed / if model thresholds were different /�/ across all values of a feature / when compared to a different model

Can you spot a metric which doesn’t require labels?

35 of 146

Most of our remedies are centered around the assumption that we have comprehensive labelling available. However, this is a tenuous assumption.

36 of 146

Most of our remedies are centered around the assumption that we have comprehensive labelling available. However, this is a tenuous assumption.

Any thoughts why? What challenges exist around ensuring comprehensive labelling?

37 of 146

More often than not, we do not have comprehensive labels from human annotators.

  1. For high dimensional problems, often time consuming/infeasible to comprehensively label.
  2. Labelling sensitive features are insufficient, necessary to label all proxy features.
  3. Legal obstacles around collecting certain sensitive features [[Andrus et al. 2021, Veal 2017]].
  4. Even when collected, issues with consistency in annotation [[Khan et al. 2021]]
  5. Annotation can be biased by the lived experience of the annotators
  6. What we deem to be harm is not static. It is shaped by political/geographical/economic/historical considerations.

38 of 146

1. For high dimensional problems, often time consuming/infeasible to comprehensively label.

church

Bird, nest, street lamp, cross, statue, window, window grid.

39 of 146

1.1. For NLP tasks, often require separately curating labels for different languages.

the man who pioneered the church named it [...])”

Generate a counterfactual sentence by substituting the word’s gender-partner in its place

“the woman who pioneered the church [...])”

Requires a curated list of words to substitute. This has to be curated for different languages.

The GEM Benchmark [[Germann et al. 2021]]

40 of 146

2. Labelling protected features is often insufficient, necessary to label all proxy features.

Task: Sleeping or awake?

Consider species to be the protected attribute, many other variables may be proxy variables (indoor/outdoor).

41 of 146

In languages like Portuguese, remedies like augmentation by swapping protected attributes require not only identifying protected nouns but also altering any words that are modified by the noun.

A minha mãe é Professora.

O meu pãe é Professor.

My mom is a teacher.

My dad is a teacher.

2. Labelling protected features is often insufficient, necessary to label all proxy features.

42 of 146

A minha mãe é Professora.

Gender swapping is much harder in languages where the adjectives, articles, and pronouns that agree with these nouns also adjust to comply with gender.

O meu pãe é Professor.

My mom is a teacher.

My dad is a teacher.

2. Labelling protected features is often insufficient, necessary to label all proxy features.

43 of 146

3. Legal obstacles around collecting certain sensitive features.

Recent dataset release by Facebook notable for both compensating and getting consent.

44 of 146

4. Inconsistency in how sensitive features are labelled.

44

Images with high (left) and low (right) levels of agreement on how to annotate race.

45 of 146

4.1 Annotation errors are prevalent in most large scale datasets.

45

[[Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes SettingDe-Arteaga et al. 2019]]

Tagged for the profession model:

"Hank Sheinkopf is a model for brilliant communications in a world where messages are broadcast at astounding rates. He’s a player in the PR world of political campaigns, both in domestic and foreign sectors."

Example found by Preethi Seshadri.

46 of 146

4.1 The subjectivity of certain labels may result in high annotator disagreement /variance.

What constitutes toxic speech?

46

[[Akin et al. 2018]]

47 of 146

4.1 The subjectivity of certain labels may result in high annotator disagreement /variance.

What constitutes toxic speech?

47

[[Akin et al. 2018]]

48 of 146

5. Annotation can be biased by the lived experience of the annotators

48

[[On Releasing Annotator-Level Labels and Information in Datasets Prabhakaran, Davani et al. 2021]]

label aggregation may introduce representational biases of individual and group perspectives.

49 of 146

6. What we deem to be harm is not static. It is shaped by political/geographical/economic/historical considerations.

49

50 of 146

6. What we deem to be harm is not static. It is shaped by political/geographical/economic/historical considerations.

50

1999 - Michigan removes law prohibiting citizens from “reproachful or contemptuous language” in print against anyone who declines a duel challenge.

In 1963, the Swiss government passed the first measures to make sure every inhabitant had access to a nuclear shelter.

The rules and social contract we ask citizens to abide by has changed over time, and varies by country.

51 of 146

Fairness considerations are not static across time or space.

6. What we deem to be harm is not static. It is shaped by political/geographical/economic/historical considerations.

51

52 of 146

We just covered several of the key challenges in ensuring comprehensive labelling.

  • For high dimensional problems, often time consuming/infeasible to comprehensively label.
  • Labelling sensitive features are insufficient, necessary to label all proxy features.
  • Legal obstacles around collecting certain sensitive features [[Andrus et al. 2021, Veal 2017]].
  • Even when collected, issues with consistency in annotation [[Khan et al. 2021]]
  • Annotation can be biased by the lived experience of the annotators
  • What we deem to be harm is not static. It is shaped by political/geographical/economic/historical considerations.

53 of 146

53

If we cannot guarantee we have fully addressed bias in the data pipeline, the overall harm in a system is a product of the interactions between the data and our model design choices.

54 of 146

Recognizing how model design impacts harm opens up new mitigation techniques that are far less burdensome than comprehensive data collection.

  • Unknown bias
  • Incomplete or no labels for sensitive features
  • Known concern
  • Comprehensive labels

55 of 146

Leveraging model signal to audit large scale datasets

56 of 146

Global feature importance - Ranks dataset examples by which are most challenging.

Data Cleaning

Isolating subset

for relabelling

Identify issues with

fairness

Surfaces a tractable subset of the most challenging/least challenging examples for human inspection. Avoids time consuming need to inspect every example.

57 of 146

One of the biggest challenges in improving datasets is identifying where to invest limited human annotator time.

58 of 146

59 of 146

Compute average variance in gradients (VOG) for an image over training.

0 epochs

90 epochs

Estimating Example Difficulty using Variance of Gradients, Agarwal, D,Souza and Hooker, 2020

Variance of Gradients (VoG) is an example of a global ranking tool.

60 of 146

VoG computes a relative ranking of each class.

Estimating Example Difficulty using Variance of Gradients, Agarwal, Souza and Hooker, 2020

What examples does the model find challenging or easy to learn?

61 of 146

Easy examples are learnt early in training, harder examples require memorization later in training.

Estimating Example Difficulty using Variance of Gradients, Agarwal, Souza and Hooker, 2020

0 epochs

90 epochs

Low Variance

High Variance

Low Variance

High Variance

Early Stage Training

Late Stage Training

62 of 146

63 of 146

Convergence rates differ between different types of examples. We can leverage and amplify these differences to distinguish between atypical and noisy examples.

[[A tale of two long tails, D’souza et al. 2021]]

64 of 146

How do system level and algorithm design choices impact model behavior?

65 of 146

In introduction to machine learning, you are often presented with a linear polynomial function like this.

66 of 146

As you increase the degree of the polynomial, you can see how it impacts how the model learns the distribution.

67 of 146

This is one of the first lessons students learn -- the choice of model function matters.

68 of 146

Our modelling choices -- architecture, loss function, optimizer all express a preference for final model behavior.

69 of 146

Model Compression - compact machine learning models to work in resource constrained environments.

Model fragility and security - deploy secure models that protect user privacy.

Fairness - imposes constraint on optimization that reflects societal norms of what is fair.

Often, ML literature makes the unrealistic assumption that optimizing for one property holds all others static.

How we often talk about different properties in the literature.

70 of 146

However, our design choices involve trade-offs between objectives.

Model Compression - compact machine learning models to work in resource constrained environments.

Model fragility and security - deploy secure models that protect user privacy.

Fairness - imposes constraint on optimization that reflects societal norms of what is fair.

71 of 146

Case Study: How does model compression trade-off against other properties we care about such as robustness and fairness?

72 of 146

A “bigger is better” race in the number of model parameters has gripped the field of machine learning.

73 of 146

This characterizes both vision and NLP tasks.

74 of 146

  • Different regimes of capacity appear to allow for different generalization properties.
  • It is very simple formula (throw more parameters at the model)

An argument in favor of this approach:

75 of 146

A key limitation of this approach:

Relationship between weights and generalization properties is not well understood.

76 of 146

The intriguing relationship between capacity and generalization.

Why do we need so many weights in the first place?

  • Diminishing returns to adding more weights.
  • Many redundancies between weights
  • We can remove most weights after training.

77 of 146

Diminishing returns to adding parameters. Millions of parameters are needed to eek out additional gains.

Almost double the amount of weights for a gain in 2% points.

78 of 146

Redundancies Between Weights

Denil et al. find that a small set of weights can be used to predict 95% of weights in the network.

79 of 146

Most weights can be removed after training is finished (while only losing a few % in test-set accuracy!)

[[The State of Sparsity in Deep Neural Networks, 2019, Gale, Elsen, Hooker]]

With 90% of the weights removed, a ResNet-50 only loses ~3% of performance (for certain pruning methods).

80 of 146

Sparse models easily outcompete dense models with same parameter count.

Efficient Neural Audio Synthesis, Kalchbrenner et al., 2018

81 of 146

Understanding how capacity impacts generalization is an increasingly urgent question:

How do generalization properties change as models get bigger and bigger?

  • Fairness, robustness, privacy.

Increasingly, we are also making design choices at test-time that alter generalization properties - pruning, quantization, fine-tuning.

82 of 146

Why is this interesting? Theoretical reasons:

We are in an era of ever bigger models.

Yet, there is a limit to how much we can scale. Understanding the role of capacity can guide us to more efficient solutions.

If most weights are redundant, why do we need them in the first place?

Can these insights guide us to better training protocols?

83 of 146

Most of the world uses ML in a resource constrained environment.

The Low-Resource Double Bind: An Empirical Study of Pruning for Low-Resource Machine Translation, link

Why is this interesting? Practical reasons:

84 of 146

If we care about access to technology, we need to revisit our model design assumptions:

As you increase size of networks:

  • More memory to store
  • Higher latency for each forward pass in training + inference time

ML at the edge:

  • Many different devices, hardware constraints
  • Many different resource constraints - memory, compute
  • Power, connectivity varies

85 of 146

0% pruning

76.70%

50% pruning

76.20%

How can networks with radically different structures and number of parameters have comparable performance?

86 of 146

One possibility is that top-line metrics are not a precise enough measure to capture how capacity impacts the generalization properties of the model.

To gain intuition into differences in generalization behavior, we go beyond topline metrics.

87 of 146

Sparsity in Deep Learning

Weight Sparsity

Sources: Pruning, sparse training

Example: 3-5x FLOP advantage for the same accuracy in CNNs [1][2]

Activation Sparsity

Sources: ReLU sparsity, sparse attention

Example: Asymptotic improvement in attention computational complexity. From O(N2) to O(N*sqrt(N)) to O(N) [3][4]

Data Sparsity

Sources: Point clouds, graphs, etc.

Example: 3D object detection with targeted computation [5]

Image sources: https://bit.ly/3a05ovu

88 of 146

Sparsify networks by “removing” unimportant activations/weights (setting weights/neurons to zero).

Initial weight matrix

After activations have been removed.

Image source: OpenAI

89 of 146

Instead of starting sparse -- most state of art sparsity methods introduce sparsity gradually over the course of training.

Image from Torsten Hoefler tutorial

90 of 146

Selective Brain Damage: Measuring the Disparate Impact of Model Pruning

Sara Hooker, Aaron Courville, Yann Dauphin, Andrea Frome

Learn more about PIEs at https://weightpruningdamage.github.io/

Sparsity of 90% means that by the end of training the model only has 10% of all weights remaining. Apply mask of 0 to remaining weights.

Initial weight matrix

After activations have been removed.

Image source

91 of 146

0 %

90 %

Overparameterized Dense Model

Model with 90% weights removed

Train populations of models with minimal differences in test-set accuracy to different end sparsities [0%, 30%, 50%, 70%, 90%, 95%, 99%].

Valuable experimental set-up - we can precisely vary the level of final sparsity.

92 of 146

Measure divergence in class level and exemplar classification performance.

Robustness to certain types of distribution shift.

Here, we ask - How does model behavior diverge

as we vary the level of compression?

1.

2.

93 of 146

Why is a narrow part of the data distribution far more sensitive to varying capacity?

Key results upfront: top level metrics hide critical differences in generalization between compressed and compressed populations of models.

Varying sparsity disproportionately and systematically impact a small subset of classes and exemplars.

1.

94 of 146

Pruning Identified Exemplars (PIEs)

are images where predictive behavior diverges between a population of independently trained compressed and non-compressed models.

95 of 146

ImageNet test-set.

True label?

96 of 146

97 of 146

ImageNet test-set.

True label?

98 of 146

99 of 146

ImageNet test-set.

True label?

100 of 146

101 of 146

ImageNet test-set.

True label?

102 of 146

103 of 146

ImageNet test-set.

True label?

104 of 146

105 of 146

  • Restricting inference to PIEs drastically degrades model performance.
  • For ImageNet, removing PIEs from test-set improves top-1 accuracy beyond baseline.

PIEs are also more challenging for algorithms to classify.

106 of 146

Attribute Proportion of CelebA Training Data vs. relative representation in PIE

PIEs over-index on the long-tail of underrepresented attributes.

107 of 146

0 %

90 %

Overparameterized Dense Model

Model with 90% weights removed

Put differently, we are using the majority of our weights to encode a useful representation for a small fraction of our training distribution.

We lose the long-tail when we remove the majority of all training weights.

108 of 146

Low-frequency events - The majority of weights (90% of all weights) are used to memorize very rare examples in the dataset.

When we remove weights models lose performance on rare examples.

109 of 146

Noisy Data Points

  • Data is improperly structured which corrupts information
    • Mislabelled
    • Severely corrupted
    • Multi-object

Misuse of parameters to represent these data points.

“Bad memorization”

110 of 146

Noisy PIEs: Incorrectly structured ImageNet data for single-image classification.

111 of 146

Noisy PIEs: Corrupted or incorrectly labeled data.

112 of 146

Noisy Data Points

  • Data is improperly structured which corrupts information
    • Mislabelled
    • Severely corrupted
    • Multi-object

Misuse of parameters to represent these data points.

“Bad memorization”

Atypical Data Points or Challenging Exemplars

  • Underrepresented vantage points (the long-tail of the dataset)
  • Image classification entails fine grained task

Valuable use of parameters to represent these data points.

“Good memorization”

113 of 146

Atypical PIEs: Unusual vantage points of the class category.

114 of 146

Whether pruning aids or impedes performance depends upon how relevant learning rare artefacts are for the task.

Two common types of examples in the long-tail:�

  • Noisy
  • Atypical

115 of 146

This is very related to whether memorization aids or hurts generalization.

Noisy Data Points

  • Data is improperly structured which corrupts information.

Misuse of parameters to represent these data points.

“Bad memorization”

Atypical Data Points or Challenging Exemplars

  • Underrepresented vantage points (the long-tail of the dataset)

Valuable use of parameters to represent these data points.

“Good memorization”

In-distribution considerations:

116 of 146

This is very related to whether memorization aids or hurts generalization.

Test data is very different from my training distribution (typical of low data regimes)

Memorization of rare artefacts in the training data is unlikely to help with generalization to the test data.

“Bad memorization”

Test dataset is very similar to my training distribution (more common with VERY big data regimes)

Memorization of rare artefacts in the training data is likely to help with generalization to the test data.

“Good memorization”

Out-of-distribution considerations:

117 of 146

In-distribution considerations - amplification of error on rare/underrepresented attributes.

118 of 146

How does compression impact performance on the long-tail in-distribution?

Noisy Data Points

  • Data is improperly structured which corrupts information.

Misuse of parameters to represent these data points.

“Bad memorization”

Atypical Data Points or Challenging Exemplars

  • Underrepresented vantage points (the long-tail of the dataset)

Valuable use of parameters to represent these data points.

“Good memorization”

In-distribution considerations:

119 of 146

Attribute Proportion of CelebA Training Data vs. relative representation in PIE

Compression amplifies algorithmic harm when the protected feature is in the long-tail of the distribution.

120 of 146

Measuring Impact of Compression on Algorithmic Bias

Celeb-A Spurious correlation between gender, age and hair color {Blond, Non-Blond}

Far fewer examples of ‘Blond Male’ (0.85%) and ‘Blond Old’ (2.43%) of training set.

121 of 146

We find sparsity disproportionately impacts underrepresented features.

122 of 146

Pruning amplifies algorithmic bias when the underrepresented feature is sensitive (age/gender)

123 of 146

Civil Comments Task of detecting toxic comments. Target label toxic is only present for ~8% of training set.

124 of 146

Sparsity sharply degrades model ability to detect toxic comments. Most impacted sub-groups are least represented in training set.

125 of 146

126 of 146

How does compression impact performance on OOD data points?

Test data is very different from my training distribution (typical of low data regimes)

Memorization of rare artefacts in the training data is unlikely to help with generalization to the test data.

“Bad memorization”

Test dataset is very similar to my training distribution (more common with VERY big data regimes)

Memorization of rare artefacts in the training data is likely to help with generalization to the test data.

“Good memorization”

Out-of-distribution considerations:

127 of 146

Limited Data Regime

Compute resource constraints

Low resource double-bind

The Low-Resource Double Bind: An Empirical Study of Pruning for Low-Resource Machine Translation

128 of 146

Key results upfront: In a low data regime, sparsity disproportionately impacts performance on the long-tail.

Prototypical test-set

Random test set

129 of 146

Surprisingly, we also find that in this setting, high levels of sparsity consistently improves generalization to out-of-distribution datasets.

130 of 146

Relates to a wider question - when do we want to curb or aid memorization of rare features?

JW300 is very specialized religious corpus. Rare artefacts even rarer in other settings we wish to generalize to.

131 of 146

What all these settings have in common is that memorization is currently very expensive.

The majority of weights (90% of all weights) are used to memorize very rare examples in the dataset.

132 of 146

This has far ranging implications. Most natural image, NLP and audio datasets follow a Zipf distribution. If we want to model the world, we need to design train models that can efficiently navigate low-frequency events.

133 of 146

Other model design choices which can amplify or curb harm.

134 of 146

Privacy trade-off with fairness - differential privacy disproportionately impacts underrepresented attributes.

135 of 146

Stopping training early disproportionately impacts performance on less common and more challenging features.

Recent research suggests there are distinct stages to training.

Characterizing Structural Regularities of Labeled Data in Overparameterized Models, 2020 (link)

Critical Learning Periods in Deep Neural Networks, 2017 (link)

Actionable:

  • Understand what features emerge when.
  • Allows us to fix incorrect labels/annotation error
  • Identify biases

136 of 146

  • Underrepresented attributes disproportionately impacted by the introduction of stochasticity

  • High variance is an AI-safety issue on sub-group with sensitive attributes like race, gender, and age.

Tooling also can impact the generalization properties of your algorithm.

The non-determinism introduced by tooling disproportionately impacts underrepresented attributes.

Data points distribution in CelebA dataset

Randomness In Neural Network Training: Characterizing The Impact of Tooling [Donglin Zhuang, Xingyao Zhang, Shuaiwen Song, Sara Hooker, ]

137 of 146

Alan Blackwell said in 1997 that in computer science

“many sub-goals can be deferred to the degree that they become what is known amongst professional programmers as S.E.P - somebody else’s problem”

The belief that algorithmic bias is only a dataset problem invites diffusion of responsibility and misses important opportunities to curb harm.

138 of 146

Lord Kelvin reflected, ‘‘If you

cannot measure it, you cannot improve it.’’

Acknowledging that model design matters has the

benefit of spurring more research focus on how it matters and

will inevitably surface new insights into how we can design

models to minimize harm.

139 of 146

The way forward

140 of 146

Deploying an algorithm involves many different steps.

Data Collection

Training using some objectives and metrics

User data filtered, ranked and aggregated

Users see an effect

user behavior informs further data collection

Data Labeling

141 of 146

Training using some objectives and metrics on an open source curated dataset

Abstract away data collection

Abstract away deployment

However, the machine learning research community has disproportionately published around one step.

142 of 146

Training using some objectives and metrics on an open source curated dataset

Abstract away data collection

Abstract away deployment

The surprisingly widely held belief that models are impartial displaces responsibility for bias to the those responsible for the data pipeline.

143 of 146

Training using some objectives and metrics on an open source curated dataset

Abstract away data collection

Abstract away deployment

If bias is not fully addressed in the data pipeline, harm is a product of both data and design choices. Model design choices can and do amplify harm.

144 of 146

Training using some objectives and metrics on an open source curated dataset

Abstract away data collection

Abstract away deployment

Understanding the interactions between model and dataset can open up new mitigation strategies for designing models that are better specified.

145 of 146

Closing Thoughts (and Q&A)

146 of 146

Moving beyond “algorithmic bias is a data problem.” Sara Hooker [[link]]

Estimating Example Difficulty using Variance of Gradients Chirag Agarwal*, Sara Hooker* [[link]]

The Low-Resource Double Bind: An Empirical Study of Pruning for

Low-Resource Machine Translation Orevaoghene Ahia, Julia Kreutzer, Sara Hooker [[link]]

What do compressed deep neural networks forget?, Sara Hooker, Aaron Courville, Gregory Clark, Yann Dauphin, Andrea Frome [[link]]

Characterizing Bias in Compressed Models Sara Hooker*, Nyalleng Moorosi*, Gregory Clark, Samy Bengio, Emily Denton [[link]]

Final takeaways:

Beyond test-set accuracy - It is not always possible to measure the trade-offs between criteria using test-set accuracy alone.

The myth of the compact, private, interpretable, fair model - Desiderata are not independent of each other. Training beyond test set accuracy requires trade-offs in our model preferences.

Understanding the interactions between model and dataset can open up new mitigation strategies.

.

Email: shooker@google.com

Questions?