Machine Learning Systems Design
ML beyond accuracy: Fairness, Security, Governance
CS 329S | Chip Huyen
Sara Hooker
My research agenda to-date has focused on:
Model Interpretability - reliable explanations for model behavior.
Model Compression - compact machine learning models to work in resource constrained environments.
Model fragility and security - deploy secure models that protect user privacy.
Fairness - imposes constraint on optimization that reflects societal norms of what is fair.
I’ll mention research collaborations with my colleagues:
Nyalleng Moorosi, Gregory Clark, Samy Bengio, Emily Denton, Aaron Courville, Yann Dauphin, Andrea Frome, Chirag Agarwal, Daniel Souza, Dumitru Erhan, Oreva Ahia, Julia Kreutzer.
The Imperfect Objective
Understanding trade-off between desiderata.
The role of model design: characterizing bias and developing trustworthy AI models
The Role of Model Design: Characterizing Bias and Developing Trustworthy AI Models
The imperfect objective.
What if discomfort is not uniform, but targeted?
Our goal is to bridge the technology design gap - develop technology that works for everyone.
Our goal is to bridge the technology design gap - develop technology that works for everyone.
Achieving this requires understanding how our modelling choices impact downstream impact.
Over the last decade “performance” of a model treated as synonymous with pursuit of top-1 accuracy.
Top-line metrics do not guarantee that the trained function fulfills other properties we may care about.
Empirical risk minimization - train a representation to minimize average error.
Compactness
Interpretability
Robustness
Fairness
Compactness
Interpretability
Adversarial Robustness
Fairness
Typical loss functions in machine learning (MSE, Hinge-Loss and CE) impose no preference for functions that are interpretable, fair, robust or guarantee privacy.
Donald Knuth said “computers do exactly what they are told, no more and no less.”
A model can fulfill an objective in many ways, while violating the spirit of said objective.
The Clever Hans Effect 1891 - 1907
Hans the horse:
High accuracy without true learning.
Experimental Design -
Can Hans answer a question if the human does not know the answer?
Hans answered correctly by picking up on microscopic clues.
The under-specification of our objective function often leads to undesirable model behavior termed “shortcut learning.”
Cow
Limousine
The under-specification of our objective function often leads to undesirable model behavior “shortcut learning.”
Sheep
Dog
Blog link
Task is predicting Ending of Story:
[[Cai et al. 2017]]
Pay Attention to the Ending -
Work by Cai et al. show show you can discard the story plots and only train to choose the better of two endings reaches 72.5% accuracy.
High accuracy without “true” learning.
The under-specification of our objective function often leads to undesirable model behavior termed “shortcut learning.”
“Shortcut learning” is due to relative under/overrepresentation of training features.
For example, a model may learn to correlate blonde with being a female because there are far fewer blonde males in the training dataset.
This results in higher error rates on the long-tail -- the underrepresented features in the dataset.
When this happen in sensitive domains, there can be a huge cost to human welfare.
High accuracy without “true” learning.
Skin lesions
Pneumonia
How a model treats underrepresented features in the long-tail of the distribution often coincides with notions of fairness.
Geographic bias in how we collect our datasets. Shankar et al. (2017) show models perform far worse on locales undersampled in the training set.
No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets
for the Developing World (Shankar et al. (link))
Men also like shopping (and cooking too).
Undersampling/oversampling leads to undesirable spurious correlations.
Zhao, Jieyu et al. (2017) show Activity recognition datasets exhibit stereotype-aligned gender biases.
Delegating learning of the function to the model can (and has) led to Clever Hans moments.
High accuracy without “true” learning.
Overfitting to pattern matching
Memorization leads to Leakage of private text
Can generate factually incorrect statements
Open challenges in auditing and mitigating harmful bias.
Preferences about how our trained model should behave on subset of sensitive or protected features.
Fairness
Legally protected features:
Certain attributes are protected by law. For example, in the US it is illegal to discriminate based upon race, color, religion, sex, national origin, disability.
Legal framework will differ by country.
Sensitive features:
Income, eye color, hair, skin color, accent, locale.
These features may not be protected by law, but are often correlated with protected attributes .
Your choice of tool to audit and mitigating algorithmic bias will depend upon whether you know:
- the sensitive features which are adversely impacted
- have comprehensive labels for these features
Your choice of tool to audit and mitigate algorithmic bias will depend upon whether you know:
- the sensitive features which are adversely impacted
- have comprehensive labels for these features
1.
What is it?�Statistically evaluate model performance (e.g. accuracy, error rates) by “subgroup”�e.g. skin tone, gender, age�
Requires�Good, “balanced” test sets that are representative of the actual use-case(s) for the model in production
28
| Male | Female | Non-binary |
Type I | | | |
Type II | | | |
Type III | | | |
Type IV | | | |
Type V | | | |
Type VI | | | |
Acc/FNP/FPR/other
Fitzpatrick Skin Type
Example of intersectional audit
Gender Shades - Evaluated classifiers’ performance across genders, skin types, and intersection of gender and skin type
29
When labels are known and complete - opens up range of remedies to mitigate impact
30
Data-Based
2. Remove problematic feature from training set (not always feasible)
3. Tailored Augmentation Strategies
Examples of counterfactual data augmentation remedies:
31
Counterfactual data augmentation strategies (CDA):
“the man who pioneered the church named it [...])”
Generate a counterfactual sentence by substituting the word’s gender-partner in its place
“the woman who pioneered the church [...])”
[[Lu et al. (2018), Maudslay et al. (2019), Webster et al. (2021)]]
When labels are known and complete - range of remedies to mitigate impact.
32
Data-Based
2. Remove problematic feature from training set (not always feasible due to proxy variables)
3. Tailored Augmentation Strategies
Model-Based
2. Rate constraint - guaranteeing recall, data co-occurrence or another rate metric is at least/most [x%].
[[zhao et al. 2017]]
3. Worst case error constraint
Incomplete or no labels for sensitive features
What about where we don’t have complete labels for the sensitive attribute we care about?
2.
How does my model perform...
Classification accuracy / precision-recall curve / logarithmic loss / area under the curve / mean squared error / mean absolute error / �F1 score / standard deviation / variance / confidence intervals / KL divergence / �false positive rate / false negative rate / �
How might my model perform...
on a sample of test data / on cross-slices of test data / on an individual data point / if a datapoint �is perturbed / if model thresholds were different /�/ across all values of a feature / when compared to a different model
Can you spot a metric which doesn’t require labels?
Most of our remedies are centered around the assumption that we have comprehensive labelling available. However, this is a tenuous assumption.
Most of our remedies are centered around the assumption that we have comprehensive labelling available. However, this is a tenuous assumption.
Any thoughts why? What challenges exist around ensuring comprehensive labelling?
More often than not, we do not have comprehensive labels from human annotators.
1. For high dimensional problems, often time consuming/infeasible to comprehensively label.
church
Bird, nest, street lamp, cross, statue, window, window grid.
1.1. For NLP tasks, often require separately curating labels for different languages.
the man who pioneered the church named it [...])”
Generate a counterfactual sentence by substituting the word’s gender-partner in its place
“the woman who pioneered the church [...])”
Requires a curated list of words to substitute. This has to be curated for different languages.
The GEM Benchmark [[Germann et al. 2021]]
2. Labelling protected features is often insufficient, necessary to label all proxy features.
Task: Sleeping or awake?
Consider species to be the protected attribute, many other variables may be proxy variables (indoor/outdoor).
In languages like Portuguese, remedies like augmentation by swapping protected attributes require not only identifying protected nouns but also altering any words that are modified by the noun.
A minha mãe é Professora.
O meu pãe é Professor.
My mom is a teacher.
My dad is a teacher.
2. Labelling protected features is often insufficient, necessary to label all proxy features.
A minha mãe é Professora.
Gender swapping is much harder in languages where the adjectives, articles, and pronouns that agree with these nouns also adjust to comply with gender.
O meu pãe é Professor.
My mom is a teacher.
My dad is a teacher.
2. Labelling protected features is often insufficient, necessary to label all proxy features.
3. Legal obstacles around collecting certain sensitive features.
Recent dataset release by Facebook notable for both compensating and getting consent.
[[Andrus et al. 2021, Veal 2017]].
4. Inconsistency in how sensitive features are labelled.
44
[[Khan et al. 2021]]
Images with high (left) and low (right) levels of agreement on how to annotate race.
4.1 Annotation errors are prevalent in most large scale datasets.
45
[[Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes SettingDe-Arteaga et al. 2019]]
Tagged for the profession model:
"Hank Sheinkopf is a model for brilliant communications in a world where messages are broadcast at astounding rates. He’s a player in the PR world of political campaigns, both in domestic and foreign sectors."
Example found by Preethi Seshadri.
4.1 The subjectivity of certain labels may result in high annotator disagreement /variance.
What constitutes toxic speech?
46
[[Akin et al. 2018]]
4.1 The subjectivity of certain labels may result in high annotator disagreement /variance.
What constitutes toxic speech?
47
[[Akin et al. 2018]]
5. Annotation can be biased by the lived experience of the annotators
48
[[On Releasing Annotator-Level Labels and Information in Datasets Prabhakaran, Davani et al. 2021]]
label aggregation may introduce representational biases of individual and group perspectives.
6. What we deem to be harm is not static. It is shaped by political/geographical/economic/historical considerations.
49
6. What we deem to be harm is not static. It is shaped by political/geographical/economic/historical considerations.
50
1999 - Michigan removes law prohibiting citizens from “reproachful or contemptuous language” in print against anyone who declines a duel challenge.
In 1963, the Swiss government passed the first measures to make sure every inhabitant had access to a nuclear shelter.
The rules and social contract we ask citizens to abide by has changed over time, and varies by country.
Fairness considerations are not static across time or space.
6. What we deem to be harm is not static. It is shaped by political/geographical/economic/historical considerations.
51
We just covered several of the key challenges in ensuring comprehensive labelling.
53
If we cannot guarantee we have fully addressed bias in the data pipeline, the overall harm in a system is a product of the interactions between the data and our model design choices.
Recognizing how model design impacts harm opens up new mitigation techniques that are far less burdensome than comprehensive data collection.
Leveraging model signal to audit large scale datasets
Global feature importance - Ranks dataset examples by which are most challenging.
Data Cleaning
Isolating subset
for relabelling
Identify issues with
fairness
Surfaces a tractable subset of the most challenging/least challenging examples for human inspection. Avoids time consuming need to inspect every example.
One of the biggest challenges in improving datasets is identifying where to invest limited human annotator time.
Compute average variance in gradients (VOG) for an image over training.
0 epochs
90 epochs
Estimating Example Difficulty using Variance of Gradients, Agarwal, D,Souza and Hooker, 2020
Variance of Gradients (VoG) is an example of a global ranking tool.
VoG computes a relative ranking of each class.
Estimating Example Difficulty using Variance of Gradients, Agarwal, Souza and Hooker, 2020
What examples does the model find challenging or easy to learn?
Easy examples are learnt early in training, harder examples require memorization later in training.
Estimating Example Difficulty using Variance of Gradients, Agarwal, Souza and Hooker, 2020
0 epochs
90 epochs
Low Variance
High Variance
Low Variance
High Variance
Early Stage Training
Late Stage Training
Convergence rates differ between different types of examples. We can leverage and amplify these differences to distinguish between atypical and noisy examples.
[[A tale of two long tails, D’souza et al. 2021]]
How do system level and algorithm design choices impact model behavior?
In introduction to machine learning, you are often presented with a linear polynomial function like this.
As you increase the degree of the polynomial, you can see how it impacts how the model learns the distribution.
This is one of the first lessons students learn -- the choice of model function matters.
Our modelling choices -- architecture, loss function, optimizer all express a preference for final model behavior.
Model Compression - compact machine learning models to work in resource constrained environments.
Model fragility and security - deploy secure models that protect user privacy.
Fairness - imposes constraint on optimization that reflects societal norms of what is fair.
Often, ML literature makes the unrealistic assumption that optimizing for one property holds all others static.
How we often talk about different properties in the literature.
However, our design choices involve trade-offs between objectives.
Model Compression - compact machine learning models to work in resource constrained environments.
Model fragility and security - deploy secure models that protect user privacy.
Fairness - imposes constraint on optimization that reflects societal norms of what is fair.
Case Study: How does model compression trade-off against other properties we care about such as robustness and fairness?
A “bigger is better” race in the number of model parameters has gripped the field of machine learning.
Canziani et al., 2016, Open AI 2019
This characterizes both vision and NLP tasks.
Link here [Sharir et al. 2020]
An argument in favor of this approach:
A key limitation of this approach:
Relationship between weights and generalization properties is not well understood.
The intriguing relationship between capacity and generalization.
Why do we need so many weights in the first place?
Diminishing returns to adding parameters. Millions of parameters are needed to eek out additional gains.
Table: Kornblith et al., 2018
Almost double the amount of weights for a gain in 2% points.
Redundancies Between Weights
Denil et al. find that a small set of weights can be used to predict 95% of weights in the network.
Most weights can be removed after training is finished (while only losing a few % in test-set accuracy!)
[[The State of Sparsity in Deep Neural Networks, 2019, Gale, Elsen, Hooker]]
With 90% of the weights removed, a ResNet-50 only loses ~3% of performance (for certain pruning methods).
Sparse models easily outcompete dense models with same parameter count.
Efficient Neural Audio Synthesis, Kalchbrenner et al., 2018
Understanding how capacity impacts generalization is an increasingly urgent question:
How do generalization properties change as models get bigger and bigger?
Increasingly, we are also making design choices at test-time that alter generalization properties - pruning, quantization, fine-tuning.
Why is this interesting? Theoretical reasons:
We are in an era of ever bigger models.
Yet, there is a limit to how much we can scale. Understanding the role of capacity can guide us to more efficient solutions.
If most weights are redundant, why do we need them in the first place?
Can these insights guide us to better training protocols?
Most of the world uses ML in a resource constrained environment.
The Low-Resource Double Bind: An Empirical Study of Pruning for Low-Resource Machine Translation, link
Why is this interesting? Practical reasons:
If we care about access to technology, we need to revisit our model design assumptions:
As you increase size of networks:
ML at the edge:
0% pruning
76.70%
50% pruning
76.20%
How can networks with radically different structures and number of parameters have comparable performance?
One possibility is that top-line metrics are not a precise enough measure to capture how capacity impacts the generalization properties of the model.
To gain intuition into differences in generalization behavior, we go beyond topline metrics.
Sparsity in Deep Learning
Weight Sparsity
Sources: Pruning, sparse training
Example: 3-5x FLOP advantage for the same accuracy in CNNs [1][2]
Activation Sparsity
Sources: ReLU sparsity, sparse attention
Example: Asymptotic improvement in attention computational complexity. From O(N2) to O(N*sqrt(N)) to O(N) [3][4]
Data Sparsity
Sources: Point clouds, graphs, etc.
Example: 3D object detection with targeted computation [5]
Image sources: https://bit.ly/3a05ovu
Sparsify networks by “removing” unimportant activations/weights (setting weights/neurons to zero).
Initial weight matrix
After activations have been removed.
Image source: OpenAI
Instead of starting sparse -- most state of art sparsity methods introduce sparsity gradually over the course of training.
Image from Torsten Hoefler tutorial
Selective Brain Damage: Measuring the Disparate Impact of Model Pruning
Sara Hooker, Aaron Courville, Yann Dauphin, Andrea Frome
Learn more about PIEs at https://weightpruningdamage.github.io/
Sparsity of 90% means that by the end of training the model only has 10% of all weights remaining. Apply mask of 0 to remaining weights.
Initial weight matrix
After activations have been removed.
Image source
0 %
90 %
Overparameterized Dense Model
Model with 90% weights removed
Train populations of models with minimal differences in test-set accuracy to different end sparsities [0%, 30%, 50%, 70%, 90%, 95%, 99%].
Valuable experimental set-up - we can precisely vary the level of final sparsity.
Measure divergence in class level and exemplar classification performance.
Robustness to certain types of distribution shift.
Here, we ask - How does model behavior diverge
as we vary the level of compression?
1.
2.
Why is a narrow part of the data distribution far more sensitive to varying capacity?
Key results upfront: top level metrics hide critical differences in generalization between compressed and compressed populations of models.
Varying sparsity disproportionately and systematically impact a small subset of classes and exemplars.
1.
Pruning Identified Exemplars (PIEs)
are images where predictive behavior diverges between a population of independently trained compressed and non-compressed models.
ImageNet test-set.
True label?
ImageNet test-set.
True label?
ImageNet test-set.
True label?
ImageNet test-set.
True label?
ImageNet test-set.
True label?
PIEs are also more challenging for algorithms to classify.
Attribute Proportion of CelebA Training Data vs. relative representation in PIE
PIEs over-index on the long-tail of underrepresented attributes.
0 %
90 %
Overparameterized Dense Model
Model with 90% weights removed
Put differently, we are using the majority of our weights to encode a useful representation for a small fraction of our training distribution.
We lose the long-tail when we remove the majority of all training weights.
Low-frequency events - The majority of weights (90% of all weights) are used to memorize very rare examples in the dataset.
When we remove weights models lose performance on rare examples.
Noisy Data Points
Misuse of parameters to represent these data points.
“Bad memorization”
Noisy PIEs: Incorrectly structured ImageNet data for single-image classification.
Noisy PIEs: Corrupted or incorrectly labeled data.
Noisy Data Points
Misuse of parameters to represent these data points.
“Bad memorization”
Atypical Data Points or Challenging Exemplars
Valuable use of parameters to represent these data points.
“Good memorization”
Atypical PIEs: Unusual vantage points of the class category.
Whether pruning aids or impedes performance depends upon how relevant learning rare artefacts are for the task.
Two common types of examples in the long-tail:�
This is very related to whether memorization aids or hurts generalization.
Noisy Data Points
Misuse of parameters to represent these data points.
“Bad memorization”
Atypical Data Points or Challenging Exemplars
Valuable use of parameters to represent these data points.
“Good memorization”
In-distribution considerations:
This is very related to whether memorization aids or hurts generalization.
Test data is very different from my training distribution (typical of low data regimes)
Memorization of rare artefacts in the training data is unlikely to help with generalization to the test data.
“Bad memorization”
Test dataset is very similar to my training distribution (more common with VERY big data regimes)
Memorization of rare artefacts in the training data is likely to help with generalization to the test data.
“Good memorization”
Out-of-distribution considerations:
In-distribution considerations - amplification of error on rare/underrepresented attributes.
How does compression impact performance on the long-tail in-distribution?
Noisy Data Points
Misuse of parameters to represent these data points.
“Bad memorization”
Atypical Data Points or Challenging Exemplars
Valuable use of parameters to represent these data points.
“Good memorization”
In-distribution considerations:
Attribute Proportion of CelebA Training Data vs. relative representation in PIE
Compression amplifies algorithmic harm when the protected feature is in the long-tail of the distribution.
Measuring Impact of Compression on Algorithmic Bias
Celeb-A Spurious correlation between gender, age and hair color {Blond, Non-Blond}
Far fewer examples of ‘Blond Male’ (0.85%) and ‘Blond Old’ (2.43%) of training set.
We find sparsity disproportionately impacts underrepresented features.
Pruning amplifies algorithmic bias when the underrepresented feature is sensitive (age/gender)
Civil Comments Task of detecting toxic comments. Target label toxic is only present for ~8% of training set.
Sparsity sharply degrades model ability to detect toxic comments. Most impacted sub-groups are least represented in training set.
How does compression impact performance on OOD data points?
Test data is very different from my training distribution (typical of low data regimes)
Memorization of rare artefacts in the training data is unlikely to help with generalization to the test data.
“Bad memorization”
Test dataset is very similar to my training distribution (more common with VERY big data regimes)
Memorization of rare artefacts in the training data is likely to help with generalization to the test data.
“Good memorization”
Out-of-distribution considerations:
Limited Data Regime
Compute resource constraints
Low resource double-bind
The Low-Resource Double Bind: An Empirical Study of Pruning for Low-Resource Machine Translation
Key results upfront: In a low data regime, sparsity disproportionately impacts performance on the long-tail.
Prototypical test-set
Random test set
Surprisingly, we also find that in this setting, high levels of sparsity consistently improves generalization to out-of-distribution datasets.
Relates to a wider question - when do we want to curb or aid memorization of rare features?
JW300 is very specialized religious corpus. Rare artefacts even rarer in other settings we wish to generalize to.
What all these settings have in common is that memorization is currently very expensive.
The majority of weights (90% of all weights) are used to memorize very rare examples in the dataset.
This has far ranging implications. Most natural image, NLP and audio datasets follow a Zipf distribution. If we want to model the world, we need to design train models that can efficiently navigate low-frequency events.
Other model design choices which can amplify or curb harm.
Privacy trade-off with fairness - differential privacy disproportionately impacts underrepresented attributes.
Stopping training early disproportionately impacts performance on less common and more challenging features.
Recent research suggests there are distinct stages to training.
Characterizing Structural Regularities of Labeled Data in Overparameterized Models, 2020 (link)
Critical Learning Periods in Deep Neural Networks, 2017 (link)
Actionable:
Tooling also can impact the generalization properties of your algorithm.
The non-determinism introduced by tooling disproportionately impacts underrepresented attributes.
Data points distribution in CelebA dataset
Randomness In Neural Network Training: Characterizing The Impact of Tooling [Donglin Zhuang, Xingyao Zhang, Shuaiwen Song, Sara Hooker, ]
Alan Blackwell said in 1997 that in computer science
“many sub-goals can be deferred to the degree that they become what is known amongst professional programmers as S.E.P - somebody else’s problem”
The belief that algorithmic bias is only a dataset problem invites diffusion of responsibility and misses important opportunities to curb harm.
Lord Kelvin reflected, ‘‘If you
cannot measure it, you cannot improve it.’’
Acknowledging that model design matters has the
benefit of spurring more research focus on how it matters and
will inevitably surface new insights into how we can design
models to minimize harm.
The way forward
Deploying an algorithm involves many different steps.
Data Collection
Training using some objectives and metrics
User data filtered, ranked and aggregated
Users see an effect
user behavior informs further data collection
Data Labeling
Training using some objectives and metrics on an open source curated dataset
Abstract away data collection
Abstract away deployment
However, the machine learning research community has disproportionately published around one step.
Training using some objectives and metrics on an open source curated dataset
Abstract away data collection
Abstract away deployment
The surprisingly widely held belief that models are impartial displaces responsibility for bias to the those responsible for the data pipeline.
Training using some objectives and metrics on an open source curated dataset
Abstract away data collection
Abstract away deployment
If bias is not fully addressed in the data pipeline, harm is a product of both data and design choices. Model design choices can and do amplify harm.
Training using some objectives and metrics on an open source curated dataset
Abstract away data collection
Abstract away deployment
Understanding the interactions between model and dataset can open up new mitigation strategies for designing models that are better specified.
Closing Thoughts (and Q&A)
Moving beyond “algorithmic bias is a data problem.” Sara Hooker [[link]]
Estimating Example Difficulty using Variance of Gradients Chirag Agarwal*, Sara Hooker* [[link]]
The Low-Resource Double Bind: An Empirical Study of Pruning for
Low-Resource Machine Translation Orevaoghene Ahia, Julia Kreutzer, Sara Hooker [[link]]
What do compressed deep neural networks forget?, Sara Hooker, Aaron Courville, Gregory Clark, Yann Dauphin, Andrea Frome [[link]]
Characterizing Bias in Compressed Models Sara Hooker*, Nyalleng Moorosi*, Gregory Clark, Samy Bengio, Emily Denton [[link]]
Final takeaways:
Beyond test-set accuracy - It is not always possible to measure the trade-offs between criteria using test-set accuracy alone.
The myth of the compact, private, interpretable, fair model - Desiderata are not independent of each other. Training beyond test set accuracy requires trade-offs in our model preferences.
Understanding the interactions between model and dataset can open up new mitigation strategies.
.
Email: shooker@google.com
Questions?