Understanding User Sensemaking in
Machine Learning Fairness Assessment Systems
Ziwei Gu*, Jing Nathan Yan*, Jeffrey M. Rzeszotarski
Cornell University
*equal contribution
Machine Learning: Everywhere
Machine Learning: Complex
Training data
Model construction
Model training & testing
Model deployment
Prediction
Result evaluation
Machine Learning: Complex
Training data
Model construction
Model training & testing
Model deployment
Prediction
Result evaluation
Machine Learning: Complex
Training data
Model construction
Model training & testing
Model deployment
Prediction
Result evaluation
Machine Learning: Complex
Training data
Model construction
Model training & testing
Model deployment
Prediction
Result evaluation
Bias and Discrimination
AI-based facial recognition
Sentencing recommendations
ML Fairness Assessment: Metrics
Metrics | Focus |
Demographic Parity (DP) | Group Fairness – Same probability |
Conditional Statistical Parity (CSP) | Controlled Group Fairness – Same prob. |
Equalized Odds (EO) | Group Fairness – Same FPR, FNR |
Predictive Parity (PP) | Group Fairness – Same PPV |
Causal Fairness | Sensitive Attributes are not causes |
… | … |
Examples of Statistic- and Causality-based metrics
Mitigation Algorithms
ML Fairness Assessment: Integrated Toolkits and Systems
Ten state-of-the-art bias mitigation algorithms that can address bias throughout AI systems
One Simple Command!
Are metrics and toolkits sufficient?
Metrics |
Demographic Parity (DP) |
Conditional Statistical parity(CSP) |
Equalized Odds (EO) |
Predictive Parity (PP) |
Causal Fairness |
… |
Users may be unable to select an appropriate metric or mitigation strategy…
…and they may be unable to effectively judge the results of a system
What about exploratory tools?
Interactive exploratory tools may help users learn how metrics work and test mitigation strategies…
…but they require more time, effort, and training to use.
What happened?
Expectation
Reality
How practitioners pursue fairness assessment?
Sensemaking-process?
Case study: Interactive de-biasing tool logs
How did participants make use of interactive de-biasing tools?
How do tools help participants reason about data and model fairness?
How does interactive de-biasing shape participants’ hypotheses and goals during sessions?
How might tools’ use have been shaped by specific affordances?
Case study: Interactive de-biasing tool logs
Answers Changed!
…however, this is purely inferential from logs
Remaining questions:
How do we answer these questions?
We employ a think-aloud methodology
Participants are prompted to voice their thoughts and reasoning as they use tools
Deeper insight into participant workflow and influence of tool design features
Think-aloud Study: ML Fairness Toolkits
AIF 360
Silva
Google What-If
Exploration
Recommendation
Exploration + Recommendation
Think-aloud Study: Setup
7
5
12 undergraduate students after pre-screening
- Each tool+dataset received even exposure
- Order was counter-balanced
Three benchmark datasets
Three candidate toolkits
Think-aloud Study: Data Collection
Tutorial
Pre-survey
Task 1 w/ Tool 1
Task 2 w/ Tool 2
Post-survey
Interview (optional)
Think-aloud:
Participants vocalize thoughts and activities
Think-aloud Study: Data Encoding
First Pass
Third Pass
Second Pass
Final Pass
Coding videos with timestamped markers for user actions
Examining codes to identify patterns �in participant sensemaking.
(connecting to notional model from Pirolli et al.)
Identifying higher level themes and patterns across state models
Sketching state models describing each participant session
Think-aloud Study: Data Encoding Sample
Raw logs
Researcher Encodings
Interface actions
Participants’ Hypothesis
Encoding Summary
Key Findings 1: Exploratory Tools Invite Iterations
Overlapped Hypothesis
Dependency and Iterations
Key Findings 2: Information Overload When Balancing� Exploration and Recommendation
Key Findings 3: Recommendations Require Less Investment
Limitations
Conclusion