1 of 26

Understanding User Sensemaking in

Machine Learning Fairness Assessment Systems

Ziwei Gu*, Jing Nathan Yan*, Jeffrey M. Rzeszotarski

Cornell University

*equal contribution

2 of 26

Machine Learning: Everywhere

3 of 26

Machine Learning: Complex

Training data

Model construction

Model training & testing

Model deployment

Prediction

Result evaluation

4 of 26

Machine Learning: Complex

Training data

Model construction

Model training & testing

Model deployment

Prediction

Result evaluation

5 of 26

Machine Learning: Complex

Training data

Model construction

Model training & testing

Model deployment

Prediction

Result evaluation

6 of 26

Machine Learning: Complex

Training data

Model construction

Model training & testing

Model deployment

Prediction

Result evaluation

7 of 26

Bias and Discrimination

AI-based facial recognition

Sentencing recommendations

8 of 26

ML Fairness Assessment: Metrics

Metrics	Focus
Demographic Parity (DP)	Group Fairness – Same probability
Conditional Statistical Parity (CSP)	Controlled Group Fairness – Same prob.
Equalized Odds (EO)	Group Fairness – Same FPR, FNR
Predictive Parity (PP)	Group Fairness – Same PPV
Causal Fairness	Sensitive Attributes are not causes
…	…

Examples of Statistic- and Causality-based metrics

Mitigation Algorithms

9 of 26

ML Fairness Assessment: Integrated Toolkits and Systems

Ten state-of-the-art bias mitigation algorithms that can address bias throughout AI systems

One Simple Command!

10 of 26

Are metrics and toolkits sufficient?

Metrics
Demographic Parity (DP)
Conditional Statistical parity(CSP)
Equalized Odds (EO)
Predictive Parity (PP)
Causal Fairness
…

Users may be unable to select an appropriate metric or mitigation strategy…

…and they may be unable to effectively judge the results of a system

11 of 26

What about exploratory tools?

Interactive exploratory tools may help users learn how metrics work and test mitigation strategies…

…but they require more time, effort, and training to use.

12 of 26

What happened?

Expectation

Reality

How practitioners pursue fairness assessment?

Sensemaking-process?

13 of 26

Case study: Interactive de-biasing tool logs

How did participants make use of interactive de-biasing tools?

How do tools help participants reason about data and model fairness?

How does interactive de-biasing shape participants’ hypotheses and goals during sessions?

How might tools’ use have been shaped by specific affordances?

14 of 26

Case study: Interactive de-biasing tool logs

Answers Changed!

…however, this is purely inferential from logs

Remaining questions:

In what specific ways do automated and exploratory de-biasing tools shape participants’ analyses?
Which interface affordances have the most or least impact on a de-biasing session?
Does expertise play a role in the efficacy of automated and exploratory tools?
How do we go about designing better tools for de-biasing datasets and models?

15 of 26

How do we answer these questions?

We employ a think-aloud methodology

Participants are prompted to voice their thoughts and reasoning as they use tools

Deeper insight into participant workflow and influence of tool design features

16 of 26

Think-aloud Study: ML Fairness Toolkits

AIF 360

Silva

Google What-If

Exploration

Recommendation

Exploration + Recommendation

17 of 26

Think-aloud Study: Setup

12 undergraduate students after pre-screening

- Each tool+dataset received even exposure

- Order was counter-balanced

Three benchmark datasets

Three candidate toolkits

18 of 26

Think-aloud Study: Data Collection

Tutorial

Pre-survey

Task 1 w/ Tool 1

Task 2 w/ Tool 2

Post-survey

Interview (optional)

Think-aloud:

Participants vocalize thoughts and activities

19 of 26

Think-aloud Study: Data Encoding

First Pass

Third Pass

Second Pass

Final Pass

Coding videos with timestamped markers for user actions

Examining codes to identify patterns �in participant sensemaking.

(connecting to notional model from Pirolli et al.)

Identifying higher level themes and patterns across state models

Sketching state models describing each participant session

20 of 26

Think-aloud Study: Data Encoding Sample

Raw logs

Researcher Encodings

Interface actions

Participants’ Hypothesis

21 of 26

Encoding Summary

22 of 26

Key Findings 1: Exploratory Tools Invite Iterations

More hypotheses!
More loops!
More dependencies between Hypothesis!

Overlapped Hypothesis

Dependency and Iterations

23 of 26

Key Findings 2: Information Overload When Balancing� Exploration and Recommendation

More time on each hypothesis
Limited dependencies between hypotheses
More time memorizing!

24 of 26

Key Findings 3: Recommendations Require Less Investment

Higher efficiency compared with the other two tools
Less accountability and transparency
More complaints from skilled participants

25 of 26

Limitations

Study methodology & think-aloud feedback

Higher cognitive load
Relatively small N
Skill effects

Data analysis

Coding & diagramming methodology

Opportunities for deeper study:

More use cases, skill levels
Integrating new hybrid tools to flesh out continuum

26 of 26

Conclusion

Think-aloud as a method for evaluating de-biasing systems

Account for expertise in exploration and recommendation

Balancing tuning and broader contexts in tool design

Motivating efficient exploration through hybridization