1 of 26

Understanding User Sensemaking in

Machine Learning Fairness Assessment Systems

Ziwei Gu*, Jing Nathan Yan*, Jeffrey M. Rzeszotarski

Cornell University

*equal contribution

2 of 26

Machine Learning: Everywhere

3 of 26

Machine Learning: Complex

Training data

Model construction

Model training & testing

Model deployment

Prediction

Result evaluation

4 of 26

Machine Learning: Complex

Training data

Model construction

Model training & testing

Model deployment

Prediction

Result evaluation

5 of 26

Machine Learning: Complex

Training data

Model construction

Model training & testing

Model deployment

Prediction

Result evaluation

6 of 26

Machine Learning: Complex

Training data

Model construction

Model training & testing

Model deployment

Prediction

Result evaluation

7 of 26

Bias and Discrimination

AI-based facial recognition

Sentencing recommendations

8 of 26

ML Fairness Assessment: Metrics

Metrics

Focus

Demographic Parity (DP)

Group Fairness – Same probability

Conditional Statistical Parity (CSP)

Controlled Group Fairness – Same prob.

Equalized Odds (EO)

Group Fairness – Same FPR, FNR

Predictive Parity (PP)

Group Fairness – Same PPV

Causal Fairness

Sensitive Attributes are not causes

Examples of Statistic- and Causality-based metrics

Mitigation Algorithms

9 of 26

ML Fairness Assessment: Integrated Toolkits and Systems

Ten state-of-the-art bias mitigation algorithms that can address bias throughout AI systems

One Simple Command!

10 of 26

Are metrics and toolkits sufficient?

Metrics

Demographic Parity (DP)

Conditional Statistical parity(CSP)

Equalized Odds (EO)

Predictive Parity (PP)

Causal Fairness

Users may be unable to select an appropriate metric or mitigation strategy…

…and they may be unable to effectively judge the results of a system

11 of 26

What about exploratory tools?

Interactive exploratory tools may help users learn how metrics work and test mitigation strategies…

…but they require more time, effort, and training to use.

12 of 26

What happened?

Expectation

Reality

How practitioners pursue fairness assessment?

Sensemaking-process?

13 of 26

Case study: Interactive de-biasing tool logs

How did participants make use of interactive de-biasing tools?

How do tools help participants reason about data and model fairness?

How does interactive de-biasing shape participants’ hypotheses and goals during sessions?

How might tools’ use have been shaped by specific affordances?

14 of 26

Case study: Interactive de-biasing tool logs

Answers Changed!

…however, this is purely inferential from logs

Remaining questions:

  • In what specific ways do automated and exploratory de-biasing tools shape participants’ analyses?
  • Which interface affordances have the most or least impact on a de-biasing session?
  • Does expertise play a role in the efficacy of automated and exploratory tools?
  • How do we go about designing better tools for de-biasing datasets and models?

15 of 26

How do we answer these questions?

We employ a think-aloud methodology

Participants are prompted to voice their thoughts and reasoning as they use tools

Deeper insight into participant workflow and influence of tool design features

16 of 26

Think-aloud Study: ML Fairness Toolkits

AIF 360

Silva

Google What-If

Exploration

Recommendation

Exploration + Recommendation 

17 of 26

Think-aloud Study: Setup

7

5

12 undergraduate students after pre-screening

- Each tool+dataset received even exposure

- Order was counter-balanced

Three benchmark datasets

Three candidate toolkits

18 of 26

Think-aloud Study: Data Collection

Tutorial 

Pre-survey

Task 1 w/ Tool 1

Task 2 w/ Tool 2

Post-survey

Interview (optional)

Think-aloud:

Participants vocalize thoughts and activities

19 of 26

Think-aloud Study: Data Encoding

First Pass

Third Pass

Second Pass

Final Pass

Coding videos with timestamped markers for user actions

Examining codes to identify patterns �in participant sensemaking.

(connecting to notional model from Pirolli et al.)

Identifying higher level themes and patterns across state models

Sketching state models describing each participant session

20 of 26

Think-aloud Study: Data Encoding Sample

Raw logs

Researcher Encodings

Interface actions

Participants’ Hypothesis

21 of 26

Encoding Summary

22 of 26

Key Findings 1: Exploratory Tools Invite Iterations

  • More hypotheses!
  • More loops!
  • More dependencies between Hypothesis!

Overlapped Hypothesis

Dependency and Iterations

23 of 26

Key Findings 2: Information Overload When Balancing� Exploration and Recommendation

  • More time on each hypothesis
  • Limited dependencies between hypotheses
  • More time memorizing!

24 of 26

Key Findings 3: Recommendations Require Less Investment

  • Higher efficiency compared with the other two tools
  • Less accountability and transparency
  • More complaints from skilled participants

25 of 26

Limitations

  • Study methodology & think-aloud feedback
    • Higher cognitive load
    • Relatively small N
    • Skill effects

  • Data analysis
    • Coding & diagramming methodology

  • Opportunities for deeper study:
    • More use cases, skill levels
    • Integrating new hybrid tools to flesh out continuum

26 of 26

Conclusion

  • Think-aloud as a method for evaluating de-biasing systems

  • Account for expertise in exploration and recommendation

  • Balancing tuning and broader contexts in tool design

  • Motivating efficient exploration through hybridization