1 of 17

AuditBot:

Status pages for AI APIs

@neal_lathia

Stanford AI Audit Challenge, Award for Greatest Potential

2 of 17

Background

M. Mitchell et. al. Model Cards for Model Reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency 2019

  • Previously a Staff Machine Learning Engineer @ Monzo Bank
  • Monzo is the fastest-growing app-only bank in the UK (7M+ customers)
  • As with all banks, different kinds of audits are regularly conducted on all kinds of decisioning systems (not just AI)
  • Audits are often long, labour intensive, and do not fit well with the growing range of AI as a service products

3 of 17

How can we audit the performance of artificial intelligence models that are behind proprietary APIs, on an ongoing basis?

4 of 17

AuditBot merges two key ideas into one:

model cards and system status pages

And is a system that automatically audits proprietary AI APIs every week

5 of 17

  1. Model Cards

M. Mitchell et. al. Model Cards for Model Reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency 2019

  • Are increasingly popular in the machine learning space
  • Describe the model, including how it should be used and known risks & limitations
  • Are often accompanied by performance metrics comparing the model to other baselines

6 of 17

  • Model Cards

  • Often give a single point-in-time snapshot of a model’s performance
  • Require people to manually update them if the model behind an API is updated
  • Are often authored by the model authors instead of an independent party

7 of 17

(2) Status Pages

  • Are widely used across the software engineering industry
  • Show the uptime status of an API and other system metrics like API request latency
  • Used to alert people when core services are unhealthy or down and track incidents

8 of 17

(2) Status Pages

  • Are not used to assess the quality of AI systems, just their availability
  • Cannot currently be used to detect or inspect changes to proprietary AI models

9 of 17

AuditBot uses open data to automatically evaluate AI APIs on an ongoing basis

The current system is live here:

https://sentiment-ai-api-audit.herokuapp.com/

10 of 17

High-level structure

The system runs weekly and focuses on proprietary sentiment detection APIs.

  1. An open dataset is sourced from the Hugging Face dataset hub
  2. Each entry in the dataset is used to query the API for a prediction
  3. Performance metrics are calculated and a sample of errors are tracked
  4. The results are automatically published online as an audit trail

11 of 17

Low-level architecture

The system was built using Google Cloud Run, Cloud Scheduler, and Heroku.

Scalability: this system can be extended to any use case that has a reference dataset for the type of problem that a proprietary API solves.

Replicability: the system is open source on Github (ai-auditor-cron and ai-auditor-web) where it can be amended or independently deployed.

12 of 17

Early Insights from the data

Consistency. The Google Sentiment Prediction API had consistent results over the evaluation period (October - December 2022); no improvements or regressions detected.

Performance. The Google API performed better 1 on the Rotten Tomatoes dataset (F1 = 0.77) than two publicly available baselines evaluated in this notebook:

Proposed next step: open source the data collected by this system for academic research.

Model

F1 Score

distilbert-base-uncased-finetuned-sst-2-english

0.66

bertweet-base-sentiment-analysis

0.66

Google Sentiment prediction API

0.77

1 This may be an unfair comparison as we don’t know what dataset Google’s model was trained on

13 of 17

Who would benefit from this tool?

Software Engineers who build systems that rely on models that sit behind proprietary APIs and can be alerted when performance changes.

Academics can source reference performance metrics about proprietary models and study how they change over time.

Policy makers can use this type of tool to gain insight into how proprietary models’ perform on real data (e.g. the upcoming EU AI Act requires audits).

The tool is open source and could be re-deployed by software companies.

The system is building up a unique dataset of how AI APIs perform over time.

Public, ongoing audit data can be used to create transparency over how models are being changed.

14 of 17

How can we audit the performance of artificial intelligence models that are behind proprietary APIs, on an ongoing basis?

AuditBot

  • Assesses the quality of AI systems, not (just) their availability
  • Can be used to detect and inspect changes to proprietary AI models over time

15 of 17

AuditBot:

Status pages for AI APIs

@neal_lathia, neal.lathia@gmail.com

Stanford AI Audit Challenge, Award for Greatest Potential

16 of 17

Proposal for next steps

Depth of sentiment API coverage

  • Adding Amazon’s Sentiment API to demonstrate how this system can be used to audit several different proprietary APIs
  • Add endpoints so that audit results can be retrieved from the system
  • Adding API endpoints so that specific text examples can be added to the audit

Breadth of AI APIs

  • Exploring other use cases, e.g. speech-to-text (e.g. minds14 dataset)

Open source the dataset

  • Enable academic research over proprietary APIs’ temporal performance

17 of 17

Threat model

Similar to the Volkswagen Emissions Scandal, companies could circumvent this system by detecting when they are being queried with entries from open datasets, and return ‘fake’ responses.