1 of 17

AuditBot:

Status pages for AI APIs

@neal_lathia

Stanford AI Audit Challenge, Award for Greatest Potential

2 of 17

Background

M. Mitchell et. al. Model Cards for Model Reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency 2019

Previously a Staff Machine Learning Engineer @ Monzo Bank
Monzo is the fastest-growing app-only bank in the UK (7M+ customers)
As with all banks, different kinds of audits are regularly conducted on all kinds of decisioning systems (not just AI)
Audits are often long, labour intensive, and do not fit well with the growing range of AI as a service products

3 of 17

How can we audit the performance of artificial intelligence models that are behind proprietary APIs, on an ongoing basis?

4 of 17

AuditBot merges two key ideas into one:

model cards and system status pages

And is a system that automatically audits proprietary AI APIs every week

5 of 17

Model Cards

M. Mitchell et. al. Model Cards for Model Reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency 2019

Are increasingly popular in the machine learning space
Describe the model, including how it should be used and known risks & limitations
Are often accompanied by performance metrics comparing the model to other baselines

6 of 17

Model Cards

Often give a single point-in-time snapshot of a model’s performance
Require people to manually update them if the model behind an API is updated
Are often authored by the model authors instead of an independent party

7 of 17

(2) Status Pages

Are widely used across the software engineering industry
Show the uptime status of an API and other system metrics like API request latency
Used to alert people when core services are unhealthy or down and track incidents

8 of 17

(2) Status Pages

Are not used to assess the quality of AI systems, just their availability
Cannot currently be used to detect or inspect changes to proprietary AI models

9 of 17

AuditBot uses open data to automatically evaluate AI APIs on an ongoing basis

The current system is live here:

https://sentiment-ai-api-audit.herokuapp.com/

10 of 17

High-level structure

The system runs weekly and focuses on proprietary sentiment detection APIs.

An open dataset is sourced from the Hugging Face dataset hub
Each entry in the dataset is used to query the API for a prediction
Performance metrics are calculated and a sample of errors are tracked
The results are automatically published online as an audit trail

11 of 17

Low-level architecture

The system was built using Google Cloud Run, Cloud Scheduler, and Heroku.

Scalability: this system can be extended to any use case that has a reference dataset for the type of problem that a proprietary API solves.

Replicability: the system is open source on Github (ai-auditor-cron and ai-auditor-web) where it can be amended or independently deployed.

12 of 17

Early Insights from the data

Consistency. The Google Sentiment Prediction API had consistent results over the evaluation period (October - December 2022); no improvements or regressions detected.

Performance. The Google API performed better ¹ on the Rotten Tomatoes dataset (F1 = 0.77) than two publicly available baselines evaluated in this notebook:

Proposed next step: open source the data collected by this system for academic research.

Model	F1 Score
distilbert-base-uncased-finetuned-sst-2-english	0.66
bertweet-base-sentiment-analysis	0.66
Google Sentiment prediction API	0.77

¹ This may be an unfair comparison as we don’t know what dataset Google’s model was trained on

13 of 17

Who would benefit from this tool?

Software Engineers who build systems that rely on models that sit behind proprietary APIs and can be alerted when performance changes.

Academics can source reference performance metrics about proprietary models and study how they change over time.

Policy makers can use this type of tool to gain insight into how proprietary models’ perform on real data (e.g. the upcoming EU AI Act requires audits).

The tool is open source and could be re-deployed by software companies.

The system is building up a unique dataset of how AI APIs perform over time.

Public, ongoing audit data can be used to create transparency over how models are being changed.

14 of 17

How can we audit the performance of artificial intelligence models that are behind proprietary APIs, on an ongoing basis?

AuditBot

Assesses the quality of AI systems, not (just) their availability
Can be used to detect and inspect changes to proprietary AI models over time

15 of 17

AuditBot:

Status pages for AI APIs

@neal_lathia, neal.lathia@gmail.com

Stanford AI Audit Challenge, Award for Greatest Potential

16 of 17

Proposal for next steps

Depth of sentiment API coverage

Adding Amazon’s Sentiment API to demonstrate how this system can be used to audit several different proprietary APIs
Add endpoints so that audit results can be retrieved from the system
Adding API endpoints so that specific text examples can be added to the audit

Breadth of AI APIs

Exploring other use cases, e.g. speech-to-text (e.g. minds14 dataset)

Open source the dataset

Enable academic research over proprietary APIs’ temporal performance

17 of 17

Threat model

Similar to the Volkswagen Emissions Scandal, companies could circumvent this system by detecting when they are being queried with entries from open datasets, and return ‘fake’ responses.