AuditBot:
Status pages for AI APIs
@neal_lathia
Stanford AI Audit Challenge, Award for Greatest Potential
Background
M. Mitchell et. al. Model Cards for Model Reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency 2019
How can we audit the performance of artificial intelligence models that are behind proprietary APIs, on an ongoing basis?
AuditBot merges two key ideas into one:
model cards and system status pages
And is a system that automatically audits proprietary AI APIs every week
M. Mitchell et. al. Model Cards for Model Reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency 2019
(2) Status Pages
(2) Status Pages
AuditBot uses open data to automatically evaluate AI APIs on an ongoing basis
The current system is live here:
High-level structure
The system runs weekly and focuses on proprietary sentiment detection APIs.
Low-level architecture
The system was built using Google Cloud Run, Cloud Scheduler, and Heroku.
Scalability: this system can be extended to any use case that has a reference dataset for the type of problem that a proprietary API solves.
Replicability: the system is open source on Github (ai-auditor-cron and ai-auditor-web) where it can be amended or independently deployed.
Early Insights from the data
Consistency. The Google Sentiment Prediction API had consistent results over the evaluation period (October - December 2022); no improvements or regressions detected.
Performance. The Google API performed better 1 on the Rotten Tomatoes dataset (F1 = 0.77) than two publicly available baselines evaluated in this notebook:
Proposed next step: open source the data collected by this system for academic research.
Model | F1 Score |
distilbert-base-uncased-finetuned-sst-2-english | 0.66 |
bertweet-base-sentiment-analysis | 0.66 |
Google Sentiment prediction API | 0.77 |
1 This may be an unfair comparison as we don’t know what dataset Google’s model was trained on
Who would benefit from this tool?
Software Engineers who build systems that rely on models that sit behind proprietary APIs and can be alerted when performance changes.
Academics can source reference performance metrics about proprietary models and study how they change over time.
Policy makers can use this type of tool to gain insight into how proprietary models’ perform on real data (e.g. the upcoming EU AI Act requires audits).
The tool is open source and could be re-deployed by software companies.
The system is building up a unique dataset of how AI APIs perform over time.
Public, ongoing audit data can be used to create transparency over how models are being changed.
How can we audit the performance of artificial intelligence models that are behind proprietary APIs, on an ongoing basis?
AuditBot
AuditBot:
Status pages for AI APIs
Proposal for next steps
Depth of sentiment API coverage
Breadth of AI APIs
Open source the dataset
Threat model
Similar to the Volkswagen Emissions Scandal, companies could circumvent this system by detecting when they are being queried with entries from open datasets, and return ‘fake’ responses.