ML Test Score template

	A	B	C
1		paper link	https://storage.googleapis.com/pub-tools-public-publication-data/pdf/aad9f93b86b7addfea4c419b9100c6cdd26cacea.pdf
2		How to compute	The final test score is computed as follows: • For each test, half a point is awarded for executing the test manually, with the results documented and distributed. • A full point is awarded if there is a system in place to run that test automatically on a repeated basis. • Sum the score for each of the 4 sections individually. • The final ML Test Score is computed by taking the minimum of the scores aggregated for each of the 4 sections. We choose the minimum because we believe all four sections are important, and so a system must consider all in order to raise the score
3
4	TESTS FOR FEATURES AND DATA	Test Manually (0.5), Test Automatically (1.0)	Note and evidence
5	Data 1: Feature expectations are captured in a schema	0
6	Data 2: All features are beneficial:	0
7	Data 3: No feature’s cost is too much	0
8	Data 4: Features adhere to meta-level requirements	0
9	Data 5: The data pipeline has appropriate privacy controls	0
10	Data 6: New features can be added quickly	0
11	Data 7: All input feature code is tested	0
12	Sum	0
13	TESTS FOR MODEL DEVELOPMENT
14	Model 1: Every model specification undergoes a code review and is checked in to a repository	0
15	Model 2: Offline proxy metrics correlate with actual online impact metrics	0
16	Model 3: All hyperparameters have been tuned	0
17	Model 4: The impact of model staleness is known	0
18	Model 5: A simpler model is not better	0
19	Model 6: Model quality is sufficient on all important data slices	0
20	Model 7: The model has been tested for considerations of inclusion	0
21	Sum	0
22	TESTS FOR ML INFRASTRUCTURE
23	Infra 1: Training is reproducible:	0
24	Infra 2: Model specification code is unit tested	0
25	Infra 3: The full ML pipeline is integration tested	0
26	Infra 4: Model quality is validated before attempting to serve it:	0
27	Infra 5: The model allows debugging by observing the step-by-step computation of training or inference on a single example	0
28	Infra 6: Models are tested via a canary process before they enter production serving environments	0
29	Infra 7: Models can be quickly and safely rolled back to a previous serving version	0
30	Sum	0
31	MONITORING TESTS FOR ML
32	Monitor 1: Dependency changes result in notification	0
33	Monitor 2: Data invariants hold in training and serving inputs	0
34	Monitor 3: Training and serving features compute the same values	0
35	Monitor 4: Models are not too stale	0
36	Monitor 5: The model is numerically stable	0
37	Monitor 6: The model has not experienced a dramatic or slow-leak regressions in training speed, serving latency, throughput, or RAM usage	0
38	Monitor 7: The model has not experienced a regression in prediction quality on served data	0
39	Sum	0
40	OVERALL ML TEST SCORE	0
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100