ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
paper link
https://storage.googleapis.com/pub-tools-public-publication-data/pdf/aad9f93b86b7addfea4c419b9100c6cdd26cacea.pdf
2
How to computeThe final test score is computed as follows:
• For each test, half a point is awarded for executing the
test manually, with the results documented and distributed.
• A full point is awarded if there is a system in place to
run that test automatically on a repeated basis.
• Sum the score for each of the 4 sections individually.
• The final ML Test Score is computed by taking the
minimum of the scores aggregated for each of the 4 sections.
We choose the minimum because we believe all four
sections are important, and so a system must consider all
in order to raise the score
3
4
TESTS FOR FEATURES AND DATA
Test Manually (0.5), Test Automatically (1.0)
Note and evidence
5
Data 1: Feature expectations are captured in a
schema
0
6
Data 2: All features are beneficial:0
7
Data 3: No feature’s cost is too much0
8
Data 4: Features adhere to meta-level requirements0
9
Data 5: The data pipeline has appropriate privacy controls0
10
Data 6: New features can be added quickly0
11
Data 7: All input feature code is tested0
12
Sum0
13
TESTS FOR MODEL DEVELOPMENT
14
Model 1: Every model specification undergoes a
code review and is checked in to a repository
0
15
Model 2: Offline proxy metrics correlate with actual online impact metrics0
16
Model 3: All hyperparameters have been tuned0
17
Model 4: The impact of model staleness is known0
18
Model 5: A simpler model is not better0
19
Model 6: Model quality is sufficient on all important data slices0
20
Model 7: The model has been tested for considerations of inclusion0
21
Sum0
22
TESTS FOR ML INFRASTRUCTURE
23
Infra 1: Training is reproducible:0
24
Infra 2: Model specification code is unit tested0
25
Infra 3: The full ML pipeline is integration tested0
26
Infra 4: Model quality is validated before attempting to serve it:0
27
Infra 5: The model allows debugging by observing the step-by-step computation of training or inference on a single example0
28
Infra 6: Models are tested via a canary process before they enter production serving environments0
29
Infra 7: Models can be quickly and safely rolled back to a previous serving version0
30
Sum0
31
MONITORING TESTS FOR ML
32
Monitor 1: Dependency changes result in notification0
33
Monitor 2: Data invariants hold in training and
serving inputs
0
34
Monitor 3: Training and serving features compute the same values0
35
Monitor 4: Models are not too stale0
36
Monitor 5: The model is numerically stable0
37
Monitor 6: The model has not experienced a dramatic or slow-leak regressions in training speed, serving latency, throughput, or RAM usage0
38
Monitor 7: The model has not experienced a regression in prediction quality on served data0
39
Sum0
40
OVERALL ML TEST SCORE0
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100