2 of 14

Data is the Bottleneck for ML

ML ≈ Model + Data

Sources:

https://www.semafor.com/article/01/27/2023/openai-has-hired-an-army-of-contractors-to-make-basic-coding-obsolete

https://www.datanami.com/2023/01/20/openai-outsourced-data-labeling-to-kenyan-workers-earning-less-than-2-per-hour-time-report/

Model is gradually commoditized

Out-of-the-box invocation of ML libraries gives decent results
Transformers for “all” tasks

Data is the bottleneck

3 of 14

Manual v.s. Programmatic Supervision

Labeling individual data points

Writing Labeling Functions (LFs) where each LF abstracts a supervision source (e.g. heuristics, existing models, external KBs, …)

Source: https://www.snorkel.org/use-cases/01-spam-tutorial#3-writing-more-labeling-functions

4 of 14

Programmatic Supervision

Challenge:

Incomplete, noisy and conflicting weak labels from LFs

1	-1	0	…
0	1	0	…
1	1	-1	..
1	1	-1	…
1	1	-1	…
…	…	…	…

Data point 1

Data point 2

Data point 3

Data point 4

Data point 5

Data point x

Weak label matrix X

LF1 LF2 LF3 LFX

5 of 14

Label Model

1	-1	0	…
0	1	0	…
1	1	-1	..
1	1	-1	…
1	1	-1	…
…	…	…	…

Data point 1

Data point 2

Data point 3

Data point 4

Data point 5

Data point x

Weak label matrix X

LF1 LF2 LF3 LFX

1
1
-1
1
-1
…

Label model

Inferred ground-truth labels y

Same problem setup in labeling by crowdsourcing

6 of 14

Label Model

Existing methods all require ad-hoc parameter learning on each dataset

Assume an underlying distribution p(y, X, Θ)
Learn parameter Θ according to some objective function
Predict y using p(y, X, Θ)

Example: Dawid and Skene’s method

Model accuracies Θ of each LF
Learn accuracies Θ with an Expectation and Maximization algorithm:

Initialize y by majority vote
Calculate accuracies Θ for each LF
Update y by maximizing p(X|y, Θ)

7 of 14

Hyper Label Model

Hyper label model

Pre-trained once and works for all datasets.
Predicts y in a single forward pass y = net(X) and no dataset-specific learning required.

Label Model

Weak label matrix X

Neural Network h

labels y

Can we replace the label model by an pre-trained neural network?

8 of 14

Neural Network Architecture Design

Requirements:

Arbitrary size of X
Switching order of LFs (columns in X) should not affect y
Switching order of data points (rows in X), then rows in y should be switched accordingly

V^k_{1, 0}= f_k[V^k-1_{1, 0}, W₁(V^k-1_{0, 0}+V^k-1_{1, 0}+V^k-1_{2, 0})/3, W₂(V^k-1_{1, 0}+V^k-1_{1, 1})/2]

1. Model parameter is invariant to size of X. (-> Req 1)

2. Switching order of columns or rows in X, the nodes in graph are switched accordingly. (-> Req 3)

3. Avg Pool along the rows. (-> Req 2)

9 of 14

Learning to be an optimal solution

Pretrain h to be an “optimal” solution.

Step 1: An analytical optimal solution (of exponential complexity to directly use):

Step 2: Synthetic training data generation that ensures the trained model is asymptotically close to the optimal solution:

10 of 14

An analytical “optimal” solution

The Better-Than-Random (BTR) Assumption:

LF is developed by human, so we expect the majority of LFs (majority of the columns in X ) to be a better than random estimate for y

Feasible region Q for y:

BTR violated

BTR

satisfied

Distribution of latent y on Q is unknown:

Intuition: X is the only information we have, there is no information to support preferences of some choices of y in Q over other choices of y.

The uniform distribution has optimalities in both average case and worst case.

The optimal estimate of y (that minimizes expected squared error ) under BTR and uniform distribution is the centroid of Q

11 of 14

Learning to be the optimal solution

(1) X Determines

a feasible region Q

(2) Q Determines

a centroid y*

Computational complexity is exponential to size of X

Can we use the analytical solution as supervision signal for learning our hyper label mode h such that y*=h(X) ?

The analytical solution cannot be directly used

12 of 14

Learning to be the optimal solution

Naive method (infeasible): randomly generate many pairs of (X, y*)

Our method: randomly generate many pairs of (X, y’) where y’ is an uniformly sampled random point on the feasible region Q defined by X

(1) X Determines

a feasible region Q

(2) Q Determines

a centroid y*

Training data generation based on the analytical solution

This can be done efficiently. Just generate a random pair (X, y’), and drop it if not in Q.
Theoretical guarantee: Trained model (that minimizes training loss) is asymptotically close to the “optimal” solution as training data set size goes to infinity.

13 of 14

Experiments: accuracy and efficiency

1.4 points better

6 times faster

Existing methods (except majority vote) all require training for each dataset

14 of 14

Summary

A hyper label model

Only needs to be pretrained once on synthetic data, works for all datasets
Faster: obtaining inferred labels in one single forward pass
Better

Technical innovations:

GNN-based model architecture design

Supporting arbitrary input matrix size
Invariant/equivariant to permutations of columns/rows in input matrix

The first analytical optimal method

but cannot be directly used due to its exponential complexity

Principled training data generation

The trained model is asymptotically close to the optimal solution