Learning Hyper Label Model for Programmatic Weak Supervision
Renzhi Wu* Shen-En Chen* Jieyu Zhang& Xu Chu*
*Georgia Tech &University of Washington
Data is the Bottleneck for ML
ML ≈ Model + Data
Sources:
https://www.semafor.com/article/01/27/2023/openai-has-hired-an-army-of-contractors-to-make-basic-coding-obsolete
https://www.datanami.com/2023/01/20/openai-outsourced-data-labeling-to-kenyan-workers-earning-less-than-2-per-hour-time-report/
Model is gradually commoditized
Data is the bottleneck
Manual v.s. Programmatic Supervision
Labeling individual data points
Writing Labeling Functions (LFs) where each LF abstracts a supervision source (e.g. heuristics, existing models, external KBs, …)
Source: https://www.snorkel.org/use-cases/01-spam-tutorial#3-writing-more-labeling-functions
Programmatic Supervision
Challenge:
Incomplete, noisy and conflicting weak labels from LFs
1 | -1 | 0 | … |
0 | 1 | 0 | … |
1 | 1 | -1 | .. |
1 | 1 | -1 | … |
1 | 1 | -1 | … |
… | … | … | … |
Data point 1
Data point 2
Data point 3
Data point 4
Data point 5
Data point x
Weak label matrix X
LF1 LF2 LF3 LFX
Label Model
1 | -1 | 0 | … |
0 | 1 | 0 | … |
1 | 1 | -1 | .. |
1 | 1 | -1 | … |
1 | 1 | -1 | … |
… | … | … | … |
Data point 1
Data point 2
Data point 3
Data point 4
Data point 5
Data point x
Weak label matrix X
LF1 LF2 LF3 LFX
1 |
1 |
-1 |
1 |
-1 |
… |
y
Label model
Inferred ground-truth labels y
Same problem setup in labeling by crowdsourcing
Label Model
Existing methods all require ad-hoc parameter learning on each dataset
Example: Dawid and Skene’s method
Hyper Label Model
Hyper label model
Label Model
Weak label matrix X
Neural Network h
labels y
Can we replace the label model by an pre-trained neural network?
Neural Network Architecture Design
Requirements:
Vk1, 0 = fk[Vk-11, 0, W1(Vk-10, 0+Vk-11, 0+Vk-12, 0)/3, W2(Vk-11, 0+Vk-11, 1)/2]
1. Model parameter is invariant to size of X. (-> Req 1)
2. Switching order of columns or rows in X, the nodes in graph are switched accordingly. (-> Req 3)
3. Avg Pool along the rows. (-> Req 2)
Learning to be an optimal solution
Pretrain h to be an “optimal” solution.
Step 1: An analytical optimal solution (of exponential complexity to directly use):
Step 2: Synthetic training data generation that ensures the trained model is asymptotically close to the optimal solution:
An analytical “optimal” solution
The Better-Than-Random (BTR) Assumption:
LF is developed by human, so we expect the majority of LFs (majority of the columns in X ) to be a better than random estimate for y
Feasible region Q for y:
BTR violated
BTR
satisfied
Distribution of latent y on Q is unknown:
Intuition: X is the only information we have, there is no information to support preferences of some choices of y in Q over other choices of y.
The uniform distribution has optimalities in both average case and worst case.
.
The optimal estimate of y (that minimizes expected squared error ) under BTR and uniform distribution is the centroid of Q
Learning to be the optimal solution
X
.
y*
(1) X Determines
a feasible region Q
(2) Q Determines
a centroid y*
Computational complexity is exponential to size of X
Can we use the analytical solution as supervision signal for learning our hyper label mode h such that y*=h(X) ?
The analytical solution cannot be directly used
Learning to be the optimal solution
X
.
y*
Naive method (infeasible): randomly generate many pairs of (X, y*)
Our method: randomly generate many pairs of (X, y’) where y’ is an uniformly sampled random point on the feasible region Q defined by X
(1) X Determines
a feasible region Q
(2) Q Determines
a centroid y*
Training data generation based on the analytical solution
Experiments: accuracy and efficiency
1.4 points better
6 times faster
Existing methods (except majority vote) all require training for each dataset
Summary