1 of 14

Learning Hyper Label Model for Programmatic Weak Supervision

Renzhi Wu* Shen-En Chen* Jieyu Zhang& Xu Chu*

*Georgia Tech &University of Washington

2 of 14

Data is the Bottleneck for ML

ML ≈ Model + Data

Sources:

https://www.semafor.com/article/01/27/2023/openai-has-hired-an-army-of-contractors-to-make-basic-coding-obsolete

https://www.datanami.com/2023/01/20/openai-outsourced-data-labeling-to-kenyan-workers-earning-less-than-2-per-hour-time-report/

Model is gradually commoditized

  • Out-of-the-box invocation of ML libraries gives decent results
  • Transformers for “all” tasks

Data is the bottleneck

3 of 14

Manual v.s. Programmatic Supervision

Labeling individual data points

Writing Labeling Functions (LFs) where each LF abstracts a supervision source (e.g. heuristics, existing models, external KBs, …)

Source: https://www.snorkel.org/use-cases/01-spam-tutorial#3-writing-more-labeling-functions

4 of 14

Programmatic Supervision

Challenge:

Incomplete, noisy and conflicting weak labels from LFs

1

-1

0

0

1

0

1

1

-1

..

1

1

-1

1

1

-1

Data point 1

Data point 2

Data point 3

Data point 4

Data point 5

Data point x

Weak label matrix X

LF1 LF2 LF3 LFX

5 of 14

Label Model

1

-1

0

0

1

0

1

1

-1

..

1

1

-1

1

1

-1

Data point 1

Data point 2

Data point 3

Data point 4

Data point 5

Data point x

Weak label matrix X

LF1 LF2 LF3 LFX

1

1

-1

1

-1

y

Label model

Inferred ground-truth labels y

Same problem setup in labeling by crowdsourcing

6 of 14

Label Model

Existing methods all require ad-hoc parameter learning on each dataset

  1. Assume an underlying distribution p(y, X, Θ)
  2. Learn parameter Θ according to some objective function
  3. Predict y using p(y, X, Θ)

Example: Dawid and Skene’s method

  • Model accuracies Θ of each LF
  • Learn accuracies Θ with an Expectation and Maximization algorithm:
    1. Initialize y by majority vote
    2. Calculate accuracies Θ for each LF
    3. Update y by maximizing p(X|y, Θ)

7 of 14

Hyper Label Model

Hyper label model

  1. Pre-trained once and works for all datasets.
  2. Predicts y in a single forward pass y = net(X) and no dataset-specific learning required.

Label Model

Weak label matrix X

Neural Network h

labels y

Can we replace the label model by an pre-trained neural network?

8 of 14

Neural Network Architecture Design

Requirements:

  1. Arbitrary size of X
  2. Switching order of LFs (columns in X) should not affect y
  3. Switching order of data points (rows in X), then rows in y should be switched accordingly

Vk1, 0 = fk[Vk-11, 0, W1(Vk-10, 0+Vk-11, 0+Vk-12, 0)/3, W2(Vk-11, 0+Vk-11, 1)/2]

1. Model parameter is invariant to size of X. (-> Req 1)

2. Switching order of columns or rows in X, the nodes in graph are switched accordingly. (-> Req 3)

3. Avg Pool along the rows. (-> Req 2)

9 of 14

Learning to be an optimal solution

Pretrain h to be an “optimal” solution.

Step 1: An analytical optimal solution (of exponential complexity to directly use):

Step 2: Synthetic training data generation that ensures the trained model is asymptotically close to the optimal solution:

10 of 14

An analytical “optimal” solution

The Better-Than-Random (BTR) Assumption:

LF is developed by human, so we expect the majority of LFs (majority of the columns in X ) to be a better than random estimate for y

Feasible region Q for y:

BTR violated

BTR

satisfied

Distribution of latent y on Q is unknown:

Intuition: X is the only information we have, there is no information to support preferences of some choices of y in Q over other choices of y.

The uniform distribution has optimalities in both average case and worst case.

.

The optimal estimate of y (that minimizes expected squared error ) under BTR and uniform distribution is the centroid of Q

11 of 14

Learning to be the optimal solution

X

.

y*

(1) X Determines

a feasible region Q

(2) Q Determines

a centroid y*

Computational complexity is exponential to size of X

Can we use the analytical solution as supervision signal for learning our hyper label mode h such that y*=h(X) ?

The analytical solution cannot be directly used

12 of 14

Learning to be the optimal solution

X

.

y*

Naive method (infeasible): randomly generate many pairs of (X, y*)

Our method: randomly generate many pairs of (X, y’) where y’ is an uniformly sampled random point on the feasible region Q defined by X

(1) X Determines

a feasible region Q

(2) Q Determines

a centroid y*

Training data generation based on the analytical solution

  • This can be done efficiently. Just generate a random pair (X, y’), and drop it if not in Q.
  • Theoretical guarantee: Trained model (that minimizes training loss) is asymptotically close to the “optimal” solution as training data set size goes to infinity.

13 of 14

Experiments: accuracy and efficiency

1.4 points better

6 times faster

Existing methods (except majority vote) all require training for each dataset

14 of 14

Summary

  1. A hyper label model
    1. Only needs to be pretrained once on synthetic data, works for all datasets
    2. Faster: obtaining inferred labels in one single forward pass
    3. Better
  2. Technical innovations:
    • GNN-based model architecture design
      1. Supporting arbitrary input matrix size
      2. Invariant/equivariant to permutations of columns/rows in input matrix
    • The first analytical optimal method
      • but cannot be directly used due to its exponential complexity
    • Principled training data generation
      • The trained model is asymptotically close to the optimal solution