1 of 19

Validating Labeling Functions

in Domain Shift

20223137 Yewon Kim

20224560 Seungjoo Lee

Presentation template was adopted from SlidesCarnival.

2 of 19

To obtain more labeled training data,

weak supervision leverages cheaper & noisy labels

2

Get cheaper labels

from non-experts

e.g., crowdsourcing

Get higher-level supervision from experts

e.g., labeling functions

Get pseudo-labels from pre-trained models

e.g., knowledge distillation

3 of 19

Labeling function (LF) is a lightweight and cost effective way to generate labels in unlabeled data.

3

Source: https://snorkel.ai/weak-supervision-modeling/

4 of 19

However, what if data distribution shifts?

4

Data stream

time

Developer side

^(?=.*\bkid\b)(?=.*\blike\b).*$

Domain: toy

review/”My kid likes it”

positive

Example scenario:

sentiment analysis from review data

5 of 19

However, what if data distribution shifts?

5

Data stream

time

Developer side

^(?=.*\bkid\b)(?=.*\blike\b).*$

Domain: toy

review/”My kid likes it”

positive

Domain: book

review/”A true love story”

???

LFs are no longer valid;

need to update!

Example scenario:

sentiment analysis from review data

6 of 19

However, what if data distribution shifts?

6

Data stream

time

Developer side

^(?=.*\bkid\b)(?=.*\blike\b).*$

Domain: toy

review/”My kid likes it”

positive

Domain: book

review/”A true love story”

???

Accurately and timely detecting the data shift and

prompting engineers to update LFs is crucial

in order to ensure the reliable performance of an end model!

Example scenario:

sentiment analysis from review data

7 of 19

Our idea: Use the outputs from LFs to detect domain shift!

7

  • Previous works: observe an input itself to determine if it is out-of-distribution (OOD)
    • Specifically, define a score function and classify it as OOD if �where is a predefined threshold.
    • Score functions: e.g., language models
  • Instead, we observe outputs of LFs to determine OOD
    • Outputs of LFs contain richer information as LFs are specifically designed to �identify certain aspects of the data.
    • More efficient and scalable, as it does not necessarily require models to capture �important features from raw data.

8 of 19

Method: (1) Changing discrete LFs to continuous LFs

8

Example of discrete LF on NLP sentiment analysis:�Keyword-based heuristic function

@labeling_function

def positive_keyword_lf(text, keyword):

if keyword in text.lower():

return POSITIVE

return ABSTAIN

  • Outputs limited values (POSITIVE, NEGATIVE, ABSTAIN)
  • Information from discrete LFs are not enough to detect OOD

T-SNE result, 8 discrete LFs

9 of 19

Method: (1) Changing discrete LFs to continuous LFs

9

T-SNE result, 8 discrete LFs

10 of 19

Method: (1) Changing discrete LFs to continuous LFs

10

Example of continuous LF:�Using cosine similarity of GloVe word embedding

@labeling_function

def positive_keyword_lf(text, keyword):

text_emb = glove(text.lower().split())

keyword_emb = glove([keyword])

return get_cosine_similarity(text_emb, keyword_emb)

  • Cosine similarity between passage and keyword
  • Outputs continuous values → dense information

T-SNE result, 8 continuous LFs

11 of 19

Method: (2) Kernel density estimation

11

12 of 19

Overall pipeline : Training phase

12

Unlabeled training data

Discrete LFs

Continuous LFs

Kernel density estimation

13 of 19

Overall pipeline : Testing phase

13

Unlabeled test data

Discrete LFs

Labeled data

Continuous LFs

Kernel density estimation

OOD detection

14 of 19

Evaluation setup: task and dataset

14

In-distribution (ID)

Out-of-distribution (OOD)

IMDB

Yelp

Amazon-baby

Amazon-electronics

Amazon-jewelry

Amazon-home

Amazon-sports

Sentiment analysis task (binary classification); we used IMDB [1], Yelp [2], and Amazon reviews [3].

  • Train : ID (20000) / Test : ID (5000) + OOD(5000)

[1] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

[2] Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

[3] R. He, J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. WWW, 2016

15 of 19

Evaluation setup: LF development

15

  • Keyword-based interactive LF generation using Argilla[1]

[1] https://docs.argilla.io/en/latest/

16 of 19

Evaluation setup: LF development

16

  • Keyword-based interactive LF generation using Argilla[1]
    • 12 positive keywords, 20 negative keywords

[1] https://docs.argilla.io/en/latest/

Label

Keywords

Positive

impress, adorable, enjoy, excellent, beautiful, wonderful, recommend, best, masterpiece, performance * best, performance * good

Negative

terrible, poor, stupid, wrong, disappoint, painful, awful, boring, worse, worst, bad, cliche, killer, unnecessary, waste, least try, nothing * special, nothing * even, performance * worst, acting * bad

17 of 19

Results: OOD Detection

17

KDE h = 0.05, batch 16 fixed

ID

OOD

AUROC

Accuracy

OOD (Coverage)

ID (Coverage)

IMDB

Yelp

0.93

0.78 (0.57)

0.74 (0.82)

Amazon-baby

0.96

0.78 (0.41)

Amazon-electronics

0.95

0.75 (0.42)

Amazon-jewelry

1.00

0.86 (0.39)

Amazon-home

0.98

0.80 (0.39)

Amazon-sports

0.99

0.79 (0.33)

18 of 19

Batch-AUROC Tradeoff

18

Batch size = 1

AUC=0.68

Batch size = 8

AUC=0.89

Batch size = 16

AUC=0.96

Batch size = 32

AUC=0.99

19 of 19

Discussion & Future work

19

  • Providing explainable prompts to engineers
    • Train separate OOD detector for each LF
    • When the OOD detected, run LF OOD detectors to find out wrong LFs
  • Other ways to convert discrete LFs to continuous LFs
  • Using coverage as OOD predictor
    • Coverage drops significantly with OOD data
  • Experiments on different shift scenarios & domains
    • Only IMDB is used as source distribution
    • Applying to other NLP tasks
    • Applying to other domains such as vision

OOD

Accuracy

OOD (Coverage)

Yelp

0.78 (0.57)

Amazon-baby

0.78 (0.41)

Amazon-electronics

0.75 (0.42)

Amazon-jewelry

0.86 (0.39)

Amazon-home

0.80 (0.39)

Amazon-sports

0.79 (0.33)