Validating Labeling Functions
in Domain Shift
20223137 Yewon Kim
20224560 Seungjoo Lee
Presentation template was adopted from SlidesCarnival.
To obtain more labeled training data,
weak supervision leverages cheaper & noisy labels
2
Get cheaper labels
from non-experts
e.g., crowdsourcing
Get higher-level supervision from experts
e.g., labeling functions
Get pseudo-labels from pre-trained models
e.g., knowledge distillation
Labeling function (LF) is a lightweight and cost effective way to generate labels in unlabeled data.
3
Source: https://snorkel.ai/weak-supervision-modeling/
However, what if data distribution shifts?
4
Data stream
time
Developer side
^(?=.*\bkid\b)(?=.*\blike\b).*$
Domain: toy
review/”My kid likes it”
⇒ positive
Example scenario:
sentiment analysis from review data
However, what if data distribution shifts?
5
Data stream
time
Developer side
^(?=.*\bkid\b)(?=.*\blike\b).*$
Domain: toy
review/”My kid likes it”
⇒ positive
Domain: book
review/”A true love story”
⇒ ???
LFs are no longer valid;
need to update!
Example scenario:
sentiment analysis from review data
However, what if data distribution shifts?
6
Data stream
time
Developer side
^(?=.*\bkid\b)(?=.*\blike\b).*$
Domain: toy
review/”My kid likes it”
⇒ positive
Domain: book
review/”A true love story”
⇒ ???
Accurately and timely detecting the data shift and
prompting engineers to update LFs is crucial
in order to ensure the reliable performance of an end model!
Example scenario:
sentiment analysis from review data
Our idea: Use the outputs from LFs to detect domain shift!
7
Method: (1) Changing discrete LFs to continuous LFs
8
Example of discrete LF on NLP sentiment analysis:�Keyword-based heuristic function
@labeling_function
def positive_keyword_lf(text, keyword):
if keyword in text.lower():
return POSITIVE
return ABSTAIN
T-SNE result, 8 discrete LFs
Method: (1) Changing discrete LFs to continuous LFs
9
T-SNE result, 8 discrete LFs
Method: (1) Changing discrete LFs to continuous LFs
10
Example of continuous LF:�Using cosine similarity of GloVe word embedding
@labeling_function
def positive_keyword_lf(text, keyword):
text_emb = glove(text.lower().split())
keyword_emb = glove([keyword])
return get_cosine_similarity(text_emb, keyword_emb)
T-SNE result, 8 continuous LFs
Method: (2) Kernel density estimation
11
Overall pipeline : Training phase
12
Unlabeled training data
⋮
Discrete LFs
⋮
Continuous LFs
Kernel density estimation
Overall pipeline : Testing phase
13
Unlabeled test data
⋮
Discrete LFs
Labeled data
⋮
Continuous LFs
Kernel density estimation
OOD detection
Evaluation setup: task and dataset
14
In-distribution (ID) | Out-of-distribution (OOD) |
IMDB | Yelp |
Amazon-baby | |
Amazon-electronics | |
Amazon-jewelry | |
Amazon-home | |
Amazon-sports |
Sentiment analysis task (binary classification); we used IMDB [1], Yelp [2], and Amazon reviews [3].
[1] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).
[2] Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).
[3] R. He, J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. WWW, 2016
Evaluation setup: LF development
15
[1] https://docs.argilla.io/en/latest/
Evaluation setup: LF development
16
[1] https://docs.argilla.io/en/latest/
Label | Keywords |
Positive | impress, adorable, enjoy, excellent, beautiful, wonderful, recommend, best, masterpiece, performance * best, performance * good |
Negative | terrible, poor, stupid, wrong, disappoint, painful, awful, boring, worse, worst, bad, cliche, killer, unnecessary, waste, least try, nothing * special, nothing * even, performance * worst, acting * bad |
Results: OOD Detection
17
KDE h = 0.05, batch 16 fixed
ID | OOD | AUROC | Accuracy | |
OOD (Coverage) | ID (Coverage) | |||
IMDB | Yelp | 0.93 | 0.78 (0.57) | 0.74 (0.82) |
Amazon-baby | 0.96 | 0.78 (0.41) | ||
Amazon-electronics | 0.95 | 0.75 (0.42) | ||
Amazon-jewelry | 1.00 | 0.86 (0.39) | ||
Amazon-home | 0.98 | 0.80 (0.39) | ||
Amazon-sports | 0.99 | 0.79 (0.33) | ||
Batch-AUROC Tradeoff
18
Batch size = 1
AUC=0.68
Batch size = 8
AUC=0.89
Batch size = 16
AUC=0.96
Batch size = 32
AUC=0.99
Discussion & Future work
19
OOD | Accuracy |
OOD (Coverage) | |
Yelp | 0.78 (0.57) |
Amazon-baby | 0.78 (0.41) |
Amazon-electronics | 0.75 (0.42) |
Amazon-jewelry | 0.86 (0.39) |
Amazon-home | 0.80 (0.39) |
Amazon-sports | 0.79 (0.33) |