Learning, evaluating and explaining (XAI) transferable text representations from grumpy Big Small Data.
Nils Rethmeier*+, Prof. Isabelle Augenstein*, Vageesh Saxena+
NLP with Friends • December 2, 2020
*
+
XAI: explain & interpret defined by [1]
Decision understanding (mostly explain, model = black-box)
Model understanding (mostly interpret, grey-box methods)
2
[1] Visual Interaction with Deep Learning Models through Collaborative Semantic Inference, 2019, arx:1907.10739
[2] Learning Comment Controversy Prediction in Web Discussions Using Incidentally Supervised Multi-Task CNNs, 2018, acl:W18-6246
[3] TX-Ray: Quantifying and Explaining Model-Knowledge Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings, 2020, medarx:10.1101/2020.07.21.20157610v1
Do antidepressants correct the sleep disturbances in these diseases? …
[n] shown today, [n] shown not ours
topic label�medicine
disease
vomit
fiever
text
images
[5]
[3]
XAI: explain & interpret defined by [1]
Decision understanding (mostly explain, model = black-box)
Model understanding (mostly interpret, grey-box methods)
3
[1] Visual Interaction with Deep Learning Models through Collaborative Semantic Inference, 2019, arx:1907.10739
[2] Learning Comment Controversy Prediction in Web Discussions Using Incidentally Supervised Multi-Task CNNs, 2018, acl:W18-6246
[3] TX-Ray: Quantifying and Explaining Model-Knowledge Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings, 2020, medarx:10.1101/2020.07.21.20157610v1
Do antidepressants correct the sleep disturbances in these diseases? …
[n] shown today, [n] shown not ours
topic label�medicine
disease
vomit
fiever
text
images
[5]
[3]
[3] transfer XAI/ measurement
Why: Limits of supervised probes/ decision understanding XAI
What: Model transfer understanding: quantify neuron transfer
How: Adapt self-supervised interpretability [5] for NLP [3]
4
[3] TX-Ray: Quantifying and Explaining Model-Knowledge Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
[5] Visualizing Higher-Layer Features of a Deep Network, 2009, http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/247
[3] TX-Ray: Quantifying and Explaining Model-Knowledge
Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
Activation maximization, 2 methods
(1) In vision: generate input that maximally activates a neuron bc
we do not know visual primitives (objects, material, color etc.)
(2) In NLP:
5
Activation maximization, 2 methods
(1) In vision: generate input that maximally activates a neuron bc
we do not know visual primitives (objects, material, color etc.)
(2) In NLP: record input (features
that maximally activate a neuron)
→ Neuron := preference distribution
over input tokens/ words
6
cake
pk = token activation probability, while training
I like cake better than cookies.
The learning-cake has has cherries inside.
Supervision is the cake’s frosting.
Self-supervision is the cake’s corpus.
The cake is a lie.
cookies
Per token, only record the maximally activated (preferred) neuron
neuron 1
[3] TX-Ray: Quantifying and Explaining Model-Knowledge
Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
Activation maximization, 2 methods
(1) In vision: generate input that maximally activates a neuron bc
we do not know visual primitives (objects, material, color etc.)
(2) In NLP: record input (features
that maximally activate a neuron)
→ Neuron := preference distribution
over input tokens/ words
7
cake
pk = token activation probability, while training
I like cake better than cookies.
The learning-cake has has cherries inside.
Supervision is the cake’s frosting.
Self-supervision is the cake’s corpus.
The cake is a lie.
cookies
Per token, only record the maximally activated (preferred) neuron
I
than
neuron 1
neuron 6
[3] TX-Ray: Quantifying and Explaining Model-Knowledge
Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
Activation maximization, 2 methods
(1) In vision: generate input that maximally activates a neuron bc
we do not know visual primitives (objects, material, color etc.)
(2) In NLP: record input (features
that maximally activate a neuron)
→ Neuron := preference distribution
over input tokens/ words
8
cake
I like cake better than cookies.
The learning-cake has has cherries inside.
Supervision is the cake’s frosting.
Self-supervision is the cake’s corpus.
The cake is a lie.
cookies
I
than
neuron 1
neuron 6
neuron prefers stop words
neuron prefers sweets
[3] TX-Ray: Quantifying and Explaining Model-Knowledge
Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
RQ1: pretraining ‘builds’ knowledge
How does each neuron abstract/ transfer knowledge during
RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune
9
Token-preference distributions visualize the knowledge abstraction of each neuron
f1
f2
f3
fk
f_
Xpre
pretrain E on Xpre
(1): pre-training
E
pk
nn:= token-prefer dist.
[3] TX-Ray: Quantifying and Explaining Model-Knowledge
Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
RQ1: pretraining ‘builds’ knowledge
How does each neuron abstract/ transfer knowledge during
RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune
10
Token-preference distributions visualize the knowledge abstraction of each neuron
f1
f2
f3
fk
f_
Xpre
pretrain E on Xpre
(1): pre-training
E
pk
nn:= token-prefer dist.
How neuron preference changes during pre-training (50) epochs:
1 vs 48. or 49. - loss converges at epoch 49
FLAIR NLP pos tagger used on Wikitext-2 (2M tokens) → noisy tags
POS word cluster activation (normalized/ sum to 1)
unique words average activation (not normalized)
POS clusters: bc otherwise x-axis has 100k tokens → unreadable
[3] TX-Ray: Quantifying and Explaining Model-Knowledge
Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
RQ1: pretraining ‘builds’ knowledge
How does each neuron abstract/ transfer knowledge during
RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune
11
Token-preference distributions visualize the knowledge abstraction of each neuron
f1
f2
f3
fk
f_
Xpre
pretrain E on Xpre
(1): pre-training
E
pk
nn:= token-prefer dist.
How neuron preference changes during pre-training (50) epochs:
1 vs 48. or 49. - loss converges at epoch 49
POS word cluster activation (normalized/ sum to 1)
unique words average activation (not normalized)
FLAIR NLP pos tagger used on Wikitext-2 (2M tokens) → noisy tags
[3] TX-Ray: Quantifying and Explaining Model-Knowledge
Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
RQ1: pretraining ‘builds’ knowledge
How does each neuron abstract/ transfer knowledge during
RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune
12
Token-preference distributions visualize the knowledge abstraction of each neuron
f1
f2
f3
fk
f_
Xpre
pretrain E on Xpre
(1): pre-training
E
pk
nn:= token-prefer dist.
little change (same word + POS)
→ high transfer
How neuron preference changes during pre-training (50) epochs:
1 vs 48. or 49. - loss converges at epoch 49
large change
→ little transfer
[3] TX-Ray: Quantifying and Explaining Model-Knowledge
Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
RQ1: pretraining ‘builds’ knowledge
How does each neuron abstract/ transfer knowledge during
RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune
13
Post-hoc reasoning (not in paper):
self-supervision as next word prediction → messy feature preference
Token-preference distributions visualize the knowledge abstraction of each neuron
f1
f2
f3
fk
f_
Xpre
pretrain E on Xpre
(1): pre-training
E
pk
nn:= token-prefer dist.
How neuron preference changes during pre-training (50) epochs:
1 vs 48. or 49. - loss converges at epoch 49
large change
→ little transfer
[3] TX-Ray: Quantifying and Explaining Model-Knowledge
Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
RQ1: pretraining ‘builds’ knowledge
How does each neuron abstract/ transfer knowledge during
RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune
14
Post-hoc reasoning (not in paper):
self-supervision as next word prediction → messy feature preference
We often only care about transfer to supervised task i.e.
Token-preference distributions visualize the knowledge abstraction of each neuron
f1
f2
f3
fk
f_
Xpre
pretrain E on Xpre
(1): pre-training
E
pk
nn:= token-prefer dist.
How neuron preference changes during pre-training (50) epochs:
1 vs 48. or 49. - loss converges at epoch 49
large change
→ little transfer
[3] TX-Ray: Quantifying and Explaining Model-Knowledge
Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
RQ1: pretraining ‘builds’ knowledge
How does each neuron abstract/ transfer knowledge during
RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune
15
Post-hoc reasoning (not in paper):
self-supervision as next word prediction → messy feature preference
We often only care about transfer to supervised task i.e.
Token-preference distributions visualize the knowledge abstraction of each neuron
f1
f2
f3
fk
f_
Xpre
pretrain E on Xpre
(1): pre-training
E
pk
nn:= token-prefer dist.
How neuron preference changes during pre-training (50) epochs:
1 vs 48. or 49. - loss converges at epoch 49
large change
→ little transfer
Uncomfortable question: Are learning objectives that produce intelligble representations better and would more self-supervised evaluation help?!
[3] TX-Ray: Quantifying and Explaining Model-Knowledge
Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
RQ1: pretraining ‘builds’ knowledge
How does each neuron abstract/ transfer knowledge during
RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune
16
Token-preference distributions visualize the knowledge abstraction of each neuron
f1
f2
f3
fk
f_
Xpre
pretrain E on Xpre
(1): pre-training
E
pk
nn:= token-prefer dist.
LSTM LM pre-training learns POS during early epochs - also shown 2018 by [6]
via FLAIR tagger
[3] TX-Ray: Quantifying and Explaining Model-Knowledge
Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
RQ2: apply knowledge to new inputs
How does each neuron abstract/ transfer knowledge during
RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune
17
feed new text Xend to frozen pre- trained encoder (1)
f1
f2
f3
fk
f_
Xpre
pretrain E on Xpre
(1): pre-training
E
pk
nn:= token-prefer dist.
f1
f2
f3
fk
f_
Xend
eval. E on new Xend
(2): zero-shot
fk := token / POS
E
zero-shot transfer
no neuron transfer if
large change vs.
neuron transfers if small change vs.
zero-shot preference distribution
[3] TX-Ray: Quantifying and Explaining Model-Knowledge
Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
RQ2: apply knowledge to new inputs
How does each neuron abstract/ transfer knowledge during
RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune
18
feed new text Xend to frozen pre- trained encoder (1)
f1
f2
f3
fk
f_
Xpre
pretrain E on Xpre
(1): pre-training
E
pk
nn:= token-prefer dist.
f1
f2
f3
fk
f_
Xend
eval. E on new Xend
(2): zero-shot
fk := token / POS
E
zero-shot transfer
Hellinger H( , ) distance measures change/ transfer.�H is a symmetric Kullback Leibler Divergence
no neuron transfer if
large change vs.
neuron transfers if small change vs.
zero-shot preference distribution
[3] TX-Ray: Quantifying and Explaining Model-Knowledge
Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
RQ3: transfer to + from supervision
How does each neuron abstract/ transfer knowledge during
RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune
19
feed new text Xend to the pre-trained encoder (1)
f1
f2
f3
fk
f_
Xpre
pretrain E on Xpre
(1): pre-training
E
pk
nn:= token-prefer dist.
f1
f2
f3
fk
f_
Xend
eval. E on new Xend
(2): zero-shot
fk := token / POS
E
zero-shot transfer
supervised transfer
f1
f2
f3
fk
f_
Xend
eval. Eend on Xend
Yend
Eend : E fit on Yend
(3): supervised
p1= .05
pk := token probability
E
fine-tune encoder (1) with labels (supervised) only
Pretrain → supervise
[3] TX-Ray: Quantifying and Explaining Model-Knowledge
Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
RQ3: transfer to + from supervision
How does each neuron abstract/ transfer knowledge during
RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune
20
pruning effects on supervised performance
f1
f2
f3
fk
f_
Xpre
pretrain E on Xpre
(1): pre-training
E
pk
nn:= token-prefer dist.
zero-shot transfer
supervised transfer
Prune neurons that post supervision …
i.e, neuron without max activation post supervision
Pretrain → supervise
f1
f2
f3
fk
f_
Xend
eval. Eend on Xend
Yend
Eend : E fit on Yend
(3): supervised
p1= .05
pk := token probability
E
[3] TX-Ray: Quantifying and Explaining Model-Knowledge
Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
RQ3: transfer to + from supervision
How does each neuron abstract/ transfer knowledge during
RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune
21
f1
f2
f3
fk
f_
Xpre
pretrain E on Xpre
(1): pre-training
E
pk
nn:= token-prefer dist.
zero-shot transfer
supervised transfer
f1
f2
f3
fk
f_
Xend
eval. Eend on Xend
Yend
Eend : E fit on Yend
(3): supervised
p1= .05
pk := token probability
E
Prune neurons that post supervision …
i.e, neuron without max activation post supervision
Pretrain → supervise
pruning effects on supervised performance
[3] TX-Ray: Quantifying and Explaining Model-Knowledge
Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
RQ3: ‘forgetting’ due to supervision
How does each neuron abstract/ transfer knowledge during
RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune
22
f1
f2
f3
fk
f_
Xpre
pretrain E on Xpre
(1): pre-training
E
pk
nn:= token-prefer dist.
zero-shot transfer
supervised transfer
f1
f2
f3
fk
f_
Xend
eval. Eend on Xend
Yend
Eend : E fit on Yend
(3): supervised
p1= .05
pk := token probability
E
Prune neurons that post supervision …
i.e, neuron without max activation post supervision
Pretrain → supervise
pruning effects on supervised performance
[3] TX-Ray: Quantifying and Explaining Model-Knowledge
Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
TX-Ray take aways
can explore generalization and specialization at neuron-level
23
[3] TX-Ray: Quantifying and Explaining Model-Knowledge
Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
!
TX-Ray take aways
can explore generalization and specialization at neuron-level
24
[3] TX-Ray: Quantifying and Explaining Model-Knowledge
Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
!
TX-Ray take aways
can explore generalization and specialization at neuron-level
25
[3] TX-Ray: Quantifying and Explaining Model-Knowledge
Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
!
XAI: further (newer) reading
XAI conceptualizing neurons/ NN components:
[7] Understanding the role of individual units in a deep neural network, PNAS,
2020. arxiv:2009.05041 (newer)
[8] Activation Atlas, 2019 distill.pub/2019/activation-atlas/ - Gave idea
that neurons should be explorable (but only worked for supervised learning). TX-Ray explores unsupervised resprentations → removes probing task limitation. Delay expectation bias.
Pruning:
[9] Movement Pruning: Adaptive Sparsity by Fine-Tuning, 2020 (newer)
arxiv:2005.07683 - pruns neurons that change during training/ transfer in
NLP, no XAI
26
Open questions/ caveats
Pruning:
pruning always hurts (MSc. thesis: Saxena)
So, is pruning only for overparameterized models?
XAI:
27
[3] TX-Ray: Quantifying and Explaining Model-Knowledge
Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982
TX-Ray in patient care prediction [4]
Evaluate preferred data source and model component for a task
28
[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings,
2020, AMIA Symposium, medarx:10.1101/2020.07.21.20157610v1
[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings, 2020, AMIA Symposium, medarx:10.1101/2020.07.21.20157610v1
Task: will patient die in 24h?
activation
1 hour bucket pooling -- low layer
pooling over hour buckets -- higher layers
limited training data: 1. low-resource pretraining + 2. XAI, since no XL pretrained model available
TX-Ray in patient care prediction [4]
Evaluate preferred data source and model component for a task
29
[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings, 2020, AMIA Symposium, medarx:10.1101/2020.07.21.20157610v1
Insights:
Task: will patient die in 24h?
activation
[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings,
2020, AMIA Symposium, medarx:10.1101/2020.07.21.20157610v1
limited training data: 1. low-resource pretraining + XAI, since no XL pretrained model available
TX-Ray in patient care prediction [4]
Evaluate preferred data source and model component for a task
30
[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings, 2020, AMIA Symposium, medarx:10.1101/2020.07.21.20157610v1
Insights:
pooling over hour bucket matters
1 hour bucket pooling -- low layer
Task: will patient die in 24h?
activation
[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings,
2020, AMIA Symposium, medarx:10.1101/2020.07.21.20157610v1
limited training data: 1. low-resource pretraining + XAI, since no XL pretrained model available
TX-Ray in patient care prediction [4]
Evaluate preferred data source and model component for a task
31
[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings, 2020, AMIA Symposium, medarx:10.1101/2020.07.21.20157610v1
Insights:
pooling over hour bucket matters
pooling matters most
1 hour bucket pooling -- low layer
Task: will patient die in 24h?
pooling over hour buckets -- higher layers
activation
[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings,
2020, AMIA Symposium, medarx:10.1101/2020.07.21.20157610v1
limited training data: 1. low-resource pretraining + XAI, since no XL pretrained model available
Evaluation: approaches, questions
Supervised: “Catch them all” supervised benchmarks
Problem:
32
32
[10] MoRTy: Unsupervised Learning of Task-specialized Word Embeddings by Autoencoding, 2020, acl:W19-4307, github.com/NilsRethmeier/MoRTy
[11] Designing and Interpreting Probes with Control Tasks, 2019, acl:D19-1275
base embs
overall better
best per task
Evaluation: approaches, questions
Supervised evaluation via probe design problems continued
Nagging question 2: How much do probe benchmarks tell us then?
Nagging question 3: are new pretraining models better or just use more data -- lack of controlled experiments!
1990-2010: learning is easy when you have enough (supervised) data
2018-20xx: learning is easy when you have web-scale (self-supervised) data
33
33
[11] Designing and Interpreting Probes with Control Tasks, 2019, acl:D19-1275, [12] The Hardware Lottery, 2020, arx:2009.06489
[13] Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping, 2020
[14] A Survey on Contextual Embeddings, 2020,
[15] How can we accelerate progress towards human-like linguistic generalization, 2020. acl:http2020.acl-main.465.
Evaluation: approaches, questions
Supervised evaluation via probe design problems continued
Nagging question 2: How much do probe benchmarks tell us then?
Nagging question 3: are new pretraining models better or just use more data -- lack of controlled experiments!
1990-2010: learning is easy when you have enough (supervised) data
2018-20xx: learning is easy when you have web-scale (self-supervised) data
34
34
[11] Designing and Interpreting Probes with Control Tasks, 2019, acl:D19-1275, [12] The Hardware Lottery, 2020, arx:2009.06489
[13] Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping, 2020
[14] A Survey on Contextual Embeddings, 2020,
[15] How can we accelerate progress towards human-like linguistic generalization, 2020. acl:http2020.acl-main.465.
Learning from small data
Evaluation idea: addressing [16-20]
Problem: how to learn a text encoder at small scale?
using small pre-training data [16, 18-21]
Considering 1+2: we chose CNNs
35
35
[16] Learning and evaluating general linguistic intelligence., 2020, arx:2010.01061 [17] How can we accelerate progress towards human-like linguistic
generalization, 2020. acl:http2020.acl-main.465. [18] Mogrifier LSTM, 2020, [19] Transformer on a Diet, 2020, [20] Character-aware neural language models. 2016. abs/1508.06615, [21] Pointer Sentinel Mixture Models, 2016, [22] Multi-task label embedding for text classification. 2018
[23] X-BERT: eXtreme Multi-label Text Classification with BERT,
[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061
Pre-training from small data
Self-supervised CNN pre-training (e.g. as SSL transformer alternative)
36
36
feed words ‘measuring..’ + positive labels + negative labels through embedding layer (1)
feed word embeddings trough text encoder NN → text encoding T
and label(-word)-embedings through label encoder NN → label encoding L
concatenate each label encoding L with a copy if the text encoding T -- Tktorch.repeat(T, |L1 … Lk|)
match label-encodings L to text-copies Tk -- matcher = learned similarity function with sim = 0..1
train with binary cross entropy over k labels * batchsize (k = num pos + neg labels)
1
2
3
4
5
6
[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061
Pre-training from small data
Supervised and self-supervised training objectives
detail we use a CNN and deeper encoders than [25]
37
37
[24] Multi-task label embedding for text classification. 2018
[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061
Pre-training from small data
Pseudo-label embedding NCE (PLENCE) properties:
38
38
[25] Learning Semantic Representations for Novel Words: Leveraging Both Form and Context. 2018
[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061
Pre-training from small data
Evaluate zero-shot ZS, few-shot + long-tail learning via PLENCE:
allowing less pre-training text
75% … 10%
3.3x self-supervision pseudo-labels
Using larger net + more pseudo-labels
39
39
Insight: when large, external data (signal) is unavailable, increase self-supervison
works despite long-tail problem worsening with few-shot
[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061
Pre-training from small data
Evaluate zero-shot ZS, few-shot + long-tail learning via PLENCE:
allowing less pre-training text
75% … 10%
3.3x self-supervision pseudo-labels
Using larger net + more pseudo-labels
40
40
Insight: when large, external data (signal) is unavailable, increase self-supervison
works despite long-tail problem worsening with few-shot
[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061
Pre-training from small data
Evaluate zero-shot, few-shot FS + long-tail learning via PLENCE:
with self-supervised pretraining (PLENCE)
without self-supervised pretraining:
41
41
Insight: pseudo-label SSL pretraining boost data efficiency, learning speed, learning stability and few-shot end performance -- even on 10% data
[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061
pseudo plateaus
stable learning curves
Pre-training from small data
Evaluate zero-shot, few-shot FS + long-tail learning via PLENCE:
with self-supervised pretraining (PLENCE)
without self-supervised pretraining:
42
42
Insight: pseudo-label SSL pretraining boost data efficiency, learning speed, learning stability and few-shot end performance -- even on 10% data
[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061
pseudo plateaus
stable learning curves
Pre-training from small data
Evaluate zero-shot ZS, few-shot + long-tail learning via PLENCE:
43
43
[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061
Evaluating a long-tail:
y-axis class label frequency in log -- extreme imbalance, 1305 classes
Pre-training from small data
Evaluate zero-shot ZS, few-shot + long-tail learning via PLENCE:
44
44
[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061
SL Sup label-embeddings (LE)
SSL prertrained LE:
XL SSL pretrained LE with larger net + 3.3x labels:
Pre-training from small data
Evaluate zero-shot ZS, few-shot + long-tail learning via PLENCE:
45
45
[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061
SSL pretraining label embs:
XL SSL pretrained LE:
Insight: pseudo-label SSL pretraining boost long-tail learning.
Larger model + more SSL signal boost long-tail further.
Summary
self-supervised XAI:
challenging, low-resource evaluation:
data efficient learning:
→ Evidence that instead of large external data pretraining, we can
just boost SSL learning signal on small data
46
46
FIN
47
47
XAI: further/ newer reading
[7] Understanding the role of individual units
in a deep neural network, PNAS, 2020.
arxiv:2009.05041
48
An image classification CNN
Redundancy in represenations
49