2 of 49

XAI: explain & interpret defined by [1]

Decision understanding (mostly explain, model = black-box)

explain input to output/ task relevance
needs supervised labels

incidental labels work [2]

often explain 1 instance

no idea what the model does

Model understanding (mostly interpret, grey-box methods)

interpret neuron acitivity
interpret over entire corpus
irrelevant of output (unsupervised)
prune unused neurons [3]
find redundant neurons [4]

[1] Visual Interaction with Deep Learning Models through Collaborative Semantic Inference, 2019, arx:1907.10739

[2] Learning Comment Controversy Prediction in Web Discussions Using Incidentally Supervised Multi-Task CNNs, 2018, acl:W18-6246

[3] TX-Ray: Quantifying and Explaining Model-Knowledge Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings, 2020, medarx:10.1101/2020.07.21.20157610v1

Do antidepressants correct the sleep disturbances in these diseases? …

src:lrpserver.hhi.fraunhofer.de/text-classification

[n] shown today, [n] shown not ours

topic label�medicine

disease

vomit

fiever

text

images

[5]

[3]

3 of 49

XAI: explain & interpret defined by [1]

Decision understanding (mostly explain, model = black-box)

explain input to output/ task relevance
needs supervised labels

incidental labels work [2]

often explain 1 instance

no idea what the model does

Model understanding (mostly interpret, grey-box methods)

interpret neuron acitivity
interpret over entire corpus
irrelevant of output (unsupervised)
prune unused neurons [3]
find redundant neurons [4]

[1] Visual Interaction with Deep Learning Models through Collaborative Semantic Inference, 2019, arx:1907.10739

[2] Learning Comment Controversy Prediction in Web Discussions Using Incidentally Supervised Multi-Task CNNs, 2018, acl:W18-6246

[3] TX-Ray: Quantifying and Explaining Model-Knowledge Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings, 2020, medarx:10.1101/2020.07.21.20157610v1

Do antidepressants correct the sleep disturbances in these diseases? …

src:lrpserver.hhi.fraunhofer.de/text-classification

[n] shown today, [n] shown not ours

topic label�medicine

disease

vomit

fiever

text

images

[5]

[3]

4 of 49

[3] transfer XAI/ measurement

Why: Limits of supervised probes/ decision understanding XAI

needs labels - evaluate expected semantics only

unforseen insights not discoverable,

probe can mismatch (pre-)training domain

What: Model transfer understanding: quantify neuron transfer

RQ1: self-supervised ‘knowledge’ abstraction (pre-training)
RQ2: zero-shot ‘knowledge’ transfer to new domain
RQ3: transfer during supervised adaptation (fine-tuning)

How: Adapt self-supervised interpretability [5] for NLP [3]

visualize what input (features) each neuron prefers (maximally activates on) - activation maximization from RBMs [5]

[3] TX-Ray: Quantifying and Explaining Model-Knowledge Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

[5] Visualizing Higher-Layer Features of a Deep Network, 2009, http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/247

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

5 of 49

Activation maximization, 2 methods

(1) In vision: generate input that maximally activates a neuron bc

we do not know visual primitives (objects, material, color etc.)

(2) In NLP:

[6] src:

distill.pub/2017/feature-visualization/

6 of 49

Activation maximization, 2 methods

(1) In vision: generate input that maximally activates a neuron bc

we do not know visual primitives (objects, material, color etc.)

(2) In NLP: record input (features

that maximally activate a neuron)

→ Neuron := preference distribution

over input tokens/ words

_cake

p_k= token activation probability, while training

I like cake better than cookies.

The learning-cake has has cherries inside.

Supervision is the cake’s frosting.

Self-supervision is the cake’s corpus.

The cake is a lie.

_cookies

Per token, only record the maximally activated (preferred) neuron

neuron 1

[6] src:

distill.pub/2017/feature-visualization/

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

7 of 49

Activation maximization, 2 methods

(1) In vision: generate input that maximally activates a neuron bc

we do not know visual primitives (objects, material, color etc.)

(2) In NLP: record input (features

that maximally activate a neuron)

→ Neuron := preference distribution

over input tokens/ words

_cake

p_k= token activation probability, while training

I like cake better than cookies.

The learning-cake has has cherries inside.

Supervision is the cake’s frosting.

Self-supervision is the cake’s corpus.

The cake is a lie.

_cookies

Per token, only record the maximally activated (preferred) neuron

_than

neuron 1

[6] src:

distill.pub/2017/feature-visualization/

neuron 6

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

8 of 49

Activation maximization, 2 methods

(1) In vision: generate input that maximally activates a neuron bc

we do not know visual primitives (objects, material, color etc.)

(2) In NLP: record input (features

that maximally activate a neuron)

→ Neuron := preference distribution

over input tokens/ words

_cake

I like cake better than cookies.

The learning-cake has has cherries inside.

Supervision is the cake’s frosting.

Self-supervision is the cake’s corpus.

The cake is a lie.

_cookies

_than

neuron 1

neuron 6

[6] src:

distill.pub/2017/feature-visualization/

neuron prefers stop words

neuron prefers sweets

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

9 of 49

RQ1: pretraining ‘builds’ knowledge

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

Token-preference distributions visualize the knowledge abstraction of each neuron

f₁

f₂

f₃

f_k

f_{_}

Xpre

pretrain E on Xpre

(1): pre-training

p_k

n_n:= token-prefer dist.

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

10 of 49

RQ1: pretraining ‘builds’ knowledge

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

Token-preference distributions visualize the knowledge abstraction of each neuron

f₁

f₂

f₃

f_k

f_{_}

Xpre

pretrain E on Xpre

(1): pre-training

p_k

n_n:= token-prefer dist.

How neuron preference changes during pre-training (50) epochs:

1 vs 48. or 49. - loss converges at epoch 49

FLAIR NLP pos tagger used on Wikitext-2 (2M tokens) → noisy tags

POS word cluster activation (normalized/ sum to 1)

unique words average activation (not normalized)

POS clusters: bc otherwise x-axis has 100k tokens → unreadable

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

11 of 49

RQ1: pretraining ‘builds’ knowledge

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

Token-preference distributions visualize the knowledge abstraction of each neuron

f₁

f₂

f₃

f_k

f_{_}

Xpre

pretrain E on Xpre

(1): pre-training

p_k

n_n:= token-prefer dist.

larger change epoch 1 vs 48.�no change epoch 48. vs 49.
neuron well initialized?!
very specific neuron (1 word only)

How neuron preference changes during pre-training (50) epochs:

1 vs 48. or 49. - loss converges at epoch 49

POS word cluster activation (normalized/ sum to 1)

unique words average activation (not normalized)

FLAIR NLP pos tagger used on Wikitext-2 (2M tokens) → noisy tags

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

12 of 49

RQ1: pretraining ‘builds’ knowledge

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

Token-preference distributions visualize the knowledge abstraction of each neuron

f₁

f₂

f₃

f_k

f_{_}

Xpre

pretrain E on Xpre

(1): pre-training

p_k

n_n:= token-prefer dist.

little change (same word + POS)

→ high transfer

How neuron preference changes during pre-training (50) epochs:

1 vs 48. or 49. - loss converges at epoch 49

large change

→ little transfer

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

13 of 49

RQ1: pretraining ‘builds’ knowledge

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

Post-hoc reasoning (not in paper):

self-supervision as next word prediction → messy feature preference

Token-preference distributions visualize the knowledge abstraction of each neuron

f₁

f₂

f₃

f_k

f_{_}

Xpre

pretrain E on Xpre

(1): pre-training

p_k

n_n:= token-prefer dist.

How neuron preference changes during pre-training (50) epochs:

1 vs 48. or 49. - loss converges at epoch 49

large change

→ little transfer

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

14 of 49

RQ1: pretraining ‘builds’ knowledge

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

Post-hoc reasoning (not in paper):

self-supervision as next word prediction → messy feature preference

We often only care about transfer to supervised task i.e.

representation has to transfer, not be intelligble

Token-preference distributions visualize the knowledge abstraction of each neuron

f₁

f₂

f₃

f_k

f_{_}

Xpre

pretrain E on Xpre

(1): pre-training

p_k

n_n:= token-prefer dist.

How neuron preference changes during pre-training (50) epochs:

1 vs 48. or 49. - loss converges at epoch 49

large change

→ little transfer

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

15 of 49

RQ1: pretraining ‘builds’ knowledge

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

Post-hoc reasoning (not in paper):

self-supervision as next word prediction → messy feature preference

We often only care about transfer to supervised task i.e.

representation has to transfer, not be intelligble

Token-preference distributions visualize the knowledge abstraction of each neuron

f₁

f₂

f₃

f_k

f_{_}

Xpre

pretrain E on Xpre

(1): pre-training

p_k

n_n:= token-prefer dist.

How neuron preference changes during pre-training (50) epochs:

1 vs 48. or 49. - loss converges at epoch 49

large change

→ little transfer

Uncomfortable question: Are learning objectives that produce intelligble representations better and would more self-supervised evaluation help?!

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

16 of 49

RQ1: pretraining ‘builds’ knowledge

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

Token-preference distributions visualize the knowledge abstraction of each neuron

f₁

f₂

f₃

f_k

f_{_}

Xpre

pretrain E on Xpre

(1): pre-training

p_k

n_n:= token-prefer dist.

LSTM LM pre-training learns POS during early epochs - also shown 2018 by [6]

[6] Language Models Learn POS First, 2018 acl:W18-5438

via FLAIR tagger

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

17 of 49

RQ2: apply knowledge to new inputs

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

feed new text Xend to frozen pre- trained encoder (1)

f₁

f₂

f₃

f_k

f_{_}

Xpre

pretrain E on Xpre

(1): pre-training

p_k

n_n:= token-prefer dist.

f₁

f₂

f₃

f_k

f_{_}

Xend

eval. E on new Xend

(2): zero-shot

f_k:= token / POS

zero-shot transfer

no neuron transfer if

large change vs.

neuron transfers if small change vs.

zero-shot preference distribution

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

18 of 49

RQ2: apply knowledge to new inputs

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

feed new text Xend to frozen pre- trained encoder (1)

f₁

f₂

f₃

f_k

f_{_}

Xpre

pretrain E on Xpre

(1): pre-training

p_k

n_n:= token-prefer dist.

f₁

f₂

f₃

f_k

f_{_}

Xend

eval. E on new Xend

(2): zero-shot

f_k:= token / POS

zero-shot transfer

Hellinger H( , ) distance measures change/ transfer.�H is a symmetric Kullback Leibler Divergence

no neuron transfer if

large change vs.

neuron transfers if small change vs.

zero-shot preference distribution

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

19 of 49

RQ3: transfer to + from supervision

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

feed new text Xend to the pre-trained encoder (1)

f₁

f₂

f₃

f_k

f_{_}

Xpre

pretrain E on Xpre

(1): pre-training

p_k

n_n:= token-prefer dist.

f₁

f₂

f₃

f_k

f_{_}

Xend

eval. E on new Xend

(2): zero-shot

f_k:= token / POS

zero-shot transfer

supervised transfer

f₁

f₂

f₃

f_k

f_{_}

Xend

eval. Eend on Xend

Yend

Eend : E fit on Yend

(3): supervised

p₁= .05

p_k:= token probability

fine-tune encoder (1) with labels (supervised) only

Pretrain → supervise

pretrain lang. model on Wikitext-2 (1)
fine-tune on IMDB binary review labels (3)

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

20 of 49

RQ3: transfer to + from supervision

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

pruning effects on supervised performance

f₁

f₂

f₃

f_k

f_{_}

Xpre

pretrain E on Xpre

(1): pre-training

p_k

n_n:= token-prefer dist.

zero-shot transfer

supervised transfer

Prune neurons that post supervision …

are specialized (re-fit by supervision)
become preferred (activated)
are ‘avoided’ (now empty distrib.)

i.e, neuron without max activation post supervision

Pretrain → supervise

pretrain lang. model on Wikitext-2 (1)
fine-tune on IMDB binary review labels (3)

f₁

f₂

f₃

f_k

f_{_}

Xend

eval. Eend on Xend

Yend

Eend : E fit on Yend

(3): supervised

p₁= .05

p_k:= token probability

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

21 of 49

RQ3: transfer to + from supervision

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

f₁

f₂

f₃

f_k

f_{_}

Xpre

pretrain E on Xpre

(1): pre-training

p_k

n_n:= token-prefer dist.

zero-shot transfer

supervised transfer

f₁

f₂

f₃

f_k

f_{_}

Xend

eval. Eend on Xend

Yend

Eend : E fit on Yend

(3): supervised

p₁= .05

p_k:= token probability

Prune neurons that post supervision …

are specialized (re-fit by supervision)
become preferred (activated)
are ‘avoided’ (now empty distrib.)

i.e, neuron without max activation post supervision

Pretrain → supervise

pretrain lang. model on Wikitext-2 (1)
fine-tune on IMDB binary review labels (3)

pruning effects on supervised performance

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

22 of 49

RQ3: ‘forgetting’ due to supervision

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

f₁

f₂

f₃

f_k

f_{_}

Xpre

pretrain E on Xpre

(1): pre-training

p_k

n_n:= token-prefer dist.

zero-shot transfer

supervised transfer

f₁

f₂

f₃

f_k

f_{_}

Xend

eval. Eend on Xend

Yend

Eend : E fit on Yend

(3): supervised

p₁= .05

p_k:= token probability

Prune neurons that post supervision …

are specialized (re-fit by supervision)
become preferred (activated)
are ‘avoided’ (now empty distrib.)

i.e, neuron without max activation post supervision

Pretrain → supervise

pretrain lang. model on Wikitext-2 (1)
fine-tune on IMDB binary review labels (3)

pruning effects on supervised performance

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

23 of 49

TX-Ray take aways

can explore generalization and specialization at neuron-level

self-supervised pre-training builds general knowledge

preference spread across many neurons

zero-shot application matches model knowledge vs new data

preference less spread -- partial generalization

supervised fine tuning, sparsifiers (concentrates) activation

preference peaked and cocentrated -- domain over-specialization supervision forces to forget pretraining

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

24 of 49

TX-Ray take aways

can explore generalization and specialization at neuron-level

self-supervised pre-training builds general knowledge

preference spread across many neurons

zero-shot application matches model knowledge vs new data

preference less spread -- partial generalization

supervised fine tuning, sparsifiers (concentrates) activation

preference peaked and cocentrated -- domain over-specialization supervision forces to forget pretraining

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

25 of 49

TX-Ray take aways

can explore generalization and specialization at neuron-level

self-supervised pre-training builds general knowledge

preference spread across many neurons

zero-shot application matches model knowledge vs new data

preference less spread -- partial generalization

supervised fine tuning, sparsifiers (concentrates) activation

preference peaked and cocentrated -- domain over-specialization supervision forces to forget pretraining

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

26 of 49

XAI: further (newer) reading

XAI conceptualizing neurons/ NN components:

[7] Understanding the role of individual units in a deep neural network, PNAS,

2020. arxiv:2009.05041 (newer)

[8] Activation Atlas, 2019 distill.pub/2019/activation-atlas/ - Gave idea

that neurons should be explorable (but only worked for supervised learning). TX-Ray explores unsupervised resprentations → removes probing task limitation. Delay expectation bias.

Pruning:

[9] Movement Pruning: Adaptive Sparsity by Fine-Tuning, 2020 (newer)

arxiv:2005.07683 - pruns neurons that change during training/ transfer in

NLP, no XAI

27 of 49

Open questions/ caveats

Pruning:

is forced forgetting → I’d expect to lose generalization beyond the test set
when predicting many classes, with a small model,

pruning always hurts (MSc. thesis: Saxena)

So, is pruning only for overparameterized models?

XAI:

Explainability is sensitive to supervised learning quality
TX-Ray (interpretability) + ‘its’ pruning are sensitive to self- supervised learning quality (see early episodes)

However, if sensitivity means worse abstractions, then could better abstractions indicate better self-supervised learning -- i.e. allow evaluation without probes?

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

28 of 49

TX-Ray in patient care prediction [4]

Evaluate preferred data source and model component for a task

[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings,

2020, AMIA Symposium, medarx:10.1101/2020.07.21.20157610v1

[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings, 2020, AMIA Symposium, medarx:10.1101/2020.07.21.20157610v1

Task: will patient die in 24h?

activation

1 hour bucket pooling -- low layer

pooling over hour buckets -- higher layers

limited training data: 1. low-resource pretraining + 2. XAI, since no XL pretrained model available

29 of 49

TX-Ray in patient care prediction [4]

Evaluate preferred data source and model component for a task

[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings, 2020, AMIA Symposium, medarx:10.1101/2020.07.21.20157610v1

Insights:

output events data not used

Task: will patient die in 24h?

activation

[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings,

2020, AMIA Symposium, medarx:10.1101/2020.07.21.20157610v1

limited training data: 1. low-resource pretraining + XAI, since no XL pretrained model available

30 of 49

TX-Ray in patient care prediction [4]

Evaluate preferred data source and model component for a task

[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings, 2020, AMIA Symposium, medarx:10.1101/2020.07.21.20157610v1

Insights:

output events data not used

pooling over hour bucket matters

1 hour bucket pooling -- low layer

Task: will patient die in 24h?

activation

[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings,

2020, AMIA Symposium, medarx:10.1101/2020.07.21.20157610v1

limited training data: 1. low-resource pretraining + XAI, since no XL pretrained model available

31 of 49

TX-Ray in patient care prediction [4]

Evaluate preferred data source and model component for a task

[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings, 2020, AMIA Symposium, medarx:10.1101/2020.07.21.20157610v1

Insights:

output events data not used

pooling over hour bucket matters

for pooling over time,

pooling matters most

1 hour bucket pooling -- low layer

Task: will patient die in 24h?

pooling over hour buckets -- higher layers

activation

[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings,

2020, AMIA Symposium, medarx:10.1101/2020.07.21.20157610v1

limited training data: 1. low-resource pretraining + XAI, since no XL pretrained model available

32 of 49

Evaluation: approaches, questions

Supervised: “Catch them all” supervised benchmarks

example: [10] - 18 tasks word embedding benchmark
(complete) autoencoder retrofits word embeddings

better single task + all task performance
more improvement on smaller pre-training data

Problem:

probe model (depth) limits performance [10, 11]

[10] MoRTy: Unsupervised Learning of Task-specialized Word Embeddings by Autoencoding, 2020, acl:W19-4307, github.com/NilsRethmeier/MoRTy

[11] Designing and Interpreting Probes with Control Tasks, 2019, acl:D19-1275

base embs

overall better

best per task

33 of 49

Evaluation: approaches, questions

Supervised evaluation via probe design problems continued

probe models limit performance [11] -- deep probes are better

Nagging question 2: How much do probe benchmarks tell us then?

SotA scores depend on:

hardware budget [12] or seed hacking [13]
using ever more pre-training data [14, 15]

Nagging question 3: are new pretraining models better or just use more data -- lack of controlled experiments!

1990-2010: learning is easy when you have enough (supervised) data

simple, small models will do

2018-20xx: learning is easy when you have web-scale (self-supervised) data

simple, XL models [14, 15]

[11] Designing and Interpreting Probes with Control Tasks, 2019, acl:D19-1275, [12] The Hardware Lottery, 2020, arx:2009.06489

[13] Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping, 2020

[14] A Survey on Contextual Embeddings, 2020,

[15] How can we accelerate progress towards human-like linguistic generalization, 2020. acl:http2020.acl-main.465.

34 of 49

Evaluation: approaches, questions

Supervised evaluation via probe design problems continued

probe models limit performance [11] -- deep probes are better

Nagging question 2: How much do probe benchmarks tell us then?

SotA scores depend on:

hardware budget [12] or seed hacking [13]
using ever more pre-training data [14, 15]

Nagging question 3: are new pretraining models better or just use more data -- lack of controlled experiments!

1990-2010: learning is easy when you have enough (supervised) data

simple, small models will do

2018-20xx: learning is easy when you have web-scale (self-supervised) data

simple, XL models [14, 15]

[11] Designing and Interpreting Probes with Control Tasks, 2019, acl:D19-1275, [12] The Hardware Lottery, 2020, arx:2009.06489

[13] Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping, 2020

[14] A Survey on Contextual Embeddings, 2020,

[15] How can we accelerate progress towards human-like linguistic generalization, 2020. acl:http2020.acl-main.465.

35 of 49

Learning from small data

Evaluation idea: addressing [16-20]

evaluate small data pre-training for few and zero-shot learning

no large external pre-training
pre-train on a few GPUs (~1-4)

long-tailed (imbalanced) data

tail is always few-shot to zero-shot and tail grows with data size

Problem: how to learn a text encoder at small scale?

We need zero-shot + long-tail capabilities

e.g. label-embedding CNNs [22], where labels = word embeddings
[23] label-embedding BERT needs ELMO word embeddings for labels

CNNs/ LSTMs [20] outperform transformers when

using small pre-training data [16, 18-21]

Considering 1+2: we chose CNNs

[16] Learning and evaluating general linguistic intelligence., 2020, arx:2010.01061 [17] How can we accelerate progress towards human-like linguistic

generalization, 2020. acl:http2020.acl-main.465. [18] Mogrifier LSTM, 2020, [19] Transformer on a Diet, 2020, [20] Character-aware neural language models. 2016. abs/1508.06615, [21] Pointer Sentinel Mixture Models, 2016, [22] Multi-task label embedding for text classification. 2018

[23] X-BERT: eXtreme Multi-label Text Classification with BERT,

[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061

36 of 49

Pre-training from small data

Self-supervised CNN pre-training (e.g. as SSL transformer alternative)

feed words ‘measuring..’ + positive labels + negative labels through embedding layer (1)

feed word embeddings trough text encoder NN → text encoding T

and label(-word)-embedings through label encoder NN → label encoding L

concatenate each label encoding L with a copy if the text encoding T -- T^ktorch.repeat(T, |L₁… L_k|)

match label-encodings L to text-copies T^k -- matcher = learned similarity function with sim = 0..1

matcher is a single class that predicts k label to text ‘similarities’

train with binary cross entropy over k labels * batchsize (k = num pos + neg labels)

(a) labels = supervision labels or (b) words as pseudo labels (sampled from text in batch)

[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061

37 of 49

Pre-training from small data

Supervised and self-supervised training objectives

noise contrastive estimation NCE
NCE is BCE with 1 positive and k negative samples
we use g positives + b negatives: N = g + b

previous work used supervised labels g, and optinally b [24]
we add positive w⁺ + negative input words w^- as pseudo-labels

N = g + b for supervised learning mode
N = w⁺ + w^- for self-supervised learning

i.e. contrastive partial autoencoding

detail we use a CNN and deeper encoders than [25]

[24] Multi-task label embedding for text classification. 2018

[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061

38 of 49

Pre-training from small data

Pseudo-label embedding NCE (PLENCE) properties:

self-supervised (SSL) pretraining → allows SSL zero-shot learning

NCE allows infite label sets and vocabulary sets

no softmax → one less bottleneck
‘text-to-text’ becomes ‘embedding-to-embedding’

networks does not have to translate to anonymous {0,1} labels
labels now have meaning (semantics) and are words (like for humans)
we pretrain word embeddings, ergo, labels are also pretrained :)

unknown labels can be inferred via FastText, or Attentive Mimicking [25]

[25] Learning Semantic Representations for Novel Words: Leveraging Both Form and Context. 2018

[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061

39 of 49

Pre-training from small data

Evaluate zero-shot ZS, few-shot + long-tail learning via PLENCE:

allowing less pre-training text

75% … 10%

takes longer to train
but reaches comparable ZS performance

3.3x self-supervision pseudo-labels

better/ faster ZS performance

Using larger net + more pseudo-labels

much better ZS performance

Insight: when large, external data (signal) is unavailable, increase self-supervison

works despite long-tail problem worsening with few-shot

[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061

40 of 49

Pre-training from small data

Evaluate zero-shot ZS, few-shot + long-tail learning via PLENCE:

allowing less pre-training text

75% … 10%

takes longer to train
but reaches comparable ZS performance

3.3x self-supervision pseudo-labels

better/ faster ZS performance

Using larger net + more pseudo-labels

much better ZS performance

Insight: when large, external data (signal) is unavailable, increase self-supervison

works despite long-tail problem worsening with few-shot

[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061

41 of 49

Pre-training from small data

Evaluate zero-shot, few-shot FS + long-tail learning via PLENCE:

with self-supervised pretraining (PLENCE)

FS performance is much better
trains fast, without pseudo plateaus
epoch 1 performance is much higher -- practical

without self-supervised pretraining:

few-shot performance is bad
we take long (200 epochs to) learn
FS learning pseudo plateaus

Insight: pseudo-label SSL pretraining boost data efficiency, learning speed, learning stability and few-shot end performance -- even on 10% data

[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061

pseudo plateaus

stable learning curves

42 of 49

Pre-training from small data

Evaluate zero-shot, few-shot FS + long-tail learning via PLENCE:

with self-supervised pretraining (PLENCE)

FS performance is much better
trains fast, without pseudo plateaus
epoch 1 performance is much higher -- practical

without self-supervised pretraining:

few-shot performance is bad
we take long (200 epochs to) learn
FS learning pseudo plateaus

Insight: pseudo-label SSL pretraining boost data efficiency, learning speed, learning stability and few-shot end performance -- even on 10% data

[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061

pseudo plateaus

stable learning curves

43 of 49

Pre-training from small data

Evaluate zero-shot ZS, few-shot + long-tail learning via PLENCE:

[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061

Evaluating a long-tail:

5 class frequency balanced buckets

i.e. each bucket has same amount of positive (real) labels

tail buckets have many more labels

classes per bucket still very imbalanced

y-axis class label frequency in log -- extreme imbalance, 1305 classes

44 of 49

Pre-training from small data

Evaluate zero-shot ZS, few-shot + long-tail learning via PLENCE:

[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061

SL Sup label-embeddings (LE)

better performance than [0,1]-BCE, esp. on long-tail

SSL prertrained LE:

slightly better yet

XL SSL pretrained LE with larger net + 3.3x labels:

strong long-tail performance boost

45 of 49

Pre-training from small data

Evaluate zero-shot ZS, few-shot + long-tail learning via PLENCE:

[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061

SSL pretraining label embs:

learn long-tail from epoch 1

XL SSL pretrained LE:

learns long-tail even faster
need 3x less epochs compared to

Insight: pseudo-label SSL pretraining boost long-tail learning.

Larger model + more SSL signal boost long-tail further.

46 of 49

Summary

self-supervised XAI:

for self-supervised (transfer learning) evaluation
without limiting to supervised probing (expected) semantics

challenging, low-resource evaluation:

zero + few-shot reveal pretraining effects
low-resource (LR), long-tail (LT) setups show learning efficiency

long-tail problem grows with data size
in the tail we are always in the low-resource setting

data efficient learning:

more self-supervision boosts LR few + zero-shot performance

→ Evidence that instead of large external data pretraining, we can

just boost SSL learning signal on small data

48 of 49

XAI: further/ newer reading

[7] Understanding the role of individual units

in a deep neural network, PNAS, 2020.

arxiv:2009.05041

An image classification CNN

1 of 49

2 of 49

3 of 49

4 of 49

5 of 49

6 of 49

7 of 49

8 of 49

9 of 49

10 of 49

11 of 49

12 of 49

13 of 49

14 of 49

15 of 49

16 of 49

17 of 49

18 of 49

19 of 49

20 of 49

21 of 49

22 of 49

23 of 49

24 of 49

25 of 49

26 of 49

27 of 49

28 of 49

29 of 49

30 of 49

31 of 49

32 of 49

33 of 49

34 of 49

35 of 49

36 of 49

37 of 49

38 of 49

39 of 49

40 of 49

41 of 49

42 of 49

43 of 49

44 of 49

45 of 49

46 of 49

47 of 49

48 of 49

49 of 49