1 of 49

Learning, evaluating and explaining (XAI) transferable text representations from grumpy Big Small Data.

Nils Rethmeier*+, Prof. Isabelle Augenstein*, Vageesh Saxena+

NLP with Friends • December 2, 2020

*

+

2 of 49

XAI: explain & interpret defined by [1]

Decision understanding (mostly explain, model = black-box)

  • explain input to output/ task relevance
  • needs supervised labels
    • incidental labels work [2]
  • often explain 1 instance
    • no idea what the model does

Model understanding (mostly interpret, grey-box methods)

  • interpret neuron acitivity
  • interpret over entire corpus
  • irrelevant of output (unsupervised)
  • prune unused neurons [3]
  • find redundant neurons [4]

2

[1] Visual Interaction with Deep Learning Models through Collaborative Semantic Inference, 2019, arx:1907.10739

[2] Learning Comment Controversy Prediction in Web Discussions Using Incidentally Supervised Multi-Task CNNs, 2018, acl:W18-6246

[3] TX-Ray: Quantifying and Explaining Model-Knowledge Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings, 2020, medarx:10.1101/2020.07.21.20157610v1

Do antidepressants correct the sleep disturbances in these diseases? …

src:lrpserver.hhi.fraunhofer.de/text-classification

[n] shown today, [n] shown not ours

topic labelmedicine

disease

vomit

fiever

text

images

[5]

[3]

3 of 49

XAI: explain & interpret defined by [1]

Decision understanding (mostly explain, model = black-box)

  • explain input to output/ task relevance
  • needs supervised labels
    • incidental labels work [2]
  • often explain 1 instance
    • no idea what the model does

Model understanding (mostly interpret, grey-box methods)

  • interpret neuron acitivity
  • interpret over entire corpus
  • irrelevant of output (unsupervised)
  • prune unused neurons [3]
  • find redundant neurons [4]

3

[1] Visual Interaction with Deep Learning Models through Collaborative Semantic Inference, 2019, arx:1907.10739

[2] Learning Comment Controversy Prediction in Web Discussions Using Incidentally Supervised Multi-Task CNNs, 2018, acl:W18-6246

[3] TX-Ray: Quantifying and Explaining Model-Knowledge Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings, 2020, medarx:10.1101/2020.07.21.20157610v1

Do antidepressants correct the sleep disturbances in these diseases? …

src:lrpserver.hhi.fraunhofer.de/text-classification

[n] shown today, [n] shown not ours

topic labelmedicine

disease

vomit

fiever

text

images

[5]

[3]

4 of 49

[3] transfer XAI/ measurement

Why: Limits of supervised probes/ decision understanding XAI

  • needs labels - evaluate expected semantics only
    • unforseen insights not discoverable,
  • probe can mismatch (pre-)training domain

What: Model transfer understanding: quantify neuron transfer

  • RQ1: self-supervised ‘knowledge’ abstraction (pre-training)
  • RQ2: zero-shot ‘knowledge’ transfer to new domain
  • RQ3: transfer during supervised adaptation (fine-tuning)

How: Adapt self-supervised interpretability [5] for NLP [3]

  • visualize what input (features) each neuron prefers (maximally activates on) - activation maximization from RBMs [5]

4

[3] TX-Ray: Quantifying and Explaining Model-Knowledge Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

[5] Visualizing Higher-Layer Features of a Deep Network, 2009, http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/247

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

5 of 49

Activation maximization, 2 methods

(1) In vision: generate input that maximally activates a neuron bc

we do not know visual primitives (objects, material, color etc.)

(2) In NLP:

5

6 of 49

Activation maximization, 2 methods

(1) In vision: generate input that maximally activates a neuron bc

we do not know visual primitives (objects, material, color etc.)

(2) In NLP: record input (features

that maximally activate a neuron)

→ Neuron := preference distribution

over input tokens/ words

6

cake

pk = token activation probability, while training

I like cake better than cookies.

The learning-cake has has cherries inside.

Supervision is the cake’s frosting.

Self-supervision is the cake’s corpus.

The cake is a lie.

cookies

Per token, only record the maximally activated (preferred) neuron

neuron 1

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

7 of 49

Activation maximization, 2 methods

(1) In vision: generate input that maximally activates a neuron bc

we do not know visual primitives (objects, material, color etc.)

(2) In NLP: record input (features

that maximally activate a neuron)

→ Neuron := preference distribution

over input tokens/ words

7

cake

pk = token activation probability, while training

I like cake better than cookies.

The learning-cake has has cherries inside.

Supervision is the cake’s frosting.

Self-supervision is the cake’s corpus.

The cake is a lie.

cookies

Per token, only record the maximally activated (preferred) neuron

I

than

neuron 1

neuron 6

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

8 of 49

Activation maximization, 2 methods

(1) In vision: generate input that maximally activates a neuron bc

we do not know visual primitives (objects, material, color etc.)

(2) In NLP: record input (features

that maximally activate a neuron)

→ Neuron := preference distribution

over input tokens/ words

8

cake

I like cake better than cookies.

The learning-cake has has cherries inside.

Supervision is the cake’s frosting.

Self-supervision is the cake’s corpus.

The cake is a lie.

cookies

I

than

neuron 1

neuron 6

neuron prefers stop words

neuron prefers sweets

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

9 of 49

RQ1: pretraining ‘builds’ knowledge

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

9

Token-preference distributions visualize the knowledge abstraction of each neuron

f1

f2

f3

fk

f_

Xpre

pretrain E on Xpre

(1): pre-training

E

pk

nn:= token-prefer dist.

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

10 of 49

RQ1: pretraining ‘builds’ knowledge

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

10

Token-preference distributions visualize the knowledge abstraction of each neuron

f1

f2

f3

fk

f_

Xpre

pretrain E on Xpre

(1): pre-training

E

pk

nn:= token-prefer dist.

How neuron preference changes during pre-training (50) epochs:

1 vs 48. or 49. - loss converges at epoch 49

FLAIR NLP pos tagger used on Wikitext-2 (2M tokens) → noisy tags

POS word cluster activation (normalized/ sum to 1)

unique words average activation (not normalized)

POS clusters: bc otherwise x-axis has 100k tokens → unreadable

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

11 of 49

RQ1: pretraining ‘builds’ knowledge

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

11

Token-preference distributions visualize the knowledge abstraction of each neuron

f1

f2

f3

fk

f_

Xpre

pretrain E on Xpre

(1): pre-training

E

pk

nn:= token-prefer dist.

  • larger change epoch 1 vs 48.�no change epoch 48. vs 49.
  • neuron well initialized?!
  • very specific neuron (1 word only)

How neuron preference changes during pre-training (50) epochs:

1 vs 48. or 49. - loss converges at epoch 49

POS word cluster activation (normalized/ sum to 1)

unique words average activation (not normalized)

FLAIR NLP pos tagger used on Wikitext-2 (2M tokens) → noisy tags

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

12 of 49

RQ1: pretraining ‘builds’ knowledge

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

12

Token-preference distributions visualize the knowledge abstraction of each neuron

f1

f2

f3

fk

f_

Xpre

pretrain E on Xpre

(1): pre-training

E

pk

nn:= token-prefer dist.

little change (same word + POS)

high transfer

How neuron preference changes during pre-training (50) epochs:

1 vs 48. or 49. - loss converges at epoch 49

large change

little transfer

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

13 of 49

RQ1: pretraining ‘builds’ knowledge

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

13

Post-hoc reasoning (not in paper):

self-supervision as next word prediction → messy feature preference

Token-preference distributions visualize the knowledge abstraction of each neuron

f1

f2

f3

fk

f_

Xpre

pretrain E on Xpre

(1): pre-training

E

pk

nn:= token-prefer dist.

How neuron preference changes during pre-training (50) epochs:

1 vs 48. or 49. - loss converges at epoch 49

large change

little transfer

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

14 of 49

RQ1: pretraining ‘builds’ knowledge

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

14

Post-hoc reasoning (not in paper):

self-supervision as next word prediction → messy feature preference

We often only care about transfer to supervised task i.e.

  • representation has to transfer, not be intelligble

Token-preference distributions visualize the knowledge abstraction of each neuron

f1

f2

f3

fk

f_

Xpre

pretrain E on Xpre

(1): pre-training

E

pk

nn:= token-prefer dist.

How neuron preference changes during pre-training (50) epochs:

1 vs 48. or 49. - loss converges at epoch 49

large change

little transfer

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

15 of 49

RQ1: pretraining ‘builds’ knowledge

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

15

Post-hoc reasoning (not in paper):

self-supervision as next word prediction → messy feature preference

We often only care about transfer to supervised task i.e.

  • representation has to transfer, not be intelligble

Token-preference distributions visualize the knowledge abstraction of each neuron

f1

f2

f3

fk

f_

Xpre

pretrain E on Xpre

(1): pre-training

E

pk

nn:= token-prefer dist.

How neuron preference changes during pre-training (50) epochs:

1 vs 48. or 49. - loss converges at epoch 49

large change

little transfer

Uncomfortable question: Are learning objectives that produce intelligble representations better and would more self-supervised evaluation help?!

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

16 of 49

RQ1: pretraining ‘builds’ knowledge

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

16

Token-preference distributions visualize the knowledge abstraction of each neuron

f1

f2

f3

fk

f_

Xpre

pretrain E on Xpre

(1): pre-training

E

pk

nn:= token-prefer dist.

LSTM LM pre-training learns POS during early epochs - also shown 2018 by [6]

[6] Language Models Learn POS First, 2018 acl:W18-5438

via FLAIR tagger

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

17 of 49

RQ2: apply knowledge to new inputs

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

17

feed new text Xend to frozen pre- trained encoder (1)

f1

f2

f3

fk

f_

Xpre

pretrain E on Xpre

(1): pre-training

E

pk

nn:= token-prefer dist.

f1

f2

f3

fk

f_

Xend

eval. E on new Xend

(2): zero-shot

fk := token / POS

E

zero-shot transfer

no neuron transfer if

large change vs.

neuron transfers if small change vs.

zero-shot preference distribution

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

18 of 49

RQ2: apply knowledge to new inputs

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

18

feed new text Xend to frozen pre- trained encoder (1)

f1

f2

f3

fk

f_

Xpre

pretrain E on Xpre

(1): pre-training

E

pk

nn:= token-prefer dist.

f1

f2

f3

fk

f_

Xend

eval. E on new Xend

(2): zero-shot

fk := token / POS

E

zero-shot transfer

Hellinger H( , ) distance measures change/ transfer.�H is a symmetric Kullback Leibler Divergence

no neuron transfer if

large change vs.

neuron transfers if small change vs.

zero-shot preference distribution

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

19 of 49

RQ3: transfer to + from supervision

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

19

feed new text Xend to the pre-trained encoder (1)

f1

f2

f3

fk

f_

Xpre

pretrain E on Xpre

(1): pre-training

E

pk

nn:= token-prefer dist.

f1

f2

f3

fk

f_

Xend

eval. E on new Xend

(2): zero-shot

fk := token / POS

E

zero-shot transfer

supervised transfer

f1

f2

f3

fk

f_

Xend

eval. Eend on Xend

Yend

Eend : E fit on Yend

(3): supervised

p1= .05

pk := token probability

E

fine-tune encoder (1) with labels (supervised) only

Pretrain → supervise

  • pretrain lang. model on Wikitext-2 (1)
  • fine-tune on IMDB binary review labels (3)

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

20 of 49

RQ3: transfer to + from supervision

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

20

pruning effects on supervised performance

f1

f2

f3

fk

f_

Xpre

pretrain E on Xpre

(1): pre-training

E

pk

nn:= token-prefer dist.

zero-shot transfer

supervised transfer

Prune neurons that post supervision

  • are specialized (re-fit by supervision)
  • become preferred (activated)
  • are ‘avoided’ (now empty distrib.)

i.e, neuron without max activation post supervision

Pretrain → supervise

  • pretrain lang. model on Wikitext-2 (1)
  • fine-tune on IMDB binary review labels (3)

f1

f2

f3

fk

f_

Xend

eval. Eend on Xend

Yend

Eend : E fit on Yend

(3): supervised

p1= .05

pk := token probability

E

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

21 of 49

RQ3: transfer to + from supervision

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

21

f1

f2

f3

fk

f_

Xpre

pretrain E on Xpre

(1): pre-training

E

pk

nn:= token-prefer dist.

zero-shot transfer

supervised transfer

f1

f2

f3

fk

f_

Xend

eval. Eend on Xend

Yend

Eend : E fit on Yend

(3): supervised

p1= .05

pk := token probability

E

Prune neurons that post supervision

  • are specialized (re-fit by supervision)
  • become preferred (activated)
  • are ‘avoided’ (now empty distrib.)

i.e, neuron without max activation post supervision

Pretrain → supervise

  • pretrain lang. model on Wikitext-2 (1)
  • fine-tune on IMDB binary review labels (3)

pruning effects on supervised performance

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

22 of 49

RQ3: ‘forgetting’ due to supervision

How does each neuron abstract/ transfer knowledge during

RQ1: pretraining RQ2: zero-shot appl. RQ3: superv. fine-tune

22

f1

f2

f3

fk

f_

Xpre

pretrain E on Xpre

(1): pre-training

E

pk

nn:= token-prefer dist.

zero-shot transfer

supervised transfer

f1

f2

f3

fk

f_

Xend

eval. Eend on Xend

Yend

Eend : E fit on Yend

(3): supervised

p1= .05

pk := token probability

E

Prune neurons that post supervision

  • are specialized (re-fit by supervision)
  • become preferred (activated)
  • are ‘avoided’ (now empty distrib.)

i.e, neuron without max activation post supervision

Pretrain → supervise

  • pretrain lang. model on Wikitext-2 (1)
  • fine-tune on IMDB binary review labels (3)

pruning effects on supervised performance

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

23 of 49

TX-Ray take aways

can explore generalization and specialization at neuron-level

  • self-supervised pre-training builds general knowledge
    • preference spread across many neurons
  • zero-shot application matches model knowledge vs new data
    • preference less spread -- partial generalization
  • supervised fine tuning, sparsifiers (concentrates) activation
    • preference peaked and cocentrated -- domain over-specialization supervision forces to forget pretraining

23

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

!

24 of 49

TX-Ray take aways

can explore generalization and specialization at neuron-level

  • self-supervised pre-training builds general knowledge
    • preference spread across many neurons
  • zero-shot application matches model knowledge vs new data
    • preference less spread -- partial generalization
  • supervised fine tuning, sparsifiers (concentrates) activation
    • preference peaked and cocentrated -- domain over-specialization supervision forces to forget pretraining

24

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

!

25 of 49

TX-Ray take aways

can explore generalization and specialization at neuron-level

  • self-supervised pre-training builds general knowledge
    • preference spread across many neurons
  • zero-shot application matches model knowledge vs new data
    • preference less spread -- partial generalization
  • supervised fine tuning, sparsifiers (concentrates) activation
    • preference peaked and cocentrated -- domain over-specialization supervision forces to forget pretraining

25

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

!

26 of 49

XAI: further (newer) reading

XAI conceptualizing neurons/ NN components:

[7] Understanding the role of individual units in a deep neural network, PNAS,

2020. arxiv:2009.05041 (newer)

[8] Activation Atlas, 2019 distill.pub/2019/activation-atlas/ - Gave idea

that neurons should be explorable (but only worked for supervised learning). TX-Ray explores unsupervised resprentations → removes probing task limitation. Delay expectation bias.

Pruning:

[9] Movement Pruning: Adaptive Sparsity by Fine-Tuning, 2020 (newer)

arxiv:2005.07683 - pruns neurons that change during training/ transfer in

NLP, no XAI

26

27 of 49

Open questions/ caveats

Pruning:

  • is forced forgetting → I’d expect to lose generalization beyond the test set
  • when predicting many classes, with a small model,

pruning always hurts (MSc. thesis: Saxena)

So, is pruning only for overparameterized models?

XAI:

  • Explainability is sensitive to supervised learning quality
  • TX-Ray (interpretability) + ‘its’ pruning are sensitive to self- supervised learning quality (see early episodes)
    • However, if sensitivity means worse abstractions, then could better abstractions indicate better self-supervised learning -- i.e. allow evaluation without probes?

27

[3] TX-Ray: Quantifying and Explaining Model-Knowledge

Transfer in (Un-)Supervised NLP, 2020, arx:1912.00982

28 of 49

TX-Ray in patient care prediction [4]

Evaluate preferred data source and model component for a task

28

[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings,

2020, AMIA Symposium, medarx:10.1101/2020.07.21.20157610v1

[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings, 2020, AMIA Symposium, medarx:10.1101/2020.07.21.20157610v1

Task: will patient die in 24h?

activation

1 hour bucket pooling -- low layer

pooling over hour buckets -- higher layers

limited training data: 1. low-resource pretraining + 2. XAI, since no XL pretrained model available

29 of 49

TX-Ray in patient care prediction [4]

Evaluate preferred data source and model component for a task

29

[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings, 2020, AMIA Symposium, medarx:10.1101/2020.07.21.20157610v1

Insights:

  1. output events data not used

Task: will patient die in 24h?

activation

[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings,

2020, AMIA Symposium, medarx:10.1101/2020.07.21.20157610v1

limited training data: 1. low-resource pretraining + XAI, since no XL pretrained model available

30 of 49

TX-Ray in patient care prediction [4]

Evaluate preferred data source and model component for a task

30

[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings, 2020, AMIA Symposium, medarx:10.1101/2020.07.21.20157610v1

Insights:

  1. output events data not used

pooling over hour bucket matters

1 hour bucket pooling -- low layer

Task: will patient die in 24h?

activation

[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings,

2020, AMIA Symposium, medarx:10.1101/2020.07.21.20157610v1

limited training data: 1. low-resource pretraining + XAI, since no XL pretrained model available

31 of 49

TX-Ray in patient care prediction [4]

Evaluate preferred data source and model component for a task

31

[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings, 2020, AMIA Symposium, medarx:10.1101/2020.07.21.20157610v1

Insights:

  1. output events data not used

pooling over hour bucket matters

  1. for pooling over time,

pooling matters most

1 hour bucket pooling -- low layer

Task: will patient die in 24h?

pooling over hour buckets -- higher layers

activation

[4] EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings,

2020, AMIA Symposium, medarx:10.1101/2020.07.21.20157610v1

limited training data: 1. low-resource pretraining + XAI, since no XL pretrained model available

32 of 49

Evaluation: approaches, questions

Supervised: “Catch them all” supervised benchmarks

  • example: [10] - 18 tasks word embedding benchmark
  • (complete) autoencoder retrofits word embeddings
    • better single task + all task performance
    • more improvement on smaller pre-training data

Problem:

  • probe model (depth) limits performance [10, 11]

32

32

[10] MoRTy: Unsupervised Learning of Task-specialized Word Embeddings by Autoencoding, 2020, acl:W19-4307, github.com/NilsRethmeier/MoRTy

[11] Designing and Interpreting Probes with Control Tasks, 2019, acl:D19-1275

base embs

overall better

best per task

33 of 49

Evaluation: approaches, questions

Supervised evaluation via probe design problems continued

  • probe models limit performance [11] -- deep probes are better

Nagging question 2: How much do probe benchmarks tell us then?

  • SotA scores depend on:
    • hardware budget [12] or seed hacking [13]
    • using ever more pre-training data [14, 15]

Nagging question 3: are new pretraining models better or just use more data -- lack of controlled experiments!

1990-2010: learning is easy when you have enough (supervised) data

  • simple, small models will do

2018-20xx: learning is easy when you have web-scale (self-supervised) data

  • simple, XL models [14, 15]

33

33

[11] Designing and Interpreting Probes with Control Tasks, 2019, acl:D19-1275, [12] The Hardware Lottery, 2020, arx:2009.06489

[13] Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping, 2020

[14] A Survey on Contextual Embeddings, 2020,

[15] How can we accelerate progress towards human-like linguistic generalization, 2020. acl:http2020.acl-main.465.

34 of 49

Evaluation: approaches, questions

Supervised evaluation via probe design problems continued

  • probe models limit performance [11] -- deep probes are better

Nagging question 2: How much do probe benchmarks tell us then?

  • SotA scores depend on:
    • hardware budget [12] or seed hacking [13]
    • using ever more pre-training data [14, 15]

Nagging question 3: are new pretraining models better or just use more data -- lack of controlled experiments!

1990-2010: learning is easy when you have enough (supervised) data

  • simple, small models will do

2018-20xx: learning is easy when you have web-scale (self-supervised) data

  • simple, XL models [14, 15]

34

34

[11] Designing and Interpreting Probes with Control Tasks, 2019, acl:D19-1275, [12] The Hardware Lottery, 2020, arx:2009.06489

[13] Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping, 2020

[14] A Survey on Contextual Embeddings, 2020,

[15] How can we accelerate progress towards human-like linguistic generalization, 2020. acl:http2020.acl-main.465.

35 of 49

Learning from small data

Evaluation idea: addressing [16-20]

  • evaluate small data pre-training for few and zero-shot learning
    • no large external pre-training
    • pre-train on a few GPUs (~1-4)
  • long-tailed (imbalanced) data
    • tail is always few-shot to zero-shot and tail grows with data size

Problem: how to learn a text encoder at small scale?

  1. We need zero-shot + long-tail capabilities
    • e.g. label-embedding CNNs [22], where labels = word embeddings
    • [23] label-embedding BERT needs ELMO word embeddings for labels
  2. CNNs/ LSTMs [20] outperform transformers when

using small pre-training data [16, 18-21]

Considering 1+2: we chose CNNs

35

35

[16] Learning and evaluating general linguistic intelligence., 2020, arx:2010.01061 [17] How can we accelerate progress towards human-like linguistic

generalization, 2020. acl:http2020.acl-main.465. [18] Mogrifier LSTM, 2020, [19] Transformer on a Diet, 2020, [20] Character-aware neural language models. 2016. abs/1508.06615, [21] Pointer Sentinel Mixture Models, 2016, [22] Multi-task label embedding for text classification. 2018

[23] X-BERT: eXtreme Multi-label Text Classification with BERT,

[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061

36 of 49

Pre-training from small data

Self-supervised CNN pre-training (e.g. as SSL transformer alternative)

36

36

feed words ‘measuring..’ + positive labels + negative labels through embedding layer (1)

feed word embeddings trough text encoder NN → text encoding T

and label(-word)-embedings through label encoder NN → label encoding L

concatenate each label encoding L with a copy if the text encoding T -- Tktorch.repeat(T, |L1 … Lk|)

match label-encodings L to text-copies Tk -- matcher = learned similarity function with sim = 0..1

  • matcher is a single class that predicts k label to text ‘similarities’

train with binary cross entropy over k labels * batchsize (k = num pos + neg labels)

  • (a) labels = supervision labels or (b) words as pseudo labels (sampled from text in batch)

1

2

3

4

5

6

[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061

37 of 49

Pre-training from small data

Supervised and self-supervised training objectives

  • noise contrastive estimation NCE
  • NCE is BCE with 1 positive and k negative samples
  • we use g positives + b negatives: N = g + b

  • previous work used supervised labels g, and optinally b [24]
  • we add positive w+ + negative input words w- as pseudo-labels
    • N = g + b for supervised learning mode
    • N = w+ + w- for self-supervised learning
      • i.e. contrastive partial autoencoding

detail we use a CNN and deeper encoders than [25]

37

37

[24] Multi-task label embedding for text classification. 2018

[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061

38 of 49

Pre-training from small data

Pseudo-label embedding NCE (PLENCE) properties:

  • self-supervised (SSL) pretraining → allows SSL zero-shot learning

  • NCE allows infite label sets and vocabulary sets
    • no softmax → one less bottleneck
    • ‘text-to-text’ becomes ‘embedding-to-embedding’
      • networks does not have to translate to anonymous {0,1} labels
      • labels now have meaning (semantics) and are words (like for humans)
      • we pretrain word embeddings, ergo, labels are also pretrained :)
    • unknown labels can be inferred via FastText, or Attentive Mimicking [25]

38

38

[25] Learning Semantic Representations for Novel Words: Leveraging Both Form and Context. 2018

[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061

39 of 49

Pre-training from small data

Evaluate zero-shot ZS, few-shot + long-tail learning via PLENCE:

allowing less pre-training text

75% … 10%

  • takes longer to train
  • but reaches comparable ZS performance

3.3x self-supervision pseudo-labels

  • better/ faster ZS performance

Using larger net + more pseudo-labels

  • much better ZS performance

39

39

Insight: when large, external data (signal) is unavailable, increase self-supervison

works despite long-tail problem worsening with few-shot

[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061

40 of 49

Pre-training from small data

Evaluate zero-shot ZS, few-shot + long-tail learning via PLENCE:

allowing less pre-training text

75% … 10%

  • takes longer to train
  • but reaches comparable ZS performance

3.3x self-supervision pseudo-labels

  • better/ faster ZS performance

Using larger net + more pseudo-labels

  • much better ZS performance

40

40

Insight: when large, external data (signal) is unavailable, increase self-supervison

works despite long-tail problem worsening with few-shot

[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061

41 of 49

Pre-training from small data

Evaluate zero-shot, few-shot FS + long-tail learning via PLENCE:

with self-supervised pretraining (PLENCE)

  • FS performance is much better
  • trains fast, without pseudo plateaus
  • epoch 1 performance is much higher -- practical

without self-supervised pretraining:

  • few-shot performance is bad
  • we take long (200 epochs to) learn
  • FS learning pseudo plateaus

41

41

Insight: pseudo-label SSL pretraining boost data efficiency, learning speed, learning stability and few-shot end performance -- even on 10% data

[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061

pseudo plateaus

stable learning curves

42 of 49

Pre-training from small data

Evaluate zero-shot, few-shot FS + long-tail learning via PLENCE:

with self-supervised pretraining (PLENCE)

  • FS performance is much better
  • trains fast, without pseudo plateaus
  • epoch 1 performance is much higher -- practical

without self-supervised pretraining:

  • few-shot performance is bad
  • we take long (200 epochs to) learn
  • FS learning pseudo plateaus

42

42

Insight: pseudo-label SSL pretraining boost data efficiency, learning speed, learning stability and few-shot end performance -- even on 10% data

[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061

pseudo plateaus

stable learning curves

43 of 49

Pre-training from small data

Evaluate zero-shot ZS, few-shot + long-tail learning via PLENCE:

43

43

[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061

Evaluating a long-tail:

  • 5 class frequency balanced buckets
    • i.e. each bucket has same amount of positive (real) labels
  • tail buckets have many more labels
    • classes per bucket still very imbalanced

y-axis class label frequency in log -- extreme imbalance, 1305 classes

44 of 49

Pre-training from small data

Evaluate zero-shot ZS, few-shot + long-tail learning via PLENCE:

44

44

[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061

SL Sup label-embeddings (LE)

  • better performance than [0,1]-BCE, esp. on long-tail

SSL prertrained LE:

  • slightly better yet

XL SSL pretrained LE with larger net + 3.3x labels:

  • strong long-tail performance boost

45 of 49

Pre-training from small data

Evaluate zero-shot ZS, few-shot + long-tail learning via PLENCE:

45

45

[24] Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data, 2020, arx:2010.01061

SSL pretraining label embs:

  • learn long-tail from epoch 1

XL SSL pretrained LE:

  • learns long-tail even faster
  • need 3x less epochs compared to

Insight: pseudo-label SSL pretraining boost long-tail learning.

Larger model + more SSL signal boost long-tail further.

46 of 49

Summary

self-supervised XAI:

  • for self-supervised (transfer learning) evaluation
  • without limiting to supervised probing (expected) semantics

challenging, low-resource evaluation:

  • zero + few-shot reveal pretraining effects
  • low-resource (LR), long-tail (LT) setups show learning efficiency
    • long-tail problem grows with data size
    • in the tail we are always in the low-resource setting

data efficient learning:

  • more self-supervision boosts LR few + zero-shot performance

→ Evidence that instead of large external data pretraining, we can

just boost SSL learning signal on small data

46

46

47 of 49

FIN

47

47

48 of 49

XAI: further/ newer reading

[7] Understanding the role of individual units

in a deep neural network, PNAS, 2020.

arxiv:2009.05041

48

An image classification CNN

49 of 49

Redundancy in represenations

49