1 of 56

Kawin Ethayarajh1, Yejin Choi,3, Swabha Swayamdipta2οΏ½ 1Stanford University | 2AI2 | 3P.G.A School of CS, UW

ICML 2022

Presented by

Pritam S. Kadasi

ICML 2022οΏ½Outstanding Paper

12 April, 2023

2 of 56

Introduction

2

Dataset

How difficult it is?

What is difficulty?

How do we measure difficulty?

3 of 56

Models vs. Humans

3

I’m here

Ref: PapersWithCode: https://paperswithcode.com/sota/natural-language-inference-on-multinli

4 of 56

Probability

Random Event (rolling a die) β†’ Random Variable (X):

β†’ Random Variable (Y):

P(X): Probability associated with random variable.

4

Represents any number that comes up on the die.

Represents even number that comes up on the die.

represents outcome of a random event

5 of 56

Information theory

  • The basic intuition behind information theory is that learning that an unlikely event has occurred is more informative than learning that a likely event has occurred.

  • sun rose this morning OR there was a solar eclipse this morning

5

Ref: https://github.com/janishar/mit-deep-learning-book-pdf/blob/master/chapter-wise-pdf/%5B7%5Dpart-1-chapter-3.pdf

6 of 56

Information Content

  • Likely events should have low information content, and in the extreme case, events that are guaranteed to happen should have no information content whatsoever.
  • Less likely events should have higher information content.
  • Independent events should have additive information. For example, finding out that a tossed coin has come up as heads twice should convey twice as much information as finding out that a tossed coin has come up as heads once.
    • I(2H) = 2*I(1H)

6

Ref: https://github.com/janishar/mit-deep-learning-book-pdf/blob/master/chapter-wise-pdf/%5B7%5Dpart-1-chapter-3.pdf

H(X,Y) = H(X) + H(Y)οΏ½[additivity of information]

7 of 56

Information Content

  • Likely events should have low information content, and in the extreme case, events that are guaranteed to happen should have no information content whatsoever.
  • Less likely events should have higher information content.

7

Ref: https://github.com/janishar/mit-deep-learning-book-pdf/blob/master/chapter-wise-pdf/%5B7%5Dpart-1-chapter-3.pdf

8 of 56

Information Content

8

9 of 56

Information Content

9

10 of 56

𝑦 = - log(π‘₯)

10

11 of 56

Entropy

H(X) = E[ βˆ’ log P(π‘₯)]

11

12 of 56

Entropy

  • Shannon defined the entropy Ξ— of a discrete random variable X which takes values in the alphabet 𝓧 and is distributed according to p : 𝓧 β†’ [0, 1] such that p(π‘₯) ≔ P[X = π‘₯]

12

Ref: https://en.wikipedia.org/wiki/Entropy_(information_theory)

Intuitively, it tells how unpredictable a random variable is

13 of 56

Entropy

13

14 of 56

Relative Entropy or KL Divergence

14

Ref: https://github.com/janishar/mit-deep-learning-book-pdf/blob/master/chapter-wise-pdf/%5B7%5Dpart-1-chapter-3.pdf

15 of 56

Mutual Information

  • It is a quantity that measures a relationship between two random variables that are sampled simultaneously.
  • It measures, how much information is communicated on average in one random variable about another
  • It tells about the reduction in uncertainty due to another random variable.

15

Ref:

16 of 56

Mutual Information

How much does one random variable tell about another?

  • Case 1: Dependent events β†’ has mutual information
    • X represents the roll of a fair 6-sided die, and
    • Y represents whether the roll is even (0 if even, 1 if odd).

  • Case 2: Independent events β†’ has no mutual information
    • X represents the roll of one fair die, and
    • Z represents the roll of another fair die

16

Ref: https://people.cs.umass.edu/~elm/Teaching/Docs/mutInf.pdf

17 of 56

Mutual Information

The mutual information I(X; Y) is the relative entropy between the joint distribution p(π‘₯, 𝑦) and the product distribution p(π‘₯)p(𝑦).

17

18 of 56

Mutual Information (Cont.)

18

19 of 56

Mutual Information (Cont.)

19

20 of 56

Mutual Information and Entropy

20

21 of 56

Tensorflow Playground

21

22 of 56

π“₯-entropy

  • Consider model family π“₯ which can be trained to map X to its label Y
  • Now, translate the text of X to complex grammar, it would be harder to predict Y given X using π“₯
  • So, how do we measure this difficulty?

22

23 of 56

π“₯-entropy

By using shannon mutual information I(X;Y)?

  • NO, as mutual information would not change even after X is encrypted (translated) as it allows for unbounded computation including any need to decrypt the text.
  • Intuitively, the task is easier when X is unencrypted because the information it contains is usable by π“₯
  • When X is encrypted, the information still exists but becomes unusable, this quantity is called π“₯-usable information

23

24 of 56

π“₯-entropy

24

Maximizes the log-likelihood of label data without input

Maximizes the log-likelihood of label data given input

25 of 56

π“₯-entropy (Cont.)

So, how do we maximize the log-likelihood with and without input?

  • Training with the cross-entropy loss finds the 𝑓 ∊ π“₯ that maximizes the log-likelihood of Y given X
  • Thus, Hπ“₯(Y | X) can be easily computed by standard training or by finetuning a pre-trained model.
  • We estimate Hπ“₯(Y) by training or finetuning another model where X is replaced by βˆ…, intended to fit the label distribution.
  • As such, computing π“₯-information involves training or finetuning only two models.

25

26 of 56

π“₯-Usable information

26

27 of 56

Measuring Pointwise Difficulty

  • Measures π“₯-information for individual instances.

27

28 of 56

Implications

π“₯-usable information allows us to compare

  • different models π“₯ by computing Iπ“₯ (X β†’ Y) for the same X, Y
  • different datasets {(π‘₯, 𝑦)} by computing Iπ“₯(X β†’Y) for the same π“₯
  • different input variables Xi by computing Iπ“₯(Xi β†’ Y) for the same π“₯ and Y

28

29 of 56

π“₯-Usable information in practice

  • Model performance tracks π“₯-information.
  • π“₯-information is more sensitive to over-fitting than held-out performance.
  • Different datasets for the same task can have different amounts of π“₯-usable information.

29

30 of 56

30

31 of 56

31

32 of 56

PVI Vs. PMI

PVI is to π“₯-information what PMI is to Shannon information.

32

33 of 56

Algorithm for calculating π“₯-information

33

34 of 56

Implications

PVI allows us to compare

  • different instances (π‘₯, 𝑦) by computing PVI(π‘₯ →𝑦) for the same X, Y, π“₯
  • different slices or subsets of the data by computing the average PVI over instances in each slice

34

35 of 56

PVI in Practice

  • PVI can be used to find mislabelled instances.
  • The PVI threshold at which predictions become incorrect is similar across datasets.
  • PVI estimates are highly consistent across models, training epochs, and random initializations.
    • Cross-Epoch correlation: The correlation is high (r > 0.80 during the first 5 epochs), suggesting that when an instance is easy(difficult) early on, it tends to remain easy(difficult)
    • Cross-Seed Correlations: The correlation is high (r > 0.87), suggesting that what a model finds difficult is not due to chance.

35

36 of 56

36

37 of 56

PVI estimates are highly consistent across models

37

38 of 56

PVI estimates are highly consistent across human annotators

38

39 of 56

Uncovering Dataset Artefacts

  • Input Transformations
  • Slicing Datasets
  • Token-level Artefacts
  • Conditioning Out Information

39

40 of 56

Uncovering Dataset Artefacts

  • Input Transformations
  • Slicing Datasets

40

41 of 56

Input Transformations

  • Approach involves applying different transformations 𝜏i(X) to isolate an attribute a, followed by calculating Iπ“₯(𝜏i(X) β†’ Y) to measure how much information (usable by π“₯) the attribute contains about the label.
  • Given that a transformation may make information more accessible (e.g., decrypting some encrypted text, it is possible for Iπ“₯(𝜏i(X) β†’ Y) β‰₯ Iπ“₯(X β†’ Y), so the latter shouldn’t be treated as an upper bound.

41

42 of 56

Some transformations for SNLI

  • shuffled (shuffle tokens randomly)
  • hypothesis-only (only include the hypothesis)
  • premise-only (only include the premise)
  • overlap (tokens in both the premise and hypothesis)

42

43 of 56

Transformations Results

  • Token identity alone provides most of the usable information in SNLI.
  • Hate speech detection might have lexical biases.
    • In DWMW17, the text contains 0.724 bits of BERT-usable information about the label.
    • Additionally, if one removed all the tokens, except for 50 (potentially) offensive ones β€” comprising common racial and homophobic slurs β€” from the input post hoc, there still remains 0.490 bits of BERT-usable information.
    • In other words, just 50 (potentially) offensive words contain most of the BERT-usable information in DWMW17

43

44 of 56

Transformations Results

  • Hate speech detection might have lexical biases.
    • In DWMW17 dataset,
      • Iπ“₯ = 0.724 bits
      • After removing all the tokens except 50 (potentially) offensive tokens from the dataset: Iπ“₯ = 0.490 bits
      • In other words, just 50 (potentially) offensive words contain most of the BERT-usable information in DWMW17
  • Token identity alone provides most of the usable information in SNLI.

44

45 of 56

In DWMW17 dataset,

  • Ex-1: I don't want my child going to school with [LGBTQ+ community], they are all perverts.
    • PVI: 0.724 bits
  • After removing all the tokens except 50 (potentially) offensive tokens from the dataset:
    • Ex-1 β†’ Ex-2: perverts.
    • PVI: 0.490 bits
  • In other words, just 50 (potentially) offensive words contain most of the BERT-usable information in DWMW17

45

46 of 56

46

47 of 56

Slicing Datasets

  • Certain attributes are more useful for certain classes.
  • Certain subsets of each class are more difficult than others.

47

48 of 56

Certain attributes are more useful for certain classes.

  • Note: π“₯-information = mean PVI (entire dataset)
  • Comparing the usefulness of an attribute across classes can be useful for identifying systemic annotation artefacts.
    • Created slices of data with respect to classes.
    • Calculate mean PVI for these slices and compare it.
    • Ex: the premise-hypothesis overlap contains much more BERT usable information about the β€˜entailment’ class than β€˜contradiction’ or β€˜neutral’.

48

49 of 56

49

50 of 56

Certain subsets of each class are more difficult than others

  • We bin the examples in each SNLI class by the level of hypothesis-premise overlap and plot the average PVI.
  • We see entailment instances with no hypothesis-premise overlap are the most difficult (i.e., lowest mean PVI).
  • While contradiction instances with no overlap are the easiest (i.e., highest mean PVI).
  • We can use dataset cartography and comparing the subsets with the PVI to dig more into subsets of data.

50

51 of 56

Certain subsets of each class are more difficult than others

51

52 of 56

52

53 of 56

Key Takeaways

  • We can estimate difficulty of the dataset, slices of the dataset, even an individual instance.
  • We can isolate attributes instances in the dataset/individual instances which are leading to difficulty.
  • Higher the π“₯-information, easier the dataset.
  • In dataset cartography, we cannot say why an individual instance is difficulty/easy/ambiguous, on basis of what parameters.
  • We cannot quantify difficulty of an instance/dataset in dataset cartography.

53

54 of 56

References

54

55 of 56

Thank You !!!

55

56 of 56

Token-level Artefacts

How do we know whether a token is helpful in deciding the label?

  • Token-level signals and artefacts can be discovered using leave-one-out.
  • we compute the change in the V-information estimate after removing t, which yields modified input π‘₯Β¬t. We use the same model g’ but evaluate only on a slice of the data, DC,t, which contains the token t and belongs to the class C of interest. This simplifies to measuring the increase in conditional entropy.

56