Kawin Ethayarajh1, Yejin Choi,3, Swabha Swayamdipta2οΏ½ 1Stanford University | 2AI2 | 3P.G.A School of CS, UW
ICML 2022
Presented by
Pritam S. Kadasi
ICML 2022οΏ½Outstanding Paper
12 April, 2023
Introduction
2
Dataset
How difficult it is?
What is difficulty?
How do we measure difficulty?
Models vs. Humans
3
Iβm here
Ref: PapersWithCode: https://paperswithcode.com/sota/natural-language-inference-on-multinli
Probability
Random Event (rolling a die) β Random Variable (X):
β Random Variable (Y):
P(X): Probability associated with random variable.
4
Represents any number that comes up on the die.
Represents even number that comes up on the die.
represents outcome of a random event
Information theory
5
Ref: https://github.com/janishar/mit-deep-learning-book-pdf/blob/master/chapter-wise-pdf/%5B7%5Dpart-1-chapter-3.pdf
Information Content
6
Ref: https://github.com/janishar/mit-deep-learning-book-pdf/blob/master/chapter-wise-pdf/%5B7%5Dpart-1-chapter-3.pdf
H(X,Y) = H(X) + H(Y)οΏ½[additivity of information]
Information Content
7
Ref: https://github.com/janishar/mit-deep-learning-book-pdf/blob/master/chapter-wise-pdf/%5B7%5Dpart-1-chapter-3.pdf
Information Content
8
Information Content
9
π¦ = - log(π₯)
10
Entropy
H(X) = E[ β log P(π₯)]
11
Entropy
12
Ref: https://en.wikipedia.org/wiki/Entropy_(information_theory)
Intuitively, it tells how unpredictable a random variable is
Entropy
13
Relative Entropy or KL Divergence
14
Ref: https://github.com/janishar/mit-deep-learning-book-pdf/blob/master/chapter-wise-pdf/%5B7%5Dpart-1-chapter-3.pdf
Mutual Information
15
Mutual Information
How much does one random variable tell about another?
16
Ref: https://people.cs.umass.edu/~elm/Teaching/Docs/mutInf.pdf
Mutual Information
The mutual information I(X; Y) is the relative entropy between the joint distribution p(π₯, π¦) and the product distribution p(π₯)p(π¦).
17
Mutual Information (Cont.)
18
Mutual Information (Cont.)
19
Mutual Information and Entropy
20
Tensorflow Playground
21
π₯-entropy
22
π₯-entropy
By using shannon mutual information I(X;Y)?
23
π₯-entropy
24
Maximizes the log-likelihood of label data without input
Maximizes the log-likelihood of label data given input
π₯-entropy (Cont.)
So, how do we maximize the log-likelihood with and without input?
25
π₯-Usable information
26
Measuring Pointwise Difficulty
27
Implications
π₯-usable information allows us to compare
28
π₯-Usable information in practice
29
30
31
PVI Vs. PMI
PVI is to π₯-information what PMI is to Shannon information.
32
Algorithm for calculating π₯-information
33
Implications
PVI allows us to compare
34
PVI in Practice
35
36
PVI estimates are highly consistent across models
37
PVI estimates are highly consistent across human annotators
38
Uncovering Dataset Artefacts
39
Uncovering Dataset Artefacts
40
Input Transformations
41
Some transformations for SNLI
42
Transformations Results
43
Transformations Results
44
In DWMW17 dataset,
45
46
Slicing Datasets
47
Certain attributes are more useful for certain classes.
48
49
Certain subsets of each class are more difficult than others
50
Certain subsets of each class are more difficult than others
51
52
Key Takeaways
53
References
54
Thank You !!!
55
Token-level Artefacts
How do we know whether a token is helpful in deciding the label?
56