1 of 144

Compositional Explanations

of Neurons

Jesse Mu�

Joint work with Jacob Andreas

Stanford CogSci Seminar�2020-03-04

2 of 144

3 of 144

4 of 144

river

5 of 144

6 of 144

boat

water�river

7 of 144

boat

water�river

boat?

water?�river?

8 of 144

Representation-level�analyses (“probing”)

9 of 144

Representation-level�analyses (“probing”)

Structural Probe

Hewitt and Manning, 2019

10 of 144

Representation-level�analyses (“probing”)

Structural Probe

Hewitt and Manning, 2019

BERT

Devlin et al., 2018

11 of 144

Representation-level�analyses (“probing”)

Structural Probe

Hewitt and Manning, 2019

BERT

Devlin et al., 2018

12 of 144

Representation-level�analyses (“probing”)

Hewitt and Manning, 2019

Liu et al., 2019

and many more

13 of 144

Representation-level�analyses (“probing”)

Hewitt and Manning, 2019

Liu et al., 2019

and many more

...but how complex should probing methods be?

14 of 144

Representation-level�analyses (“probing”)

Hewitt and Manning, 2019

Liu et al., 2019

and many more

...but how complex should probing methods be?

Hewitt and Liang, 2019

“Has the probe just memorized the task?”

15 of 144

Representation-level�analyses (“probing”)

Hewitt and Manning, 2019

Liu et al., 2019

and many more

...but how complex should probing methods be?

Hewitt and Liang, 2019

“Has the probe just memorized the task?”

Pimentel et al., 2020

“The more complex the probe, the better”

16 of 144

boat

water�river

boat?

water?�river?

17 of 144

boat

water�river

boat?

water?�river?

18 of 144

Analyzing neurons allows us to�measure disentanglement of representations

and inspect surface-level behavior

19 of 144

Neuron Interpretability

Dalvi et al., 2018�also see Radford et al., 2017; Bau et al., 2019

NetDissect

Bau et al., 2017; NEW Bau et al., 2020 (PNAS)

also see Carter et al., 2019; Olah et al., 2020

NeuroX

20 of 144

NetDissect (Bau et al., 2017)

21 of 144

NetDissect (Bau et al., 2017)

Maximally activating images

22 of 144

NetDissect (Bau et al., 2017)

bullring

Maximally activating images

23 of 144

NetDissect (Bau et al., 2017)

bullring

Maximally activating images

24 of 144

Neurons are not just simple feature detectors, but rather encode complex concepts composed of multiple primitives!

bullring

This work

Maximally activating images

25 of 144

Neurons are not just simple feature detectors, but rather encode complex concepts composed of multiple primitives!

bullring OR pitch OR volleyball court�OR batters box OR baseball stadium OR tennis court OR badminton court AND (NOT football field) AND (NOT railing)

This work

Maximally activating images

26 of 144

27 of 144

28 of 144

29 of 144

30 of 144

31 of 144

IoU

32 of 144

IoU

33 of 144

NetDissect

IoU

34 of 144

NetDissect

35 of 144

NetDissect

Given compositional operators AND, OR, and NOT, construct logical forms via beam search

36 of 144

Given compositional operators AND, OR, and NOT, construct logical forms via beam search

37 of 144

Given compositional operators AND, OR, and NOT, construct logical forms via beam search

38 of 144

3 Questions

39 of 144

3 Questions

Do neurons learn compositional concepts?

40 of 144

3 Questions

Do neurons learn compositional concepts?
Does interpretability relate to model performance?

41 of 144

3 Questions

Do neurons learn compositional concepts?
Does interpretability relate to model performance?
Can we use our explanations to probe model behavior?

42 of 144

1. Do neurons learn compositional concepts?

43 of 144

1. Do neurons learn compositional concepts?

We probe the final convolutional�layer (before softmax) on ResNet-18�trained on the Places365 scene�prediction task…

44 of 144

1. Do neurons learn compositional concepts?

We probe the final convolutional�layer (before softmax) on ResNet-18�trained on the Places365 scene�prediction task…

...then generate explanations�from the primitive concepts�(inc. objects, parts, scenes, colors)�in the Broden dataset

45 of 144

1. Do neurons learn compositional concepts?

We probe the final convolutional�layer (before softmax) on ResNet-18�trained on the Places365 scene�prediction task…

...then generate explanations�from the primitive concepts�(inc. objects, parts, scenes, colors)�in the Broden dataset

46 of 144

1. Do neurons learn compositional concepts?

We probe the final convolutional�layer (before softmax) on ResNet-18�trained on the Places365 scene�prediction task…

...then generate explanations�from the primitive concepts�(inc. objects, parts, scenes, colors)�in the Broden dataset

47 of 144

1. Do neurons learn compositional concepts?

We probe the final convolutional�layer (before softmax) on ResNet-18�trained on the Places365 scene�prediction task…

...then generate explanations�from the primitive concepts�(inc. objects, parts, scenes, colors)�in the Broden dataset

68% increase in�explanation quality�(0.059 → 0.099 IoU)

48 of 144

22%

8%

31%

49 of 144

39%

22%

8%

31%

50 of 144

39%

22%

8%

31%

51 of 144

39%

22%

8%

31%

52 of 144

39%

22%

8%

31%

53 of 144

2. Does interpretability relate to model performance?

54 of 144

2. Does interpretability relate to model performance?

What is the model accuracy on inputs�where the neuron is active?

55 of 144

2. Does interpretability relate to model performance?

What is the model accuracy on inputs�where the neuron is active?

56 of 144

2. Does interpretability relate to model performance?

What is the model accuracy on inputs�where the neuron is active?

57 of 144

2. Does interpretability relate to model performance?

What is the model accuracy on inputs�where the neuron is active?

r = 0.31�p < 1e−13

58 of 144

2. Does interpretability relate to model performance?

What is the model accuracy on inputs�where the neuron is active?

r = 0.31�p < 1e−13

59 of 144

3. Can explanations help us probe model behavior?

60 of 144

3. Can explanations help us probe model behavior?

61 of 144

3. Can explanations help us probe model behavior?

swimming �hole

324

483

304

326

(water OR river)�AND (NOT blue)

forest-broad�OR waterfall OR forest-needle

0.38

creek OR waterfall�OR desert-sand

0.27

0.29

62 of 144

3. Can explanations help us probe model behavior?

swimming �hole

324

483

304

326

(water OR river)�AND (NOT blue)

forest-broad�OR waterfall OR forest-needle

0.38

creek OR waterfall�OR desert-sand

0.27

0.29

63 of 144

3. Can explanations help us probe model behavior?

swimming �hole

324

483

304

326

(water OR river)�AND (NOT blue)

forest-broad�OR waterfall OR forest-needle

0.38

creek OR waterfall�OR desert-sand

0.27

0.29

swimming hole

ResNet18

AlexNet

ResNet50

DenseNet161

64 of 144

3. Can explanations help us probe model behavior?

swimming �hole

324

483

304

326

(water OR river)�AND (NOT blue)

forest-broad�OR waterfall OR forest-needle

0.38

creek OR waterfall�OR desert-sand

0.27

0.29

swimming hole

grotto

hot spring

ResNet18

AlexNet

ResNet50

DenseNet161

65 of 144

3. Can explanations help us probe model behavior?

swimming �hole

324

483

304

326

(water OR river)�AND (NOT blue)

forest-broad�OR waterfall OR forest-needle

0.38

creek OR waterfall�OR desert-sand

0.27

0.29

swimming hole

grotto

hot spring

ResNet18

AlexNet

ResNet50

DenseNet161

66 of 144

3. Can explanations help us probe model behavior?

clean�room

93

413

473

209

pool table OR machine OR bank vault

martial arts gym�OR ice OR fountain

0.34

batters box OR martial arts gym OR clean room

0.32

0.34

67 of 144

3. Can explanations help us probe model behavior?

clean�room

93

413

473

209

pool table OR machine OR bank vault

martial arts gym�OR ice OR fountain

0.34

batters box OR martial arts gym OR clean room

0.32

0.34

corridor

ResNet18

AlexNet

ResNet50

DenseNet161

68 of 144

3. Can explanations help us probe model behavior?

clean�room

93

413

473

209

pool table OR machine OR bank vault

martial arts gym�OR ice OR fountain

0.34

batters box OR martial arts gym OR clean room

0.32

0.34

corridor

clean room

alcove

igloo

corridor

ResNet18

AlexNet

ResNet50

DenseNet161

69 of 144

3. Can explanations help us probe model behavior?

viaduct

347

26

378

308

aqueduct OR viaduct�OR cloister-indoor

bridge OR viaduct�OR aqueduct

0.48

washer OR�laundromat

OR viaduct

0.36

0.46

70 of 144

3. Can explanations help us probe model behavior?

viaduct

347

26

378

308

aqueduct OR viaduct�OR cloister-indoor

bridge OR viaduct�OR aqueduct

0.48

washer OR�laundromat

OR viaduct

0.36

0.46

forest path�forest path�forest path�forest path

ResNet18

AlexNet

ResNet50

DenseNet161

71 of 144

3. Can explanations help us probe model behavior?

viaduct

347

26

378

308

aqueduct OR viaduct�OR cloister-indoor

bridge OR viaduct�OR aqueduct

0.48

washer OR�laundromat

OR viaduct

0.36

0.46

forest path�forest path�forest path�forest path

viaduct

laundromat

ResNet18

AlexNet

ResNet50

DenseNet161

72 of 144

73 of 144

Natural language inference (NLI)

74 of 144

Natural language inference (NLI)

Pre A woman in a light blue jacket is riding a bike.

75 of 144

Natural language inference (NLI)

Pre A woman in a light blue jacket is riding a bike.

Hyp A woman in a jacket riding a bike.

76 of 144

Natural language inference (NLI)

Pre A woman in a light blue jacket is riding a bike.

Hyp A woman in a jacket riding a bike.

→ entailment

77 of 144

Natural language inference (NLI)

Pre A woman in a light blue jacket is riding a bike.

Hyp A woman in a jacket riding a bus.

→ contradiction

78 of 144

Natural language inference (NLI)

Pre A woman in a light blue jacket is riding a bike.

Hyp A woman in a jacket riding a bike to the store.

→ neutral

79 of 144

Natural language inference (NLI)

Pre A woman in a light blue jacket is riding a bike.

Hyp A woman in a jacket riding a bike.

→ entailment

80 of 144

Natural language “inference” (NLI)

Pre A woman in a light blue jacket is riding a bike.

Hyp A woman in a jacket riding a bike.

→ entailment

81 of 144

Natural language “inference” (NLI)

Pre A woman in a light blue jacket is riding a bike.

Hyp A woman in a jacket riding a bike.

→ entailment

Poliak et al., 2018

Pre

Hyp

M

entail

82 of 144

Natural language “inference” (NLI)

Pre A woman in a light blue jacket is riding a bike.

Hyp A woman in a jacket riding a bike.

→ entailment

Pre

Hyp

M

entail

78% accuracy

Poliak et al., 2018

83 of 144

Natural language “inference” (NLI)

Pre A woman in a light blue jacket is riding a bike.

Hyp A woman in a jacket riding a bike.

→ entailment

Hyp

M

entail

78% accuracy

Poliak et al., 2018

84 of 144

Natural language “inference” (NLI)

Pre A woman in a light blue jacket is riding a bike.

Hyp A woman in a jacket riding a bike.

→ entailment

Hyp

M

entail

69% accuracy!

Poliak et al., 2018

85 of 144

Natural language “inference” (NLI)

Pre A woman in a light blue jacket is riding a bike.

Hyp A woman in a jacket riding a bike.

→ entailment

Hyp

M

entail

69% accuracy!

(chance 33%)

Poliak et al., 2018

86 of 144

Natural language “inference” (NLI)

Pre A woman in a light blue jacket is riding a bike.

Hyp A woman in a jacket riding a bike.

→ entailment

McCoy et al., 2019

Rule: predict entail when all hyp words are in pre.

87 of 144

Natural language “inference” (NLI)

Pre A woman in a light blue jacket is riding a bike.

Hyp A woman in a jacket riding a bike.

→ entailment

McCoy et al., 2019

Rule: predict entail when all hyp words are in pre.

90% accuracy

88 of 144

Natural language “inference” (NLI)

Pre A woman in a light blue jacket is riding a bike.

Hyp A woman in a jacket riding a bike.

→ entailment

Adversarial NLI datasets

HANS (McCoy et al., 2019) 78% → <10%

Counterfactual SNLI (Kaushik et al., 2020) 72% → 40%

...and more

89 of 144

Pre

Hyp

RNN

entailment

(Bowman et al., 2016)

90 of 144

Bag-of-words concepts

Pre A woman in a light blue

jacket is riding a bike.

Hyp A woman in a jacket

riding a bike.

Pre

Hyp

RNN

entailment

(Bowman et al., 2016)

91 of 144

{pre:woman, pre:light,

pre:NN, pre:JJ,

hyp:woman, hyp:jacket,

hyp:NN, hyp:VBG,

overlap-75%}

Bag-of-words concepts

Pre A woman in a light blue

jacket is riding a bike.

Hyp A woman in a jacket

riding a bike.

Pre

Hyp

RNN

entailment

(Bowman et al., 2016)

92 of 144

{pre:woman, pre:light,

pre:NN, pre:JJ,

hyp:woman, hyp:jacket,

hyp:NN, hyp:VBG,

overlap-75%}

Bag-of-words concepts

Pre A woman in a light blue

jacket is riding a bike.

Hyp A woman in a jacket

riding a bike.

Compositions

AND, OR, NOT

Pre

Hyp

RNN

entailment

(Bowman et al., 2016)

93 of 144

{pre:woman, pre:light,

pre:NN, pre:JJ,

hyp:woman, hyp:jacket,

hyp:NN, hyp:VBG,

overlap-75%}

Bag-of-words concepts

Pre A woman in a light blue

jacket is riding a bike.

Hyp A woman in a jacket

riding a bike.

Compositions

AND, OR, NOT, NEIGHBORS

Pre

Hyp

RNN

entailment

(Bowman et al., 2016)

94 of 144

{pre:woman, pre:light,

pre:NN, pre:JJ,

hyp:woman, hyp:jacket,

hyp:NN, hyp:VBG,

overlap-75%}

Bag-of-words concepts

Pre A woman in a light blue

jacket is riding a bike.

Hyp A woman in a jacket

riding a bike.

Compositions

AND, OR, NOT, NEIGHBORS

NEIGHBORS(bike) = (bike OR�bicycle OR biking OR car OR bus)

Pre

Hyp

RNN

entailment

(Bowman et al., 2016)

95 of 144

1. Do neurons learn compositional concepts?

Probing a BiLSTM baseline model (Bowman et al., 2016) on the SNLI validation set

96 of 144

1. Do neurons learn compositional concepts?

Probing a BiLSTM baseline model (Bowman et al., 2016) on the SNLI validation set

97 of 144

98 of 144

99 of 144

100 of 144

101 of 144

102 of 144

103 of 144

104 of 144

lexical overlap heuristics

McCoy et al., 2019

105 of 144

106 of 144

words with high�pointwise mutual information�(PMI) with class label

Gururangan et al., 2018

107 of 144

108 of 144

2. Does interpretability relate to model performance?

109 of 144

2. Does interpretability relate to model performance?

110 of 144

2. Does interpretability relate to model performance?

111 of 144

2. Does interpretability relate to model performance?

Interpretability not a priori correlated with performance—depends on concept space

112 of 144

2. Does interpretability relate to model performance?

Interpretability not a priori correlated with performance—depends on concept space

Are we searching for meaningful abstractions or spurious heuristics?

113 of 144

3. Can explanations help us probe model behavior?

114 of 144

3. Can explanations help us probe model behavior?

115 of 144

3. Can explanations help us probe model behavior?

116 of 144

3. Can explanations help us probe model behavior?

117 of 144

3. Can explanations help us probe model behavior?

118 of 144

3. Can explanations help us probe model behavior?

119 of 144

3. Can explanations help us probe model behavior?

120 of 144

3. Can explanations help us probe model behavior?

121 of 144

3. Can explanations help us probe model behavior?

122 of 144

3. Can explanations help us probe model behavior?

123 of 144

Parting thoughts

124 of 144

Parting thoughts

Compositional explanations

125 of 144

Parting thoughts

Compositional explanations

Identify abstractions, polysemanticity, and spurious correlations localized to specific neurons in deep models

126 of 144

Parting thoughts

Compositional explanations

Identify abstractions, polysemanticity, and spurious correlations localized to specific neurons in deep models
Can disambiguate better vs worse neurons w.r.t. task performance

127 of 144

Parting thoughts

Compositional explanations

Identify abstractions, polysemanticity, and spurious correlations localized to specific neurons in deep models
Can disambiguate better vs worse neurons w.r.t. task performance
Allow us to predictably manipulate model behavior

128 of 144

Parting thoughts

Compositional explanations

Identify abstractions, polysemanticity, and spurious correlations localized to specific neurons in deep models
Can disambiguate better vs worse neurons w.r.t. task performance
Allow us to predictably manipulate model behavior

Future questions

129 of 144

Parting thoughts

Compositional explanations

Identify abstractions, polysemanticity, and spurious correlations localized to specific neurons in deep models
Can disambiguate better vs worse neurons w.r.t. task performance
Allow us to predictably manipulate model behavior

Future questions

Can we look at connections between layers to better understand information flow within a network? (Circuits; Olah et al., 2020)

130 of 144

Parting thoughts

Compositional explanations

Identify abstractions, polysemanticity, and spurious correlations localized to specific neurons in deep models
Can disambiguate better vs worse neurons w.r.t. task performance
Allow us to predictably manipulate model behavior

Future questions

Can we look at connections between layers to better understand information flow within a network? (Circuits; Olah et al., 2020)
Can we use interpretability as a regularization signal during training?

131 of 144

Thanks!

Jacob Andreas

Thanks to David Bau, Eric Chu, Alex Tamkin,�Mike Wu, and Noah Goodman.

Funding from NSF GRFP and Office of Naval Research.

Code: github.com/jayelm/compexp

arXiv: arxiv.org/abs/2006.14032

132 of 144

Additional Slides

133 of 144

Concept Uniqueness

134 of 144

Local explanations

LIME, Anchors

Ribeiro et al., 2016, 2018

Natural Language Explanations

Andreas et al., 2017

Hendricks et al., 2016, 2018a,b

GradCAM

Selvaraju et al., 2018

135 of 144

3. Can explanations help us probe model behavior?

fire�escape

143

199

30

104

fire escape OR bridge OR staircase

house OR porch�OR townhouse

0.57

cradle OR autobus

OR fire escape

0.26

0.30

136 of 144

3. Can explanations help us probe model behavior?

fire�escape

143

199

30

104

fire escape OR bridge OR staircase

house OR porch�OR townhouse

0.57

cradle OR autobus

OR fire escape

0.26

0.30

street�street�street�street

ResNet18

AlexNet

ResNet50

DenseNet161

137 of 144

3. Can explanations help us probe model behavior?

fire�escape

143

199

30

104

fire escape OR bridge OR staircase

house OR porch�OR townhouse

0.57

cradle OR autobus

OR fire escape

0.26

0.30

street�street�street�street

fire escape

street

cradle

fire escape

ResNet18

AlexNet

ResNet50

DenseNet161

138 of 144

Successful change to intended class

Change to a different class (e.g. “aqueduct”)

No change