Compositional Explanations
of Neurons
Jesse Mu�
Joint work with Jacob Andreas
Stanford CogSci Seminar�2020-03-04
river
boat
water�river
boat
water�river
boat?
water?�river?
Representation-level�analyses (“probing”)
Representation-level�analyses (“probing”)
Structural Probe
Hewitt and Manning, 2019
Representation-level�analyses (“probing”)
Structural Probe
Hewitt and Manning, 2019
BERT
Devlin et al., 2018
Representation-level�analyses (“probing”)
Structural Probe
Hewitt and Manning, 2019
BERT
Devlin et al., 2018
Representation-level�analyses (“probing”)
Hewitt and Manning, 2019
Liu et al., 2019
and many more
Representation-level�analyses (“probing”)
Hewitt and Manning, 2019
Liu et al., 2019
and many more
...but how complex should probing methods be?
Representation-level�analyses (“probing”)
Hewitt and Manning, 2019
Liu et al., 2019
and many more
...but how complex should probing methods be?
Hewitt and Liang, 2019
“Has the probe just memorized the task?”
Representation-level�analyses (“probing”)
Hewitt and Manning, 2019
Liu et al., 2019
and many more
...but how complex should probing methods be?
Hewitt and Liang, 2019
“Has the probe just memorized the task?”
Pimentel et al., 2020
“The more complex the probe, the better”
boat
water�river
boat?
water?�river?
boat
water�river
boat?
water?�river?
Analyzing neurons allows us to�measure disentanglement of representations
and inspect surface-level behavior
Neuron Interpretability
Dalvi et al., 2018�also see Radford et al., 2017; Bau et al., 2019
NetDissect
Bau et al., 2017; NEW Bau et al., 2020 (PNAS)
also see Carter et al., 2019; Olah et al., 2020
NeuroX
NetDissect (Bau et al., 2017)
NetDissect (Bau et al., 2017)
Maximally activating images
NetDissect (Bau et al., 2017)
bullring
Maximally activating images
NetDissect (Bau et al., 2017)
bullring
Maximally activating images
Neurons are not just simple feature detectors, but rather encode complex concepts composed of multiple primitives!
bullring
This work
Maximally activating images
Neurons are not just simple feature detectors, but rather encode complex concepts composed of multiple primitives!
bullring OR pitch OR volleyball court�OR batters box OR baseball stadium OR tennis court OR badminton court AND (NOT football field) AND (NOT railing)
This work
Maximally activating images
IoU
IoU
NetDissect
IoU
NetDissect
NetDissect
Given compositional operators AND, OR, and NOT, construct logical forms via beam search
Given compositional operators AND, OR, and NOT, construct logical forms via beam search
Given compositional operators AND, OR, and NOT, construct logical forms via beam search
3 Questions
3 Questions
3 Questions
3 Questions
1. Do neurons learn compositional concepts?
1. Do neurons learn compositional concepts?
We probe the final convolutional�layer (before softmax) on ResNet-18�trained on the Places365 scene�prediction task…
1. Do neurons learn compositional concepts?
We probe the final convolutional�layer (before softmax) on ResNet-18�trained on the Places365 scene�prediction task…
...then generate explanations�from the primitive concepts�(inc. objects, parts, scenes, colors)�in the Broden dataset
1. Do neurons learn compositional concepts?
We probe the final convolutional�layer (before softmax) on ResNet-18�trained on the Places365 scene�prediction task…
...then generate explanations�from the primitive concepts�(inc. objects, parts, scenes, colors)�in the Broden dataset
1. Do neurons learn compositional concepts?
We probe the final convolutional�layer (before softmax) on ResNet-18�trained on the Places365 scene�prediction task…
...then generate explanations�from the primitive concepts�(inc. objects, parts, scenes, colors)�in the Broden dataset
1. Do neurons learn compositional concepts?
We probe the final convolutional�layer (before softmax) on ResNet-18�trained on the Places365 scene�prediction task…
...then generate explanations�from the primitive concepts�(inc. objects, parts, scenes, colors)�in the Broden dataset
68% increase in�explanation quality�(0.059 → 0.099 IoU)
22%
8%
31%
39%
22%
8%
31%
39%
22%
8%
31%
39%
22%
8%
31%
39%
22%
8%
31%
2. Does interpretability relate to model performance?
2. Does interpretability relate to model performance?
What is the model accuracy on inputs�where the neuron is active?
2. Does interpretability relate to model performance?
What is the model accuracy on inputs�where the neuron is active?
2. Does interpretability relate to model performance?
What is the model accuracy on inputs�where the neuron is active?
2. Does interpretability relate to model performance?
What is the model accuracy on inputs�where the neuron is active?
r = 0.31�p < 1e−13
2. Does interpretability relate to model performance?
What is the model accuracy on inputs�where the neuron is active?
r = 0.31�p < 1e−13
3. Can explanations help us probe model behavior?
3. Can explanations help us probe model behavior?
3. Can explanations help us probe model behavior?
swimming �hole
324
483
304
326
(water OR river)�AND (NOT blue)
forest-broad�OR waterfall OR forest-needle
0.38
creek OR waterfall�OR desert-sand
0.27
0.29
3. Can explanations help us probe model behavior?
swimming �hole
324
483
304
326
(water OR river)�AND (NOT blue)
forest-broad�OR waterfall OR forest-needle
0.38
creek OR waterfall�OR desert-sand
0.27
0.29
3. Can explanations help us probe model behavior?
swimming �hole
324
483
304
326
(water OR river)�AND (NOT blue)
forest-broad�OR waterfall OR forest-needle
0.38
creek OR waterfall�OR desert-sand
0.27
0.29
swimming hole
swimming hole
swimming hole
swimming hole
ResNet18
AlexNet
ResNet50
DenseNet161
3. Can explanations help us probe model behavior?
swimming �hole
324
483
304
326
(water OR river)�AND (NOT blue)
forest-broad�OR waterfall OR forest-needle
0.38
creek OR waterfall�OR desert-sand
0.27
0.29
swimming hole
swimming hole
swimming hole
swimming hole
grotto
grotto
grotto
hot spring
ResNet18
AlexNet
ResNet50
DenseNet161
3. Can explanations help us probe model behavior?
swimming �hole
324
483
304
326
(water OR river)�AND (NOT blue)
forest-broad�OR waterfall OR forest-needle
0.38
creek OR waterfall�OR desert-sand
0.27
0.29
swimming hole
swimming hole
swimming hole
swimming hole
grotto
grotto
grotto
hot spring
ResNet18
AlexNet
ResNet50
DenseNet161
3. Can explanations help us probe model behavior?
clean�room
93
413
473
209
pool table OR machine OR bank vault
martial arts gym�OR ice OR fountain
0.34
batters box OR martial arts gym OR clean room
0.32
0.34
3. Can explanations help us probe model behavior?
clean�room
93
413
473
209
pool table OR machine OR bank vault
martial arts gym�OR ice OR fountain
0.34
batters box OR martial arts gym OR clean room
0.32
0.34
corridor
corridor
corridor
corridor
ResNet18
AlexNet
ResNet50
DenseNet161
3. Can explanations help us probe model behavior?
clean�room
93
413
473
209
pool table OR machine OR bank vault
martial arts gym�OR ice OR fountain
0.34
batters box OR martial arts gym OR clean room
0.32
0.34
corridor
corridor
corridor
corridor
clean room
alcove
igloo
corridor
ResNet18
AlexNet
ResNet50
DenseNet161
3. Can explanations help us probe model behavior?
viaduct
347
26
378
308
aqueduct OR viaduct�OR cloister-indoor
bridge OR viaduct�OR aqueduct
0.48
washer OR�laundromat
OR viaduct
0.36
0.46
3. Can explanations help us probe model behavior?
viaduct
347
26
378
308
aqueduct OR viaduct�OR cloister-indoor
bridge OR viaduct�OR aqueduct
0.48
washer OR�laundromat
OR viaduct
0.36
0.46
forest path�forest path�forest path�forest path
ResNet18
AlexNet
ResNet50
DenseNet161
3. Can explanations help us probe model behavior?
viaduct
347
26
378
308
aqueduct OR viaduct�OR cloister-indoor
bridge OR viaduct�OR aqueduct
0.48
washer OR�laundromat
OR viaduct
0.36
0.46
forest path�forest path�forest path�forest path
viaduct
viaduct
viaduct
laundromat
ResNet18
AlexNet
ResNet50
DenseNet161
Natural language inference (NLI)
Natural language inference (NLI)
Pre A woman in a light blue jacket is riding a bike.
Natural language inference (NLI)
Pre A woman in a light blue jacket is riding a bike.
Hyp A woman in a jacket riding a bike.
Natural language inference (NLI)
Pre A woman in a light blue jacket is riding a bike.
Hyp A woman in a jacket riding a bike.
→ entailment
Natural language inference (NLI)
Pre A woman in a light blue jacket is riding a bike.
Hyp A woman in a jacket riding a bus.
→ contradiction
Natural language inference (NLI)
Pre A woman in a light blue jacket is riding a bike.
Hyp A woman in a jacket riding a bike to the store.
→ neutral
Natural language inference (NLI)
Pre A woman in a light blue jacket is riding a bike.
Hyp A woman in a jacket riding a bike.
→ entailment
Natural language “inference” (NLI)
Pre A woman in a light blue jacket is riding a bike.
Hyp A woman in a jacket riding a bike.
→ entailment
Natural language “inference” (NLI)
Pre A woman in a light blue jacket is riding a bike.
Hyp A woman in a jacket riding a bike.
→ entailment
Poliak et al., 2018
Pre
Hyp
M
entail
Natural language “inference” (NLI)
Pre A woman in a light blue jacket is riding a bike.
Hyp A woman in a jacket riding a bike.
→ entailment
Pre
Hyp
M
entail
78% accuracy
Poliak et al., 2018
Natural language “inference” (NLI)
Pre A woman in a light blue jacket is riding a bike.
Hyp A woman in a jacket riding a bike.
→ entailment
Hyp
M
entail
78% accuracy
Poliak et al., 2018
Natural language “inference” (NLI)
Pre A woman in a light blue jacket is riding a bike.
Hyp A woman in a jacket riding a bike.
→ entailment
Hyp
M
entail
69% accuracy!
Poliak et al., 2018
Natural language “inference” (NLI)
Pre A woman in a light blue jacket is riding a bike.
Hyp A woman in a jacket riding a bike.
→ entailment
Hyp
M
entail
69% accuracy!
(chance 33%)
Poliak et al., 2018
Natural language “inference” (NLI)
Pre A woman in a light blue jacket is riding a bike.
Hyp A woman in a jacket riding a bike.
→ entailment
McCoy et al., 2019
Rule: predict entail when all hyp words are in pre.
Natural language “inference” (NLI)
Pre A woman in a light blue jacket is riding a bike.
Hyp A woman in a jacket riding a bike.
→ entailment
McCoy et al., 2019
Rule: predict entail when all hyp words are in pre.
90% accuracy
Natural language “inference” (NLI)
Pre A woman in a light blue jacket is riding a bike.
Hyp A woman in a jacket riding a bike.
→ entailment
Adversarial NLI datasets
HANS (McCoy et al., 2019) 78% → <10%
Counterfactual SNLI (Kaushik et al., 2020) 72% → 40%
...and more
Pre
Hyp
RNN
RNN
entailment
(Bowman et al., 2016)
Bag-of-words concepts
Pre A woman in a light blue
jacket is riding a bike.
Hyp A woman in a jacket
riding a bike.
Pre
Hyp
RNN
RNN
entailment
(Bowman et al., 2016)
{pre:woman, pre:light,
pre:NN, pre:JJ,
hyp:woman, hyp:jacket,
hyp:NN, hyp:VBG,
overlap-75%}
Bag-of-words concepts
Pre A woman in a light blue
jacket is riding a bike.
Hyp A woman in a jacket
riding a bike.
Pre
Hyp
RNN
RNN
entailment
(Bowman et al., 2016)
{pre:woman, pre:light,
pre:NN, pre:JJ,
hyp:woman, hyp:jacket,
hyp:NN, hyp:VBG,
overlap-75%}
Bag-of-words concepts
Pre A woman in a light blue
jacket is riding a bike.
Hyp A woman in a jacket
riding a bike.
Compositions
AND, OR, NOT
Pre
Hyp
RNN
RNN
entailment
(Bowman et al., 2016)
{pre:woman, pre:light,
pre:NN, pre:JJ,
hyp:woman, hyp:jacket,
hyp:NN, hyp:VBG,
overlap-75%}
Bag-of-words concepts
Pre A woman in a light blue
jacket is riding a bike.
Hyp A woman in a jacket
riding a bike.
Compositions
AND, OR, NOT, NEIGHBORS
Pre
Hyp
RNN
RNN
entailment
(Bowman et al., 2016)
{pre:woman, pre:light,
pre:NN, pre:JJ,
hyp:woman, hyp:jacket,
hyp:NN, hyp:VBG,
overlap-75%}
Bag-of-words concepts
Pre A woman in a light blue
jacket is riding a bike.
Hyp A woman in a jacket
riding a bike.
Compositions
AND, OR, NOT, NEIGHBORS
NEIGHBORS(bike) = (bike OR�bicycle OR biking OR car OR bus)
Pre
Hyp
RNN
RNN
entailment
(Bowman et al., 2016)
1. Do neurons learn compositional concepts?
Probing a BiLSTM baseline model (Bowman et al., 2016) on the SNLI validation set
1. Do neurons learn compositional concepts?
Probing a BiLSTM baseline model (Bowman et al., 2016) on the SNLI validation set
lexical overlap heuristics
McCoy et al., 2019
words with high�pointwise mutual information�(PMI) with class label
Gururangan et al., 2018
2. Does interpretability relate to model performance?
2. Does interpretability relate to model performance?
2. Does interpretability relate to model performance?
2. Does interpretability relate to model performance?
Interpretability not a priori correlated with performance—depends on concept space
2. Does interpretability relate to model performance?
Interpretability not a priori correlated with performance—depends on concept space
Are we searching for meaningful abstractions or spurious heuristics?
3. Can explanations help us probe model behavior?
3. Can explanations help us probe model behavior?
3. Can explanations help us probe model behavior?
3. Can explanations help us probe model behavior?
3. Can explanations help us probe model behavior?
3. Can explanations help us probe model behavior?
3. Can explanations help us probe model behavior?
3. Can explanations help us probe model behavior?
3. Can explanations help us probe model behavior?
3. Can explanations help us probe model behavior?
Parting thoughts
Parting thoughts
Compositional explanations
Parting thoughts
Compositional explanations
Parting thoughts
Compositional explanations
Parting thoughts
Compositional explanations
Parting thoughts
Compositional explanations
Future questions
Parting thoughts
Compositional explanations
Future questions
Parting thoughts
Compositional explanations
Future questions
Thanks!
Jacob Andreas
Thanks to David Bau, Eric Chu, Alex Tamkin,�Mike Wu, and Noah Goodman.
Funding from NSF GRFP and Office of Naval Research.
Code: github.com/jayelm/compexp
arXiv: arxiv.org/abs/2006.14032
Additional Slides
Concept Uniqueness
Local explanations
LIME, Anchors
Ribeiro et al., 2016, 2018
Natural Language Explanations
Andreas et al., 2017
Hendricks et al., 2016, 2018a,b
GradCAM
Selvaraju et al., 2018
3. Can explanations help us probe model behavior?
fire�escape
143
199
30
104
fire escape OR bridge OR staircase
house OR porch�OR townhouse
0.57
cradle OR autobus
OR fire escape
0.26
0.30
3. Can explanations help us probe model behavior?
fire�escape
143
199
30
104
fire escape OR bridge OR staircase
house OR porch�OR townhouse
0.57
cradle OR autobus
OR fire escape
0.26
0.30
street�street�street�street
ResNet18
AlexNet
ResNet50
DenseNet161
3. Can explanations help us probe model behavior?
fire�escape
143
199
30
104
fire escape OR bridge OR staircase
house OR porch�OR townhouse
0.57
cradle OR autobus
OR fire escape
0.26
0.30
street�street�street�street
fire escape
street
cradle
fire escape
ResNet18
AlexNet
ResNet50
DenseNet161
Successful change to intended class
Change to a different class (e.g. “aqueduct”)
No change