What does it take to convince ourselves that a system is exhibiting compositionality?
Najoung Kim
May 4 2025, RepL4NLP
najoung@bu.edu
Mostly AI, but humans too!!
Agenda
2
Mostly a “takes” talk, the empirical work and discussions will be brief.
What do we actually mean by compositionality?
3
Following the tradition in Linguistics & Philosophy: it’s about meaning
Cited from Partee (1984, p281); brackets mine (Kim 2021)
But it is well known that this broad construal “Principle of Compositionality” is not very meaningful as is.
The generality of the Principle of Compositionality
4
Why? The literature on why is vast. But here’s an intuition:
Fig from https://cameronrwolfe.substack.com
Take these to be parts
Take these to be meanings
Take the triangular masking scheme as defining “a way that parts are syntactically combined” (or if no masking, everything can combine with everything)
The output is literally a function of “the meanings of parts and the way they are syntactically combined”
Have we solved compositionality? :)
If your answers are “no”...
5
Some objections may be:
Then…
Non-vacuous claims of compositionality requires commitment to the compositional machinery.
Aside: Scopes of claims
6
I’ve used linguistic compositionality to make the point, but the point about commitment generalizes to other domains of compositionality in Cognitive Science and AI (that might also not necessarily concern “meaning” per se).
Visual perception (from Lande 2024)
Question-answering (from Liu et al. 2022)
Aside 2: about co-occurring concepts
7
Hupkes et al. (2020)
Kudo et al. (2023)
These are not properties of compositional systems (in general), nor are they generally “tests for compositionality”, nor are they “types of composition”. Certain compositional commitments can give rise to all or a subset of these properties. If one claims that “these are hallmarks of a compositional system”, they are likely making certain implicit commitments.
Will skip, but feel free to ask me about them later
No commitments, no (meaningful) compositionality! (NCNC)
(Corollary: no system is “generally” compositional)
8
With this bottom line in mind, let’s dive into some specifics…
Notions useful to tease apart
9
Process compositionality
State compositionality
Defining the target of analysis
*I (currently) view states as relating to representations and processes as relating to computations, both of which could be analyzed at all (Marrean) levels of analysis, for those of you care
Process vs. State Compositionality
10
(Note: I’m adopting Nefdt (2020)’s terminology, but I am not using them in exactly the same sense as he is, especially regarding claims about correspondence to Marr’s levels.)
Process vs. State Compositionality
11
Depending on your commitments, your system may exhibit both. But they can be dissociated!
Some examples from Nefdt (2020); Coppock & Champollion (2024)
Agnetha loves Bjorn:
Aristotle taught a king:
Process vs. State Compositionality
12
Depending on your commitments, your system may exhibit both. But they can be dissociated!
Some examples from Nefdt (2020); Coppock & Champollion (2024)
3 + (5 + 9) * 2:
Process vs. State Compositionality
13
Depending on your commitments, your system may exhibit both. But they can be dissociated!
Classifier
0 or 1
(sums and classifies)
Taking a step back: why do we want compositionality?
14
In the context of AI, we often cite the following reasons:
3 + (5 + 9) * 2:
Recall:
Observation: Process compositionality predicts (specific) generalizations when not all possible I/O mappings are observed; state compositionality does not.
Taking a step back: why do we want compositionality?
15
In the context of AI, we often cite the following reasons:
→ Ideally, we want to (at least) test for process compositionality.
Observation: Process compositionality predicts (specific) generalizations when not all possible I/O mappings are observed; state compositionality does not.
What do we want from an AI system?
16
Very broadly: Doing well on tasks that we subject them to.
Does this require process-level compositionality for every query?
But process-level compositionality allows the models to go beyond memory limitations (in predictable ways that the process allows for).
No! If a memorized response is a great one, using that is fine.
Main take: what we really want to evaluate is the availability of a process-level compositional route
17
The availability of a process-level compositional route
18
…as opposed to whether the model takes this route for every possible complex input.
The same observation holds for people too: people store and retrieve chunks of frequent expressions (that are not idioms, like “I don’t know”)
cf) Also see Baggio (2021)’s “at least one (strictly compositional) meaning” argument.
What empirical evidence would convince us that this route is available?
Notions useful to tease apart
19
Process compositionality
State compositionality
Behavioral evidence
Non-behavioral evidence
Defining the target of analysis
Available observations
Behavioral evidence
20
Working definition of behavioral:
Note: generalization (responding to unseen stimuli) is also behavior!
Non-behavioral evidence
21
Non-behavioral evidence concerns everything else.
This, for humans, may include:
Non-behavioral evidence
22
For models that are commonly used today, what are they?
🤔
Notions useful to tease apart
23
Process compositionality
State compositionality
Behavioral evidence
Non-behavioral evidence
“Neural” parts of the model
Non-neural parts of the model
Defining the target of analysis
Available observations
Specificities of the object of study
Today’s language models are neurosymbolic models
24
https://en.wikipedia.org/wiki/Neuro-symbolic_AI
Today’s language models are neurosymbolic models
25
..at a certain level of abstraction, it is symbols in, symbols out!
What is the weather like in Albuquerque?
Let me look that up for you…
Stimulus
Acting upon/interacting with the stimulus
Things in between (“model internals”)
Notions useful to tease apart
26
Process compositionality
State compositionality
Behavioral evidence
Non-behavioral evidence
“Neural” parts of the model
Non-neural parts of the model
Not equivalent in general, happens to be aligned this way often because of the dominant paradigm
Defining the target of analysis
Available observations
Specificities of the object of study
Notions useful to tease apart
27
Process compositionality
State compositionality
Behavioral evidence
“Mechanistic” or “model internal” evidence
“Neural” parts of the model
Non-neural parts of the model
Not equivalent in general, happens to be aligned this way often because of the dominant paradigm
Defining the target of analysis
Available observations
Specificities of the object of study
Relation between types of compositionality / evidence
28
How do the following types of evidence help us convince ourselves that a system exhibits process / state compositionality?
Behavioral evidence & State compositionality
29
Relation to state compositionality: room for philosophical qualms
What is the weather like in Albuquerque?
Let me look that up for you…
Is this state-compositional?
(Debatable)
Behavioral evidence & State compositionality
30
Relation to state compositionality: room for philosophical qualms
What is the weather like in Albuquerque?
is_warm(albuquerque, May 4th)
Pragmatically: if we define some sort of metalanguage that leads to (linearized) structured expressions, we could see whether the output conforms to the syntax of this metalanguage (philosophical qualms still remain)
Behavioral evidence & Process compositionality
31
What about process compositionality?
I said earlier on that only process compositionality predicts generalization behavior. One way to go about this is to test whether the models’ generalization behavior conforms to the predictions.
However…
Behavioral evidence is necessary but not sufficient
32
This behavioral alignment is a necessary but not sufficient condition for process compositionality.
If a model fails on an unseen example, we know for sure that the model is not employing the hypothesized mechanism (for handling that example).
Behavioral evidence is necessary but not sufficient
33
However, if a model succeeds on the test, we need additional evidence that the model is indeed succeeding in the “right way”, which is difficult to derive solely from observation of behavior. (Argument from rule induction from finite examples)
cf) Researchers’ views vary on this! (McCurdy et al. 2024)
Behavioral tests
34
My prior work COGS (Kim and Linzen 2020) & SLOG (Li et al. 2023) concerned behavioral tests showing that fairly recent models still fail on certain types of predicted generalizations.
Given that models can do this…
Fig. from Weissenhorn et al. (2022)
Behavioral tests
35
My prior work COGS (Kim and Linzen 2020) & SLOG (Li et al. 2023) concerned behavioral tests showing that fairly recent models still fail on certain types of predicted generalizations.
Can they generalize to deeper nested structures?
But suppose they do very well. What then?�How can we strengthen our inference?
36
Behavioral evidence′: ruling out plausible alternatives by building models of system behavior
37
Behavioral tests for compositionality involve positing a model of composition, deriving predictions from it, and seeing if predictions align with the system’s behavior.
Proposal: While we cannot directly get sufficient evidence that this is an outcome of the hypothesized compositional process, we can do similar things for competing hypotheses. �(i.e., thinking about plausible alternatives: if not this, when what?)
(This capitalizes on the fact that behavior can be used to rule things out)
Behavioral evaluation of Adj+N membership inferences
38
Ross, Davidson & Kim (2024), Ross, Kim & Davidson (2024)
fake professor
professor
bored professor
Bored professors are professors
Fake professors are not professors
Subsective
inference
Privative
inference
Properties of adjectives?
Hyp. about compositional machinery
Privativity inferences are context-sensitive
39
“fake crowd”
paid actors�(still a crowd)
dummies on a movie set�(probably not a crowd)
Ross et al. (2024)
Theoretical commitments
40
Not super important for today’s purposes, the important point is that these commitments make specific predictions
Collect human judgments…
41
Steering effect of context in predicted ways for novel Adj+Noun combinations �(0 frequency in C4)
Goodness of membership inference
Subject LLMs to the test
42
CONTEXT
LLM
5-shot�Logprobs
Single response
Behavioral evidence
43
(Probably) not memorization!
A case where an LLM does show behavioral alignment with predicted generalizations. What now?
Behavioral evidence′: ruling out (pure) inference by analogy
44
Behavioral evidence′: ruling out (pure) inference by analogy
45
Behavioral evidence′: ruling out (pure) inference by analogy
46
Mechanistic evidence & State/Process compositionality
47
What is the weather like in Albuquerque?
Let me look that up for you…
Some philosophizing to do here about what are states and processes that are commitment-relevant
Mechanistic evidence & State/Process compositionality
48
What is the weather like in Albuquerque?
Let me look that up for you…
Claim: State and process compositionality are assessable with “symbols-in-neurons” approaches: e.g., tensor product representations and operations (not the only way)
How might we look for process compositionality?
49
Work in progress with Aditya Yedetore
Preliminary commitments (drawing from Smolensky 1990, i.a.):
How might we look for process compositionality?
50
Work in progress with Aditya Yedetore
An idea: The representations and computations of neural networks that perform symbol manipulation tasks may be approximated by Tensor Product Representations and Operations that describe the hypothesized compositional machinery.
If this is the case, we gain direct evidence for state and process compositionality.
How might we look for process compositionality?
51
Work in progress with Aditya Yedetore
An idea: The representations and computations of neural networks that perform symbol manipulation tasks may be approximated by Tensor Product Representations and Operations that describe the hypothesized compositional machinery.
If this is the case, we gain direct evidence for state and process compositionality.
We need very precise hypotheses
Symbol manipulation hypothesis: Copying in a GRU
52
Let’s say the input sequence is bcd. The network needs to construct something like b ⊗ r_1 + c ⊗ r_2 + d ⊗ r_3. (~b in the first position, c in the second, d in the third)
A GRU processes inputs sequentially (reps of b, c, d are provided as input). �The target representation can be constructed if at each timestep, the binding of the input and the new role is added to the previous step’s representation.
h_1: b ⊗ r_1
h_2: b ⊗ r_1 + c ⊗ r_2
h_3: b ⊗ r_1 + c ⊗ r_2 + d ⊗ r_3
Symbol manipulation hypothesis: Copying in a GRU
53
Let’s say the input sequence is bcd. The network needs to construct something like b ⊗ r_1 + c ⊗ r_2 + d ⊗ r_3. (~b in the first position, c in the second, d in the third)
A GRU processes inputs sequentially (reps of b, c, d are provided as input). �The target representation can be constructed if at each timestep, the binding of the input and the new role is added to the previous step’s representation.
h_1: b ⊗ r_1
h_2: b ⊗ r_1 + c ⊗ r_2
h_3: b ⊗ r_1 + c ⊗ r_2 + d ⊗ r_3
This requires a “Current role fetcher” as a processing mechanism: basically, getting the role of the current position given the previous hidden representation
Symbol manipulation hypothesis: Copying in a GRU
54
Test for state compositionality
Test for process compositionality
State compositionality (of each hidden state)
55
Extension of McCoy et al.’s TPDN method
Process-level compositionality?
56
We find that the “next role getter” mechanism coincides with the 𝑊ℎ𝑧 matrix in the GRU
Process-level compositionality?
57
But it makes reference to specific filler-role bindings observed during training, which predicts lack of generalization to symbols in unseen positions (behaviorally verified) - not fully compositional processing in this sense.
Summary
58
(0) No meaningful claims of compositionality without commitments (NCNC)
(1) Availability of process compositionality is what we want in AI systems and therefore what we want to test for
Summary
59
(2) The key behavioral evidence is predicted generalization to unseen examples. However, this is necessary but not sufficient for process compositionality. But it can be strengthened by gaining behavioral evidence that rule out plausible alternatives.
Work with Hayley Ross and Kate Davidson
Work with Aditya Yedetore
(3) Mechanistic evidence can help make sufficiency arguments but requires much more precise commitments to machinery, which often is hard even when the phenomenon you're looking at is extremely simple. Tensor product representations and operations are promising (but not only) ways to conduct this investigation in neural networks.
Aside: getting harder and harder to test properly in today’s AI landscape
Thank you!
60
State compositional:
carrier(cat)