1 of 60

What does it take to convince ourselves that a system is exhibiting compositionality?

Najoung Kim

May 4 2025, RepL4NLP

najoung@bu.edu

Mostly AI, but humans too!!

2 of 60

Agenda

What do we actually mean by compositionality?�
Main take: what we actually want in an AI system is �the availability of a process compositional-route�
The role of behavioral and mechanistic evidence

Ex. 1: Contextual inferences from adjective + noun compositions
Ex. 2: Looking for Tensor Product Operations for symbol manipulation

Mostly a “takes” talk, the empirical work and discussions will be brief.

3 of 60

What do we actually mean by compositionality?

Following the tradition in Linguistics & Philosophy: it’s about meaning

Cited from Partee (1984, p281); brackets mine (Kim 2021)

But it is well known that this broad construal “Principle of Compositionality” is not very meaningful as is.

4 of 60

The generality of the Principle of Compositionality

Why? The literature on why is vast. But here’s an intuition:

Fig from https://cameronrwolfe.substack.com

Take these to be parts

Take these to be meanings

Take the triangular masking scheme as defining “a way that parts are syntactically combined” (or if no masking, everything can combine with everything)

The output is literally a function of “the meanings of parts and the way they are syntactically combined”

Have we solved compositionality? :)

5 of 60

If your answers are “no”...

Some objections may be:

“But that’s not syntax!”
“Those aren’t meaning!”
“Those aren’t the right parts!”
“The model’s output is the next token. That isn’t meaning!”
… (list goes on forever)

Then…

“Then, what is syntax?”
“Then, what is meaning?”
“Then, what are the right parts?”
“Then, where should we be looking and what should we be looking for?”
…(responses also go on forever)

Non-vacuous claims of compositionality requires commitment to the compositional machinery.

6 of 60

Aside: Scopes of claims

I’ve used linguistic compositionality to make the point, but the point about commitment generalizes to other domains of compositionality in Cognitive Science and AI (that might also not necessarily concern “meaning” per se).

Visual perception (from Lande 2024)

Question-answering (from Liu et al. 2022)

7 of 60

Aside 2: about co-occurring concepts

Hupkes et al. (2020)

Kudo et al. (2023)

These are not properties of compositional systems (in general), nor are they generally “tests for compositionality”, nor are they “types of composition”. Certain compositional commitments can give rise to all or a subset of these properties. If one claims that “these are hallmarks of a compositional system”, they are likely making certain implicit commitments.

Will skip, but feel free to ask me about them later

8 of 60

No commitments, no (meaningful) compositionality! (NCNC)

(Corollary: no system is “generally” compositional)

With this bottom line in mind, let’s dive into some specifics…

9 of 60

Notions useful to tease apart

Process compositionality

State compositionality

Defining the target of analysis

*I (currently) view states as relating to representations and processes as relating to computations, both of which could be analyzed at all (Marrean) levels of analysis, for those of you care

10 of 60

Process vs. State Compositionality

Process compositionality: a system follows a compositional procedure, putting together meaningful parts in hypothesized ways
State compositionality: the end state can be decomposed into meaningful parts

(Note: I’m adopting Nefdt (2020)’s terminology, but I am not using them in exactly the same sense as he is, especially regarding claims about correspondence to Marr’s levels.)

11 of 60

Process vs. State Compositionality

Depending on your commitments, your system may exhibit both. But they can be dissociated!

State C -/-> Process C:

Some examples from Nefdt (2020); Coppock & Champollion (2024)

Agnetha loves Bjorn:

Aristotle taught a king:

12 of 60

Process vs. State Compositionality

Depending on your commitments, your system may exhibit both. But they can be dissociated!

State C -/-> Process C:

Some examples from Nefdt (2020); Coppock & Champollion (2024)

3 + (5 + 9) * 2:

13 of 60

Process vs. State Compositionality

Depending on your commitments, your system may exhibit both. But they can be dissociated!

Classifier

0 or 1

(sums and classifies)

Process C -!-> State C

14 of 60

Taking a step back: why do we want compositionality?

In the context of AI, we often cite the following reasons:

Robustness and generalization
Interpretability (not so much the topic today)

3 + (5 + 9) * 2:

Recall:

Observation: Process compositionality predicts (specific) generalizations when not all possible I/O mappings are observed; state compositionality does not.

15 of 60

Taking a step back: why do we want compositionality?

In the context of AI, we often cite the following reasons:

Robustness and generalization
Interpretability (not so much the topic today)

→ Ideally, we want to (at least) test for process compositionality.

Observation: Process compositionality predicts (specific) generalizations when not all possible I/O mappings are observed; state compositionality does not.

16 of 60

What do we want from an AI system?

Very broadly: Doing well on tasks that we subject them to.

Does this require process-level compositionality for every query?

But process-level compositionality allows the models to go beyond memory limitations (in predictable ways that the process allows for).

No! If a memorized response is a great one, using that is fine.

17 of 60

Main take: what we really want to evaluate is the availability of a process-level compositional route

18 of 60

The availability of a process-level compositional route

…as opposed to whether the model takes this route for every possible complex input.

The same observation holds for people too: people store and retrieve chunks of frequent expressions (that are not idioms, like “I don’t know”)

cf) Also see Baggio (2021)’s “at least one (strictly compositional) meaning” argument.

What empirical evidence would convince us that this route is available?

19 of 60

Notions useful to tease apart

Process compositionality

State compositionality

Behavioral evidence

Non-behavioral evidence

Defining the target of analysis

Available observations

20 of 60

Behavioral evidence

Working definition of behavioral:

Something that a system does in response to stimuli, acting upon/interacting with them

Note: generalization (responding to unseen stimuli) is also behavior!

21 of 60

Non-behavioral evidence

Non-behavioral evidence concerns everything else.

This, for humans, may include:

Physiological measures (like blood pressure or heart rate)
Neural measures (like EEG, fMRI)

22 of 60

Non-behavioral evidence

For models that are commonly used today, what are they?

🤔

23 of 60

Notions useful to tease apart

Process compositionality

State compositionality

Behavioral evidence

Non-behavioral evidence

“Neural” parts of the model

Non-neural parts of the model

Defining the target of analysis

Available observations

Specificities of the object of study

24 of 60

Today’s language models are neurosymbolic models

https://en.wikipedia.org/wiki/Neuro-symbolic_AI

25 of 60

Today’s language models are neurosymbolic models

..at a certain level of abstraction, it is symbols in, symbols out!

What is the weather like in Albuquerque?

Let me look that up for you…

Stimulus

Acting upon/interacting with the stimulus

Things in between (“model internals”)

26 of 60

Notions useful to tease apart

Process compositionality

State compositionality

Behavioral evidence

Non-behavioral evidence

“Neural” parts of the model

Non-neural parts of the model

Not equivalent in general, happens to be aligned this way often because of the dominant paradigm

Defining the target of analysis

Available observations

Specificities of the object of study

27 of 60

Notions useful to tease apart

Process compositionality

State compositionality

Behavioral evidence

“Mechanistic” or “model internal” evidence

“Neural” parts of the model

Non-neural parts of the model

Not equivalent in general, happens to be aligned this way often because of the dominant paradigm

Defining the target of analysis

Available observations

Specificities of the object of study

28 of 60

Relation between types of compositionality / evidence

How do the following types of evidence help us convince ourselves that a system exhibits process / state compositionality?

Behavioral
Mechanistic

29 of 60

Behavioral evidence & State compositionality

Relation to state compositionality: room for philosophical qualms

What is the weather like in Albuquerque?

Let me look that up for you…

Is this state-compositional?

(Debatable)

30 of 60

Behavioral evidence & State compositionality

Relation to state compositionality: room for philosophical qualms

What is the weather like in Albuquerque?

is_warm(albuquerque, May 4th)

Pragmatically: if we define some sort of metalanguage that leads to (linearized) structured expressions, we could see whether the output conforms to the syntax of this metalanguage (philosophical qualms still remain)

31 of 60

Behavioral evidence & Process compositionality

What about process compositionality?

I said earlier on that only process compositionality predicts generalization behavior. One way to go about this is to test whether the models’ generalization behavior conforms to the predictions.

However…

32 of 60

Behavioral evidence is necessary but not sufficient

This behavioral alignment is a necessary but not sufficient condition for process compositionality.

If a model fails on an unseen example, we know for sure that the model is not employing the hypothesized mechanism (for handling that example).

33 of 60

Behavioral evidence is necessary but not sufficient

However, if a model succeeds on the test, we need additional evidence that the model is indeed succeeding in the “right way”, which is difficult to derive solely from observation of behavior. (Argument from rule induction from finite examples)

cf) Researchers’ views vary on this! (McCurdy et al. 2024)

34 of 60

Behavioral tests

My prior work COGS (Kim and Linzen 2020) & SLOG (Li et al. 2023) concerned behavioral tests showing that fairly recent models still fail on certain types of predicted generalizations.

Given that models can do this…

Fig. from Weissenhorn et al. (2022)

35 of 60

Behavioral tests

My prior work COGS (Kim and Linzen 2020) & SLOG (Li et al. 2023) concerned behavioral tests showing that fairly recent models still fail on certain types of predicted generalizations.

Can they generalize to deeper nested structures?

36 of 60

But suppose they do very well. What then?�How can we strengthen our inference?

37 of 60

Behavioral evidence′: ruling out plausible alternatives by building models of system behavior

Behavioral tests for compositionality involve positing a model of composition, deriving predictions from it, and seeing if predictions align with the system’s behavior.

Proposal: While we cannot directly get sufficient evidence that this is an outcome of the hypothesized compositional process, we can do similar things for competing hypotheses. �(i.e., thinking about plausible alternatives: if not this, when what?)

(This capitalizes on the fact that behavior can be used to rule things out)

38 of 60

Behavioral evaluation of Adj+N membership inferences

Ross, Davidson & Kim (2024), Ross, Kim & Davidson (2024)

fake professor

professor

bored professor

Bored professors are professors

Fake professors are not professors

Subsective

inference

Privative

inference

Properties of adjectives?

Hyp. about compositional machinery

39 of 60

Privativity inferences are context-sensitive

“fake crowd”

paid actors�(still a crowd)

dummies on a movie set�(probably not a crowd)

Ross et al. (2024)

40 of 60

Theoretical commitments

The compositional operator is function application
The functions are “faithful” - they (mostly) preserve the meanings of the arguments rather than deleting stuff or adding stuff randomly (note that this is a material commitment - func(a,b)=a+c is a totally valid function)
The functions can consider the effects of contexts in a constrained manner (formal semantics territory - see Hayley’s dissertation!)

Not super important for today’s purposes, the important point is that these commitments make specific predictions

41 of 60

Collect human judgments…

Steering effect of context in predicted ways for novel Adj+Noun combinations �(0 frequency in C4)

Goodness of membership inference

42 of 60

Subject LLMs to the test

CONTEXT

LLM

5-shot�Logprobs

Single response

43 of 60

Behavioral evidence

Sufficiently powerful LLMs are able to draw human-like inferences in context

They perform similarly on high-frequency and zero-frequency bigrams

(Probably) not memorization!

A case where an LLM does show behavioral alignment with predicted generalizations. What now?

44 of 60

Behavioral evidence′: ruling out (pure) inference by analogy

Approach: build an analogical reasoner (computes inferences by analogy to known examples) and see how well this model explains LLM behavior

45 of 60

Behavioral evidence′: ruling out (pure) inference by analogy

Regression analysis predicting the LLM’s inferences (for zero-frequency examples) from analogy model’s inferences show a significant effect, but only explains 12% of the variance�
This rules out analogy as a core mechanism�
Still does not show that composition (as we hypothesize) IS a core mechanism, but we’ve ruled out what people have proposed as an important competitor

46 of 60

Behavioral evidence′: ruling out (pure) inference by analogy

If your priors are sufficiently strong, one could stop here and make the “What else could it be? It must be composition” argument 🙂�
Interestingly, people seem more likely to believe this for humans than for LLMs, even when observations of a similar flavor holds for both humans and LLMs like in our experiments

(i.e., behavior aligns with compositional predictions & the analogy model significantly explains some, but not all, of the variance in the inference data)

47 of 60

Mechanistic evidence & State/Process compositionality

What is the weather like in Albuquerque?

Let me look that up for you…

Some philosophizing to do here about what are states and processes that are commitment-relevant

48 of 60

Mechanistic evidence & State/Process compositionality

What is the weather like in Albuquerque?

Let me look that up for you…

Claim: State and process compositionality are assessable with “symbols-in-neurons” approaches: e.g., tensor product representations and operations (not the only way)

49 of 60

How might we look for process compositionality?

Work in progress with Aditya Yedetore

Symbolic structures can be encoded as distributed representations via tensor products of filler and role embeddings (Tensor Product Representations: TPRs)
Operations over TPRs define the compositional machinery (Tensor Product Operations)

E.g., Complex expressions can be constructed via sum (+) of TPRs

Preliminary commitments (drawing from Smolensky 1990, i.a.):

50 of 60

How might we look for process compositionality?

Work in progress with Aditya Yedetore

An idea: The representations and computations of neural networks that perform symbol manipulation tasks may be approximated by Tensor Product Representations and Operations that describe the hypothesized compositional machinery.

If this is the case, we gain direct evidence for state and process compositionality.

51 of 60

How might we look for process compositionality?

Work in progress with Aditya Yedetore

If this is the case, we gain direct evidence for state and process compositionality.

We need very precise hypotheses

52 of 60

Symbol manipulation hypothesis: Copying in a GRU

Let’s say the input sequence is bcd. The network needs to construct something like b ⊗ r_1 + c ⊗ r_2 + d ⊗ r_3. (~b in the first position, c in the second, d in the third)

A GRU processes inputs sequentially (reps of b, c, d are provided as input). �The target representation can be constructed if at each timestep, the binding of the input and the new role is added to the previous step’s representation.

h_1: b ⊗ r_1

h_2: b ⊗ r_1 + c ⊗ r_2

h_3: b ⊗ r_1 + c ⊗ r_2 + d ⊗ r_3

53 of 60

Symbol manipulation hypothesis: Copying in a GRU

Let’s say the input sequence is bcd. The network needs to construct something like b ⊗ r_1 + c ⊗ r_2 + d ⊗ r_3. (~b in the first position, c in the second, d in the third)

h_1: b ⊗ r_1

h_2: b ⊗ r_1 + c ⊗ r_2

h_3: b ⊗ r_1 + c ⊗ r_2 + d ⊗ r_3

This requires a “Current role fetcher” as a processing mechanism: basically, getting the role of the current position given the previous hidden representation

54 of 60

Symbol manipulation hypothesis: Copying in a GRU

Test for state compositionality

Test for process compositionality

55 of 60

State compositionality (of each hidden state)

Extension of McCoy et al.’s TPDN method

56 of 60

Process-level compositionality?

We find that the “next role getter” mechanism coincides with the 𝑊ℎ𝑧 matrix in the GRU

57 of 60

Process-level compositionality?

But it makes reference to specific filler-role bindings observed during training, which predicts lack of generalization to symbols in unseen positions (behaviorally verified) - not fully compositional processing in this sense.

58 of 60

Summary

(0) No meaningful claims of compositionality without commitments (NCNC)

(1) Availability of process compositionality is what we want in AI systems and therefore what we want to test for

59 of 60

Summary

(2) The key behavioral evidence is predicted generalization to unseen examples. However, this is necessary but not sufficient for process compositionality. But it can be strengthened by gaining behavioral evidence that rule out plausible alternatives.

Work with Hayley Ross and Kate Davidson

Work with Aditya Yedetore

(3) Mechanistic evidence can help make sufficiency arguments but requires much more precise commitments to machinery, which often is hard even when the phenomenon you're looking at is extremely simple. Tensor product representations and operations are promising (but not only) ways to conduct this investigation in neural networks.

Aside: getting harder and harder to test properly in today’s AI landscape

60 of 60

Thank you!

State compositional:

carrier(cat)