1 of 37

Linguistic Insights �Deepen our Understanding of AI Systems�The Cases of Reference Frames and Logical Reasoning

Freda Shi

University of Waterloo, Vector Institute, Canada CIFAR AI Chair

fhs@uwaterloo.ca

2 of 37

Interaction between Linguistics and CS

2

1950s-1960s

Russian (Romanized)

English Translation

Mi pyeryedayem mislyi posryedstvom ryechyi.

We transmit thoughts by means of speech.

The Georgetown--IBM experiment

1990s

Recurrent Neural Networks

2020s

Are linguistic insights still helpful?

Noam Chomsky

Jeffery Elman

3 of 37

Language Models for Linguistics

3

4 of 37

This Talk: Linguistics for Language Models

What tasks are hard for current large language and vision-language models?

  • Linguistic theories can guide the design of our analyses.
  • Frame of reference theory (Levinson, 2003): identify the reasons for spatial reasoning issues in vision-language models.
  • Possible world semantics and modal logic (Kripke, 1959, 1963): explain text-only language model behaviors.

The analyses also offer insights on what we should do to improve the models.

4

[Levinson. 2003. Space in language and cognition: Explorations in cognitive diversity. Vol. 5. Cambridge University Press]

[Kripke. 1959. A completeness theorem in modal logic. The journal of symbolic logic, 24(1), 1-14]

[Kripke. 1963. Semantical considerations on modal logic. Acta Philosophica Fennica, 16]

5 of 37

This Talk: Linguistics for Language Models

What tasks are hard for current large language and vision-language models?

  • Linguistic theories can guide the design of our analyses.
  • Frame of reference theory (Levinson, 2003): identify the reasons for spatial reasoning issues in vision-language models.
  • Possible world semantics and modal logic (Kripke, 1959, 1963): explain text-only language model behaviors.

The analyses also offer insights on what we should do to improve the models.

5

[Levinson. 2003. Space in language and cognition: Explorations in cognitive diversity. Vol. 5. Cambridge University Press]

[Kripke. 1959. A completeness theorem in modal logic. The journal of symbolic logic, 24(1), 1-14]

[Kripke. 1963. Semantical considerations on modal logic. Acta philosophica fennica, 16]

6 of 37

Common Vision-Language Models

Image-Text Mapping, e.g., CLIP (Radford et al., 2021)

The similarity metrics (e.g., probability after softmax) quantify the endorsement level.

6

[Kiros et al. 2014. Unifying visual-semantic embeddings with multimodal neural language models]

[Radford et al. 2021. Learning transferable visual models from natural language supervision; Figure credit: Radford et al.]

7 of 37

Common Vision-Language Models

Conditioned Generative Models, e.g., LLaVA (Liu et al., 2023)

Predict natural language response conditioned on visual and textual context.

The probability of generated text quantifies the endorsement level.

7

[Vinyals et al. 2015. Show and tell: A neural image caption generator. In CVPR;

Liu et al. 2023. Visual instruction tuning. In NeurIPS; Figure credit: Liu et al.]

8 of 37

Spatial Reasoning in VLMs

Liu et al. (2023): state-of-the-art VLMs fail in spatial reasoning.

  • Evaluation: compare endorsement for ground-truth captions and distractors.

Source of complexity: spatial relations are subjective and not always clearly defined.

8

The pizza is at the edge of the dining table.

The pizza is behind the dining table.

The cat is in front of the potted plant.

The cat is to the left of the potted plant.

[Liu et al. 2023. Visual Spatial Reasoning. In TACL; Figure credit: MSCOCO, Lin et al. (2014)]

9 of 37

Why Endorsement?

An alternative way to evaluate generative models is matching-based accuracy.

However, there are two major issues:

  • Language models tend to answer Yes to binary questions (Dentella et al., 2023).
  • Exact match--based accuracy does not consider the “competence” of models.

With ground-truth answer No, case 1 is better than case 2,� although both are considered correct.

Most generative models are built on top of probability.

It is natural to use probability-based metrics to quantify models’ competence.

9

[Dentella et al. 2023. Systematic testing of three language models reveals low language accuracy, absence of response stability, and a yes-response bias. In PNAS]

10 of 37

Frame of Reference (FoR): An Example

Frame of reference theories offer a taxonomy of part of the spatial relations.

For each of the following captions, do you think it’s a good description of the image?

Do VLMs also endorse these descriptions?

10

The tree is behind the car.

(reflection-relative frame)

The tree is in front of the car.

(translation-relative frame)

The tree is to the right of the car.

(ground-intrinsic frame)

11 of 37

FoR Knowledge in VLMs

A synthetic dataset that enables analyzing FoR knowledge in VLMs.

11

Ziqiao Ma

Zheyuan Zhang

[Zhang et al. 2025. Do vision-language models represent space and how? Evaluating spatial frame of reference under ambiguities. In ICLR]

(Camera) Q: From the camera’s viewpoint, is the basketball to the right of the car? A: Yes.

(Addressee) Q: From the woman’s viewpoint, is the basketball to the right of the car? A: Yes.

(Ground Object) Q: From the car’s viewpoint, is the basketball to the right of the car? A: Yes.

12 of 37

Evaluation Metric

Soft accuracy (normalized ground-truth probability)

A few other metrics (for cross-validation purposes) can be found in the paper.

12

(Camera) Q: From the camera’s viewpoint, is the basketball to the right of the car? A: Yes.

(Addressee) Q: From the woman’s viewpoint, is the basketball to the right of the car? A: Yes.

(Ground Object) Q: From the car’s viewpoint, is the basketball to the right of the car? A: Yes.

13 of 37

Soft Accuracy (↑): Results

13

Only some of the models give nontrivial accuracies from the camera’s view.

Random guess baseline

14 of 37

From the Camera’s View: A Closer Look

14

 

 

 

in front of

 

 

to the right of

15 of 37

From the Camera’s View: A Closer Look

Plot: , where serves as an intuitive reference curve.

Is the red ball [relation] the blue ball?

15

XComposer

 

 

to the left of to the right of in front of behind

Instruct-BLIP-13B

in front of

[Hayward and Tarr. 1995. Spatial language and spatial representations. In Cognition]

16 of 37

From the Camera’s View: A Closer Look

XComposer (behind)

16

What happened here?

Similar phenomena also observed in strong models like GPT-4o.

The tree is behind the car.

(reflection-relative frame)

The tree is in front of the car.

(translation-relative frame)

17 of 37

From the Camera’s View: A Closer Look

  •  

17

XComposer� (behind)

(reflection-relative frame)

  • The tree is behind the car.
  • The car is in front of the tree.

(translation-relative frame)

  • The tree is in front of the car.
  • The car is behind the tree.

Least liked by humans.

The drop of “in front of” is much milder than that of “behind.”

[Bender et al. 2020. Being in front is good---but where is in front? Preferences for spatial referencing affect evaluation. In CogSci]

18 of 37

Cross-Lingual FoR Adoption

Different language users have clearly shown different preferences in FoR adoption.

Bender et al. (2020): Chinese, German, and Japanese native speakers are more likely to adopt the translation-relative FoR than English native speakers.

Hill (1982): Hausa native speakers adopt the translation-relative �frame more than the reflection-relative one.

Clark (1973): Tamil native speakers generally adopt the rotation-�relative frame more than English native speakers.

Are we able to observe similar preference in multilingual VLMs?

18

[Bender et al. 2020. Being in front is good---but where is in front? Preferences for spatial referencing affect evaluation. In CogSci]

[Hill. 1982. Up/down, front/back, left/right. A contrastive study of Hausa and English. In Here and there: Cross-linguistic studies on deixis and demonstration]

[Clark. 1973. Space, time, semantics, and the child. In Cognitive development and the acquisition of language]

19 of 37

Cross-Lingual FoR Adoption in VLMs?

GPT-4o on 109 languages (darker = more favoring the reflection-relative frame).

No preference towards the intrinsic frame observed!

Conjecture: this may be because of the dominance of English in training data.

19

20 of 37

A Complementary View

Data scarcity: VLMs don’t see sufficient training examples for many spatial relations.

Top 17% of VSR spatial relations account for >90% MSCOCO training examples.

Proposed solution: use LLMs to synthesize spatial QA pairs from detailed captions and use the synthesized image QA data to finetune VLMs.

20

[Ogezi and Shi. 2025. SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data. Under Review]

Mike Ogezi

A front view of a small gray elephant figurine on the left; in the middle, there is an orange and black tiger; and on the right, there is a papier-mâché rhino head. They are all positioned side by side, with space between them…

What is to the right of the orange tiger?

The rhino head.

Which animal figurine is located on the leftmost side?

The elephant.

What animal is in the middle of the arrangement?

The tiger.

What is in the background?

Wallpaper.

21 of 37

Finetuning Results

Finetuning improves the spatial reasoning capability over the base models (Qwen2-VL; Wang et al., 2024), with comparable performance on other tasks.

There is still a large gap between human and VLM performance.

21

[Wang et al. 2024. Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution]

22 of 37

This Talk: Linguistics for Language Models

What tasks are hard for current large language and vision-language models?

  • Linguistic theories can guide the design of our analyses.
  • Frame of reference theory (Levinson, 2003): identify the reasons for spatial reasoning issues in vision-language models.
  • Possible world semantics and modal logic (Kripke, 1959, 1963): explain text-only language model behaviors.

The analyses also offer insights on what we should do to improve the models.

22

[Levinson. 2003. Space in language and cognition: Explorations in cognitive diversity]

[Kripke. 1959. A completeness theorem in modal logic. The journal of symbolic logic]

[Kripke. 1963. Semantical considerations on modal logic. Acta philosophica fennica]

23 of 37

LLMs perform better on high-frequency tasks

23

[Gonen et al. 2023. Demystifying Prompts in Language Models via Perplexity Estimation. In Findings of EMNLP.]

[McCoy et al. 2024. Embers of autoregression show how large language models are shaped by the problem they �are trained to solve. In PNAS]

[Source: Gonen et al.]

[Source: McCoy et al.]

24 of 37

A Complementary View

Besides random noises, is probability/perplexity everything about accuracy?

Quick answer: No. Logical forms complement it.

Propositional and modal logic gives us a framework to synthesize material.

24

[Wang and Shi. 2025. Logical forms complement probability in understanding language model (and human) performance. Under Review]

Yixuan Wang

25 of 37

Hypothetical and Disjunctive Syllogisms

We consider both logic sequents and fallacies.

25

26 of 37

Adding Modalities

All formulas still hold if we add the same modality quantifier (must) and (may) to each propositional variable.

26

27 of 37

Adding Modalities

All formulas still hold if we add the same modality quantifier (must) and (may) to each propositional variable.

27

It is certain that… (must)

It is possible that … (may)

28 of 37

Interpreting Logic Formulas

Real-world scenarios to create the material for main experiment.

We manually selected possible interpretations to minimize content effect.

28

If Freda is giving a talk, then Freda is reading a book.

Freda is giving a talk.

Is Freda reading a book?

29 of 37

Fitting a Mixed-Effects Model

We obtain the LLM soft accuracy and fit a mixed-effects model.

Fixed effects: analogous to the coefficients in linear regression.

  • Likelihood ratio tests support the inclusion of all fixed-effect predictors.

Random effects: individual-specific effects.

  • Each LLM may have systematically different performance.
  • Each LLM may systematically assign higher or lower perplexity to our material.

29

30 of 37

Results: Fixed Effects of Modality and Argument Forms

LLMs are more likely to respond Yes under the possibility modality, and No under the necessity modality.

Results on argument forms resonate with Wason (1968)---modus tollens is harder for humans; related results on LLMs also reported by Huang and Wurgaft (2023).

30

[Wason. 1968. Reasoning about a rule. In Quarterly journal of experimental psychology]

[McKenzie et al. 2023. Inverse Scaling: When bigger isn’t better. In TMLR; Credit to Huang and Wurgaft: Task Modus Tollens]

31 of 37

Results: Mixed Effects of Perplexity

31

99.9% confidence intervals: the mixed effects of perplexity for all LLMs are negative, which is in line with the results by Gonen et al. (2023) and McCoy et al. (2024).

32 of 37

Results: Constant Random Effects per LLM

32

Model soft-accuracy ranking (numbers in parentheses; ↓) shows a similar trend as the constant random effects (↑).

33 of 37

Accuracy Distributions

33

Left and middle: similar perplexity, but clearly different performance better explained by modality and argument forms.

Right: significantly different perplexity, but similar performance.

34 of 37

This Talk: Linguistics for Language Models

What tasks are hard for current large language and vision-language models?

  • Linguistic theories can guide the design of our analyses.
  • Frame of reference theory (Levinson, 2003): identify the reasons for spatial reasoning issues in vision-language models.
  • Possible world semantics and modal logic (Kripke, 1959, 1963): explain text-only language model behaviors.

The analyses also offer insights on what we should do to improve the models.�What’s next?

34

[Levinson. 2003. Space in language and cognition: Explorations in cognitive diversity. Vol. 5. Cambridge University Press]

[Kripke. 1959. A completeness theorem in modal logic. The journal of symbolic logic, 24(1), 1-14]

[Kripke. 1963. Semantical considerations on modal logic. Acta philosophica fennica, 16]

35 of 37

Thoughts on Synthetic Data

All work in this talk is based on synthetic data.

Synthetic data doesn’t represent the entire world.

  • Success on synthetic data does not necessarily imply success on real data.
  • Training on synthetic data leads to model collapse (Shumailov et al., 2023).

Synthetic data provides a clean testbed for analysis purposes.

  • Failures on synthetic data raise caveats in real-world applications.

35

[Shumailov et al. 2023. The curse of recursion: Training on generated data makes models forget]

36 of 37

Thoughts on Generative AI

Generative AI has exciting potentials.

But what is our goal of building generative AI?

Generative AI as (1) powerful human assistants, or (2) human cognitive models?

  • They might be different goals in terms of research questions, benchmarking protocols, and desired training data availability, etc.
  • Both require better understanding and interpretability.
  • Both require better personalized and pragmatic understanding of humans.
  • We are in an era where interdisciplinary insights are more valuable than ever.

36

37 of 37

Thanks!

The Computation, Language, Intelligence, and Grounding (CompLING) �Laboratory at the University of Waterloo: compling-wat.com

37