1 of 48

© dall-e

2 of 48

Why Does ChatGPT “Delve” So Much? Exploring the Sources of Lexical

Overrepresentation in Large Language Models

21 Jan 2025 @ COLING25

T.S. Juzek & Z. B. Ward

3 of 48

Joint work with �Zina Ward

4 of 48

Throughout the project great input by Gordon Erlebacher

5 of 48

Links

6 of 48

7 of 48

Background

Language changes over time

Scientific English changes over time (→ Elke Teich’s Team at Saarland University)

Examples:

8 of 48

9 of 48

Background

There have been rapid changes recently

These changes are hard to explain ‘naturally’

10 of 48

11 of 48

12 of 48

13 of 48

Background

  • That this is happening, is well established

Koppenburg, 2024; Nguyen, 2024; Shapira, 2024; Gray, 2024; Kobak et al., 2024; Liang et al., 2024; Liu and Bu, 2024; Matsui, 2024; Juzek and Ward 2025

14 of 48

Background

  • That this is happening, is well established

Koppenburg, 2024; Nguyen, 2024; Shapira, 2024; Gray, 2024; Kobak et al., 2024; Liang et al., 2024; Liu and Bu, 2024; Matsui, 2024; Juzek and Ward 2025

Broader impacts:

(virtually) unprecedented language change

15 of 48

Background

And early on, these changes were attributed to the influence of Large Language Models (LLMs) like ChatGPT

16 of 48

Background

However:

  • Handpicked items
  • Strengthen the link to LLMs
  • Critically: not clear WHY LLMs do this
    • Informed speculation �→ RLHF: Hern, 2024; Sheikh, 2024

17 of 48

Background

Our work:

  • Procedure to systematically identify overused items
  • Why are LLMs overusing certain words

18 of 48

19 of 48

The procedure:

20 of 48

21 of 48

22 of 48

23 of 48

24 of 48

25 of 48

List of factors

  • Initial training data
  • Fine-tuning
  • Model architecture
  • Choice of algorithms
  • Context priming
  • Learning from Human Feedback
  • Other factors (parameter settings, etc.)

26 of 48

List of factors

  • Initial training data
  • Fine-tuning
  • Model architecture
  • Choice of algorithms
  • Context priming
  • Learning from Human Feedback
  • Other factors (parameter settings, etc.)

possible, but no strong starting points

27 of 48

List of factors

  • Initial training data
  • Fine-tuning
  • Model architecture
  • Choice of algorithms
  • Context priming
  • Learning from Human Feedback
  • Other factors (parameter settings, etc.)

28 of 48

List of factors

  • Initial training data
  • Fine-tuning
  • Model architecture
  • Choice of algorithms
  • Context priming
  • Learning from Human Feedback
  • Other factors (parameter settings, etc.)

language output by Llama Base (-LHF)

vs �Llama Instruct (+LHF)

→ indicator

29 of 48

List of factors

  • Initial training data
  • Fine-tuning
  • Model architecture
  • Choice of algorithms
  • Context priming
  • Learning from Human Feedback
  • Other factors (parameter settings, etc.)

language output by Llama Base (-LHF)

vs �Llama Instruct (+LHF)

→ indicator

30 of 48

N.B.: RLHF, DPO, and LHF

illustration from Rafailov et al. 2024

31 of 48

Direct Preference Learning

Rafailov et al. 2024

32 of 48

Roof term

Learning from Human Feedback (LHF)

Because of Direct Preference Learning → Llama 3

33 of 48

LHF

  • LHF is a very plausible candidate,
  • Others have pointed to it
  • Experimental validation is needed

34 of 48

35 of 48

LHF

  • typically:
    • done in global south
    • precarious conditions
    • Toxtli et al., 2021; Roberts, 2022; Novick, 2023
  • lack of transparency

36 of 48

Experiment: Emulate LHF

  • IRB
  • LAMP stack for rating website
  • Decision log
    • virtually everything pre-planned
    • pre-designed: coefficient, 2.5 vs 2
    • “preliminary” results
  • Recruitment Prolific
  • Emulate procedure: Demographics
    • Global South
  • Highest standards, incl. NCP
  • Random item order, random item position, etc. etc.

37 of 48

38 of 48

Analysis

  • chi-square
  • explorative multifactorial regression
  • → paper

39 of 48

40 of 48

Results

  • Exclusion rate
  • “Delve” pushback
    • We will come back to this

→virtually no chance to get conclusive experimental results

→conjecture → follow-up

41 of 48

42 of 48

Limitations

  • Issues with experiment
  • For ethical reasons: cannot truly emulate procedure
  • Need to explore other factors

43 of 48

Intellectual merit

  • procedure to identify overused words, “focal words”
  • factors contributing to lexical bias
    • stronger, but not fully conclusive evidence → Learning from Human Feedback

44 of 48

Broader impacts

  • Technology is strongly affecting language usage
  • It was not clear: What do we make of the recent changes
  • What do we make of the causes

45 of 48

Broader impacts

  • The big unknown:
    • Variety
    • vs
    • Demographics: Age
  • It could be just ‘normal’ language change!
  • Or just the Task!
  • →follow up

46 of 48

Broader impacts

  • Critically:
    • Insights can be gained
      • It is tough, though, partly because:
    • Lack of procedure and data transparency slow down progress

47 of 48

48 of 48

Thank you