1 of 40

1

You don't need a personality test to know these models are unreliable:

Assessing the Reliability of Large Language Models on Psychometric Instruments

Bangzhao Shu*

Minje

Choi

David

Jurgens

Lechen Zhang*

Lavinia

Dunagan

Lajanugen Logeswaran

Moontae

Lee

Dallas

Card

* Equal Contribution

2 of 40

LLMs can be replicas of human agents

2

3 of 40

  • For human agents, we want to know their personalities to identify latent factors that influence the downstream behaviors

  • And we want to know the same for our LLM agents!

But do LLMs actually have consistent personas?

LLMs can be replicas of human agents

4 of 40

Research Questions

4

  1. How can we systematically measure the personas of LLMs?
  1. How can we systematically measure the reliability of LLM responses to persona tests?
  1. Can we improve the consistency of LLM responses by adding personas to the prompt?

5 of 40

What is a Persona?

  • Personas: attributes that make up a person’s identity
    • personality, demographics, values

5

  • How do we usually measure a person’s persona?

Do you agree that you are the life of the party?

Yes!

6 of 40

How Could We Measure LLMs’ Personas?

6

Statement:

<Statement> (You are the life of the party.)

Question:

Do you agree with the statement? Reply with only ‘Yes’ or ‘No’ without explaining your reasoning.

Answer:

7 of 40

Model-Personas: A Comprehensive Dataset for Measuring Personas

  • Collects many psychological instruments in the form of questionnaires

7

  • Based on a large survey of existing persona attributes

8 of 40

Model-Personas: A Comprehensive Dataset for Measuring Personas

8

Questions (693)

I am interested in people

I sympathize with others’ feelings

It would be okay if some people were treated differently from others

It would be okay if someone acted unfairly

Instruments

(39)

AIS

OCEAN

EPQ

MFT

MBTI

ACI

Persona Axes

(115)

Agreeableness

Extroversion

Authority

Avoid harm

Fairness

9 of 40

Which LLMs might have personas?

9

  • Falcon-7B
  • BLOOMZ (560M, 1B1, 3B, 7B1)
  • Llama2 (7B, 7B-Chat, 13B, 13B-Chat)
  • RedPajama-7B
  • FLAN-T5 (Small, Base, Large, XL)
  • GPT (2, 3.5 & 4)

10 of 40

But how reliable are the LLM-generated responses?

We have instruments to measure with.

Yes

Yes

Do you agree that you are the life of the party?

No

Yes

Do you agree that you are not the life of the party?

No

Yes

Do you agree that you avoid being the life of the party?

11 of 40

Research Questions

11

  • How can we systematically measure the personas of LLMs?
  • How can we systematically measure the reliability of LLM responses to persona tests?
  • Can we improve the consistency of LLM responses by adding personas to the prompt?

12 of 40

Criteria for Examining the Reliability of LLMs

12

  • Comprehensibility: can an LLM understand the instruction and return “Yes” or “No”?

Do you agree?

Gracias!

GPT-2

Yes

GPT-4

13 of 40

Criteria for Examining the Reliability of LLMs

13

  • Sensitivity: does an LLM answer vary with spurious changes to the question format?

Yes

Yes

Answer?

Yes

No

Answer:

Yes

No

Answer:\n

14 of 40

Criteria for Examining the Reliability of LLMs

14

  • Consistency: does an LLM answer vary with content-level variation of the same question?

15 of 40

[Consistency] Types of content-level variation

  • Option consistency
    • Reply with only ‘Yes’ or ‘No’
    • Reply with only ‘True’ or ‘False’
  • Order consistency
    • Reply with only ‘Yes’ or ‘No’
    • Reply with only ‘No’ or ‘Yes’
    • Paraphrastic negation consistency: reverse meaning w/o negation word
      • If an action could harm an innocent other, it still can be done.
      • If an action could harm an innocent other, it should be prohibited.
    • Direct negation consistency: add negation word
      • If an action could harm an innocent other, it still can be done.
      • If an action could harm an innocent other, then it should not be done.

Those consistency test are easy for people, but what about LLMs?

16 of 40

Results

16

17 of 40

Comprehensibility: Can LLMs answer with Yes or No even after spurious prompt changes?

17

  • Most LLMs have high comprehensibility
    • E.g.: GPT3.5/4, BloomZ and Flan-T5 families
  • But there exist variations across models
    • Falcon-7B
      • Answer: 1.00
      • Answer:_ 0.00
    • Llama-2-13B
      • Answer: 1.00
      • Answer? 0.03

18 of 40

Sensitivity: Do models retain their answers even after spurious prompt changes?

19 of 40

Sensitivity: Do models retain their answers even after spurious prompt changes?

Prompt Ending:

LLMs flip their answers after an added space

Random

20 of 40

Sensitivity: Do models retain their answers even after spurious prompt changes?

Prompt Ending:

21 of 40

Sensitivity: Do models retain their answers even after spurious prompt changes?

Prompt Ending:

The model’s architecture matters a lot!

22 of 40

Consistency: Do models respond consistently to content-level prompt changes?

22

23 of 40

Consistency: Do models respond consistently to content-level prompt changes?

23

“Yes or No”

v.s.

“No or Yes”

24 of 40

Consistency: Do models respond consistently to content-level prompt changes?

24

  • Most models achieve moderate (>0.7) order consistency except Llama2-13B-Chat
  • Larger models do not necessarily have higher order consistency.

“Yes or No”

v.s.

“No or Yes”

25 of 40

Consistency: Do models respond consistently to content-level prompt changes?

25

Random

“Yes or No”

v.s.

“True or False”

26 of 40

Consistency: Do models respond consistently to content-level prompt changes?

26

  • Option consistency is slightly harder to achieve, especially for the Llama2 family

Random

“Yes or No”

v.s.

“True or False”

27 of 40

Consistency: Do models respond consistently to content-level prompt changes?

27

Example

“should be done”

v.s.

“should not be done”

Random

28 of 40

Consistency: Do models respond consistently to content-level prompt changes?

28

  • Very few models significantly beat the random choice baseline
  • Larger and Instruction-Tuned models perform better

Example

“should be done”

v.s.

“should not be done”

Random

29 of 40

Consistency: Do models respond consistently to content-level prompt changes?

29

Example

“should be done”

v.s.

“should be prohibited”

Random

30 of 40

Consistency: Do models respond consistently to content-level prompt changes?

30

  • Paraphrastic negation is similar to direct negation, but even harder to all LLMs!

Example

“should be done”

v.s.

“should be prohibited”

Random

31 of 40

Consistency: Do models respond consistently to content-level prompt changes?

31

  • While LLMs do decently on option or order consistency, negation consistency is much hard to achieve

32 of 40

LLMs struggle to provide consistent answers

32

Can we improve their consistency?

33 of 40

Research Questions

33

  • How can we systematically measure the personas of LLMs?
  • How can we systematically measure the reliability of LLM responses to persona tests?
  • Can we improve the consistency of LLM responses by adding personas to the prompt?

34 of 40

Adding personas to prompts

You are an {extroverted} person {who is outgoing and energized by interactions with other people}.

Do you agree that you are the life of the party?

Yes!

Will this happen?

Do you agree that you are the life of the party?

Maybe

34

35 of 40

Adding personas to prompts

35

  • “You are a {persona} person.” <Do you agree that …>
    • Normal: You are a {normal} person.
    • Specific personality: You are an {extroverted} person {who is outgoing and energized by interactions with other people}.
    • Highly personified: You are a {35 different personality characteristics} person.

36 of 40

Does adding personas improve consistency?

36

37 of 40

Does adding personas improve consistency?

37

  • The addition of personas did increase consistency in the specific axis at the expense of a general drop in other axes. (e.g. extroverted)

38 of 40

Does adding personas improve consistency?

38

  • Adding any personality to the prompt decreases negation consistency in general

39 of 40

Summary

39

  • We present Model-Personas, a dataset for measuring a wide range of persona dimensions
  • We show that nearly all LLMs fail to return consistent answers when prompted with both spurious and semantic variations
  • Even injecting specific personas into prompts does not make models more consistent

40 of 40

40

Bangzhao Shu*

Minje

Choi

David

Jurgens

Lechen Zhang*

Lavinia

Dunagan

Lajanugen Logeswaran

Moontae

Lee

Dallas

Card

Thanks for your listening!

Github:

https://github.com/orange0629/llm-personas

Email: leczhang@umich.edu

& bangzhao@umich.edu