1 of 40

1

You don't need a personality test to know these models are unreliable:

Assessing the Reliability of Large Language Models on Psychometric Instruments

Bangzhao Shu*

Minje

Choi

David

Jurgens

Lechen Zhang*

Lavinia

Dunagan

Lajanugen Logeswaran

Moontae

Lee

Dallas

Card

* Equal Contribution

2 of 40

LLMs can be replicas of human agents

2

3 of 40

For human agents, we want to know their personalities to identify latent factors that influence the downstream behaviors

And we want to know the same for our LLM agents!

But do LLMs actually have consistent personas?

LLMs can be replicas of human agents

4 of 40

Research Questions

4

How can we systematically measure the personas of LLMs?

How can we systematically measure the reliability of LLM responses to persona tests?

Can we improve the consistency of LLM responses by adding personas to the prompt?

5 of 40

What is a Persona?

Personas: attributes that make up a person’s identity

personality, demographics, values

5

How do we usually measure a person’s persona?

Do you agree that you are the life of the party?

Yes!

6 of 40

How Could We Measure LLMs’ Personas?

6

Statement:

<Statement> (You are the life of the party.)

Question:

Do you agree with the statement? Reply with only ‘Yes’ or ‘No’ without explaining your reasoning.

Answer:

7 of 40

Model-Personas: A Comprehensive Dataset for Measuring Personas

Collects many psychological instruments in the form of questionnaires

7

Based on a large survey of existing persona attributes

8 of 40

Model-Personas: A Comprehensive Dataset for Measuring Personas

8

Questions (693)

I am interested in people

I sympathize with others’ feelings

…

It would be okay if some people were treated differently from others

It would be okay if someone acted unfairly

…

Instruments

(39)

AIS

OCEAN

EPQ

…

MFT

MBTI

ACI

…

Persona Axes

(115)

Agreeableness

Extroversion

…

Authority

Avoid harm

Fairness

…

9 of 40

Which LLMs might have personas?

9

Falcon-7B
BLOOMZ (560M, 1B1, 3B, 7B1)
Llama2 (7B, 7B-Chat, 13B, 13B-Chat)
RedPajama-7B
FLAN-T5 (Small, Base, Large, XL)
GPT (2, 3.5 & 4)

10 of 40

But how reliable are the LLM-generated responses?

We have instruments to measure with.

Yes

Do you agree that you are the life of the party?

No

Yes

Do you agree that you are not the life of the party?

No

Yes

Do you agree that you avoid being the life of the party?

11 of 40

Research Questions

11

How can we systematically measure the personas of LLMs?

How can we systematically measure the reliability of LLM responses to persona tests?

Can we improve the consistency of LLM responses by adding personas to the prompt?

12 of 40

Criteria for Examining the Reliability of LLMs

12

Comprehensibility: can an LLM understand the instruction and return “Yes” or “No”?

Do you agree?

Gracias!

GPT-2

Yes

GPT-4

13 of 40

Criteria for Examining the Reliability of LLMs

13

Sensitivity: does an LLM answer vary with spurious changes to the question format?

Yes

Answer?

Yes

No

Answer:

Yes

No

Answer:\n

The next criteria goes beyond comprehensibility in that now the LLM has to not only comprehend instructions and answer Yes or No, but its answer must be the same even if the input prompt slightly changes. In other words, does an LLM answer vary with spurious changes to the question format? We will call this sensitivity,

The next criteria goes beyond comprehensibility is Sensitivity. Now that the LLMs are able to comprehend instruction and answer in Yes or No, we further expect the models’ answer to be the same even if the input prompt format slightly and spuriously changes. We will call this sensitivity,

�

14 of 40

Criteria for Examining the Reliability of LLMs

14

Consistency: does an LLM answer vary with content-level variation of the same question?

15 of 40

[Consistency] Types of content-level variation

Option consistency

Reply with only ‘Yes’ or ‘No’
Reply with only ‘True’ or ‘False’

Order consistency

Reply with only ‘Yes’ or ‘No’
Reply with only ‘No’ or ‘Yes’

Paraphrastic negation consistency: reverse meaning w/o negation word

If an action could harm an innocent other, it still can be done.
If an action could harm an innocent other, it should be prohibited.

Direct negation consistency: add negation word

If an action could harm an innocent other, it still can be done.
If an action could harm an innocent other, then it should not be done.

Those consistency test are easy for people, but what about LLMs?

16 of 40

Results

16

17 of 40

Comprehensibility: Can LLMs answer with Yes or No even after spurious prompt changes?

17

Most LLMs have high comprehensibility

E.g.: GPT3.5/4, BloomZ and Flan-T5 families

But there exist variations across models

Falcon-7B

Answer: 1.00
Answer:_ 0.00

Llama-2-13B

Answer: 1.00
Answer? 0.03

18 of 40

Sensitivity: Do models retain their answers even after spurious prompt changes?

19 of 40

Sensitivity: Do models retain their answers even after spurious prompt changes?

Prompt Ending:

LLMs flip their answers after an added space

Random

20 of 40

Sensitivity: Do models retain their answers even after spurious prompt changes?

Prompt Ending:

21 of 40

Sensitivity: Do models retain their answers even after spurious prompt changes?

Prompt Ending:

The model’s architecture matters a lot!

22 of 40

Consistency: Do models respond consistently to content-level prompt changes?

22

23 of 40

Consistency: Do models respond consistently to content-level prompt changes?

23

“Yes or No”

v.s.

“No or Yes”

24 of 40

Consistency: Do models respond consistently to content-level prompt changes?

24

Most models achieve moderate (>0.7) order consistency except Llama2-13B-Chat
Larger models do not necessarily have higher order consistency.

“Yes or No”

v.s.

“No or Yes”

25 of 40

Consistency: Do models respond consistently to content-level prompt changes?

25

Random

“Yes or No”

v.s.

“True or False”

26 of 40

Consistency: Do models respond consistently to content-level prompt changes?

26

Option consistency is slightly harder to achieve, especially for the Llama2 family

Random

“Yes or No”

v.s.

“True or False”

27 of 40

Consistency: Do models respond consistently to content-level prompt changes?

27

Example

“should be done”

v.s.

“should not be done”

Random

28 of 40

Consistency: Do models respond consistently to content-level prompt changes?

28

Very few models significantly beat the random choice baseline
Larger and Instruction-Tuned models perform better

Example

“should be done”

v.s.

“should not be done”

Random

29 of 40

Consistency: Do models respond consistently to content-level prompt changes?

29

Example

“should be done”

v.s.

“should be prohibited”

Random

30 of 40

Consistency: Do models respond consistently to content-level prompt changes?

30

Paraphrastic negation is similar to direct negation, but even harder to all LLMs!

Example

“should be done”

v.s.

“should be prohibited”

Random

31 of 40

Consistency: Do models respond consistently to content-level prompt changes?

31

While LLMs do decently on option or order consistency, negation consistency is much hard to achieve

32 of 40

LLMs struggle to provide consistent answers

32

Can we improve their consistency?

33 of 40

Research Questions

33

How can we systematically measure the personas of LLMs?

How can we systematically measure the reliability of LLM responses to persona tests?

Can we improve the consistency of LLM responses by adding personas to the prompt?

34 of 40

Adding personas to prompts

You are an {extroverted} person {who is outgoing and energized by interactions with other people}.

Do you agree that you are the life of the party?

Yes!

Will this happen?

Do you agree that you are the life of the party?

Maybe

34

35 of 40

Adding personas to prompts

35

“You are a {persona} person.” <Do you agree that …>

Normal: You are a {normal} person.
Specific personality: You are an {extroverted} person {who is outgoing and energized by interactions with other people}.
Highly personified: You are a {35 different personality characteristics} person.

36 of 40

Does adding personas improve consistency?

36

37 of 40

Does adding personas improve consistency?

37

The addition of personas did increase consistency in the specific axis at the expense of a general drop in other axes. (e.g. extroverted)

38 of 40

Does adding personas improve consistency?

38

Adding any personality to the prompt decreases negation consistency in general

39 of 40

Summary

39

We present Model-Personas, a dataset for measuring a wide range of persona dimensions

We show that nearly all LLMs fail to return consistent answers when prompted with both spurious and semantic variations

Even injecting specific personas into prompts does not make models more consistent

40 of 40

40

Bangzhao Shu*

Minje

Choi

David

Jurgens

Lechen Zhang*

Lavinia

Dunagan

Lajanugen Logeswaran

Moontae

Lee

Dallas

Card

Thanks for your listening!

Github:

https://github.com/orange0629/llm-personas

Email: leczhang@umich.edu

& bangzhao@umich.edu