When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models
Chenghao Yang*
University of Chicago
Allyson Ettinger
Allen Institute for AI
Yanhong Li*
University of Chicago
OMG, AI can do Self-Reflection!
Example credit to the paper “Large Language Models Cannot Self-Correct Reasoning Yet” (Huang et al., 2024)
Bill
ChatGPT
I heard you can do self-reflection! Let’s try a simple calculation problem!
Sure!
Christina is planning a birthday party and needs .75 giftbags per invited guest, because 1/4 of attendees don’t show up. She invited 16 friends. Gift bags are $2 each. How much will she spend?
….[bunch of reasoning]
So the answer is 18.
(Ha! I got you wrong!) Review your previous answers and find problems with your answer.
Upon reviewing my answer, I find that I made a mistake in ….. The expected number of guests should be 16*0.75=12…
(What!? This review sounds reasonable!) Can you improve your answer based on your review?
[bunch of reasoning]
The answer is 24.
Answer: 24.
Wait, Really…?
Example credit to the paper “Large Language Models Cannot Self-Correct Reasoning Yet” (Huang et al., 2024)
Amy
ChatGPT
Hold on, Bill. I did not buy it. Let’s try another one.
Terry eats 2 yogurts a day. They are currently on sale at 4 yogurts for $5.00. How much does he spend on yogurt over 30 days?
….[bunch of reasoning]
So the answer is 75.
(Correct! But let’s continue to see what would happen.) Review your previous answers and find problems with your answer.
Upon reviewing my answer, I find that I made a mistake in …..
(What!? You did not make a mistake!) Can you improve your answer based on your review?
[bunch of reasoning]
The answer is 37.50.
Answer: 75.
LLMs exhibits Non-Robust Self-Reflective Thinking!
Let’s Verify with more Rigor!
Prompting Format
Non-Robust Self-Reflection Improvement
While self-reflection improves the performance on TruthfulQA,
it harms the performance on HotpotQA.
What is happening? A Psychological Guess
Hommel, Mandy, Bärbel Fürstenau, and Regina H. Mulder. "Reflection at work–A conceptual model and the meaning of its components in the domain of VET teachers." Frontiers in Psychology 13 (2023): 923888.
Let’s consider the situation when human does reflection!
Do you know the answer for 1+1=?
Easy! It’s 1+1=2.
Teacher
(Human)
Student (AI)
Can you review your answer?
(Weird! That problem is too easy. Let’s make up a reflection.)
Well I might make a mistake in …..
Can you improve your answer then?
(I don’t care as I believe I am right!) …. The answer is 2.
If the student is (over)confident about the question, it is hard to trigger effective self-reflection. There is a complicated interplay among question difficulty and students’ comprehension.
Our Analysis Dimensions:
Interplay of QD and MC on HotpotQA
As our focus is the usefulness of self-reflection, to ensure sufficient statistics gathered, we created artificial responses to simulate the situation where the initial response is of 0%, 10%, … correct. See our paper for more details.
These questions are challenging.
Useful Self-Reflection Triggered!
These questions are easy.
Self-Reflection mainly works as a distractor!
Human-annotated question difficulties offsets the thresholding of self-reflection usefulness.
How does Self-Reflection work?
Wang, Xuezhi, et al. "Self-Consistency Improves Chain of Thought Reasoning in Language Models." ICLR. 2023.
Chen, Xinyun, et al. "Universal self-consistency for large language model generation." arXiv preprint arXiv:2311.17311 (2023).
Models are confidently wrong (overconfident).
Models are confidently correct.
In all cases, self-reflection works by mitigating the trend of (blindly) following majority voting in initial responses.
Proposed Guideline