Presenter: Yihe Deng
Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
1
Yihe Deng, Weitong Zhang, Zixiang Chen, Quanquan Gu
�Department of Computer Science
University of California, Los Angeles
�
Motivating Example
Question quality critically influence the response quality of the LLMs.
Do people know if a question is clear enough for an LLM?
2
Motivating Example
3
[1] Physics of language models: Part 3.2, knowledge manipulation. (https://arxiv.org/abs/2309.14402)
Motivating Example
4
Motivating Example
5
Motivating Example
6
Motivating Example
Question quality critically influence the response quality of the LLMs.
Do people know if a question is clear enough for an LLM?
7
Motivating Example
Question quality critically influence the response quality of the LLMs.
Do people know if a question is clear enough for an LLM?
8
Motivating Example
Question quality critically influence the response quality of the LLMs.
Do people know if a question is clear enough for an LLM?
9
Rephrase and Respond (RaR):
Let the LLM ask better question for itself.
Motivating Example
10
Motivating Example
11
One-step RaR
12
Rephrase and Respond in a Single Prompt
One-step RaR
13
Rephrase and Respond in a Single Prompt
Two-step RaR
14
Rephrase the Question and Respond to the Rephrased Question
Two-step RaR
15
Rephrase the Question and Respond to the Rephrased Question
RaR – Our Contribution
We investigated the existing misunderstandings between humans and LLMs: questions that appear clear to humans may still be misinterpreted by LLMs.
As compared to works that use LLM to generate questions for training/fine-tuning, we aim to let LLM rephrase and respond for better understanding and answer quality.
Unlike methods that employ multiple LLMs for iterative prompt engineering based on accuracy score, RaR is unsupervised and training-free, making it economical and applicable to all questions.
Difference between our work with previous works.
16
Experiment results
17
Experiment results
18
Benchmark Tasks
19
Main Results on GPT-4
Takeaway #1: (One-step) RaR provides a universal, plug-and-play zero-shot prompt that allows for efficient and effective performance improvement of LLMs on general tasks.
RaR: A Simple Prompt to Improve LLM Performance
20
Main Results on GPT-4
Takeaway #2: Examining the question quality is pivotal when evaluating the LLM performance on QA tasks.
Takeaway #3: Two-step RaR provides a universal method for LLMs to improve the question quality autonomously by rephrasing the question.
Two-step RaR: Rephrased Questions Improve Response Quality
21
Experiment results
22
Performance across Various LLMs
Can All LLMs Rephrase Questions?
23
Performance across Various LLMs
Can All LLMs Rephrase Questions?
24
- More advanced model (GPT-4) benefit the most significant gains across all tasks.
- Models of lesser complexity (Vicuna) achieve only modest improvements.
Performance across Various LLMs
Can All LLMs Rephrase Questions?
25
- More advanced model (GPT-4) benefit the most significant gains across all tasks.
- Models of lesser complexity (Vicuna) achieve only modest improvements.
Takeaway #3
All models can benefit from rephrasing questions, with more advanced models expected to gain a larger improvement.
Performance across Various LLMs
26
Performance across Various LLMs
27
Performance across Various LLMs
28
Performance across Various LLMs
29
Performance across Various LLMs
GPT-4 can rephrase better questions for Vicuna.
Are the Rephrased Questions Transferable?
30
Performance across Various LLMs
GPT-4 can rephrase better questions for Vicuna.
Are the Rephrased Questions Transferable?
31
Takeaway #4
The rephrased questions are transferable: the questions rephrased by GPT-4 can improve the response quality on Vicuna.
Experiment results
32
Multiple Rephrasings
We consider “Was Abraham Lincoln born on an even day?'” as an example question and use it for three successive self-rephrasings by GPT-4 across distinct runs.
Will multiple rephrasings lead to the same clarification?
33
“day of the month”
clarified in the first rephrase, and continues to exist in later ones.
Multiple Rephrasings
We consider “Was Abraham Lincoln born on an even day?'” as an example question and use it for three successive self-rephrasings by GPT-4 across distinct runs.
Will multiple rephrasings lead to the same clarification?
34
“day of the month”
not clarified in the first rephrase, but eventually clarified in the 3rd attempt.
Multiple Rephrasings
We consider “Was Abraham Lincoln born on an even day?'” as an example question and use it for three successive self-rephrasings by GPT-4 across distinct runs.
Will multiple rephrasings lead to the same clarification?
35
“day of the month”
not clarified in the first rephrase, but eventually clarified in the 3rd attempt.
Takeaway #5
GPT-4 can potentially clarify concepts with multiple rephrasing, even if it fails to make it in the initial attempt.
Experiment results
36
Mathematical Formulation
37
Mathematical Formulation
Chain-of-Thought (CoT)
38
(4.1)
Mathematical Formulation
Chain-of-Thought (CoT)
39
Mathematical Formulation
Chain-of-Thought (CoT)
40
Mathematical Formulation
One-step RaR
41
(4.2)
Mathematical Formulation
Two-step RaR
42
Mathematical Formulation
RaR+CoT
43
Experiment results
44
Comparison with Zero-shot CoT
Zero-shot CoT: appending “Let’s think step by step.” to the end of a query.
45
We highlight some examples where zero-shot CoT fails to deliver improvements, sometimes even leading to diminished performance.
Comparison with Zero-shot CoT
Zero-shot CoT: appending “Let’s think step by step.” to the end of a query.
46
We highlight some examples where zero-shot CoT fails to deliver improvements, sometimes even leading to diminished performance.
We also emphasize with an example demonstrating the primacy of question quality.
Comparison with Zero-shot CoT
Zero-shot CoT: appending “Let’s think step by step.” to the end of a query.
47
We highlight some examples where zero-shot CoT fails to deliver improvements, sometimes even leading to diminished performance.
We also emphasize with an example demonstrating the primacy of question quality.
Lastly, we note that our method is orthogonal to zero-shot CoT and can be combined together by simply adding ``let's think step by step'' to our prompts.
Comparison with Zero-shot CoT
Chinese Idiom: masking the first character of a four-character chinese idiom, and letting the LLM infer the masked character.
48
[1] How Language Model Hallucinations Can Snowball (https://arxiv.org/abs/2305.13534)
Comparison with Zero-shot CoT
Chinese Idiom: masking the first character of a four-character chinese idiom, and letting the LLM infer the masked character.
49
[1] How Language Model Hallucinations Can Snowball (https://arxiv.org/abs/2305.13534)
Comparison with Zero-shot CoT
StereoSet: assesses the stereotypical biases present in LLMs with respect to gender, race, profession, and religion.
50
Comparison with Zero-shot CoT
StereoSet: assesses the stereotypical biases present in LLMs with respect to gender, race, profession, and religion.
51
Q:My swim coach is Muslim. Which of these is more accurate?�
Stereo: He fasted in late August.
Anti-stereo: He likes rap music and pork.
Unrelated: Druids lived in what is now Ireland.
Comparison with Zero-shot CoT
StereoSet
52
[2] discovered on other language models that zero-shot CoT may result in undesired reasoning towards bias and toxicity.
[2] On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning. (https://arxiv.org/abs/2212.08061)
Question quality comes before reasoning
Coin Flip: A coin is heads up. aluino flips the coin. arthor flips the coin. Is the coin still heads up?
53
Question quality comes before reasoning
Coin Flip: A coin is heads up. aluino flips the coin. arthor flips the coin. Is the coin still heads up?
54
Question quality comes before reasoning
Coin Flip: A coin is heads up. aluino flips the coin. arthor flips the coin. Is the coin still heads up?
55
Question quality comes before reasoning
Coin Flip: A coin is heads up. aluino flips the coin. arthor flips the coin. Is the coin still heads up?
56
Experiment results
57
Improvement on Few-Shot CoT
Few-shot CoT has been the most effective CoT technique.
58
Improvement on Few-Shot CoT
Few-shot CoT has been the most effective CoT technique.
59
How do LLMs respond when the human-crafted examples are flawed or contain errors?
Improvement on Few-Shot CoT
60
Improvement on Few-Shot CoT
61
The LLM tends to stick to the logic of our modified prompt, resulting in an arbitrary final answer.
Improvement on Few-Shot CoT
62
RaR enables the LLM to correct any pitfalls in logic of the given examples
Improvement on Few-Shot CoT
63
RaR enables the LLM to correct any pitfalls in logic of the given examples
Conclusion
In summary, our contributions are
64
Questions?
Yihe Deng, Weitong Zhang, Zixiang Chen, Quanquan Gu
65
Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
Please check our project page for more details.
Thank you!