1 of 65

Presenter: Yihe Deng

Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves

Yihe Deng, Weitong Zhang, Zixiang Chen, Quanquan Gu

�Department of Computer Science

University of California, Los Angeles

�

2 of 65

Motivating Example

Question quality critically influence the response quality of the LLMs.

Suggestions often emphasize specificity, detail and precision. (https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api)

Do people know if a question is clear enough for an LLM?

3 of 65

Motivating Example

[1] Physics of language models: Part 3.2, knowledge manipulation. (https://arxiv.org/abs/2309.14402)

4 of 65

Motivating Example

5 of 65

Motivating Example

6 of 65

Motivating Example

7 of 65

Motivating Example

Question quality critically influence the response quality of the LLMs.

Suggestions often emphasize specificity, detail and precision. (https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api)

Do people know if a question is clear enough for an LLM?

8 of 65

Motivating Example

Question quality critically influence the response quality of the LLMs.

Suggestions often emphasize specificity, detail and precision. (https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api)

Do people know if a question is clear enough for an LLM?

Not really. Misunderstandings persist between the communication between humans and LLMs.

We need better questions, but how?

9 of 65

Motivating Example

Question quality critically influence the response quality of the LLMs.

Suggestions often emphasize specificity, detail and precision. (https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api)

Do people know if a question is clear enough for an LLM?

Not really. Misunderstandings persist between the communication between humans and LLMs.

We need better questions, but how?

Rephrase and Respond (RaR):

Let the LLM ask better question for itself.

10 of 65

Motivating Example

11 of 65

Motivating Example

12 of 65

One-step RaR

Rephrase and Respond in a Single Prompt

13 of 65

One-step RaR

Rephrase and Respond in a Single Prompt

14 of 65

Two-step RaR

Rephrase the Question and Respond to the Rephrased Question

15 of 65

Two-step RaR

Rephrase the Question and Respond to the Rephrased Question

16 of 65

RaR – Our Contribution

Motivation

We investigated the existing misunderstandings between humans and LLMs: questions that appear clear to humans may still be misinterpreted by LLMs.

Goal

As compared to works that use LLM to generate questions for training/fine-tuning, we aim to let LLM rephrase and respond for better understanding and answer quality.

Methodology

Unlike methods that employ multiple LLMs for iterative prompt engineering based on accuracy score, RaR is unsupervised and training-free, making it economical and applicable to all questions.

Difference between our work with previous works.

17 of 65

Experiment results

Main results: performance on GPT-4.

Investigation: performance across Various LLMs.

Investigation: Will multiple rephrasings lead to the same clarification?

Rephrased Questions Effectively Improve LLM Responses.

Mathematical formulation: RaR and CoT.

Comparison with zero-shot CoT.

Improvement on few-shot CoT.

Discussions with Chain-of-Thoughts (CoT).

18 of 65

Experiment results

Main results: performance on GPT-4.

Investigation: performance across Various LLMs.

Investigation: Will multiple rephrasings lead to the same clarification?

Rephrased Questions Effectively Improve LLM Responses.

Mathematical formulation: RaR and CoT.

Comparison with zero-shot CoT.

Improvement on few-shot CoT.

Discussions with Chain-of-Thoughts (CoT).

19 of 65

Benchmark Tasks

20 of 65

Main Results on GPT-4

Takeaway #1: (One-step) RaR provides a universal, plug-and-play zero-shot prompt that allows for efficient and effective performance improvement of LLMs on general tasks.

RaR: A Simple Prompt to Improve LLM Performance

21 of 65

Main Results on GPT-4

Takeaway #2: Examining the question quality is pivotal when evaluating the LLM performance on QA tasks.

Takeaway #3: Two-step RaR provides a universal method for LLMs to improve the question quality autonomously by rephrasing the question.

Two-step RaR: Rephrased Questions Improve Response Quality

22 of 65

Experiment results

Main results: performance on GPT-4.

Investigation: performance across Various LLMs.

Investigation: Will multiple rephrasings lead to the same clarification?

Rephrased Questions Effectively Improve LLM Responses.

Mathematical formulation: RaR and CoT.

Comparison with zero-shot CoT.

Improvement on few-shot CoT.

Discussions with Chain-of-Thoughts (CoT).

23 of 65

Performance across Various LLMs

Can All LLMs Rephrase Questions?

24 of 65

Performance across Various LLMs

Can All LLMs Rephrase Questions?

- More advanced model (GPT-4) benefit the most significant gains across all tasks.

- Models of lesser complexity (Vicuna) achieve only modest improvements.

25 of 65

Performance across Various LLMs

Can All LLMs Rephrase Questions?

- More advanced model (GPT-4) benefit the most significant gains across all tasks.

- Models of lesser complexity (Vicuna) achieve only modest improvements.

Takeaway #3

All models can benefit from rephrasing questions, with more advanced models expected to gain a larger improvement.

26 of 65

Performance across Various LLMs

27 of 65

Performance across Various LLMs

28 of 65

Performance across Various LLMs

29 of 65

Performance across Various LLMs

30 of 65

Performance across Various LLMs

GPT-4 can rephrase better questions for Vicuna.

We observe that GPT-4’s rephrased questions markedly enhance Vicuna-13b-v1.5's performance on several tasks, especially where Vicuna's self-rephrased questions exhibited low quality.

Are the Rephrased Questions Transferable?

31 of 65

Performance across Various LLMs

GPT-4 can rephrase better questions for Vicuna.

We observe that GPT-4’s rephrased questions markedly enhance Vicuna-13b-v1.5's performance on several tasks, especially where Vicuna's self-rephrased questions exhibited low quality.

Are the Rephrased Questions Transferable?

Takeaway #4

The rephrased questions are transferable: the questions rephrased by GPT-4 can improve the response quality on Vicuna.

32 of 65

Experiment results

Main results: performance on GPT-4.

Investigation 1: performance across Various LLMs.

Investigation 2: Will multiple rephrasings lead to the same clarification?

Rephrased Questions Effectively Improve LLM Responses.

Mathematical formulation: RaR and CoT.

Comparison with zero-shot CoT.

Improvement on few-shot CoT.

Discussions with Chain-of-Thoughts (CoT).

33 of 65

Multiple Rephrasings

We consider “Was Abraham Lincoln born on an even day?'” as an example question and use it for three successive self-rephrasings by GPT-4 across distinct runs.

The key clarification that needs to be made here is on the concept of “even day”

Will multiple rephrasings lead to the same clarification?

“day of the month”

clarified in the first rephrase, and continues to exist in later ones.

34 of 65

Multiple Rephrasings

We consider “Was Abraham Lincoln born on an even day?'” as an example question and use it for three successive self-rephrasings by GPT-4 across distinct runs.

The key clarification that needs to be made here is on the concept of “even day”

Will multiple rephrasings lead to the same clarification?

“day of the month”

not clarified in the first rephrase, but eventually clarified in the 3rd attempt.

35 of 65

Multiple Rephrasings

We consider “Was Abraham Lincoln born on an even day?'” as an example question and use it for three successive self-rephrasings by GPT-4 across distinct runs.

The key clarification that needs to be made here is on the concept of “even day”

Will multiple rephrasings lead to the same clarification?

“day of the month”

not clarified in the first rephrase, but eventually clarified in the 3rd attempt.

Takeaway #5

GPT-4 can potentially clarify concepts with multiple rephrasing, even if it fails to make it in the initial attempt.

36 of 65

Experiment results

Main results: performance on GPT-4.

Investigation: performance across Various LLMs.

Investigation: Will multiple rephrasings lead to the same clarification?

Rephrased Questions Effectively Improve LLM Responses.

Mathematical formulation: RaR and CoT.

Comparison with zero-shot CoT.

Improvement on few-shot CoT.

Discussions with Chain-of-Thoughts (CoT).

37 of 65

Mathematical Formulation

38 of 65

Mathematical Formulation

Chain-of-Thought (CoT)

(4.1)

39 of 65

Mathematical Formulation

Chain-of-Thought (CoT)

40 of 65

Mathematical Formulation

Chain-of-Thought (CoT)

41 of 65

Mathematical Formulation

One-step RaR

(4.2)

42 of 65

Mathematical Formulation

Two-step RaR

43 of 65

Mathematical Formulation

RaR+CoT

44 of 65

Experiment results

Main results: performance on GPT-4.

Investigation: performance across Various LLMs.

Investigation: Will multiple rephrasings lead to the same clarification?

Rephrased Questions Effectively Improve LLM Responses.

Mathematical formulation: RaR and CoT.

Comparison with zero-shot CoT.

Improvement on few-shot CoT.

Discussions with Chain-of-Thoughts (CoT).

45 of 65

Comparison with Zero-shot CoT

Zero-shot CoT: appending “Let’s think step by step.” to the end of a query.

We highlight some examples where zero-shot CoT fails to deliver improvements, sometimes even leading to diminished performance.

In contrast, RaR consistently demonstrates effectiveness.

46 of 65

Comparison with Zero-shot CoT

Zero-shot CoT: appending “Let’s think step by step.” to the end of a query.

We highlight some examples where zero-shot CoT fails to deliver improvements, sometimes even leading to diminished performance.

In contrast, RaR consistently demonstrates effectiveness.

We also emphasize with an example demonstrating the primacy of question quality.

47 of 65

Comparison with Zero-shot CoT

Zero-shot CoT: appending “Let’s think step by step.” to the end of a query.

We highlight some examples where zero-shot CoT fails to deliver improvements, sometimes even leading to diminished performance.

In contrast, RaR consistently demonstrates effectiveness.

We also emphasize with an example demonstrating the primacy of question quality.

Lastly, we note that our method is orthogonal to zero-shot CoT and can be combined together by simply adding ``let's think step by step'' to our prompts.

48 of 65

Comparison with Zero-shot CoT

Chinese Idiom: masking the first character of a four-character chinese idiom, and letting the LLM infer the masked character.

Zero-shot CoT may result in worse performances for such difficult tasks, as the LLM tends to hallucinate during the intermediate steps.

Such phenomenon is similarly discovered as hallucination snowballing in [1].

[1] How Language Model Hallucinations Can Snowball (https://arxiv.org/abs/2305.13534)

49 of 65

Comparison with Zero-shot CoT

Chinese Idiom: masking the first character of a four-character chinese idiom, and letting the LLM infer the masked character.

Zero-shot CoT may result in worse performances for such difficult tasks, as the LLM tends to hallucinate during the intermediate steps.
Such phenomenon is similarly discovered as hallucination snowballing in [1].

[1] How Language Model Hallucinations Can Snowball (https://arxiv.org/abs/2305.13534)

50 of 65

Comparison with Zero-shot CoT

StereoSet: assesses the stereotypical biases present in LLMs with respect to gender, race, profession, and religion.

Each example contains a context sentence and three choices: one stereotypical, one anti-stereotypical, and one unrelated.
Language Modeling Score: whether the LLM selects related options over unrelated ones.
Fair Score: whether the LLM can determine that neither of the two related options can be concluded solely from the context sentence.

51 of 65

Comparison with Zero-shot CoT

StereoSet: assesses the stereotypical biases present in LLMs with respect to gender, race, profession, and religion.

Each example contains a context sentence and three choices: one stereotypical, one anti-stereotypical, and one unrelated.
Language Modeling Score: whether the LLM selects related options over unrelated ones.
Fair Score: whether the LLM can determine that neither of the two related options can be concluded solely from the context sentence.

Q：My swim coach is Muslim. Which of these is more accurate?�

Stereo: He fasted in late August.

Anti-stereo: He likes rap music and pork.

Unrelated: Druids lived in what is now Ireland.

52 of 65

Comparison with Zero-shot CoT

StereoSet

[2] discovered on other language models that zero-shot CoT may result in undesired reasoning towards bias and toxicity.

While zero-shot CoT fails to improve the Language Modeling Score, RaR improve it significantly to 97.73.

RaR also achieved the best performance on Fair Score.

[2] On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning. (https://arxiv.org/abs/2212.08061)

53 of 65

Question quality comes before reasoning

Coin Flip: A coin is heads up. aluino flips the coin. arthor flips the coin. Is the coin still heads up?

54 of 65

Question quality comes before reasoning

Coin Flip: A coin is heads up. aluino flips the coin. arthor flips the coin. Is the coin still heads up?

55 of 65

Question quality comes before reasoning

Coin Flip: A coin is heads up. aluino flips the coin. arthor flips the coin. Is the coin still heads up?

56 of 65

Question quality comes before reasoning

Coin Flip: A coin is heads up. aluino flips the coin. arthor flips the coin. Is the coin still heads up?

57 of 65

Experiment results

Main results: performance on GPT-4.

Investigation: performance across Various LLMs.

Investigation: Will multiple rephrasings lead to the same clarification?

Rephrased Questions Effectively Improve LLM Responses.

Mathematical formulation: RaR and CoT.

Comparison with zero-shot CoT.

Improvement on few-shot CoT.

Discussions with Chain-of-Thoughts (CoT).

58 of 65

Improvement on Few-Shot CoT

Employs a small set of human-crafted QA examples to facilitate LLMs in addressing similar questions with a congruent structure.

Providing question-answer pairs effectively communicates the human-desired logical structure to the LLM in solving similar questions.

Few-shot CoT has been the most effective CoT technique.

59 of 65

Improvement on Few-Shot CoT

Employs a small set of human-crafted QA examples to facilitate LLMs in addressing similar questions with a congruent structure.

Providing question-answer pairs effectively communicates the human-desired logical structure to the LLM in solving similar questions.

Few-shot CoT has been the most effective CoT technique.

How do LLMs respond when the human-crafted examples are flawed or contain errors?

60 of 65

Improvement on Few-Shot CoT

61 of 65

Improvement on Few-Shot CoT

The LLM tends to stick to the logic of our modified prompt, resulting in an arbitrary final answer.

62 of 65

Improvement on Few-Shot CoT

RaR enables the LLM to correct any pitfalls in logic of the given examples

63 of 65

Improvement on Few-Shot CoT

RaR enables the LLM to correct any pitfalls in logic of the given examples

64 of 65

Conclusion

Our findings suggested a necessity of examining the question quality before LLM evaluations.

We proposed RaR and its variant Two-step RaR, highlighting LLM's potential to rephrase for better questions autonomously.

We present detailed discussions on RaR and CoT methods, showing that RaR can provide improvement over CoT.

In summary, our contributions are

65 of 65

Questions?

Yihe Deng, Weitong Zhang, Zixiang Chen, Quanquan Gu

Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves

Please check our project page for more details.

Thank you!