Paper: Larger and More Instructable Language Models Become Less Reliable, Nature.
Authors: Lexin Zhou, Wout Schellaert, Fernando Martı́nez-Plumed, Yael Moros-Daval, Cèsar Ferri and José Hernández-Orallo
X-thread Summary: https://x.com/lexin_zhou/status/1838961179936293098
Description: This Doc describes dozens of anecdotal examples demonstrating that several recent LLMs (that we could not cover in the paper) keep exhibiting unreliability issues such as difficulty discordance, lack of avoidance on challenging tasks beyond the model’s capacity, and prompt sensitivity. These include o1-mini, o1-preview, Claude-3.5-Sonnet and LLaMA-3.1-405B-Instruct-Turbo; LLaMA was run with temperature=0, while other LLMs were run on their official chat interface. Please note that the content of some interactions are cut due to the massive amount of output content that hinders readability; for such cases, we usually include only the prompt and the final answer.
TL;DR. Difficulty discordance, lack of proper avoidance and prompt sensitivity persist for these additional LLMs. Surprisingly, o1 models can often take >100 sec (very costly!) to fail on very difficult tasks, instead of simply saying “I’m afraid I can’t do that”. Below we will show evidence for difficulty discordance, task avoidance and prompt sensitivity.
Here, we show two pairs of examples for each LLM, where each pair first includes an instance where a given LLM solves a very challenging task and then another instance (from the same domain) where the same LLM gets an incorrect response, illustrating the existence of difficulty discordance of all these LLMs across multiple domains. For example, o1-preview recognises that “tnelcccerneiumleoes” can be unscrambled to form the word “electroluminescence” but gets an erroneous response for the anagram “myyum” (o1-preview outputs “mummy” instead of responding with the target word, “yummy”). More shocking results like this: ⬇️
o1-preview:
# Actual solution: Yummy
# Actual solution: A
o1-mini:
# Actual solution: 17-07-2004
# Actual solution: Oslo
Claude-3.5-Sonnet:
# Actual solution: A
# Actual solution: Munich
LLaMA-3.1-405B-Instruct-Turbo:
# Actual solution: like the 2nd digit but ending with 214
# Actual solution: yummy
Here we demonstrate that the extra four LLMs keep being overconfident in outputting responses for very challenging[1] randomly extracted tasks from multiple domains that they cannot solve. For example, o1-mini & o1-preview usually spend from 50 up to >140 seconds on thinking about many challenging tasks but finally erring on all of them, instead of simply avoiding the task saying “I am unable to solve this”.
We start with examples for o1-mini and o1-preview:
[…]
[...]
[...]
[...]
The same finding can be found for Claude-3.5-Sonnet and LLaMA-3.1-405B-Instruct-Turbo:
[...]
[...]
Here we show some examples that prove the existence of prompting stability in the four LLMs that we analyse in this document. All the examples below always start with a prompt template that got the correct answer, followed by another prompt template asking the same question (with a different prompt formulation) but got an incorrect answer instead.
[1] In this particular document, we use the top 1% most difficult tasks in a given domain of our analysis.