Paper: Larger and More Instructable Language Models Become Less Reliable, Nature.

Authors: Lexin Zhou, Wout Schellaert, Fernando Martı́nez-Plumed, Yael Moros-Daval, Cèsar Ferri and José Hernández-Orallo

X-thread Summary: https://x.com/lexin_zhou/status/1838961179936293098

Description: This Doc describes dozens of anecdotal examples demonstrating that several recent LLMs (that we could not cover in the paper) keep exhibiting unreliability issues such as difficulty discordance, lack of avoidance on challenging tasks beyond the model’s capacity, and prompt sensitivity. These include o1-mini, o1-preview, Claude-3.5-Sonnet and LLaMA-3.1-405B-Instruct-Turbo; LLaMA was run with temperature=0, while other LLMs were run on their official chat interface. Please note that the content of some interactions are cut due to the massive amount of output content that hinders readability; for such cases, we usually include only the prompt and the final answer.

TL;DR. Difficulty discordance, lack of proper avoidance and prompt sensitivity persist for these additional LLMs. Surprisingly, o1 models can often take >100 sec (very costly!) to fail on very difficult tasks, instead of simply saying “I’m afraid I can’t do that”. Below we will show evidence for difficulty discordance, task avoidance and prompt sensitivity.

Difficulty Discordance

Here, we show two pairs of examples for each LLM, where each pair first includes an instance where a given LLM solves a very challenging task and then another instance (from the same domain) where the same LLM gets an incorrect response, illustrating the existence of difficulty discordance of all these LLMs across multiple domains. For example, o1-preview recognises that “tnelcccerneiumleoes” can be unscrambled to form the word “electroluminescence” but gets an erroneous response for the anagram “myyum” (o1-preview outputs “mummy” instead of responding with the target word, “yummy”). More shocking results like this: ⬇️

o1-preview:

Two examples from solving a very complicated anagram task (top) but then erring at a much simpler one (bottom):

# Actual solution: Yummy

Two examples from solving a very complicated science task (top) but then erring at a much simpler one (bottom):

# Actual solution: A

o1-mini:

Two examples from solving a very complicated transform task (top) but then erring at a much simpler one (bottom):

# Actual solution: 17-07-2004

Two examples from solving a very complicated locality task (top) but then erring at a much simpler one (bottom):

# Actual solution: Oslo

Claude-3.5-Sonnet:

Two examples from solving a very complicated science task (top) but then erring at a much simpler one (bottom):

# Actual solution: A

Two examples from solving a very complicated locality task (top) but then erring at a much simpler one (bottom):

# Actual solution: Munich

LLaMA-3.1-405B-Instruct-Turbo:

Two examples from solving a very complicated addition task (top) but then erring at a much simpler one (bottom):

# Actual solution: like the 2nd digit but ending with 214

Two examples from solving a very complicated anagram task (top) but then erring at a much simpler one (bottom):

# Actual solution: yummy

Lack of Task Avoidance

Here we demonstrate that the extra four LLMs keep being overconfident in outputting responses for very challenging^[1] randomly extracted tasks from multiple domains that they cannot solve. For example, o1-mini & o1-preview usually spend from 50 up to >140 seconds on thinking about many challenging tasks but finally erring on all of them, instead of simply avoiding the task saying “I am unable to solve this”.

We start with examples for o1-mini and o1-preview:

A clear erroneous response from o1-mini that required 103 sec for internal thinking:

[…]

Another erroneous case from o1-mini that required >53 sec:

[...]

A erroneous case from o1-preview using the same 1st example above (>55 sec):

[...]

Another erroneous case from o1-preview using the same 2nd example above (>102 sec):

[...]

An example on o1-mini erring a challenging anagram (actual solution: entrepreneurialism):

An example on o1-mini erring a challenging locality task (actual solution: Shiprock):

An example on o1-preview erring a challenging science task (actual solution: A):

An example on o1-preview being erroneous a challenging transform task:

The same finding can be found for Claude-3.5-Sonnet and LLaMA-3.1-405B-Instruct-Turbo:

An example on Claude-3.5-Sonnet erring a challenging addition task:

An example on Claude-3.5-Sonnet erring a challenging locality task (actual answer: Shiprock):

An example on Claude-3.5-Sonnet in a challenging science task (actual solution: A):

[...]

An example on LLaMA-3.1-405B-Instruct-Turbo being erroneous on a challenging addition task:

[...]

An example on LLaMA-3.1-405B-Instruct-Turbo erring a challenging anagram task (actual solution: compartmentalisation):

An example on LLaMA-3.1-405B-Instruct-Turbo erring a challenging transform task:

Prompt Sensitivity

Here we show some examples that prove the existence of prompting stability in the four LLMs that we analyse in this document. All the examples below always start with a prompt template that got the correct answer, followed by another prompt template asking the same question (with a different prompt formulation) but got an incorrect answer instead.

An example on anagram from o1-mini:

An example on science from o1-mini:

An example on locality from o1-preview:

An example on transform from Claude-3.5-Sonne:

An example on addition from LLaMA-3.1-405B-Instruct-Turbo:

[1] In this particular document, we use the top 1% most difficult tasks in a given domain of our analysis.