1 of 46

HCAST:�can they do stuff yet?

12.05.2025

Gavin Leech, Arb Research

2 of 46

1

Introduction

3 of 46

1

Introduction

Read the following talk as red-teaming:�

Can Gavin manage to not believe this if he tries really hard?

4 of 46

1

Introduction

OK, AI is improving. But how quick? On real tasks?

5 of 46

1

Introduction

ARC-AGI (2019): nah. boring, static, public, 3-minuters, no credible human baseline, procedural, isolated�
HCAST:

Hard endpoint: actual cognitively-loaded professional tasks
Actual human baselines
New tasks specially commissioned,
vetted contractors, great pay

1st serious measure of progress towards AGI

6 of 46

1

Introduction

difficulty(task) ~= time it takes a human expert to solve

1 hour task harder than a 1 min task.

universal; any task can be placed on “how long does it take?” scale.

How hard is stuff?

7 of 46

1

Introduction

1st serious measure of progress towards AGI

8 of 46

1

Introduction

1st serious measure of progress towards AGI

9 of 46

1

Introduction

Their metric is the maximum human-time for an average AI success rate = 50%, given arbitrary amounts of AI-time �
(jagged!)

“Doubling every 2 -12 months on automatically-scoreable, relatively clean + greenfield software tasks from a few distributions”

“Persnickety” (accurate) version

10 of 46

1

Introduction

“If the measured trend from the past 6 years continues for 2-4 more years, generalist autonomous agents will be capable of performing a wide range of week-long tasks.”

Estimated doubling time = 7 months ± 5

Supposed current horizon = 1-2 hours
Feb 2027 prediction = 16 hours
Apr 2028 prediction = 5 days

Their prediction

11 of 46

1

Introduction

1st serious measure of progress towards AGI

12 of 46

1

Introduction

Fractional AGI?: t-AGI

13 of 46

1

Introduction

Fractional AGI?: t-AGI

14 of 46

1

Introduction

Fractional AGI?: t-AGI

15 of 46

1

Introduction

Fractional AGI?: t-AGI

16 of 46

1

Introduction

~$200k bounties
$100k baselining infra
$150k baselining tasks,
~~$200k absurd API costs (or maybe free for them?)
~~$500k extremely annoying labour (design, recruitment, herding cats, etc)�
n = 189
SOFTWARE CODING TASKS ONLY
Human completion times range from 1 second to 16 hours.

Details

17 of 46

1

Introduction

Details

Mean task “messiness” in HCAST is 3/16�
Task horizon is human-time;�
not counting lines of code�
not counting how long it takes the AI agent to do the task �
For the AI agents, what counts is whether they can do it autonomously >50% of the time.

18 of 46

1

Introduction

One-third leakable

19 of 46

1

Introduction

Reasoning trend break?

20 of 46

1

Introduction

Reasoning trend break?

21 of 46

1

Introduction

Well it beats the arse off naive AR loss scaling “laws” (not to mention Aschenbrenner fake extrapolation of them)�
But no:

1) HCAST predicts Sep 2030 for a ~3-month-AGI

2) I don’t buy it

So is this it? 2027 is it?

22 of 46

1

Introduction

Core mystery of AI training: does coding improve noncoding skill?

METR thesis: “coding is transformative”

if an AI can complete one-month tasks from a similar distribution, it could be transformative for society, by automating software engineering and research�

METR thesis: disregard absolute times; doubling time is consistent across tasks

Rebuttal #1: External validity of coding

23 of 46

1

Introduction

Rebuttals #1: External validity of coding

24 of 46

1

Introduction

Messiness factors

Real life source Resource limited Not easily resettable Irreversible mistake availability Dynamic environment Difficult counterfactuals Not purely automatic scoring Implicit generalizability required	Novel situation Nonexplicit scoring description Is suboptimal behavior exploited No provided verification mechanisms Real-time coordination Self modification required Self improvement required Information-seeking required

25 of 46

1

Introduction

Again, HCAST metric is not “AI does in 1 hour what human does in 1 hour”. In practice AIs take forever floundering�
o3 ARC “solution” involved 40 million tokens per task�
Gemini took 888 hours to minimally beat Pokemon, with a dozen intense human hand-holds like tile labelling�
Rebuttal rebuttal: “ok fine I retract specific numerical claims. But the trend is up my dude, and not slow”

Rebuttal #2: Just watch AIs work

26 of 46

1

Introduction

HCAST metric is ~just code! And no figure title admits it.�
AI struggles at some tasks that take humans <5 minutes (OSWorld) �
faster and better at some human-day tasks

Task-specific. On chess, other AI is already a 20-year-AGI�
Needle-haystack retrieval: 99.7% accuracy on 1MT. A 12-hour-AGI

Rebuttals #3: AI too jagged for one scalar

27 of 46

1

Introduction

Classic contrived example:�

Say AI is R = 90% on every mini-task.�
But real tasks involve chaining n�mini-tasks together, some of which�are crucial (conjunctive).�
So success is R^n ��(e.g. 0.9^20 = 12%)�

#4: Reliability multiplies down

28 of 46

1

Introduction

Is reliability R improving?�
HCAST fans can say: yes! �“We can infer this from the [code] task horizon expanding.”�
I say “I don’t know”: �

shockingly confounded as all measurements of generalisation are by memorisation and semantic duplicates from a closed and unfathomably vast pretraining corpus.

Reliability multiplies down

29 of 46

1

Introduction

30 of 46

1

Introduction

Well, current models have almost 100% success on [code] tasks taking humans <4 minutes�
Seems pretty reliable and maybe chainable!�

Weak r-r-rebuttal: simple means conceivably in the training set?�
Takes me 4 mins to get ffmpeg to do anything at all for instance

Rebuttal rebuttal:

31 of 46

1

Introduction

Noobs: the contractors weren’t familiar with the specific context of each task. They’re in the disoriented situation of a new hire on their first day of work.�
contractors took 5-18x longer to do the tasks than METR staff

“Human baseliners allowed to drop tasks they weren’t making progress with, and mostly cut off at 8hrs; baselines average time taken for successful runs; thus biased down for difficult tasks”

#5: Problems with the human time estimates

32 of 46

1

Introduction

Incentive to overreport hours: “the number of hours my playtester reported. paid more the higher that number was”�

#5: Problems with the human time estimates

33 of 46

1

Introduction

September 2023: "Q*"
November 2023: Altman's coup
May 2024: Sutskever officially out
Jul 2024: "Strawberry"
Sep 2024: o1-preview

i.e. We’re seeing 18 months into the past, not translated but compressed into 4. GPT-4 release a ~9 month lag after pretraining

Release overhang: Shocking tempo (o1→R1→o3) a one-time artefact of dropping safety culture.

#6: Product overhang exhausted?

34 of 46

1

Introduction

More importantly, the coup double-decimated their pipeline�
Sutskever, Schulman, Zoph, Radford, Weng, ... all gone. �
Fragmentation means HCAST trend is too optimistic??

My prediction: the labs will slow. Anthropic didn’t get most of the OAI leavers.

#7: Idea overhang exhausted?

35 of 46

1

Introduction

We know that the “reasoning” models are sneakier.�
This has embarrassed various people (Sakana AI scientist bs Torch code fooled auto-evaluation and cursory human inspection)�
No specific signs in HCAST, just paranoid

#8: Speculation: auto scoring enables sneakiness

36 of 46

1

Introduction

“1 hour” horizon @ 50%?�
How long does booking a flight take you? 20 mins?

Do you fuck it up 50% of the time?

Does o3 work for booking flights? Does any scaffolded thing?

Even if they did get 50% success, this would be useless in most jobs! They do look at 80% and above, and these aren’t scaling as fast.�
Just look

#9: I refute it thus

37 of 46

1

Introduction

38 of 46

1

Introduction

Claude 3.7 gets 3 badges (about 6 hours)

Completing Blue is about 26 hours, 106,000 in-game actions�

But Pokemon!

39 of 46

1

Introduction

40 of 46

1

Introduction

Bullshit. Substantial hard-coding, “scaffolds”

41 of 46

1

Introduction

42 of 46

1

Introduction

Very good work
exponential fit is plausible
Oversold unwarranted extrapolation. Code ~only!!!
Small n
Still not beating leakage worry
Ignore the y-axis and the slope

Or divide by like 6

But yes time horizon is improving
But yes model ranks are probably correct

They’re not stupid but they are jagged

Bottom line on HCAST

43 of 46

1

Introduction

Bottom line on HCAST

44 of 46

1

Introduction

“the content of AI 2027 was all but finalized before the METR report came out.”�
We heavily draw from METR’s recent report which catalogues a trend of increasing time horizon.

Bottom line on AGI 2027/8

45 of 46

1

Introduction

Maybe just METR magic backdoor access?�
GPT-3 is dead�
GPT-3.5-T is a heavily changed checkpoint

Tangent: How did they get 2022 models to test?

46 of 46

1

Introduction