1 of 46

HCAST:�can they do stuff yet?

12.05.2025

Gavin Leech, Arb Research

2 of 46

1

Introduction

3 of 46

1

Introduction

Read the following talk as red-teaming:�

Can Gavin manage to not believe this if he tries really hard?

4 of 46

1

Introduction

OK, AI is improving. But how quick? On real tasks?

5 of 46

1

Introduction

  • ARC-AGI (2019): nah. boring, static, public, 3-minuters, no credible human baseline, procedural, isolated�
  • HCAST:
    • Hard endpoint: actual cognitively-loaded professional tasks
    • Actual human baselines
    • New tasks specially commissioned,
    • vetted contractors, great pay

1st serious measure of progress towards AGI

6 of 46

1

Introduction

  • difficulty(task) ~= time it takes a human expert to solve

  • 1 hour task harder than a 1 min task.

  • universal; any task can be placed on “how long does it take?” scale.

How hard is stuff?

7 of 46

1

Introduction

1st serious measure of progress towards AGI

8 of 46

1

Introduction

1st serious measure of progress towards AGI

9 of 46

1

Introduction

  • Their metric is the maximum human-time for an average AI success rate = 50%, given arbitrary amounts of AI-time �
  • (jagged!)

  • “Doubling every 2 -12 months on automatically-scoreable, relatively clean + greenfield software tasks from a few distributions”

“Persnickety” (accurate) version

10 of 46

1

Introduction

“If the measured trend from the past 6 years continues for 2-4 more years, generalist autonomous agents will be capable of performing a wide range of week-long tasks.”

  • Estimated doubling time = 7 months ± 5

  • Supposed current horizon = 1-2 hours
  • Feb 2027 prediction = 16 hours
  • Apr 2028 prediction = 5 days

Their prediction

11 of 46

1

Introduction

1st serious measure of progress towards AGI

12 of 46

1

Introduction

Fractional AGI?: t-AGI

13 of 46

1

Introduction

Fractional AGI?: t-AGI

14 of 46

1

Introduction

Fractional AGI?: t-AGI

15 of 46

1

Introduction

Fractional AGI?: t-AGI

16 of 46

1

Introduction

  • ~$200k bounties
  • $100k baselining infra
  • $150k baselining tasks,
  • ~~$200k absurd API costs (or maybe free for them?)
  • ~~$500k extremely annoying labour (design, recruitment, herding cats, etc)�
  • n = 189
  • SOFTWARE CODING TASKS ONLY
  • Human completion times range from 1 second to 16 hours.

Details

17 of 46

1

Introduction

Details

  • Mean task “messiness” in HCAST is 3/16�
  • Task horizon is human-time;�
  • not counting lines of code�
  • not counting how long it takes the AI agent to do the task �
  • For the AI agents, what counts is whether they can do it autonomously >50% of the time.

18 of 46

1

Introduction

One-third leakable

19 of 46

1

Introduction

Reasoning trend break?

20 of 46

1

Introduction

Reasoning trend break?

21 of 46

1

Introduction

  • Well it beats the arse off naive AR loss scaling “laws” (not to mention Aschenbrenner fake extrapolation of them)�
  • But no:

1) HCAST predicts Sep 2030 for a ~3-month-AGI

2) I don’t buy it

So is this it? 2027 is it?

22 of 46

1

Introduction

  • Core mystery of AI training: does coding improve noncoding skill?

  • METR thesis: “coding is transformative”
    • if an AI can complete one-month tasks from a similar distribution, it could be transformative for society, by automating software engineering and research�
  • METR thesis: disregard absolute times; doubling time is consistent across tasks

Rebuttal #1: External validity of coding

23 of 46

1

Introduction

Rebuttals #1: External validity of coding

24 of 46

1

Introduction

Messiness factors

Real life source

Resource limited

Not easily resettable

Irreversible mistake availability

Dynamic environment

Difficult counterfactuals

Not purely automatic scoring

Implicit generalizability required

Novel situation

Nonexplicit scoring description

Is suboptimal behavior exploited

No provided verification mechanisms

Real-time coordination

Self modification required

Self improvement required

Information-seeking required

25 of 46

1

Introduction

  • Again, HCAST metric is not “AI does in 1 hour what human does in 1 hour”. In practice AIs take forever floundering�
  • o3 ARC “solution” involved 40 million tokens per task�
  • Gemini took 888 hours to minimally beat Pokemon, with a dozen intense human hand-holds like tile labelling�
  • Rebuttal rebuttal: “ok fine I retract specific numerical claims. But the trend is up my dude, and not slow”

Rebuttal #2: Just watch AIs work

26 of 46

1

Introduction

  • HCAST metric is ~just code! And no figure title admits it.�
  • AI struggles at some tasks that take humans <5 minutes (OSWorld) �
  • faster and better at some human-day tasks

    • Task-specific. On chess, other AI is already a 20-year-AGI�
    • Needle-haystack retrieval: 99.7% accuracy on 1MT. A 12-hour-AGI

Rebuttals #3: AI too jagged for one scalar

27 of 46

1

Introduction

Classic contrived example:�

  • Say AI is R = 90% on every mini-task.�
  • But real tasks involve chaining n�mini-tasks together, some of which�are crucial (conjunctive).�
  • So success is R^n ��(e.g. 0.9^20 = 12%)�

#4: Reliability multiplies down

28 of 46

1

Introduction

  • Is reliability R improving?�
  • HCAST fans can say: yes! �“We can infer this from the [code] task horizon expanding.”�
  • I say “I don’t know”: �

shockingly confounded as all measurements of generalisation are by memorisation and semantic duplicates from a closed and unfathomably vast pretraining corpus.

Reliability multiplies down

29 of 46

1

Introduction

30 of 46

1

Introduction

  • Well, current models have almost 100% success on [code] tasks taking humans <4 minutes�
  • Seems pretty reliable and maybe chainable!�

  • Weak r-r-rebuttal: simple means conceivably in the training set?�
  • Takes me 4 mins to get ffmpeg to do anything at all for instance

Rebuttal rebuttal:

31 of 46

1

Introduction

  • Noobs: the contractors weren’t familiar with the specific context of each task. They’re in the disoriented situation of a new hire on their first day of work.�
  • contractors took 5-18x longer to do the tasks than METR staff

  • “Human baseliners allowed to drop tasks they weren’t making progress with, and mostly cut off at 8hrs; baselines average time taken for successful runs; thus biased down for difficult tasks”

#5: Problems with the human time estimates

32 of 46

1

Introduction

  • Incentive to overreport hours: “the number of hours my playtester reported. paid more the higher that number was”�

#5: Problems with the human time estimates

33 of 46

1

Introduction

  • September 2023: "Q*"
  • November 2023: Altman's coup
  • May 2024: Sutskever officially out
  • Jul 2024: "Strawberry"
  • Sep 2024: o1-preview

i.e. We’re seeing 18 months into the past, not translated but compressed into 4. GPT-4 release a ~9 month lag after pretraining

Release overhang: Shocking tempo (o1→R1→o3) a one-time artefact of dropping safety culture.

#6: Product overhang exhausted?

34 of 46

1

Introduction

  • More importantly, the coup double-decimated their pipeline�
  • Sutskever, Schulman, Zoph, Radford, Weng, ... all gone. �
  • Fragmentation means HCAST trend is too optimistic??

  • My prediction: the labs will slow. Anthropic didn’t get most of the OAI leavers.

#7: Idea overhang exhausted?

35 of 46

1

Introduction

  • We know that the “reasoning” models are sneakier.�
  • This has embarrassed various people (Sakana AI scientist bs Torch code fooled auto-evaluation and cursory human inspection)�
  • No specific signs in HCAST, just paranoid

#8: Speculation: auto scoring enables sneakiness

36 of 46

1

Introduction

  • “1 hour” horizon @ 50%?�
  • How long does booking a flight take you? 20 mins?

Do you fuck it up 50% of the time?

Does o3 work for booking flights? Does any scaffolded thing?

  • Even if they did get 50% success, this would be useless in most jobs! They do look at 80% and above, and these aren’t scaling as fast.�
  • Just look

#9: I refute it thus

37 of 46

1

Introduction

38 of 46

1

Introduction

  • Claude 3.7 gets 3 badges (about 6 hours)

  • Completing Blue is about 26 hours, 106,000 in-game actions�

But Pokemon!

39 of 46

1

Introduction

40 of 46

1

Introduction

  • Bullshit. Substantial hard-coding, “scaffolds”

41 of 46

1

Introduction

42 of 46

1

Introduction

  • Very good work
  • exponential fit is plausible
  • Oversold unwarranted extrapolation. Code ~only!!!
  • Small n
  • Still not beating leakage worry
  • Ignore the y-axis and the slope
    • Or divide by like 6
  • But yes time horizon is improving
  • But yes model ranks are probably correct

  • They’re not stupid but they are jagged

Bottom line on HCAST

43 of 46

1

Introduction

Bottom line on HCAST

44 of 46

1

Introduction

  • “the content of AI 2027 was all but finalized before the METR report came out.”�
  • We heavily draw from METR’s recent report which catalogues a trend of increasing time horizon.

Bottom line on AGI 2027/8

45 of 46

1

Introduction

  • Maybe just METR magic backdoor access?�
  • GPT-3 is dead�
  • GPT-3.5-T is a heavily changed checkpoint

Tangent: How did they get 2022 models to test?

46 of 46

1

Introduction