HCAST:�can they do stuff yet?
12.05.2025
Gavin Leech, Arb Research
1
Introduction
1
Introduction
Read the following talk as red-teaming:�
Can Gavin manage to not believe this if he tries really hard?
1
Introduction
OK, AI is improving. But how quick? On real tasks?
1
Introduction
1st serious measure of progress towards AGI
1
Introduction
How hard is stuff?
1
Introduction
1st serious measure of progress towards AGI
1
Introduction
1st serious measure of progress towards AGI
1
Introduction
“Persnickety” (accurate) version
1
Introduction
“If the measured trend from the past 6 years continues for 2-4 more years, generalist autonomous agents will be capable of performing a wide range of week-long tasks.”
Their prediction
1
Introduction
1st serious measure of progress towards AGI
1
Introduction
Fractional AGI?: t-AGI
1
Introduction
Fractional AGI?: t-AGI
1
Introduction
Fractional AGI?: t-AGI
1
Introduction
Fractional AGI?: t-AGI
1
Introduction
Details
1
Introduction
Details
1
Introduction
One-third leakable
1
Introduction
Reasoning trend break?
1
Introduction
Reasoning trend break?
1
Introduction
1) HCAST predicts Sep 2030 for a ~3-month-AGI
2) I don’t buy it
So is this it? 2027 is it?
1
Introduction
Rebuttal #1: External validity of coding
1
Introduction
Rebuttals #1: External validity of coding
1
Introduction
Messiness factors
Real life source Resource limited Not easily resettable Irreversible mistake availability Dynamic environment Difficult counterfactuals Not purely automatic scoring Implicit generalizability required | Novel situation Nonexplicit scoring description Is suboptimal behavior exploited No provided verification mechanisms Real-time coordination Self modification required Self improvement required Information-seeking required |
1
Introduction
Rebuttal #2: Just watch AIs work
1
Introduction
Rebuttals #3: AI too jagged for one scalar
1
Introduction
Classic contrived example:�
#4: Reliability multiplies down
1
Introduction
shockingly confounded as all measurements of generalisation are by memorisation and semantic duplicates from a closed and unfathomably vast pretraining corpus.
Reliability multiplies down
1
Introduction
1
Introduction
Rebuttal rebuttal:
1
Introduction
#5: Problems with the human time estimates
1
Introduction
#5: Problems with the human time estimates
1
Introduction
i.e. We’re seeing 18 months into the past, not translated but compressed into 4. GPT-4 release a ~9 month lag after pretraining
Release overhang: Shocking tempo (o1→R1→o3) a one-time artefact of dropping safety culture.
#6: Product overhang exhausted?
1
Introduction
#7: Idea overhang exhausted?
1
Introduction
#8: Speculation: auto scoring enables sneakiness
1
Introduction
Do you fuck it up 50% of the time?
Does o3 work for booking flights? Does any scaffolded thing?
#9: I refute it thus
1
Introduction
1
Introduction
But Pokemon!
1
Introduction
1
Introduction
1
Introduction
1
Introduction
Bottom line on HCAST
1
Introduction
Bottom line on HCAST
1
Introduction
Bottom line on AGI 2027/8
1
Introduction
Tangent: How did they get 2022 models to test?
1
Introduction