ABCDEFGHIJKLMNOPQRSTUVWXYZAAABAC
1
The questions will resolve based on the best GPT-4.5-based system (including post-training enhancements) 6 months after the release of the first model named GPT-4.5. If nothing named GPT-4.5 is released, nothing resolves. If GPT-4.5 is not applied to a well-specificied benchmark, that question resolves ambigious; there is no analogous condition for non-benchmark questions. You could alternatively forecast for GPT-5, or a specified date rather than model.
2
All questions are optional but encouraged! 10th and 90th percentile are optional, median is required. For yes/no questions, input your probaility where the median would go.Forecasts
3
ReferencesOptionalRequiredOptionalHighly encouraged
4
QuestionGPT-3.5HumanExpert HumanCurrent Best10th percentileMedian90th percentileRationale
5
GPQA28%34%74%60%
6
WebArena6%78%-14%
7
GAIA-92%24%
8
SWE-Bench, unassisted0.2%--13.86%
9
InterCode46.50%--48.50%
10
Anthropic ASL-3 ARA evals (note: only 5 tasks)0%0%100%0%
11
Chatbot Arena ELO, relative to GPT-4-1106-preview-138--0
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100