Metaculus Forecasting AI Progress

	C	D	E	F	G	H	I	J	K	L	M	N	O	P	Q	R	S	T	U	V
1	Question	Category	Language benchmark?	Reverse-code?	Exclude?	Resolution	Metaculus Prediction [25%, 50%, 75%]			Optimism	Relative error	\|Relative error\|	Relative log error	CDF	CDF (rev)	Log score	Horizon / days	Forecasters	Predictions	Notes

2	What will the state-of-the-art object detection performance on COCO be, at 2021-06-14 in box Average Precision (box AP)?	Benchmark				58.70	57.32	57.36	57.61	Q4	-2%	2%	0.977	97%	97%	-2.61	120	35	83	Metaculus expected ~no change from the top score at the time but a new paper that beat the SotA came out ~1 month after question close and ~3 months before resolution. Looking at the chart plotting SotA over time on PapersWithCode, this doesn't seem like a big, unexpected jump. Unclear why the community was so overconfident on "no change" since no-one left any comments.
3	What will the value of the herein defined Image Classification Performance Index be on 2021-06-14?	Benchmark				120.87	115.08	115.47	117.09	Q4	-4%	4%	0.955	96%	96%	0.08	281	60	227	The initial prediction was pretty close to the true value but it declined over time, presumably due to lack of progress. I guess people overupdated on that and didn't put enough probability mass on the possibility of a big jump. For short horizon questions (a few months) like this one, there's always the problem of seasonal publication dynamics (e.g. big conferences) swamping actual progress as the main factor.
4	What will the state-of-the-art performance on semantic segmentation on Cityscapes be at 2021-06-14 in mean IoU in percent (MIoU%)?	Benchmark				84.30	83.32	83.34	83.42	Q4	-1%	1%	0.989	96%	96%	-0.39	281	60	195	I have no idea what's going on here. Resolution and SotA at question open as implied by the question description don't match what's on PapersWithCode. The community prediction was consistently below the PapersWithCode SotA during all the time the question was open.
5	What will the state-of-the-art performance on semantic segmentation of PASCAL-Context be at 2021-06-14 in mean IoU in percent (MIoU%)?	Benchmark				60.50	58.94	58.95	59.01	Q4	-3%	3%	0.974	95%	95%	-1.24	225	52	175	Seems like another case of anchoring to current SOTA as time goes by without new results + something coming out after question close that beats the SOTA in a way that's totally unsurprising if one looks at the trend over several years.
6	What will the state-of-the-art object detection performance on COCO be, at 2022-01-14 in box average precision (box AP)?	Benchmark				63.30	57.59	58.86	62.07	Q4	-7%	7%	0.930	93%	93%	-2.17	373	30	96	Looking at this graph I don't see any obvious jumps between 2021 and 2022. Not sure why the community underestimated progress so much.
7	What will the state-of-the-art performance on image classification on ImageNet be at 2022-01-14 in top-1 accuracy?	Benchmark				88.30	86.53	86.91	87.58	Q4	-2%	2%	0.984	92%	92%	-1.14	373	48	168
8	What will the state-of-the-art performance on semantic segmentation of PASCAL-Context be on 2023-02-14 in mean IoU in percent (MIoU%), amongst models not trained on extra data?	Benchmark				68.87	61.70	63.56	65.53	Q4	-8%	8%	0.923	92%	92%	-1.95	670	28	119
9	What will the state-of-the-art language modelling performance on One Billion Word be on 2022-01-14, in perplexity?	Benchmark				20.25	21.16	21.46	21.54	Q4	-6%	6%	0.944	8%	92%	-1.09	373	41	181	SOTA was beaten a few days before question close and the Metaculus prediction shows a very narrow peak around that figure, while the community prediction is more spread out. This suggests to me that top forecasters predicted nothing would happen in the 9 months between close and resolution, while other forecasters either failed to update on the latest result or were less overconfident.
10	What will the state-of-the-art performance on semantic segmentation on Cityscapes be at 2022-01-14 in mean IoU in percent (MIoU%)?	Benchmark				84.40	83.32	83.41	83.88	Q4	-1%	1%	0.988	89%	89%	-0.12	373	35	122
11	What will the highest Exact Match rate of the best-performing model on SQuAD2.0 be on 2021-06-14?	Benchmark				90.94	90.73	90.74	90.77	Q4	0%	0%	0.998	88%	88%	0.64	121	59	208
12	What will be the state-of-the-art language modelling performance (in perplexity) on WikiText-103 by the following dates? (January 14, 2022)	Benchmark				14.80	15.30	15.71	15.76	Q4	-6%	6%	0.942	12%	88%	0.62	325	44	125	This one was confusing because the fine print contained a restriction about how models had to be trained to count towards resolution that made it hard for people to find reliable info, e.g. the paperswithcode dataset reported models that didn't meet the restriction. See this comment thread for details.
13	What will the state-of-the-art performance on one-shot image classification on MiniImageNet be, on 2021-06-14, in accuracy?	Benchmark				84.81	82.94	82.97	83.77	Q4	-2%	2%	0.978	85%	85%	-0.11	276	100	314
14	What will the state-of-the-art performance on image classification on ImageNet be at 2021-06-14 in top-1 accuracy?	Benchmark				86.78	86.46	86.53	86.64	Q4	0%	0%	0.997	83%	83%	0.78	121	74	291
15	What will the state-of-the-art language modelling performance on WikiText-103 be on 2023-02-14 in perplexity, amongst models not trained on extra data?	Benchmark				14.80	15.35	15.73	15.76	Q4	-6%	6%	0.941	18%	82%	-0.16	670	26	97
16	What will the state-of-the-art language modelling performance on One Billion Word be on 2023-02-14, in perplexity, amongst models not trained on extra data?	Benchmark				20.25	20.42	21.22	21.51	Q4	-5%	5%	0.954	21%	79%	0.82	670	36	109
17	What will the state-of-the-art object detection performance on COCO be, on 2023-02-14 in box average precision (box AP) amongst all models?	Benchmark				65.40	60.93	63.49	66.20	Q3	-3%	3%	0.971	78%	78%	-0.17	671	31	94	Same as the question below, no discontinuous jumps.
18	What will the state-of-the-art performance on one-shot image classification on miniImageNet be, at 2022-01-14 in accuracy?	Benchmark				84.81	82.95	83.71	85.31	Q3	-1%	1%	0.987	70%	70%	-0.27	373	34	137
19	What share (in %) of the world's super-compute performance will be based in the United States in the November 2022 publication of TOP500 list?	Compute				43.74	31.64	38.64	45.74	Q3	-12%	12%	0.883	69%	69%	1.41	681	32	101	IIRC there were some shenanigans re: China under-reporting these numbers. Lots of info in the comments of this similar INFER Pub question that I was heavily involved in, but I forgot the details.
20	What percent will software and information services contribute to US GDP in Q1 of 2021?	Econ				3.09	2.94	3.04	3.14	Q3	-2%	2%	0.983	64%	64%	1.91	217	83	226
21	What percent of total GDP will software and information services contribute to US GDP in Q3 of 2021?	Econ				3.22	3.05	3.17	3.29	Q3	-2%	2%	0.984	61%	61%	1.23	298	38	92
22	What will the state-of-the-art performance on semantic segmentation on Cityscapes be on 2023-02-14 in mean IoU in percent (MIoU%), amongst models not trained on extra data?	Benchmark				84.40	83.35	84.00	85.06	Q3	0%	0%	0.995	61%	61%	0.69	670	29	93
23	What will the price of IGM be, on 2021-06-14?	Econ				392.89	363.61	387.07	411.36	Q3	-1%	1%	0.985	59%	59%	1.00	120	80	220
24	What share (in %) of the world's super-compute performance will be United States-based in the TOP500 list on the following dates? (June 2021)	Compute				30.55	26.03	30.28	36.31	Q3	-1%	1%	0.991	55%	55%	0.82	133	65	191
25	What will the value of the herein defined Object Detection Performance Index be on 2023-02-15?	Benchmark				135.27	129.12	134.28	140.20	Q3	-1%	1%	0.993	55%	55%	0.82	671	34	105
26	What will the value of the herein defined Image Classification Performance Index be on 2022-01-14?	Benchmark				123.71	120.49	123.46	126.66	Q3	0%	0%	0.998	53%	53%	0.87	373	39	121
27	How many e-prints on AI Safety, Interpretability or Explainability will be published on arXiv over the 2020-12-14 to 2021-06-14 period?	Biblio				260.00	223.81	256.86	288.96	Q3	-1%	1%	0.988	53%	53%	1.10	121	84	221
28	How many e-prints on multi-modal machine learning will be published on arXiv over the 2020-12-14 to 2021-06-14 period?	Biblio				109.00	93.91	107.92	122.07	Q3	-1%	1%	0.990	51%	51%	1.10	121	65	196
29	What will the value of the herein defined Object Detection Performance Index be on 2021-06-14?	Benchmark				121.81	119.73	122.29	125.81	Q2	0%	0%	1.004	51%	51%	0.90	281	54	198
30	What will the state-of-the-art performance on SuperGLUE be on 2021-06-14?	Benchmark				90.40	90.31	90.40	90.62	Q3	0%	0%	1.000	51%	51%	2.27	120	71	266
31	What will be the state-of-the-art language modelling performance (in perplexity) on WikiText-103 by the following dates? (June 14, 2021)	Benchmark				15.79	15.72	15.76	15.77	Q1	0%	0%	1.002	50%	50%	5.92	157	63	188
32	What will the state-of-the-art language text-to-SQL performance on WikiSQL be at 2021-06-14 in logical form test accuracy?	Benchmark				87.80	87.81	87.83	88.23	Q1	0%	0%	1.000	50%	50%	5.94	120	66	216
33	What will the combined sector weighting of Information Technology and Communications be, in the S&P 500 on 2021-06-14?	Econ				39.16	38.08	39.20	40.36	Q2	0%	0%	1.001	49%	49%	2.29	226	75	256
34	How many e-prints on Few-Shot Learning will be published on arXiv over the 2020-12-14 to 2021-06-14 period?	Biblio				744.00	694.73	745.57	812.96	Q2	0%	0%	1.002	48%	48%	1.17	121	73	185
35	How much will the average degree of automation change for key US professions from December 2020 to the following dates? (January 2022)	Econ				1.31	-0.45	1.75	4.46	Q2	33%	33%	1.335	45%	45%	2.48	357	50	151	Wide range, true value came close to median :shrug:
36	How many e-prints on multi-modal learning will be published on ArXiv over the 2021-12-14 to 2022-01-14 period?	Biblio				248.00	223.54	259.35	303.87	Q2	5%	5%	1.046	43%	43%	1.77	311	35	117
37	How many e-prints on AI Safety, Interpretability or Explainability will be published on arXiv over the 2021-01-14 to 2022-01-14 period?	Biblio				560.00	515.60	575.86	636.73	Q2	3%	3%	1.028	42%	42%	2.59	311	39	141
38	What will the value of the herein defined Object Detection Performance Index be on 2022-01-14?	Benchmark				125.77	123.59	126.96	130.36	Q2	1%	1%	1.009	41%	41%	0.68	373	35	103
39	What will the value of the herein defined Image Classification Performance Index be on 2023-02-14?	Benchmark				128.94	125.71	130.87	136.61	Q2	1%	1%	1.015	40%	40%	0.99	672	32	109
40	What will the Federal Reserves' Industrial Production Index be for April 2021, for semiconductors, printed circuit boards and related products?	Econ				198.80	197.98	199.82	201.86	Q2	1%	1%	1.005	35%	35%	1.89	122	65	177
41	What will the highest Exact Match rate of the best-performing model on SQuAD2.0 be on 2022-01-14?	Benchmark				90.94	90.90	91.05	91.28	Q2	0%	0%	1.001	35%	35%	1.01	326	39	152
42	What will the combined sector weighting of Information Technology and Communications be, in the S&P 500 on 2022-01-14?	Econ				38.35	37.67	39.17	40.69	Q2	2%	2%	1.021	35%	35%	1.50	311	36	129
43	What will be the maximum compute (in petaFLOPS-days) ever used in training an AI experiment by the following dates? (Feb-2023)	Compute				31712.96	26933.26	57277.55	116671.87	Q2	81%	81%	1.806	29%	29%	0.87	670	31	114
44	How many e-prints on Few-Shot Learning will be published on ArXiv over the 2021-02-14 to 2023-02-14 period?	Biblio				4089.00	3999.78	4476.93	4978.83	Q2	9%	9%	1.095	29%	29%	1.85	679	31	97
45	What will the the performance be of the top-performing supercomputer (in exaFLOPS) in the TOP500 be according to their November 2022 list?	Compute				1.10	1.02	1.28	1.49	Q2	16%	16%	1.157	28%	28%	1.31	680	35	122
46	How many e-prints on Few-Shot Learning will be published on ArXiv over the 2021-01-14 to 2022-01-14 period?	Biblio				1671.00	1668.62	1833.78	1985.18	Q2	10%	10%	1.097	25%	25%	1.46	311	36	123
47	How many e-prints on AI Safety, interpretability or explainability will be published on ArXiv over the 2021-02-14 to 2023-02-14 period?	Biblio				1211.00	1230.03	1432.61	1652.76	Q1	18%	18%	1.183	23%	23%	2.59	734	31	99
48	What will be the sum of the performance (in exaFLOPS) of the top 500 supercomputers in the following dates? (November 2022)	Compute				4.86	4.72	5.42	6.17	Q2	11%	11%	1.114	22%	22%	1.71	680	39	108
49	What will be the sum of the performance (in exaFLOPS) of the top 500 supercomputers in the following dates? (June 2021)	Compute				2.80	2.87	3.17	3.99	Q1	13%	13%	1.133	20%	20%	-0.10	134	91	321
50	What will the state-of-the-art performance on one-shot image classification on miniImageNet be, on 2023-02-14 in accuracy, amongst models not trained on extra data?	Benchmark				86.11	87.06	89.93	91.01	Q1	4%	4%	1.044	20%	20%	-1.32	671	30	99
51	How many Reinforcement Learning e-prints will be published on arXiv over the 2020-12-14 to 2021-06-14 period?	Biblio				1590.00	1586.84	1700.14	1815.22	Q2	7%	7%	1.069	19%	19%	1.77	121	59	163
52	What percent of total GDP will software and information services contribute to US GDP in Q3 of 2022?	Econ				3.13	3.16	3.29	3.43	Q1	5%	5%	1.050	19%	19%	1.92	681	32	76
53	How many Computer Vision and Pattern Recognition e-prints will be published on arXiv over the 2020-12-14 to 2021-06-14 period?	Biblio				8207.00	8338.51	8713.27	9200.80	Q1	6%	6%	1.062	18%	18%	0.94	120	60	172
54	How many Computation and Language e-prints will be published on arXiv over the 2020-12-14 to 2021-06-14 period?	Biblio				3938.00	4109.83	4431.17	4841.95	Q1	13%	13%	1.125	15%	15%	1.22	120	62	179
55	What will be the maximum compute (in petaFLOPS-days) ever used in training an AI experiment by the following dates? (January 14, 2022)	Compute				13542.00	8407.30	16229.32	31621.25	Q2	20%	20%	1.198	12%	12%	0.87	372	42	129
56	What will the highest Exact Match rate of the best-performing model on SQuAD2.0 be on 2023-02-14?	Benchmark				90.94	91.35	91.85	92.53	Q1	1%	1%	1.010	11%	11%	0.34	672	30	118
57	What will be the sum of the performance (in exaFLOPS) of the top 500 supercomputers in the following dates? (November 2021)	Compute				3.04	3.47	3.95	4.47	Q1	30%	30%	1.300	7%	7%	0.88	247	36	124
58	How many Reinforcement Learning e-prints will be published on arXiv over the 2021-02-14 to 2023-02-14 period?	Biblio				7202.00	8148.60	8768.55	9391.50	Q1	22%	22%	1.218	6%	6%	0.34	680	37	102
59	What will the price of IGM be, on 2023-02-14, in 2019 USD?	Econ				280.98	391.28	460.52	537.20	Q1	64%	64%	1.639	4%	4%	-1.07	671	32	119	Probably a similar story, haven't run the numbers.
60	How many Computer Vision and Pattern Recognition e-prints will be published on arXiv over the 2021-02-14 to 2023-02-14 period?	Biblio				37123.00	42429.86	44676.12	46966.40	Q1	20%	20%	1.203	4%	4%	-0.59	735	29	83
61	How many Natural Language Processing e-prints will be published on arXiv over the 2021-02-14 to 2023-02-14 period?	Biblio				17199.00	19848.96	21082.88	22363.20	Q1	23%	23%	1.226	3%	3%	-0.39	679	31	106
62	How many Computer Vision and Pattern Recognition e-prints will be published on arXiv over the 2021-01-14 to 2022-01-14 period?	Biblio				17249.00	18920.18	19666.65	20399.52	Q1	14%	14%	1.140	3%	3%	0.25	311	43	99
63	How many Reinforcement Learning e-prints will be published on arXiv over the 2021-01-14 to 2022-01-14 period?	Biblio				3375.00	3825.48	4015.12	4215.56	Q1	19%	19%	1.190	2%	2%	0.06	310	37	93
64	How many Natural Language Processing e-prints will be published on arXiv over the 2021-01-14 to 2022-01-14 period?	Biblio				8066.00	9142.36	9581.54	10083.44	Q1	19%	19%	1.188	2%	2%	-0.26	311	41	98	Growth was surprisingly weak compared to previous years. Metaculus bet on exponential growth continuing.
65	What will Alphabet Inc.'s market capitalisation be at market close on 2023-02-14?	Econ				1.05	1.54	1.84	2.20	Q1	75%	75%	1.750	1%	1%	-1.07	671	31	90	Market cap at question close (2021/4/15) was ~$1.46T, so the median prediction implies ~12% growth per year. That's surprisingly bullish imo. Prediction looks even worse because 2022 saw a stock market slump.
66	What will be the average top price performance (in G3D Mark /$) of the best available GPU on the following dates? (June 14, 2021)	Compute				40.80	60.90	62.94	65.93	Q1	54%	54%	1.543	0%	0%	-8.07	120	74	203	GPU price spike, see above.
67	What will be the average top price performance (in G3D Mark /$) of the best available GPU on the following dates? (January 14, 2022)	Compute				23.20	63.69	68.83	73.77	Q1	197%	197%	2.967	0%	0%	-5.67	307	45	133	Biggest miss in terms of relative error. I blame it on the post-covid price spike (driven by supply chain disruptions and crypto mining afaik). I don't quite remember when exactly prices peaked but this piece from Nov 2021 suggests they were still on the way up back then.
68	What will the highest score be, on Atari 2600 Montezuma's Revenge, by any ML model that is un-augmented with domain knowledge on 2022-01-14?	Benchmark				43791.00	44321.88	45356.53	140920.33	Q1	4%	4%	1.036	0%	0%	6.82	373	42	173	Resolution coincided with the lower bound of the range, so this number is slightly fake. The Metaculus prediction was bunched up against the lower bound, but still predicted higher numbers though.
69	What will the highest score of any ML model that is un-augmented with domain knowledge on Atari 2600 Montezuma's Revenge be on 2023-02-14?	Benchmark				43791.00	98544.46	1483981.70	1501456.21	Q1	3289%	3289%	33.888	0%	0%	5.34	670	35	151	Same as above, though less so because there was a significant peak on the upper end too.
70	What will the state-of-the-art language text-to-SQL performance on WikiSQL be on 2023-02-14 in logical form test accuracy?	Benchmark				87.80	90.52	91.60	92.72	Q1	4%	4%	1.043	0%	0%	2.47	670	29	85	Roughly same as above, though much less so because there was a good chunk of probability mass in the upper part of the range.