Testing CoT

	B	C	D	E	F	G	H	I	J
1
2
3	Models	BIG-bench (Base)	MMLU (Base)	Popular (Base)	Putnam (Base)
4	Mistral 7B	57.14%	62.71%	38%	13%
5	Llama 3 8B	59.52%	66.10%	50%	13%
6	Mixtral 8x7B	58.73%	71.19%	50%	0%
7	Gemma 2 9B	69.05%	75.71%	63%	0%
8	Llama 3.1 70B	67.46%	82.49%	50%	0%
9	GPT-4o	69.84%	82.49%	50%	19%
10	Claude Haiku 3.5	74.60%	84.66%	63%	0%
11	Grok-Beta	76.19%	87.01%	88%	25%
12	Claude Sonnet 3.5	80.80%	88.14%	63%	38%
13	Gemini 1.5 Pro			50%	56%
14	o1-preview			88%	63%
15	GPT-4			63%	19%
16
17
18
19
20
21	Models	BIG-bench (Base)	BIG-bench (Auto-CoT)	MMLU (Base)	MMLU (Auto-CoT)	Popular (Base)	Popular (A-CoT)	Putnam (Base)	Putnam (A-CoT)
22	Mistral 7B	57.14%		62.71%		38%		13%
23	Llama 3 8B	59.52%		66.10%		50%		13%
24	Mixtral 8x7B	58.73%		71.19%		50%		0%
25	Gemma 2 9B	69.05%		75.71%		63%		0%
26	Llama 3.1 70B	67.46%	78.57%	82.49%	91.53%	50%	75.00%	0%	25.00%
27	GPT-4o	69.84%		82.49%		50%		19%
28	Claude Haiku 3.5	74.60%	76.98%	84.66%	84.66%	63%	50.00%	0%	18.75%
29	Grok-Beta	76.19%	79.37%	87.01%	92.09%	88%	75.00%	25%	31.25%
30	Claude Sonnet 3.5	80.80%	80.16%	88.14%	90.40%	63%	75.00%	38%	68.75%
31	Gemini 1.5 Pro	-				50%		56%
32	o1-preview	-				88%		63%
33	GPT-4	-				63%		19%
34
35
36	Note: very small dataset - use of less than 400 questions
37
38
39
40	Models	BIG-bench (Base -> CoT)	MMLU (Base -> CoT)	Popular (Base -> CoT)	Putnam (Base -> CoT)
41	Llama 3.1 70B	16.47%	10.96%	50%	25%
42	Claude Haiku 3.5	3.19%	0.00%	-21%	19%
43	Grok-Beta	4.17%	5.84%	-15%	25%
44	Claude Sonnet 3.5	-0.79%	2.56%	19%	81%
45
46	Notes:
47	if the level of difficulty is low then a model can tend to overanalyze and then go through endless loops.
48	If the model is larger it has more inherent capabilities and thus will perform better if forced to reason for longer (especially for math problems)
49	Best to understand the level of difficulty of a problem upfront
50	A model can sometimes produce wrong results and sometimes correct results which means employing a CoT-SC or Decoding CoT system could prove fruitful
51
52
53
54
55
56	Llama 3 8B	BIG-bench	MMLU	Popular	Putnam	Total
57	Base	59.52%	66.10%	50%	13%	61.10%
58	Greedy Decoding	65.87%	74.86%	75.00%	12.50%	69.91%
59	Entropy Encoding	66.67%	76.55%	75.00%	18.75%	71.00%
60	Decoding CoT	73.02%	78.25%	75.00%	18.75%	74.56%
61
62
63
64
65
66	Llama 3 8B	BIG-bench	MMLU	Popular	Putnam	Total
67	Base	59.52%	66.10%	50%	13%	61.10%
68	Greedy Decoding	65.87%	74.86%	75.00%	12.50%	69.91%
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100