ABCDEFGHIJKLMNOPQRSTUVWXYZAA
1
2
3
Models
BIG-bench (Base)
MMLU (Base)
Popular (Base)
Putnam (Base)
4
Mistral 7B57.14%62.71%38%13%
5
Llama 3 8B59.52%66.10%50%13%
6
Mixtral 8x7B58.73%71.19%50%0%
7
Gemma 2 9B69.05%75.71%63%0%
8
Llama 3.1 70B67.46%82.49%50%0%
9
GPT-4o69.84%82.49%50%19%
10
Claude Haiku 3.574.60%84.66%63%0%
11
Grok-Beta76.19%87.01%88%25%
12
Claude Sonnet 3.580.80%88.14%63%38%
13
Gemini 1.5 Pro50%56%
14
o1-preview88%63%
15
GPT-463%19%
16
17
18
19
20
21
Models
BIG-bench (Base)
BIG-bench (Auto-CoT)
MMLU (Base)
MMLU (Auto-CoT)
Popular (Base)
Popular (A-CoT)
Putnam (Base)
Putnam (A-CoT)
22
Mistral 7B57.14%62.71%38%13%
23
Llama 3 8B59.52%66.10%50%13%
24
Mixtral 8x7B58.73%71.19%50%0%
25
Gemma 2 9B69.05%75.71%63%0%
26
Llama 3.1 70B67.46%78.57%82.49%91.53%50%75.00%0%25.00%
27
GPT-4o69.84%82.49%50%19%
28
Claude Haiku 3.574.60%76.98%84.66%84.66%63%50.00%0%18.75%
29
Grok-Beta76.19%79.37%87.01%92.09%88%75.00%25%31.25%
30
Claude Sonnet 3.580.80%80.16%88.14%90.40%63%75.00%38%68.75%
31
Gemini 1.5 Pro-50%56%
32
o1-preview-88%63%
33
GPT-4-63%19%
34
35
36
Note: very small dataset - use of less than 400 questions
37
38
39
40
Models
BIG-bench (Base -> CoT)
MMLU (Base -> CoT)
Popular (Base -> CoT)
Putnam (Base -> CoT)
41
Llama 3.1 70B16.47%10.96%50%25%
42
Claude Haiku 3.53.19%0.00%-21%19%
43
Grok-Beta4.17%5.84%-15%25%
44
Claude Sonnet 3.5-0.79%2.56%19%81%
45
46
Notes:
47
if the level of difficulty is low then a model can tend to overanalyze and then go through endless loops.
48
If the model is larger it has more inherent capabilities and thus will perform better if forced to reason for longer (especially for math problems)
49
Best to understand the level of difficulty of a problem upfront
50
A model can sometimes produce wrong results and sometimes correct results which means employing a CoT-SC or Decoding CoT system could prove fruitful
51
52
53
54
55
56
Llama 3 8B
BIG-bench
MMLU
PopularPutnam
Total
57
Base59.52%66.10%50%13%61.10%
58
Greedy Decoding65.87%74.86%75.00%12.50%69.91%
59
Entropy Encoding66.67%76.55%75.00%18.75%71.00%
60
Decoding CoT73.02%78.25%75.00%18.75%74.56%
61
62
63
64
65
66
Llama 3 8B
BIG-bench
MMLU
PopularPutnam
Total
67
Base59.52%66.10%50%13%61.10%
68
Greedy Decoding65.87%74.86%75.00%12.50%69.91%
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100