A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Task | Type | Model status | Random | Avg human | Best human | GPT 3-shot | PaLM 3-shot | GPT >= avg human? | PaLM >= avg human? | PaLM >= 90% of avg human? | GPT >= best human? | PaLM >= best human? | PaLM > GPT? | Best/avg human | PaLM/GPT | PaLM/avg human | PaLM/best human | GPT/avg human | GPT/best human | GPT/random | PaLM/random | Avg human / random | Best human / random | Commentary | |
2 | abstract_narrative_understanding | multiple choice | Models at avg human | 20 | 30 | 60 | 40 | 55 | TRUE | TRUE | TRUE | FALSE | FALSE | TRUE | 200% | 138% | 183% | 92% | 133% | 67% | 200% | 275% | 150% | 300% | ||
3 | anachronisms | true / false | PaLM at avg human | 50 | 60 | 90 | 53 | 65 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 150% | 123% | 108% | 72% | 88% | 59% | 106% | 130% | 120% | 180% | ||
4 | analogical_similarity | multiple choice | PaLM at avg human | 15 | 40 | 100 | 20 | 40 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 250% | 200% | 100% | 40% | 50% | 20% | 133% | 267% | 267% | 667% | ||
5 | analytic_entailment | true / false | PaLM almost at avg human | 50 | 80 | 100 | 63 | 75 | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE | 125% | 119% | 94% | 75% | 79% | 63% | 126% | 150% | 160% | 200% | ||
6 | ascii_word_recognition | exact str match | Models subhuman | 0 | 85 | 100 | 5 | 15 | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | 118% | 300% | 18% | 15% | 6% | 5% | ||||||
7 | authorship_verification | true / false | Models at avg human | 50 | 50 | 90 | 50 | 50 | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | 180% | 100% | 100% | 56% | 100% | 56% | 100% | 100% | 100% | 180% | Seems like a very hard task for both humans and machines | |
8 | auto_categorization | bleu | Models superhuman | 0 | 1.5 | 8 | 10 | 9 | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | 533% | 90% | 600% | 113% | 667% | 125% | ||||||
9 | auto_debugging | exact str match | PaLM at avg human | 0 | 15 | 50 | 0 | 45 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 333% | 300% | 90% | 0% | 0% | |||||||
10 | bridging_anaphora_resolution_barqa | exact str match | PaLM superhuman | 0 | 15 | 30 | 0 | 33 | FALSE | TRUE | TRUE | FALSE | TRUE | TRUE | 200% | 220% | 110% | 0% | 0% | |||||||
11 | causal_judgment | true / false | PaLM almost at avg human | 50 | 70 | 100 | 55 | 65 | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE | 143% | 118% | 93% | 65% | 79% | 55% | 110% | 130% | 140% | 200% | ||
12 | cause_and_effect | true / false | PaLM almost at avg human | 50 | 88 | 100 | 70 | 85 | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE | 114% | 121% | 97% | 85% | 80% | 70% | 140% | 170% | 176% | 200% | ||
13 | checkmate_in_one | exact str match | Models subhuman | 0 | 8 | 70 | 0 | 2 | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | 875% | 25% | 3% | 0% | 0% | |||||||
14 | chess_state_tracking | exact str match | Models at avg human | 0 | 20 | 88 | 43 | 45 | TRUE | TRUE | TRUE | FALSE | FALSE | TRUE | 440% | 105% | 225% | 51% | 215% | 49% | ||||||
15 | chinese_remainder_theorem | exact str match | Models cannot do task | 0 | 25 | 100 | 0 | 0 | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | 400% | 0% | 0% | 0% | 0% | |||||||
16 | cifar10_classification | multiple choice | Models cannot do task | 10 | 25 | 85 | 2 | 0 | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | 340% | 0% | 0% | 0% | 8% | 2% | 20% | 0% | 250% | 850% | Seems like a huge stretch to expect language models to parse base64 images | |
17 | code_line_description | multiple choice | PaLM at avg human | 25 | 60 | 100 | 30 | 80 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 167% | 267% | 133% | 80% | 50% | 30% | 120% | 320% | 240% | 400% | ||
18 | codenames | bleu | PaLM at avg human | 0 | 18 | 85 | 15 | 30 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 472% | 200% | 167% | 35% | 83% | 18% | ||||||
19 | color | multiple choice | Models subhuman | 10 | 60 | 100 | 10 | 30 | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | 167% | 300% | 50% | 30% | 17% | 10% | 100% | 300% | 600% | 1000% | Kinda surprised models are this bad at this task | |
20 | common_morpheme | multiple choice | PaLM superhuman | 25 | 35 | 70 | 55 | 80 | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | 200% | 145% | 229% | 114% | 157% | 79% | 220% | 320% | 140% | 280% | ||
21 | conceptual_combinations | multiple choice | PaLM almost at avg human | 25 | 85 | 100 | 35 | 80 | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE | 118% | 229% | 94% | 80% | 41% | 35% | 140% | 320% | 340% | 400% | ||
22 | conlang_translation | rougeLsum | PaLM superhuman | 0 | 22 | 55 | 52 | 65 | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | 250% | 125% | 295% | 118% | 236% | 95% | ||||||
23 | crash_blossom | multiple choice | Models at avg human | 25 | 42 | 68 | 45 | 55 | TRUE | TRUE | TRUE | FALSE | FALSE | TRUE | 162% | 122% | 131% | 81% | 107% | 66% | 180% | 220% | 168% | 272% | ||
24 | crass_ai | multiple choice | PaLM at avg human | 25 | 85 | 100 | 35 | 90 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 118% | 257% | 106% | 90% | 41% | 35% | 140% | 360% | 340% | 400% | ||
25 | cryobiology_spanish | true / false | PaLM at avg human | 50 | 70 | 100 | 68 | 85 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 143% | 125% | 121% | 85% | 97% | 68% | 136% | 170% | 140% | 200% | ||
26 | cryptonite | multiple choice | Models subhuman | 0 | 25 | 85 | 0 | 10 | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | 340% | 40% | 12% | 0% | 0% | |||||||
27 | cs_algorithms | true / false | PaLM almost at avg human | 10 | 48 | 90 | 35 | 45 | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE | 188% | 129% | 94% | 50% | 73% | 39% | 350% | 450% | 480% | 900% | ||
28 | cycled_letters | exact str match | Models cannot do task | 0 | 25 | 100 | 0 | 0 | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | 400% | 0% | 0% | 0% | 0% | Tokenization may be an issue here? Surprised average human is so low | ||||||
29 | dark_humor_detection | true / false | Models subhuman | 50 | 82 | 100 | 55 | 65 | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | 122% | 118% | 79% | 65% | 67% | 55% | 110% | 130% | 164% | 200% | ||
30 | date_understanding | multiple choice | PaLM almost at avg human | 15 | 75 | 100 | 55 | 72 | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE | 133% | 131% | 96% | 72% | 73% | 55% | 367% | 480% | 500% | 667% | ||
31 | disambiguation_qa | multiple choice | Models subhuman | 33 | 65 | 95 | 43 | 52 | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | 146% | 121% | 80% | 55% | 66% | 45% | 130% | 158% | 197% | 288% | ||
32 | discourse_marker_prediction | multiple choice | Models subhuman | 10 | 33 | 80 | 13 | 15 | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | 242% | 115% | 45% | 19% | 39% | 16% | 130% | 150% | 330% | 800% | ||
33 | disfl_qa | exact str match | PaLM at avg human | 0 | 22 | 50 | 5 | 25 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 227% | 500% | 114% | 50% | 23% | 10% | ||||||
34 | dyck_languages | multiple choice | Models subhuman | 1 | 45 | 100 | 15 | 30 | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | 222% | 200% | 67% | 30% | 33% | 15% | 1500% | 3000% | 4500% | 10000% | ||
35 | elementary_math_qa | multiple choice | Models subhuman | 20 | 60 | 100 | 25 | 38 | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | 167% | 152% | 63% | 38% | 42% | 25% | 125% | 190% | 300% | 500% | ||
36 | emoji_movie | multiple choice | PaLM at avg human | 20 | 95 | 100 | 20 | 95 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 105% | 475% | 100% | 95% | 21% | 20% | 100% | 475% | 475% | 500% | ||
37 | emojis_emotion_prediction | multiple choice | Models at avg human | 20 | 48 | 65 | 48 | 55 | TRUE | TRUE | TRUE | FALSE | FALSE | TRUE | 135% | 115% | 115% | 85% | 100% | 74% | 240% | 275% | 240% | 325% | ||
38 | empirical_judgments | multiple choice | Models at avg human | 33 | 48 | 80 | 48 | 55 | TRUE | TRUE | TRUE | FALSE | FALSE | TRUE | 167% | 115% | 115% | 69% | 100% | 60% | 145% | 167% | 145% | 242% | ||
39 | english_proverbs | multiple choice | PaLM at avg human | 25 | 65 | 100 | 45 | 90 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 154% | 200% | 138% | 90% | 69% | 45% | 180% | 360% | 260% | 400% | ||
40 | english_russian_proverbs | multiple choice | PaLM at avg human | 25 | 42 | 90 | 35 | 70 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 214% | 200% | 167% | 78% | 83% | 39% | 140% | 280% | 168% | 360% | ||
41 | entailed_polarity | true / false | PaLM at avg human | 50 | 85 | 100 | 75 | 95 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 118% | 127% | 112% | 95% | 88% | 75% | 150% | 190% | 170% | 200% | ||
42 | entailed_polarity_hindi | true / false | PaLM at avg human | 50 | 71 | 100 | 60 | 75 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 141% | 125% | 106% | 75% | 85% | 60% | 120% | 150% | 142% | 200% | ||
43 | epistemic_reasoning | true / false | Models at avg human | 50 | 53 | 100 | 63 | 58 | TRUE | TRUE | TRUE | FALSE | FALSE | FALSE | 189% | 92% | 109% | 58% | 119% | 63% | 126% | 116% | 106% | 200% | The average human is surprisingly bad at this | |
44 | evaluating_information_essentiality | multiple choice | Models subhuman | 20 | 38 | 70 | 25 | 22 | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | 184% | 88% | 58% | 31% | 66% | 36% | 125% | 110% | 190% | 350% | ||
45 | fact_checker | true / false | PaLM at avg human | 50 | 73 | 88 | 65 | 83 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 121% | 128% | 114% | 94% | 89% | 74% | 130% | 166% | 146% | 176% | ||
46 | fantasy_reasoning | true / false | PaLM at avg human | 50 | 68 | 100 | 65 | 72 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 147% | 111% | 106% | 72% | 96% | 65% | 130% | 144% | 136% | 200% | The average human is surprisingly bad at this | |
47 | figure_of_speech_detection | multiple choice | PaLM at avg human | 10 | 40 | 85 | 25 | 65 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 213% | 260% | 163% | 76% | 63% | 29% | 250% | 650% | 400% | 850% | ||
48 | formal_fallacies_syllogisms_negation | true / false | PaLM almost at avg human | 50 | 54 | 80 | 52 | 52 | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | 148% | 100% | 96% | 65% | 96% | 65% | 104% | 104% | 108% | 160% | ||
49 | gem | rougeLsum | Models superhuman | 0 | 22 | 30 | 35 | 38 | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | 136% | 109% | 173% | 127% | 159% | 117% | ||||||
50 | general_knowledge | multiple choice | PaLM at avg human | 15 | 85 | 100 | 70 | 90 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 118% | 129% | 106% | 90% | 82% | 70% | 467% | 600% | 567% | 667% | ||
51 | geometric_shapes | multiple choice | Models subhuman | 10 | 55 | 100 | 10 | 35 | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | 182% | 350% | 64% | 35% | 18% | 10% | 100% | 350% | 550% | 1000% | Kinda surprised average humans are so bad at this (though some parts do seem hard) and surprised PaLM can do this at all! | |
52 | goal_step_wikihow | multiple choice | PaLM at avg human | 20 | 60 | 95 | 42 | 80 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 158% | 190% | 133% | 84% | 70% | 44% | 210% | 400% | 300% | 475% | ||
53 | gre_reading_comprehension | multiple choice | Models at avg human | 27 | 38 | 80 | 42 | 65 | TRUE | TRUE | TRUE | FALSE | FALSE | TRUE | 211% | 155% | 171% | 81% | 111% | 53% | 156% | 241% | 141% | 296% | ||
54 | hhh_alignment | true / false | Models subhuman | 50 | 75 | ? | 45 | 50 | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | 111% | 67% | 60% | Seems kinda alarming models aren't better at this? | ||||||||
55 | hindu_knowledge | multiple choice | PaLM at avg human | 25 | 63 | 100 | 55 | 95 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 159% | 173% | 151% | 95% | 87% | 55% | 220% | 380% | 252% | 400% | ||
56 | human_organs_senses | multiple choice | PaLM almost at avg human | 25 | 90 | 100 | 62 | 82 | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE | 111% | 132% | 91% | 82% | 69% | 62% | 248% | 328% | 360% | 400% | ||
57 | hyperbaton | true / false | Models subhuman | 60 | 75 | 100 | 52 | 65 | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | 133% | 125% | 87% | 65% | 69% | 52% | 87% | 108% | 125% | 167% | ||
58 | identify_math_theorems | multiple choice | Models at avg human | 25 | 35 | 60 | 35 | 55 | TRUE | TRUE | TRUE | FALSE | FALSE | TRUE | 171% | 157% | 157% | 92% | 100% | 58% | 140% | 220% | 140% | 240% | ||
59 | identify_odd_metaphor | multiple choice | PaLM at avg human | 25 | 70 | 100 | 25 | 80 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 143% | 320% | 114% | 80% | 36% | 25% | 100% | 320% | 280% | 400% | ||
60 | implicatures | true / false | PaLM at avg human | 50 | 82 | 100 | 60 | 92 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 122% | 153% | 112% | 92% | 73% | 60% | 120% | 184% | 164% | 200% | ||
61 | implicit_relations | multiple choice | PaLM at avg human | 5 | 35 | 85 | 25 | 45 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 243% | 180% | 129% | 53% | 71% | 29% | 500% | 900% | 700% | 1700% | ||
62 | intent_recognition | multiple choice | PaLM at avg human | 10 | 85 | 100 | 83 | 90 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 118% | 108% | 106% | 90% | 98% | 83% | 830% | 900% | 850% | 1000% | ||
63 | international_phonetic_alphabet_nli | multiple choice | PaLM at avg human | 33 | 42 | 100 | 40 | 60 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 238% | 150% | 143% | 60% | 95% | 40% | 121% | 182% | 127% | 303% | ||
64 | international_phonetic_alphabet_transliterate | bleu | PaLM at avg human | 0 | 30 | 65 | 25 | 55 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 217% | 220% | 183% | 85% | 83% | 38% | ||||||
65 | intersect_geometry | multiple choice | Models at avg human | 2.5 | 6.5 | 20 | 13.5 | 15 | TRUE | TRUE | TRUE | FALSE | FALSE | TRUE | 308% | 111% | 231% | 75% | 208% | 68% | 540% | 600% | 260% | 800% | I'm surprised humans are so bad and that models can do this at all | |
66 | irony_identification | true / false | PaLM almost at avg human | 50 | 72 | 88 | 55 | 65 | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE | 122% | 118% | 90% | 74% | 76% | 63% | 110% | 130% | 144% | 176% | ||
67 | kanji_ascii | exact str match | Models cannot do task | 0 | 1 | 20 | 0 | 0 | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | 2000% | 0% | 0% | 0% | 0% | |||||||
68 | kannada | multiple choice | PaLM at avg human | 25 | 42 | 80 | 25 | 50 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 190% | 200% | 119% | 63% | 60% | 31% | 100% | 200% | 168% | 320% | ||
69 | key_value_maps | multiple choice | PaLM at avg human | 50 | 55 | 88 | 50 | 65 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 160% | 130% | 118% | 74% | 91% | 57% | 100% | 130% | 110% | 176% | ||
70 | language_games | bleu | PaLM at avg human | 0 | 15 | 30 | 4 | 25 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 200% | 625% | 167% | 83% | 27% | 13% | ||||||
71 | language_identification | multiple choice | PaLM at avg human | 8 | 18 | 55 | 12 | 38 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 306% | 317% | 211% | 69% | 67% | 22% | 150% | 475% | 225% | 688% | ||
72 | linguistic_mappings | exact str match | PaLM superhuman | 0 | 42 | 68 | 0 | 70 | FALSE | TRUE | TRUE | FALSE | TRUE | TRUE | 162% | 167% | 103% | 0% | 0% | |||||||
73 | logic_grid_puzzle | multiple choice | PaLM at avg human | 33 | 40 | 100 | 35 | 45 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 250% | 129% | 113% | 45% | 88% | 35% | 106% | 136% | 121% | 303% | ||
74 | logical_args | multiple choice | PaLM at avg human | 20 | 52 | 88 | 32 | 85 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 169% | 266% | 163% | 97% | 62% | 36% | 160% | 425% | 260% | 440% | ||
75 | logical_deduction | multiple choice | Models subhuman | 33 | 40 | 88 | 33 | 35 | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | 220% | 106% | 88% | 40% | 83% | 38% | 100% | 106% | 121% | 267% | ||
76 | logical_fallacy_detection | true / false | PaLM at avg human | 50 | 63 | 90 | 58 | 75 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 143% | 129% | 119% | 83% | 92% | 64% | 116% | 150% | 126% | 180% | ||
77 | logical_sequence | multiple choice | PaLM at avg human | 25 | 85 | 100 | 40 | 90 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 118% | 225% | 106% | 90% | 47% | 40% | 160% | 360% | 340% | 400% | ||
78 | mathematical_induction | true / false | PaLM almost at avg human | 50 | 61 | 90 | 55 | 58 | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE | 148% | 105% | 95% | 64% | 90% | 61% | 110% | 116% | 122% | 180% | ||
79 | matrixshapes | exact str match | PaLM at avg human | 0 | 3 | 60 | 0 | 35 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 2000% | 1167% | 58% | 0% | 0% | PaLM does surprisingly well at this | ||||||
80 | metaphor_boolean | true / false | PaLM at avg human | 50 | 88 | 100 | 62 | 92 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 114% | 148% | 105% | 92% | 70% | 62% | 124% | 184% | 176% | 200% | ||
81 | metaphor_understanding | multiple choice | PaLM at avg human | 25 | 63 | 100 | 50 | 80 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 159% | 160% | 127% | 80% | 79% | 50% | 200% | 320% | 252% | 400% | ||
82 | minute_mysteries_qa | rougeLsum | PaLM at avg human | 0 | 2 | 15.5 | 0 | 8.1 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 775% | 405% | 52% | 0% | 0% | |||||||
83 | misconceptions | true / false | PaLM at avg human | 50 | 63 | 90 | 60 | 81 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 143% | 135% | 129% | 90% | 95% | 67% | 120% | 162% | 126% | 180% | ||
84 | misconceptions_russian | true / false | Models subhuman | 50 | 65 | 100 | 40 | 52 | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | 154% | 130% | 80% | 52% | 62% | 40% | 80% | 104% | 130% | 200% | ||
85 | mnist_ascii | multiple choice | Models cannot do task | 10 | 85 | 100 | 5 | 7 | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | 118% | 140% | 8% | 7% | 6% | 5% | 50% | 70% | 850% | 1000% | LLMs still can't see images | |
86 | modified_arithmetic | exact str match | PaLM at avg human | 0 | 58 | 100 | 30 | 70 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 172% | 233% | 121% | 70% | 52% | 30% | Humans are surprisingly bad at this | |||||
87 | moral_permissibility | true / false | PaLM almost at avg human | 50 | 65 | 90 | 50 | 62 | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE | 138% | 124% | 95% | 69% | 77% | 56% | 100% | 124% | 130% | 180% | Should I be worried humans are so bad at this? | |
88 | movie_dialog_same_or_different | true / false | PaLM almost at avg human | 50 | 68 | 100 | 52 | 62 | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE | 147% | 119% | 91% | 62% | 76% | 52% | 104% | 124% | 136% | 200% | ||
89 | movie_recommendation | multiple choice | Models subhuman | 25 | 61 | 90 | 35 | 38 | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | 148% | 109% | 62% | 42% | 57% | 39% | 140% | 152% | 244% | 360% | ||
90 | natural_instructions | rougeLsum | Models superhuman | 0 | 20 | 32 | 45 | 55 | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | 160% | 122% | 275% | 172% | 225% | 141% | ||||||
91 | navigate | true / false | Models subhuman | 50 | 85 | 100 | 50 | 55 | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | 118% | 110% | 65% | 55% | 59% | 50% | 100% | 110% | 170% | 200% | Surprised models are so bad at this. Also surprised some humans are not good at this. | |
92 | nonsense_words_grammar | multiple choice | PaLM almost at avg human | 20 | 70 | 100 | 65 | 68 | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE | 143% | 105% | 97% | 68% | 93% | 65% | 325% | 340% | 350% | 500% | ||
93 | novel_concepts | multiple choice | PaLM almost at avg human | 20 | 65 | 100 | 55 | 63 | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE | 154% | 115% | 97% | 63% | 85% | 55% | 275% | 315% | 325% | 500% | ||
94 | object_counting | exact str match | Models subhuman | 0 | 85 | 100 | 0 | 41 | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | 118% | 48% | 41% | 0% | 0% | |||||||
95 | odd_one_out | multiple choice | PaLM almost at avg human | 20 | 80 | 100 | 30 | 75 | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE | 125% | 250% | 94% | 75% | 38% | 30% | 150% | 375% | 400% | 500% | ||
96 | operators | exact str match | PaLM at avg human | 0 | 45 | 85 | 30 | 60 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 189% | 200% | 133% | 71% | 67% | 35% | ||||||
97 | parsinlu_reading_comprehension | exact str match | PaLM superhuman | 0 | 3 | 30 | 0 | 45 | FALSE | TRUE | TRUE | FALSE | TRUE | TRUE | 1000% | 1500% | 150% | 0% | 0% | |||||||
98 | penguins_in_a_table | exact str match | Models subhuman | 0 | 70 | 85 | 30 | 50 | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | 121% | 167% | 71% | 59% | 43% | 35% | ||||||
99 | periodic_elements | exact str match | Models at avg human | 0 | 8 | 75 | 18 | 63 | TRUE | TRUE | TRUE | FALSE | FALSE | TRUE | 938% | 350% | 788% | 84% | 225% | 24% | ||||||
100 | phrase_relatedness | multiple choice | PaLM at avg human | 25 | 75 | 100 | 60 | 90 | FALSE | TRUE | TRUE | FALSE | FALSE | TRUE | 133% | 150% | 120% | 90% | 80% | 60% | 240% | 360% | 300% | 400% |