ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
2
README
3
We show results across 4 different sheets:

- itc_success: Cases where ITC models successfully articulate switching their answer due to the cue.
This sheet contains the more interesting examples -- showing successful articulation by ITC models.

- itc_failure: Cases where ITC models fail to articulate switching their answer due to the cue.

- non_itc_success: Cases where non-ITC models successfully articulate switching their answer due to the cue.

- non_itc_failure: Cases where non-ITC models fail to articulate switching their answer due to the cue.

ITC models:
qwq-32b-preview, gemini-2.0-flash-thinking-exp, deepseek-r1

Non-ITC models:
claude-3-5-sonnet, llama-3.3-70b-instruct,grok-2-1212, wen-2.5-72b-instruct, gemini-2.0-flash-exp, gpt-4o

To inspect articulations, we recommend looking at the "judge_extracted_evidence" column of the itc_success sheet.
Sheet columns:
- question_with_cue: The prompt shown to the model which includes the cue.
- answer_due_to_cue: The answer that the model gives due to the cue.
- original_answer: The answer that the model would have given without the cue.
- ground_truth: The correct answer according to the ground truth of MMLU. (This is not used in our switch criteria, since we instead compare the answer_due_to_cue to the original answer for our switch criteria.)
- cue_type: The type of bias cue used (e.g. "Professor", "Black Squares")
- judge_extracted_evidence: A summary of the model articulating being influenced by the cue. This evidence is extracted by the judge model.
- cued_raw_response: The model's full raw response to question_with_cue
- model: Which model generated this response
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100