1
Exploring the Curious Case of Code Prompts
Li Zhang*, Liam Dugan*, Hainiu Xu*, Chris Callison-Burch
*equal contribution
What is Prompting?
2
Write me an essay about the Great Gatsby.
Title: The Great Gatsby: A Tale of Illusion, Love, and American Dream
Introduction:
"The Great Gatsby," written by F. Scott Fitzgerald, is a timeless novel that delves into the intriguing world of the Roaring Twenties. Set in the lavish and fast-paced society of Long Island, New York, Fitzgerald masterfully paints a picture of glamour, decadence, and the pursuit of the American Dream. Through the eyes of the enigmatic Jay Gatsby, the novel
<...>
Large Language Model
Your prompt depends on your task
3
Does sentence A entail sentence B?�
A: Jill is at ACL 2023
B: Jill is in Toronto
Label: Entailment
<…>��A: Jack wrote a paper�B: Jack is at ACL
Label:
A: Jill is at ACL 2023
B: Jill is in Toronto
Does sentence A entail sentence B? Let's think step by step
Few-shot1
Chain of Thought2
A: Jill is at ACL 2023
B: Jill is in Toronto
Does sentence A entail sentence B? Yes
Is the answer correct? If not, how can it be improved?
Self-Refining3
+ obscure tasks
+ reasoning tasks
+ generation tasks
[1] Language models are Few-Shot Learners (Brown et al. 2020)
[2] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al. 2022)
[3] Self-Refine: Iterative Refinement with Self-Feedback (Madaan et al. 2023)
A type of task that is hard for most prompts…
4
Text
Structured Representation
(graph, triples, etc.)
Structured Prediction
5
digraph G {
begin -> find_sheet_music;
find_sheet_music -> sit_at_bench;
find_sheet_music -> set_up_sheet;
sit_at_bench -> warm_up_on_organ;
set_up_sheet -> warm_up_on_organ;
warm_up_on_organ -> play_organ;
}
Doesn’t look like Natural Language!
Idea: Convert to code and prompt a Code LM
6
class Node:
def __init__(self):
self.children = []
class Tree:
goal = "play organ"
def __init__(self):
begin = Node()
find_sheet_music = Node()
sit_at_bench = Node()
set_up_sheet = Node()
warm_up_on_organ = Node()
play_organ = Node()
begin.children =
[find_sheet_music]
find_sheet_music.children =
[sit_at_bench, set_up_sheet]
sit_at_bench.children =
[warm_up_on_organ]
set_up_sheet.children =
[warm_up_on_organ]
warm_up_on_organ.children =
[play_organ]
digraph G {
begin -> find_sheet_music;
find_sheet_music -> sit_at_bench;
find_sheet_music -> set_up_sheet;
sit_at_bench -> warm_up_on_organ;
set_up_sheet -> warm_up_on_organ;
warm_up_on_organ -> play_organ;
}
Our code prompt now looks something like this
7
Large Code-Trained Language Model
begin = Node()
find_sheet_music = Node()
sit_at_bench = Node()
set_up_sheet = Node()
warm_up_on_organ = Node()
play_organ = Node()
begin.children = [find_sheet_music]
find_sheet_music.children = [sit_at_bench, set_up_sheet]
sit_at_bench.children = [warm_up_on_organ]
set_up_sheet.children = [warm_up_on_organ]
warm_up_on_organ.children = [play_organ]
<...few-shot examples...>
"""
The following is a script
for the goal: play an organ
"""
class Node:
def __init__(self):
self.children = []
class Tree:
goal = "play organ"
def __init__(self):
Code prompting works extremely well
8
[1] “Language Models of Code are Few-Shot Commonsense Learners” (Madaan et al. 2022)
Follow-up work did this for other structured tasks
9
Event Argument Extraction
(Wang et al. 2022)
Knowledge Graph Construction
(Bi et al. 2023)
Short Story Understanding�(Dong et al. 2023)
Structured Causal Inference
(Zhang et al. 2023)
10
… or are they better in general?
Are code prompts better for structured reasoning tasks only?
Let’s investigate it!
11
How should we compare code prompts to text?
12
+
Sentiment Analysis
Question Answering
Common-Sense
Reasoning
…
2. Select a diverse set of
models for comparison
3. Select a code prompt and
text prompt to compare for
each task
Step 1: Select a diverse set of tasks
13
Summarization
Question Answering
HotPotQA
SQuADv2
XSUM
CNN / Daily Mail
Sentiment Analysis
IMDb
Yelp
Common-Sense Reasoning
wikiHow Goal-Step
HellaSwag
wikiHow Temporal
WinoGrande
OpenPI 2.0
Natural Language Inference
ANLI
Step 2: Select a diverse set of models
14
davinci
code-davinci-002
text-davinci-002
Code Pre-training?
Instruction Fine-Tuning?
✅
❌
❌
✅
✅
❌
Step 3: Select our code and text prompts per task
15
[1] PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts (Bach et al. 2022)
Example 1
Example 2
Example n
…
+
2. Rewrite the text prompt to
be a code prompt
3. Add in-context examples to
both prompts for inference
What is the best way to write a code prompt?
16
We wrote and tested four code prompts per task
17
Are code prompts better?
… sort of? … it’s complicated
18
19
𝚫
text → code
HellaSwag
wikiHow Goal-Step
wikiHow Temporal
Yelp
IMDb
WinoGrande
ANLI
HotPotQA
SQuAD
OpenPI
CNN / Daily Mail
XSUM
davinci
code-002
text-002
Accuracy
Accuracy
Accuracy
Pearson ⍴
Accuracy
Accuracy
Accuracy
Macro-F1
Macro-F1
ROUGE-F1
ROUGE-2
ROUGE-2
-0.014
-0.045
+0.037
-0.017
+0.063
-0.013
+0.027
-
-0.016
-
-0.150
-2.550
-0.046
-0.026
+0.105
-0.017
+0.006
+0.109
-0.011
-0.021
-0.025
-0.970
-0.070
-3.580
+0.046
-0.004
+0.073
-0.015
+0.012
+0.098
+0.053
-0.140
-0.014
-4.300
-0.080
-1.220
QA and Summarization are worse with code prompts
20
𝚫
text → code
HellaSwag
wikiHow Goal-Step
wikiHow Temporal
Yelp
IMDb
WinoGrande
ANLI
davinci
code-002
text-002
Accuracy
Accuracy
Accuracy
Pearson ⍴
Accuracy
Accuracy
Accuracy
-0.014
-0.045
+0.037
-0.017
+0.063
-0.013
+0.027
-0.046
-0.026
+0.105
-0.017
+0.006
+0.109
-0.011
+0.046
-0.004
+0.073
-0.015
+0.012
+0.098
+0.053
21
𝚫
text → code
HellaSwag
wikiHow Goal-Step
wikiHow Temporal
Yelp
IMDb
WinoGrande
ANLI
davinci
code-002
text-002
Accuracy
Accuracy
Accuracy
Pearson ⍴
Accuracy
Accuracy
Accuracy
-0.014
-0.045
+0.037
-0.017
+0.063
-0.013
+0.027
-0.046
-0.026
+0.105
-0.017
+0.006
+0.109
-0.011
+0.046
-0.004
+0.073
-0.015
+0.012
+0.098
+0.053
Observation 1: Code prompts are better more often than they are worse on classification tasks
22
𝚫
text → code
HellaSwag
wikiHow Goal-Step
wikiHow Temporal
Yelp
IMDb
WinoGrande
ANLI
davinci
code-002
text-002
Accuracy
Accuracy
Accuracy
Pearson ⍴
Accuracy
Accuracy
Accuracy
-0.014
-0.045
+0.037
-0.017
+0.063
-0.013
+0.027
-0.046
-0.026
+0.105
-0.017
+0.006
+0.109
-0.011
+0.046
-0.004
+0.073
-0.015
+0.012
+0.098
+0.053
Observation 2: Tasks that are better with code tend to be better with code regardless of the model
23
𝚫
text → code
HellaSwag
wikiHow Goal-Step
wikiHow Temporal
Yelp
IMDb
WinoGrande
ANLI
davinci
code-002
text-002
Accuracy
Accuracy
Accuracy
Pearson ⍴
Accuracy
Accuracy
Accuracy
-0.014
-0.045
+0.037
-0.017
+0.063
-0.013
+0.027
-0.046
-0.026
+0.105
-0.017
+0.006
+0.109
-0.011
+0.046
-0.004
+0.073
-0.015
+0.012
+0.098
+0.053
Observation 3: Instruction fine-tuning allows models to better utilize code prompts – even if it’s done on text instructions
24
𝚫
text → code
HellaSwag
wikiHow Goal-Step
wikiHow Temporal
Yelp
IMDb
WinoGrande
ANLI
davinci
code-002
text-002
Accuracy
Accuracy
Accuracy
Pearson ⍴
Accuracy
Accuracy
Accuracy
-0.014
-0.045
+0.037
-0.017
+0.063
-0.013
+0.027
-0.046
-0.026
+0.105
-0.017
+0.006
+0.109
-0.011
+0.046
-0.004
+0.073
-0.015
+0.012
+0.098
+0.053
Observation 4: Given the right task and model, code prompts can be *way* better than text prompts
Takeaways
25
What did we learn
26
2. Code prompts seem to benefit from instruction fine-tuning
3. Improvements seem to do more with task type rather than model
4. Given the right task we see substantial improvements with
code prompting (10%+)
There are tons of things we don’t fully understand
27
How do you even write one of these things?
Us during this project
What is the right code-text pre-training and instruction data mix, task selection, prompt formulation, etc.
Insights on code prompts → insights on reasoning
28
Why do code models reason so effectively? What latent structures do LMs use? How do we define structure?
29
Thank you!
Hainiu Xu
{zharry, seacow, ccb}@seas.upenn.edu
My awesome co-authors!
Paper Link
Enjoyed the talk? *Curious* about code prompts? We’d love to collaborate!
Email: ldugan@seas.upenn.edu
Website: http://liamdugan.com/
Twitter: @LiamDugan_