1 of 29

1

Exploring the Curious Case of Code Prompts

Li Zhang*, Liam Dugan*, Hainiu Xu*, Chris Callison-Burch

*equal contribution

2 of 29

What is Prompting?

2

Write me an essay about the Great Gatsby.

Title: The Great Gatsby: A Tale of Illusion, Love, and American Dream

Introduction:

"The Great Gatsby," written by F. Scott Fitzgerald, is a timeless novel that delves into the intriguing world of the Roaring Twenties. Set in the lavish and fast-paced society of Long Island, New York, Fitzgerald masterfully paints a picture of glamour, decadence, and the pursuit of the American Dream. Through the eyes of the enigmatic Jay Gatsby, the novel

<...>

Large Language Model

3 of 29

Your prompt depends on your task

3

Does sentence A entail sentence B?�

A: Jill is at ACL 2023

B: Jill is in Toronto

Label: Entailment

<…>��A: Jack wrote a paper�B: Jack is at ACL

Label:

A: Jill is at ACL 2023

B: Jill is in Toronto

Does sentence A entail sentence B? Let's think step by step

Few-shot1

Chain of Thought2

A: Jill is at ACL 2023

B: Jill is in Toronto

Does sentence A entail sentence B? Yes

Is the answer correct? If not, how can it be improved?

Self-Refining3

+ obscure tasks

+ reasoning tasks

+ generation tasks

[1] Language models are Few-Shot Learners (Brown et al. 2020)

[2] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al. 2022)

[3] Self-Refine: Iterative Refinement with Self-Feedback (Madaan et al. 2023)

4 of 29

A type of task that is hard for most prompts…

4

Text

Structured Representation

(graph, triples, etc.)

Structured Prediction

5 of 29

5

digraph G {

begin -> find_sheet_music;

find_sheet_music -> sit_at_bench;

find_sheet_music -> set_up_sheet;

sit_at_bench -> warm_up_on_organ;

set_up_sheet -> warm_up_on_organ;

warm_up_on_organ -> play_organ;

}

Doesn’t look like Natural Language!

6 of 29

Idea: Convert to code and prompt a Code LM

6

class Node:

def __init__(self):

self.children = []

class Tree:

goal = "play organ"

def __init__(self):

begin = Node()

find_sheet_music = Node()

sit_at_bench = Node()

set_up_sheet = Node()

warm_up_on_organ = Node()

play_organ = Node()

begin.children =

[find_sheet_music]

find_sheet_music.children =

[sit_at_bench, set_up_sheet]

sit_at_bench.children =

[warm_up_on_organ]

set_up_sheet.children =

[warm_up_on_organ]

warm_up_on_organ.children =

[play_organ]

digraph G {

begin -> find_sheet_music;

find_sheet_music -> sit_at_bench;

find_sheet_music -> set_up_sheet;

sit_at_bench -> warm_up_on_organ;

set_up_sheet -> warm_up_on_organ;

warm_up_on_organ -> play_organ;

}

7 of 29

Our code prompt now looks something like this

7

Large Code-Trained Language Model

begin = Node()

find_sheet_music = Node()

sit_at_bench = Node()

set_up_sheet = Node()

warm_up_on_organ = Node()

play_organ = Node()

begin.children = [find_sheet_music]

find_sheet_music.children = [sit_at_bench, set_up_sheet]

sit_at_bench.children = [warm_up_on_organ]

set_up_sheet.children = [warm_up_on_organ]

warm_up_on_organ.children = [play_organ]

<...few-shot examples...>

"""

The following is a script

for the goal: play an organ

"""

class Node:

def __init__(self):

self.children = []

class Tree:

goal = "play organ"

def __init__(self):

8 of 29

Code prompting works extremely well

8

[1] “Language Models of Code are Few-Shot Commonsense Learners” (Madaan et al. 2022)

9 of 29

Follow-up work did this for other structured tasks

9

Event Argument Extraction

(Wang et al. 2022)

Knowledge Graph Construction

(Bi et al. 2023)

Short Story Understanding�(Dong et al. 2023)

Structured Causal Inference

(Zhang et al. 2023)

10 of 29

10

… or are they better in general?

Are code prompts better for structured reasoning tasks only?

11 of 29

Let’s investigate it!

11

12 of 29

How should we compare code prompts to text?

12

  1. Select a diverse set of benchmark tasks

+

Sentiment Analysis

Question Answering

Common-Sense

Reasoning

2. Select a diverse set of

models for comparison

3. Select a code prompt and

text prompt to compare for

each task

13 of 29

Step 1: Select a diverse set of tasks

13

Summarization

Question Answering

HotPotQA

SQuADv2

XSUM

CNN / Daily Mail

Sentiment Analysis

IMDb

Yelp

Common-Sense Reasoning

wikiHow Goal-Step

HellaSwag

wikiHow Temporal

WinoGrande

OpenPI 2.0

Natural Language Inference

ANLI

14 of 29

Step 2: Select a diverse set of models

14

davinci

code-davinci-002

text-davinci-002

Code Pre-training?

Instruction Fine-Tuning?

15 of 29

Step 3: Select our code and text prompts per task

15

[1] PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts (Bach et al. 2022)

  • Select one text prompt per task from PromptSource1

Example 1

Example 2

Example n

+

2. Rewrite the text prompt to

be a code prompt

3. Add in-context examples to

both prompts for inference

16 of 29

What is the best way to write a code prompt?

16

17 of 29

We wrote and tested four code prompts per task

17

18 of 29

Are code prompts better?

… sort of? … it’s complicated

18

19 of 29

19

𝚫

text → code

HellaSwag

wikiHow Goal-Step

wikiHow Temporal

Yelp

IMDb

WinoGrande

ANLI

HotPotQA

SQuAD

OpenPI

CNN / Daily Mail

XSUM

davinci

code-002

text-002

Accuracy

Accuracy

Accuracy

Pearson ⍴

Accuracy

Accuracy

Accuracy

Macro-F1

Macro-F1

ROUGE-F1

ROUGE-2

ROUGE-2

-0.014

-0.045

+0.037

-0.017

+0.063

-0.013

+0.027

-

-0.016

-

-0.150

-2.550

-0.046

-0.026

+0.105

-0.017

+0.006

+0.109

-0.011

-0.021

-0.025

-0.970

-0.070

-3.580

+0.046

-0.004

+0.073

-0.015

+0.012

+0.098

+0.053

-0.140

-0.014

-4.300

-0.080

-1.220

QA and Summarization are worse with code prompts

20 of 29

20

𝚫

text → code

HellaSwag

wikiHow Goal-Step

wikiHow Temporal

Yelp

IMDb

WinoGrande

ANLI

davinci

code-002

text-002

Accuracy

Accuracy

Accuracy

Pearson ⍴

Accuracy

Accuracy

Accuracy

-0.014

-0.045

+0.037

-0.017

+0.063

-0.013

+0.027

-0.046

-0.026

+0.105

-0.017

+0.006

+0.109

-0.011

+0.046

-0.004

+0.073

-0.015

+0.012

+0.098

+0.053

21 of 29

21

𝚫

text → code

HellaSwag

wikiHow Goal-Step

wikiHow Temporal

Yelp

IMDb

WinoGrande

ANLI

davinci

code-002

text-002

Accuracy

Accuracy

Accuracy

Pearson ⍴

Accuracy

Accuracy

Accuracy

-0.014

-0.045

+0.037

-0.017

+0.063

-0.013

+0.027

-0.046

-0.026

+0.105

-0.017

+0.006

+0.109

-0.011

+0.046

-0.004

+0.073

-0.015

+0.012

+0.098

+0.053

Observation 1: Code prompts are better more often than they are worse on classification tasks

22 of 29

22

𝚫

text → code

HellaSwag

wikiHow Goal-Step

wikiHow Temporal

Yelp

IMDb

WinoGrande

ANLI

davinci

code-002

text-002

Accuracy

Accuracy

Accuracy

Pearson ⍴

Accuracy

Accuracy

Accuracy

-0.014

-0.045

+0.037

-0.017

+0.063

-0.013

+0.027

-0.046

-0.026

+0.105

-0.017

+0.006

+0.109

-0.011

+0.046

-0.004

+0.073

-0.015

+0.012

+0.098

+0.053

Observation 2: Tasks that are better with code tend to be better with code regardless of the model

23 of 29

23

𝚫

text → code

HellaSwag

wikiHow Goal-Step

wikiHow Temporal

Yelp

IMDb

WinoGrande

ANLI

davinci

code-002

text-002

Accuracy

Accuracy

Accuracy

Pearson ⍴

Accuracy

Accuracy

Accuracy

-0.014

-0.045

+0.037

-0.017

+0.063

-0.013

+0.027

-0.046

-0.026

+0.105

-0.017

+0.006

+0.109

-0.011

+0.046

-0.004

+0.073

-0.015

+0.012

+0.098

+0.053

Observation 3: Instruction fine-tuning allows models to better utilize code prompts – even if it’s done on text instructions

24 of 29

24

𝚫

text → code

HellaSwag

wikiHow Goal-Step

wikiHow Temporal

Yelp

IMDb

WinoGrande

ANLI

davinci

code-002

text-002

Accuracy

Accuracy

Accuracy

Pearson ⍴

Accuracy

Accuracy

Accuracy

-0.014

-0.045

+0.037

-0.017

+0.063

-0.013

+0.027

-0.046

-0.026

+0.105

-0.017

+0.006

+0.109

-0.011

+0.046

-0.004

+0.073

-0.015

+0.012

+0.098

+0.053

Observation 4: Given the right task and model, code prompts can be *way* better than text prompts

25 of 29

Takeaways

25

26 of 29

What did we learn

26

  • Code prompts are applicable to much more than just structured reasoning tasks

2. Code prompts seem to benefit from instruction fine-tuning

3. Improvements seem to do more with task type rather than model

4. Given the right task we see substantial improvements with

code prompting (10%+)

27 of 29

There are tons of things we don’t fully understand

27

How do you even write one of these things?

Us during this project

What is the right code-text pre-training and instruction data mix, task selection, prompt formulation, etc.

28 of 29

Insights on code prompts → insights on reasoning

28

Why do code models reason so effectively? What latent structures do LMs use? How do we define structure?

29 of 29

29

Thank you!

Hainiu Xu

{zharry, seacow, ccb}@seas.upenn.edu

My awesome co-authors!

Paper Link

Enjoyed the talk? *Curious* about code prompts? We’d love to collaborate!

Email: ldugan@seas.upenn.edu

Website: http://liamdugan.com/

Twitter: @LiamDugan_