1 of 30

Exploring the Verifiability of Code Generated by GitHub Copilot

Dakota Wong, Austin Kothig, and Patrick Lam

University of Waterloo

HATRA 2022

December 7, 2022

Auckland, NZ

https://patricklam.ca

2 of 30

How We Got Here: GitHub Copilot

3 of 30

What Copilot Does

You say…� def binary_search(arr:list, target: int) -> int:

… and Copilot suggests:� low = 0� high = len(arr) - 1� while low <= high:� mid = (low + high) // 2� if arr[mid] == target:� return mid� elif arr[mid] < target:� low = mid + 1� else:� high = mid - 1� return -1

4 of 30

Our Question:

Is it possible to formally verify Copilot-generated code?

Frame Lake, NWT

5 of 30

Previous Work: Nguyen and Nadi, MSR 2022

Q: Does Copilot-generated code pass tests?

Approach: evaluate with LeetCode questions

Best case (Java): solves 19/33 problems

Comments: generated code isn’t clean�

6 of 30

Previous Work: Nguyen and Nadi, MSR 2022

Q: Does Copilot-generated code pass tests?

Approach: evaluate with LeetCode questions

�This work: is the generated code verifiable?

7 of 30

Previous Work: Sobania, Briesch, Rothlauf, GECCO 2022

Q: Copilot vs genetic programming,� on program synthesis benchmarks

Results:

Copilot generates “correct” solution to 24/29 problems;� GP to 25/29;

needs labeled data, � often generates bloated & hard-to-understand code.

8 of 30

Other Previous Work

Dakhel et al, 2022: Copilot vs humans,

Copilot with humans

can be used for faster early-stage prototyping

� Vaithilingam et al, 2022: Copilot user study

devs prefer Copilot to blank slate

� Pearce et al, 2022: Copilot generates vulnerabilities

Siddiq et al, 2022: Copilot generates code smells

9 of 30

Our Methodology

St.-Jean-sur-Richelieu, QC

grad student by Vectorportal.com under CC/BY/SA 4.0

Problem description

Python solutions

Dafny code

requires, ensures, invariants

✔ or ✘

10 of 30

About Our Inputs

6 problems, 1 of which from Nguyen & Nadi
Provided: full function signature

(descriptive function name, parameter name(s), parameter type(s), output type)

�Another example:

def primeFactors(val: int) -> list:

11 of 30

Copilot Prompts

def binary_search(arr: list, target: int) -> int:

// LeetCode; types on omitted subsequent lines:

def twoSum(self, nums, target):

def largest_sum(nums: list) -> int:

def sortArray(arr: list) -> list:

def primeFactors(val: int) -> list:

def maxHeapify(arr: list) -> list:

12 of 30

Additional Verification Inputs

requires, ensures clauses
loop invariants
plus extra requires, ensures clauses needed during verification
plus helper functions & methods

14 of 30

Successful Verification 1: Binary Search

def binary_search(arr: list, target: int) -> int:

We provided:

requires Sorted(arr)

requires Distinct(arr)

ensures r.v != 1 ==> found

ensures r.v == 1 ==> not_found(arr, target)

(these predicates too)

15 of 30

Successful Verification 1: Loop Invariants

while low <= high:� mid = (low + high) // 2� if arr[mid] == target:� return mid� elif arr[mid] < target:� low = mid + 1� else:� high = mid - 1

Using domain knowledge, we wrote:

and Dafny successfully verifies the implementation. ✔

16 of 30

Successful Verification 2: Two Sums

This problem also appeared in Nguyen and Nadi, from LeetCode.

def twoSum(self, nums, target):� """� :type nums: List[int]� :type target: int� :rtype: List[int]� """

17 of 30

Specifying Two Sums

LeetCode provides constraint that valid answer always exists, i.e.:

18 of 30

Verifying Two Sums

5 lines of generated Python -> 8 lines of Dafny
we wrote 7 invariants (3 in outer loop, 4 in inner loop),
upon which verification succeeded ✔

19 of 30

Not shown: Largest Sum ✔ & Sort Array ✔

We also verified ✔ these two benchmarks, providing specifications and invariants for them.

Details omitted, but available in replication package.

20 of 30

Not Verified: Prime Factors ✘

Given this input:

def primeFactors(val: int) -> list:

Copilot generates a reasonable-looking implementation.

21 of 30

Copilot Prime Factors Python Implementation

def primeFactors(val: int) -> list:� factors = []� while val % 2 == 0:� factors.append(2)� val = val // 2� for i in range(3, int(val ** 0.5) + 1, 2):� while val % i == 0:� factors.append(i)� val = val // i� if val > 2:� factors.append(val)� return factors

22 of 30

Prime Factors: Dafny Conversion Issues

We converted Python for loops to Dafny while loops.
We found a verified implementation of integer square root.
We needed a dynamic array; we used seq<int> and initialized with [1].

23 of 30

Prime Factors: Dafny Verification Issues

Issue: Dafny timeouts.

Probably the implementation is correct.�Maybe our invariants are correct?�We tried lemmas which assumed false,� plus specific ensures clauses before and� after each append.�Still encountered timeouts.

Kastraki, GR

24 of 30

Not Verified: Max Heapify ✘

Problem: turn an array into a max-heap.� def maxHeapify(arr: list) -> list:

We wrote formal specs:

25 of 30

Max Heapify Generated Implementation

def maxHeapify(arr: list) -> list:� for i in range(len(arr)):� left = 2 * i + 1� right = 2 * i + 2� if left < len(arr) and arr[left] > arr[i]:� largest = left� else:� largest = i� if right < len(arr) and arr[right] > arr[largest]:� largest = right� if largest != i:� arr[i], arr[largest] = arr[largest], arr[i]� maxHeapify(arr)� return arr

we provided

this line

Recursion!

26 of 30

Max Heapify Implementation Notes

“sifts down” rather than “sifting up”� maxHeapify(arr) [and not a subset]
the implementation looks O(n²) not O(n log n)
does not obviously reduce to sub-problems
we tested the implementation on random inputs

seems to work
trace-tree seems to go quite deep

27 of 30

Max Heapify Verification Barriers ✘

Tried to show array partially heapified. ✘

Is the implementation correct?

¯\_(ツ)_/¯

But we couldn’t even prove termination. ✘

(Are we the problem?)

Tafat and Marché verified max-heapify; it was difficult.

Whitehorse, YT

28 of 30

Discussion

Nguyen and Nadi showed that some Copilot code is incorrect.

now we know that some Copilot code is hard to verify too.
probably code designed for verification is easier to verify.

We’re not verification experts. But invariants are never easy.

Our examples were bog standard (so is LeetCode!). What about esoteric code?

in particular, API usages like reading a CSV?
we also dodged pointers & memory management

Cape Saunders, NZ

29 of 30

Summary

Generated 6 code samples with Copilot.

Came up with specifications and invariants.

Managed to verify 4 of 6 implementations.

tl;dr: Copilot generates code quickly, � but beware generated code quality!

30 of 30

O’Neill Bay, NZ

artifact: https://zenodo.org/record/7040924

bonus: https://patricklam.ca/nz-places

1 of 30

2 of 30

3 of 30

4 of 30

5 of 30

6 of 30

7 of 30

8 of 30

9 of 30

10 of 30

11 of 30

12 of 30

13 of 30

14 of 30

15 of 30

16 of 30

17 of 30

18 of 30

19 of 30

20 of 30

21 of 30

22 of 30

23 of 30

24 of 30

25 of 30

26 of 30

27 of 30

28 of 30

29 of 30

30 of 30