1 of 30

Exploring the Verifiability of Code Generated by GitHub Copilot

Dakota Wong, Austin Kothig, and Patrick Lam

University of Waterloo

HATRA 2022

December 7, 2022

Auckland, NZ

https://patricklam.ca

2 of 30

How We Got Here: GitHub Copilot

3 of 30

What Copilot Does

You say…� def binary_search(arr:list, target: int) -> int:

… and Copilot suggests:� low = 0� high = len(arr) - 1� while low <= high:� mid = (low + high) // 2� if arr[mid] == target:� return mid� elif arr[mid] < target:� low = mid + 1� else:� high = mid - 1� return -1

4 of 30

Our Question:

Is it possible to formally verify Copilot-generated code?

Frame Lake, NWT

5 of 30

Previous Work: Nguyen and Nadi, MSR 2022

Q: Does Copilot-generated code pass tests?

Approach: evaluate with LeetCode questions

Best case (Java): solves 19/33 problems

Comments: generated code isn’t clean�

6 of 30

Previous Work: Nguyen and Nadi, MSR 2022

Q: Does Copilot-generated code pass tests?

Approach: evaluate with LeetCode questions

�This work: is the generated code verifiable?

7 of 30

Previous Work: Sobania, Briesch, Rothlauf, GECCO 2022

Q: Copilot vs genetic programming,� on program synthesis benchmarks

Results:

Copilot generates “correct” solution to 24/29 problems;� GP to 25/29;

needs labeled data, � often generates bloated & hard-to-understand code.

??

8 of 30

Other Previous Work

Dakhel et al, 2022: Copilot vs humans,

Copilot with humans

  • can be used for faster early-stage prototyping

� Vaithilingam et al, 2022: Copilot user study

  • devs prefer Copilot to blank slate

� Pearce et al, 2022: Copilot generates vulnerabilities

Siddiq et al, 2022: Copilot generates code smells

9 of 30

Our Methodology

St.-Jean-sur-Richelieu, QC

grad student by Vectorportal.com under CC/BY/SA 4.0

Problem description

Python solutions

Dafny code

  • requires, ensures, invariants

✔ or ✘

10 of 30

About Our Inputs

  • 6 problems, 1 of which from Nguyen & Nadi
  • Provided: full function signature

(descriptive function name, parameter name(s), parameter type(s), output type)

�Another example:

def primeFactors(val: int) -> list:

11 of 30

Copilot Prompts

$

def binary_search(arr: list, target: int) -> int:

// LeetCode; types on omitted subsequent lines:

def twoSum(self, nums, target):

def largest_sum(nums: list) -> int:

def sortArray(arr: list) -> list:

def primeFactors(val: int) -> list:

def maxHeapify(arr: list) -> list:

12 of 30

Additional Verification Inputs

  • requires, ensures clauses
  • loop invariants
  • plus extra requires, ensures clauses needed during verification
  • plus helper functions & methods

13 of 30

Results

14 of 30

Successful Verification 1: Binary Search

def binary_search(arr: list, target: int) -> int:

We provided:

requires Sorted(arr)

requires Distinct(arr)

ensures r.v != 1 ==> found

ensures r.v == 1 ==> not_found(arr, target)

(these predicates too)

15 of 30

Successful Verification 1: Loop Invariants

while low <= high:� mid = (low + high) // 2� if arr[mid] == target:� return mid� elif arr[mid] < target:� low = mid + 1� else:� high = mid - 1

Using domain knowledge, we wrote:

and Dafny successfully verifies the implementation.

16 of 30

Successful Verification 2: Two Sums

This problem also appeared in Nguyen and Nadi, from LeetCode.

def twoSum(self, nums, target):� """� :type nums: List[int]� :type target: int� :rtype: List[int]� """

17 of 30

Specifying Two Sums

LeetCode provides constraint that valid answer always exists, i.e.:

18 of 30

Verifying Two Sums

  • 5 lines of generated Python -> 8 lines of Dafny
  • we wrote 7 invariants (3 in outer loop, 4 in inner loop),
  • upon which verification succeeded

19 of 30

Not shown: Largest Sum ✔ & Sort Array ✔

We also verified ✔ these two benchmarks, providing specifications and invariants for them.

Details omitted, but available in replication package.

20 of 30

Not Verified: Prime Factors ✘

Given this input:

def primeFactors(val: int) -> list:

Copilot generates a reasonable-looking implementation.

21 of 30

Copilot Prime Factors Python Implementation

def primeFactors(val: int) -> list:� factors = []� while val % 2 == 0:� factors.append(2)� val = val // 2� for i in range(3, int(val ** 0.5) + 1, 2):� while val % i == 0:� factors.append(i)� val = val // i� if val > 2:� factors.append(val)� return factors

22 of 30

Prime Factors: Dafny Conversion Issues

  1. We converted Python for loops to Dafny while loops.
  2. We found a verified implementation of integer square root.
  3. We needed a dynamic array; we used seq<int> and initialized with [1].

23 of 30

Prime Factors: Dafny Verification Issues

Issue: Dafny timeouts.

Probably the implementation is correct.�Maybe our invariants are correct?�We tried lemmas which assumed false,� plus specific ensures clauses before and� after each append.�Still encountered timeouts.

Kastraki, GR

24 of 30

Not Verified: Max Heapify ✘

Problem: turn an array into a max-heap.� def maxHeapify(arr: list) -> list:

We wrote formal specs:

25 of 30

Max Heapify Generated Implementation

def maxHeapify(arr: list) -> list:� for i in range(len(arr)):� left = 2 * i + 1� right = 2 * i + 2� if left < len(arr) and arr[left] > arr[i]:� largest = left� else:� largest = i� if right < len(arr) and arr[right] > arr[largest]:� largest = right� if largest != i:� arr[i], arr[largest] = arr[largest], arr[i]� maxHeapify(arr)� return arr

we provided

this line

Recursion!

26 of 30

Max Heapify Implementation Notes

  • “sifts down” rather than “sifting up”� maxHeapify(arr) [and not a subset]
  • the implementation looks O(n2) not O(n log n)
  • does not obviously reduce to sub-problems
  • we tested the implementation on random inputs
    • seems to work
    • trace-tree seems to go quite deep

27 of 30

Max Heapify Verification Barriers ✘

Tried to show array partially heapified. ✘

Is the implementation correct?

¯\_(ツ)_/¯

But we couldn’t even prove termination. ✘

(Are we the problem?)

Tafat and Marché verified max-heapify; it was difficult.

Whitehorse, YT

28 of 30

Discussion

Nguyen and Nadi showed that some Copilot code is incorrect.

  • now we know that some Copilot code is hard to verify too.
  • probably code designed for verification is easier to verify.

We’re not verification experts. But invariants are never easy.

Our examples were bog standard (so is LeetCode!). What about esoteric code?

  • in particular, API usages like reading a CSV?
  • we also dodged pointers & memory management

Cape Saunders, NZ

29 of 30

Summary

Σ

Generated 6 code samples with Copilot.

Came up with specifications and invariants.

Managed to verify 4 of 6 implementations.

tl;dr: Copilot generates code quickly, � but beware generated code quality!

30 of 30

O’Neill Bay, NZ

bonus: https://patricklam.ca/nz-places