Exploring the Verifiability of Code Generated by GitHub Copilot
Dakota Wong, Austin Kothig, and Patrick Lam
University of Waterloo
HATRA 2022
December 7, 2022
Auckland, NZ
https://patricklam.ca
How We Got Here: GitHub Copilot
What Copilot Does
You say…� def binary_search(arr:list, target: int) -> int:
… and Copilot suggests:� low = 0� high = len(arr) - 1� while low <= high:� mid = (low + high) // 2� if arr[mid] == target:� return mid� elif arr[mid] < target:� low = mid + 1� else:� high = mid - 1� return -1
Our Question:
Is it possible to formally verify Copilot-generated code?
Frame Lake, NWT
Previous Work: Nguyen and Nadi, MSR 2022
Q: Does Copilot-generated code pass tests?
Approach: evaluate with LeetCode questions
Best case (Java): solves 19/33 problems
Comments: generated code isn’t clean�
Previous Work: Nguyen and Nadi, MSR 2022
Q: Does Copilot-generated code pass tests?
Approach: evaluate with LeetCode questions
�This work: is the generated code verifiable?
Previous Work: Sobania, Briesch, Rothlauf, GECCO 2022
Q: Copilot vs genetic programming,� on program synthesis benchmarks
Results:
Copilot generates “correct” solution to 24/29 problems;� GP to 25/29;
needs labeled data, � often generates bloated & hard-to-understand code.
??
Other Previous Work
Dakhel et al, 2022: Copilot vs humans,
Copilot with humans
� Vaithilingam et al, 2022: Copilot user study
� Pearce et al, 2022: Copilot generates vulnerabilities
Siddiq et al, 2022: Copilot generates code smells
Our Methodology
St.-Jean-sur-Richelieu, QC
grad student by Vectorportal.com under CC/BY/SA 4.0
Problem description
Python solutions
Dafny code
✔ or ✘
About Our Inputs
(descriptive function name, parameter name(s), parameter type(s), output type)
�Another example:
def primeFactors(val: int) -> list:
Copilot Prompts
$
def binary_search(arr: list, target: int) -> int:
// LeetCode; types on omitted subsequent lines:
def twoSum(self, nums, target):
def largest_sum(nums: list) -> int:
def sortArray(arr: list) -> list:
def primeFactors(val: int) -> list:
def maxHeapify(arr: list) -> list:
Additional Verification Inputs
Results
Successful Verification 1: Binary Search
def binary_search(arr: list, target: int) -> int:
We provided:
requires Sorted(arr)
requires Distinct(arr)
ensures r.v != 1 ==> found
ensures r.v == 1 ==> not_found(arr, target)
(these predicates too)
Successful Verification 1: Loop Invariants
while low <= high:� mid = (low + high) // 2� if arr[mid] == target:� return mid� elif arr[mid] < target:� low = mid + 1� else:� high = mid - 1
Using domain knowledge, we wrote:
and Dafny successfully verifies the implementation. ✔
Successful Verification 2: Two Sums
This problem also appeared in Nguyen and Nadi, from LeetCode.
def twoSum(self, nums, target):� """� :type nums: List[int]� :type target: int� :rtype: List[int]� """
Specifying Two Sums
LeetCode provides constraint that valid answer always exists, i.e.:
Verifying Two Sums
Not shown: Largest Sum ✔ & Sort Array ✔
We also verified ✔ these two benchmarks, providing specifications and invariants for them.
Details omitted, but available in replication package.
Not Verified: Prime Factors ✘
Given this input:
def primeFactors(val: int) -> list:
Copilot generates a reasonable-looking implementation.
Copilot Prime Factors Python Implementation
def primeFactors(val: int) -> list:� factors = []� while val % 2 == 0:� factors.append(2)� val = val // 2� for i in range(3, int(val ** 0.5) + 1, 2):� while val % i == 0:� factors.append(i)� val = val // i� if val > 2:� factors.append(val)� return factors
Prime Factors: Dafny Conversion Issues
Prime Factors: Dafny Verification Issues
Issue: Dafny timeouts.
Probably the implementation is correct.�Maybe our invariants are correct?�We tried lemmas which assumed false,� plus specific ensures clauses before and� after each append.�Still encountered timeouts.
Kastraki, GR
Not Verified: Max Heapify ✘
Problem: turn an array into a max-heap.� def maxHeapify(arr: list) -> list:
We wrote formal specs:
Max Heapify Generated Implementation
def maxHeapify(arr: list) -> list:� for i in range(len(arr)):� left = 2 * i + 1� right = 2 * i + 2� if left < len(arr) and arr[left] > arr[i]:� largest = left� else:� largest = i� if right < len(arr) and arr[right] > arr[largest]:� largest = right� if largest != i:� arr[i], arr[largest] = arr[largest], arr[i]� maxHeapify(arr)� return arr
we provided
this line
Recursion!
Max Heapify Implementation Notes
Max Heapify Verification Barriers ✘
Tried to show array partially heapified. ✘
Is the implementation correct?
¯\_(ツ)_/¯
But we couldn’t even prove termination. ✘
(Are we the problem?)
Tafat and Marché verified max-heapify; it was difficult.
Whitehorse, YT
Discussion
Nguyen and Nadi showed that some Copilot code is incorrect.
We’re not verification experts. But invariants are never easy.
Our examples were bog standard (so is LeetCode!). What about esoteric code?
Cape Saunders, NZ
Summary
Σ
Generated 6 code samples with Copilot.
Came up with specifications and invariants.
Managed to verify 4 of 6 implementations.
tl;dr: Copilot generates code quickly, � but beware generated code quality!
O’Neill Bay, NZ
artifact: https://zenodo.org/record/7040924
bonus: https://patricklam.ca/nz-places