2 of 40

Jesse Michael Han

Univ. of Pittsburgh

Yuhuai Tony Wu

Univ. of Toronto

Edward Ayers

Univ. of Cambridge

Stanislas Polu

OpenAI

With thanks to the �N2Formal team at�Google AI

3 of 40

Solving the game of mathematics

with machine learning

4 of 40

Machine learning guided formal theorem proving

1. Get mathematical training data

2. Train a model

3. Build theorem proving testing environment

4. Execute proof search

p q : Prop, �h : p ∧ q �⊢ q ∧ p

cases h with hp hq

p q : Prop, �h : p ∧ q �⊢ q ∧ p

cases h with hp hq

p q : Prop, �h : p ∧ q �⊢ q ∧ p

cases h with hp hq

p q : Prop, �hp : p, �hq : q �⊢ q ∧ p

⊢

no goals

5 of 40

Datasets, environments, and ML provers for ITP

HOL4

TacticToe (solves 66.4% of HOL4 theorems^†)

HOL-Light

HOLStep
HOList/DeepHOL (69.1%)

Metamath

Holophrasm (14.3%)
MetGen-IL (21.2%)
GPT-f (56.2%)

Isabelle/HOL

IsarStep

Coq

Gamepad
CoqGym/ASTactic (12.2%)
ProverBot9001 (19.4%)
Tactician (39.3%)

Mizar and First Order Logic

Miz40
RLCoP (50.0%)
(Works combining ML and FOL provers)

Lean 3

Nothing Yet

^†Percent of test theorems solved. Numbers cannot always

be compared across solvers, even within the same prover.

6 of 40

State of machine learning for theorem proving

Explored a couple of ideas

Tactic selection
Lemma selection
Full proof step text synthesis
A few neural network architectures and other ML algorithms
Simple reinforcement learning

Lot more to do

Catch up with previous two years of AI research in

self-supervised learning
reinforcement learning

Better benchmarks
Lemma creation
Definition creation
Program synthesis
Use arXiv proofs
Combine with traditional AI (SMT solvers)
New ideas

It shows a lot of promise!

Good ML systems solve 20%–60% of theorems
Performance similar to (SMT-based) hammers
ML starting to be incorporated into theorem provers
AI labs (Google AI, OpenAI, Facebook AI, Deepmind, IBM Watson) showing interest in this area (and in Lean!)

7 of 40

Extracting Proof Data from Lean

LeanStep datasets

8 of 40

Proof modes

Tactic proof

lemma and_swap : p ∧ q → q ∧ p := �begin � intro h, � cases h with hp hq, � constructor, exact hq, exact hp �end

Term proof

lemma and_swap : p ∧ q → q ∧ p := �iff.intro � (assume h : p ∧ q, � show q ∧ p, � from and.intro (and.right h) (and.left h))

(assume h : q ∧ p, � show p ∧ q, � from and.intro (and.right h) (and.left h))

9 of 40

LeanStep :Tactic proofs

10 of 40

Available information

Tactic proof

lemma and_swap : p ∧ q → q ∧ p := �begin � intro h, � cases h with hp hq , � constructor, exact hq, exact hp �end

Data to extract

Tactic command and position
Hypotheses and goals
Tactic name
Tactic arguments
Full abstract syntax tree of the proof
Declaration name
Hidden tactic state information:

Open namespaces
Environment
Metavariables
Other hidden information

p q : Prop, �h : p ∧ q �⊢ q ∧ p

tactic.interactive.cases (none, ``(h)) [`hp, `hq]

11 of 40

Getting start position and state

Tactic proof

lemma and_swap : p ∧ q → q ∧ p := �begin � intro h, � cases h with hp hq, � constructor, exact hq, exact hp �end

Modifying istep

Modify istep in Lean core library^† to trace

start position (line and column)
declaration name,
goals
tactic state information

Compile mathlib library and capture output
Process output

^† Used Python script to modify Lean code

p q : Prop, �h : p ∧ q �⊢ q ∧ p

12 of 40

Getting tactic arguments

Tactic proof

lemma and_swap : p ∧ q → q ∧ p := �begin � intro h, � cases h with hp hq , � constructor, exact hq, exact hp �end

Modifying the parser state

Modify parse arguments for all interactive Lean tactics (in Lean core and MathLib)^† to trace

argument start and end position
argument as reflected in Lean

Compile Lean library and capture output
Process output

^† Used Python script to find/modify all interactive tactics

tactic.interactive.cases (none, ``(h)) [`hp, `hq]

13 of 40

LeanStep tactic proof dataset

Tactic proof

lemma and_swap : p ∧ q → q ∧ p := �begin � intro h, � cases h with hp hq , � constructor, exact hq, exact hp �end

A machine learning dataset

Input text: Pretty printed hypotheses and goal

Output text: Tactic command

Good dataset for Transformer models like GPT-f

p q : Prop, �h : p ∧ q �⊢ q ∧ p

cases h with hp hq

14 of 40

LeanStep tactic proof dataset

138,720 goal-tactic pairs�
18,992 theorems containing tactic proofs�
out of 42,079 total�
Please keep adding to mathlib!

15 of 40

Highlights

Tracing code is written in Lean
Automatically inject this code into Lean
Easy to maintain and update to newer versions of Lean
Can record anything in tactic state:

All declarations in environment
All meta-variables and their details

Easy to record expressions via

pretty printed string
pp.all strings
raw fmt
s-expressions
json
vector embeddings (not yet)

Captures every interactive tactic application without issue
Dataset and code will be made public when ready

16 of 40

This dataset has other uses besides ML

Cleaning up mathlib

Find and remove tactics which are not executed
e.g. by simp; refl

Learning basic statistics about mathlib

Most common tactics
Longest tactic proof

Porting mathlib to Lean 4

Dataset contains abstract syntax trees of every tactic proof in mathlib

17 of 40

LeanStep: Term proofs

18 of 40

Lean stores term proofs for all theorems

#print of_iff_true

theorem of_iff_true : ∀ {a : Prop}, (a ↔ true) → a := �λ {a : Prop} (h : a ↔ true), iff.mp (iff.symm h) trivial

19 of 40

Generate datasets by adding holes to proof term

Proof term with hole	Hypotheses and goal	Term that fulfills the goal
_	�⊢ ∀ {a : Prop}, (a ↔ true) → a	λ {a : Prop} (h : a ↔ true), �iff.mp (iff.symm h) trivial
λ {a : Prop}, _	a : Prop �⊢ (a ↔ true) → a	λ (h : a ↔ true), �iff.mp (iff.symm h) trivial
λ {a : Prop} (h : a ↔ true),�_	a : Prop, h : a ↔ true �⊢ a	iff.mp (iff.symm h) trivial
λ {a : Prop} (h : a ↔ true), �iff.mp _ trivial	a : Prop, h : a ↔ true �⊢ true ↔ a	iff.symm h
λ {a : Prop} (h : a ↔ true), �iff.mp (iff.symm _) trivial	a : Prop, h : a ↔ true �⊢ a ↔ true	h
λ {a : Prop} (h : a ↔ true), �iff.mp (iff.symm h) _	a : Prop, h : a ↔ true �⊢ true	trivial

20 of 40

Possible prediction tasks with this dataset

Why perform auxiliary tasks?

Provides simpler tasks to test new methods on
Provides more data for training your model (data augmentation, transfer learning, self-supervised learning)
Tasks are subtasks as part of fulfilling the main task

Predict masked term ("skip tree task") from partial proof

Predict masked proof term from hyps. and goal

Predict if prop. hyps. are used in the proof of goal

Predict main function or its type (c.f. GPT-f) in the proof of goal

a : Prop, �h : a ↔ true �⊢ true ↔ a

a : Prop, �h : a ↔ true �⊢ true

a : Prop, �h : a ↔ true �⊢ true ↔ a

iff.symm : ∀ {a b : Prop}, (a ↔ b) → (b ↔ a)

a : Prop, �h : a ↔ true �⊢ true ↔ a

λ {a : Prop} (h : a ↔ true), �iff.mp _ trivial

iff.symm h

21 of 40

A theorem proving AI environment

22 of 40

Current proof search and evaluation

Train a model on LeanStep tactic proof dataset

Incorporate into Lean testing environment

(implemented with Lean metaprogramming)

Run greedy search (à la tidy) on test theorems

p q : Prop, �h : p ∧ q �⊢ q ∧ p

cases h with hp hq

p q : Prop, �h : p ∧ q �⊢ q ∧ p

cases h with hp hq

p q : Prop, �hp : p, �hq : q �⊢ q ∧ p

⊢

no goals

23 of 40

tidy proof search

⊢ ∀ n, 2 * n + 1 ≥ 1

refl

exact dec_trivial

failed

assumption

failed

intros

apply auto_param

n: ℕ �⊢ 2 * n + 1 ≥ 1

refl

exact dec_trivial

failed

assumption

no goals

...

24 of 40

Model-guided greedy proof search

failed

split

cases le_total a b

exact ⟨le_of_lt h, not_le_of_lt h⟩

a b: ℕ�h: a.succ < b �⊢ a ≤ b ∧ ¬b ≤ a

⊢

...

query N tactic commands from model

follow first successful tactic

repeat greedy search

up to a fixed depth D

⊢

25 of 40

Preliminary�Results

26 of 40

Evaluation Strategy

Training:

Train model to predict tactic from goals
Train on 80% of LeanStep tactic proof steps�

Testing:

Run greedy proof search on 3,279 test theorems
No tactic proof steps of test theorems were included in the training split�
Compare against tidy as a baseline

p q : Prop, �h : p ∧ q �⊢ q ∧ p

cases h with hp hq

⊢

no goals

27 of 40

Fairseq transformer (encoder-decoder) model

Setting

Use public library fairseq^†
Model size: 50M parameters.
10 candidates generated from beam search with size 10.

Results:

Proof Step test accuracy: 13.8%
Fairseq transformer proved 386 out of 3279 (11.8%).
Tidy baseline proved 362 out of 3279 (11.0%).
Average successful proof length: 3.36 (fairseq) vs 4.91 (tidy)
Transformer proved different theorems: fairseq <|> tidy proved 559 theorems (17.0%).

^†https://github.com/pytorch/fairseq

p q : Prop, �h : p ∧ q �⊢ q ∧ p

cases h with hp hq

28 of 40

Fairseq transformer (encoder-decoder) model

29 of 40

Scale up: Large (3B) GPT-f transformer

Setting

Model size: 3B parameters.
Pre-trained on WebMath (same as GPT-f).
64 candidates generated from 64 greedy sampling, temperature = 1.2.

Results:

Proof Step test accuracy: 20.1%
GPT-f proved 811 out of 3279 (24.7%).
Average successful proof length: 3.36 (fairseq) vs 4.91 (tidy) vs. 3.80 (GPT-f)
Stochasticity exists in evaluation due to sampling.
It proved in total 1179 theorems over 8 evaluation runs (35.9%).

p q : Prop, �h : p ∧ q �⊢ q ∧ p

cases h with hp hq

30 of 40

Scale up: Large (3B) GPT-f transformer

31 of 40

Scale even more: Larger (12B) GPT-f transformer

Setting

Model size: 12B parameters.
Pre-trained on WebMath (same as GPT-f).
64 candidates generated from 64 greedy sampling, temperature = 1.2.

Results:

Proof Step test accuracy: 23.4%
GPT-f proved 833 out of 3279 (25.7%).
Average successful proof length: 4.443

p q : Prop, �h : p ∧ q �⊢ q ∧ p

cases h with hp hq

32 of 40

Scale even more: Larger (12B) GPT-f transformer

33 of 40

Summary

Stats \ Model	tidy	Fairseq�Transformer	Large GPT-f	Larger GPT-f	Large GPT-f over 8 runs
Success rate	11.0%	11.8%	24.8%	25.7%	35.9%
Proof Step accuracy		13.8%	20.1%	23.4%	20.1%
Model Size		50M	3B	12B	3B
Average Proof Length	4.91	3.36	3.80	4.43	4.75

34 of 40

Interactive Neural Theorem Proving

35 of 40

Interactive neural theorem proving

39 of 40

The elephant in the room

Mathlib will take a long time to port to Lean 4, but the Lean 3 tactic data might help in the porting.
The term proof approach should work in Lean 4 (but with new extraction code).
The tactic proof recording won't work as written.
We now have a stake in the ground.
We have been told that it will be possible in the future in Lean 4 to get this sort of data.
Lean 4 should make it much easier to inspect and record the tactic state since it is not hidden behind a C++ wall.

Lean 4

40 of 40

Why Lean?

Popular and newsworthy

Extensive and Growing MathLib Library

Great tools and customization

IMO Grand Challenge (In Lean 4)

Easy to learn and use

Active user base and supportive community

1 of 40

2 of 40

3 of 40

4 of 40

5 of 40

6 of 40

7 of 40

8 of 40

9 of 40

10 of 40

11 of 40

12 of 40

13 of 40

14 of 40

15 of 40

16 of 40

17 of 40

18 of 40

19 of 40

20 of 40

21 of 40

22 of 40

23 of 40

24 of 40

25 of 40

26 of 40

27 of 40

28 of 40

29 of 40

30 of 40

31 of 40

32 of 40

33 of 40

34 of 40

35 of 40

36 of 40

37 of 40

38 of 40

39 of 40

40 of 40