LeanStep: a dataset and environment for (interactive) neural theorem proving
Jason Rute
Jesse Michael Han
Univ. of Pittsburgh
Yuhuai Tony Wu
Univ. of Toronto
Edward Ayers
Univ. of Cambridge
Stanislas Polu
OpenAI
With thanks to the �N2Formal team at�Google AI
Solving the game of mathematics
with machine learning
Machine learning guided formal theorem proving
1. Get mathematical training data
2. Train a model
3. Build theorem proving testing environment
4. Execute proof search
p q : Prop, �h : p ∧ q �⊢ q ∧ p
cases h with hp hq
p q : Prop, �h : p ∧ q �⊢ q ∧ p
cases h with hp hq
p q : Prop, �h : p ∧ q �⊢ q ∧ p
cases h with hp hq
p q : Prop, �hp : p, �hq : q �⊢ q ∧ p
⊢
⊢
⊢
no goals
Datasets, environments, and ML provers for ITP
HOL4
HOL-Light
Metamath
Isabelle/HOL
Coq
Mizar and First Order Logic
Lean 3
† Percent of test theorems solved. Numbers cannot always
be compared across solvers, even within the same prover.
State of machine learning for theorem proving
Explored a couple of ideas
Lot more to do
It shows a lot of promise!
Extracting Proof Data from Lean
LeanStep datasets
Proof modes
Tactic proof
lemma and_swap : p ∧ q → q ∧ p := �begin � intro h, � cases h with hp hq, � constructor, exact hq, exact hp �end
Term proof
lemma and_swap : p ∧ q → q ∧ p := �iff.intro � (assume h : p ∧ q, � show q ∧ p, � from and.intro (and.right h) (and.left h))
(assume h : q ∧ p, � show p ∧ q, � from and.intro (and.right h) (and.left h))
LeanStep :Tactic proofs
Available information
Tactic proof
lemma and_swap : p ∧ q → q ∧ p := �begin � intro h, � cases h with hp hq , � constructor, exact hq, exact hp �end
Data to extract
p q : Prop, �h : p ∧ q �⊢ q ∧ p
tactic.interactive.cases (none, ``(h)) [`hp, `hq]
Getting start position and state
Tactic proof
lemma and_swap : p ∧ q → q ∧ p := �begin � intro h, � cases h with hp hq, � constructor, exact hq, exact hp �end
Modifying istep
† Used Python script to modify Lean code
p q : Prop, �h : p ∧ q �⊢ q ∧ p
Getting tactic arguments
Tactic proof
lemma and_swap : p ∧ q → q ∧ p := �begin � intro h, � cases h with hp hq , � constructor, exact hq, exact hp �end
Modifying the parser state
† Used Python script to find/modify all interactive tactics
tactic.interactive.cases (none, ``(h)) [`hp, `hq]
LeanStep tactic proof dataset
Tactic proof
lemma and_swap : p ∧ q → q ∧ p := �begin � intro h, � cases h with hp hq , � constructor, exact hq, exact hp �end
A machine learning dataset
p q : Prop, �h : p ∧ q �⊢ q ∧ p
cases h with hp hq
LeanStep tactic proof dataset
Highlights
This dataset has other uses besides ML
LeanStep: Term proofs
Lean stores term proofs for all theorems
#print of_iff_true
theorem of_iff_true : ∀ {a : Prop}, (a ↔ true) → a := �λ {a : Prop} (h : a ↔ true), iff.mp (iff.symm h) trivial
Generate datasets by adding holes to proof term
Proof term with hole | Hypotheses and goal | Term that fulfills the goal |
_ | �⊢ ∀ {a : Prop}, (a ↔ true) → a | λ {a : Prop} (h : a ↔ true), �iff.mp (iff.symm h) trivial |
λ {a : Prop}, _ | a : Prop �⊢ (a ↔ true) → a | λ (h : a ↔ true), �iff.mp (iff.symm h) trivial |
λ {a : Prop} (h : a ↔ true),�_ | a : Prop, h : a ↔ true �⊢ a | iff.mp (iff.symm h) trivial |
λ {a : Prop} (h : a ↔ true), �iff.mp _ trivial | a : Prop, h : a ↔ true �⊢ true ↔ a | iff.symm h |
λ {a : Prop} (h : a ↔ true), �iff.mp (iff.symm _) trivial | a : Prop, h : a ↔ true �⊢ a ↔ true | h |
λ {a : Prop} (h : a ↔ true), �iff.mp (iff.symm h) _ | a : Prop, h : a ↔ true �⊢ true | trivial |
Possible prediction tasks with this dataset
Why perform auxiliary tasks?
a : Prop, �h : a ↔ true �⊢ true ↔ a
a : Prop, �h : a ↔ true �⊢ true
a : Prop, �h : a ↔ true �⊢ true ↔ a
iff.symm : ∀ {a b : Prop}, (a ↔ b) → (b ↔ a)
a : Prop, �h : a ↔ true �⊢ true ↔ a
λ {a : Prop} (h : a ↔ true), �iff.mp _ trivial
iff.symm h
iff.symm h
A theorem proving AI environment
Current proof search and evaluation
Train a model on LeanStep tactic proof dataset
Incorporate into Lean testing environment
(implemented with Lean metaprogramming)
Run greedy search (à la tidy) on test theorems
p q : Prop, �h : p ∧ q �⊢ q ∧ p
cases h with hp hq
p q : Prop, �h : p ∧ q �⊢ q ∧ p
cases h with hp hq
p q : Prop, �hp : p, �hq : q �⊢ q ∧ p
⊢
⊢
⊢
no goals
tidy proof search
⊢ ∀ n, 2 * n + 1 ≥ 1
refl
exact dec_trivial
failed
failed
assumption
failed
intros
apply auto_param
n: ℕ �⊢ 2 * n + 1 ≥ 1
refl
exact dec_trivial
failed
assumption
no goals
...
...
Model-guided greedy proof search
failed
split
cases le_total a b
exact ⟨le_of_lt h, not_le_of_lt h⟩
a b: ℕ�h: a.succ < b �⊢ a ≤ b ∧ ¬b ≤ a
⊢
...
query N tactic commands from model
follow first successful tactic
repeat greedy search
up to a fixed depth D
⊢
⊢
Preliminary�Results
Evaluation Strategy
Training:
Testing:
p q : Prop, �h : p ∧ q �⊢ q ∧ p
cases h with hp hq
⊢
⊢
⊢
no goals
Fairseq transformer (encoder-decoder) model
Setting
Results:
†https://github.com/pytorch/fairseq
p q : Prop, �h : p ∧ q �⊢ q ∧ p
cases h with hp hq
Fairseq transformer (encoder-decoder) model
Scale up: Large (3B) GPT-f transformer
Setting
Results:
p q : Prop, �h : p ∧ q �⊢ q ∧ p
cases h with hp hq
Scale up: Large (3B) GPT-f transformer
Scale even more: Larger (12B) GPT-f transformer
Setting
Results:
p q : Prop, �h : p ∧ q �⊢ q ∧ p
cases h with hp hq
Scale even more: Larger (12B) GPT-f transformer
Summary
Stats \ Model | tidy | Fairseq�Transformer | Large GPT-f | Larger GPT-f | Large GPT-f over 8 runs |
Success rate | 11.0% | 11.8% | 24.8% | 25.7% | 35.9% |
Proof Step accuracy |
| 13.8% | 20.1% | 23.4% | 20.1% |
Model Size |
| 50M | 3B | 12B | 3B |
Average Proof Length | 4.91 | 3.36 | 3.80 | 4.43 | 4.75 |
Interactive Neural Theorem Proving
Interactive neural theorem proving
Demo Video
Thank You!
Appendix
The elephant in the room
Lean 4
Why Lean?
Popular and newsworthy
Extensive and Growing MathLib Library
Great tools and customization
IMO Grand Challenge (In Lean 4)
Easy to learn and use
Active user base and supportive community
meta