1 of 40

LeanStep: a dataset and environment for (interactive) neural theorem proving

Jason Rute

2 of 40

Jesse Michael Han

Univ. of Pittsburgh

Yuhuai Tony Wu

Univ. of Toronto

Edward Ayers

Univ. of Cambridge

Stanislas Polu

OpenAI

With thanks to the �N2Formal team at�Google AI

3 of 40

Solving the game of mathematics

with machine learning

4 of 40

Machine learning guided formal theorem proving

1. Get mathematical training data

2. Train a model

3. Build theorem proving testing environment

4. Execute proof search

p q : Prop, �h : p q � q p

cases h with hp hq

p q : Prop, �h : p q � q p

cases h with hp hq

p q : Prop, �h : p q � q p

cases h with hp hq

p q : Prop, �hp : p, �hq : q � q p

no goals

5 of 40

Datasets, environments, and ML provers for ITP

HOL4

  • TacticToe (solves 66.4% of HOL4 theorems)

HOL-Light

  • HOLStep
  • HOList/DeepHOL (69.1%)

Metamath

  • Holophrasm (14.3%)
  • MetGen-IL (21.2%)
  • GPT-f (56.2%)

Isabelle/HOL

  • IsarStep

Coq

  • Gamepad
  • CoqGym/ASTactic (12.2%)
  • ProverBot9001 (19.4%)
  • Tactician (39.3%)

Mizar and First Order Logic

  • Miz40
  • RLCoP (50.0%)
  • (Works combining ML and FOL provers)

Lean 3

  • Nothing Yet

Percent of test theorems solved. Numbers cannot always

be compared across solvers, even within the same prover.

6 of 40

State of machine learning for theorem proving

Explored a couple of ideas

  • Tactic selection
  • Lemma selection
  • Full proof step text synthesis
  • A few neural network architectures and other ML algorithms
  • Simple reinforcement learning

Lot more to do

  • Catch up with previous two years of AI research in
    • self-supervised learning
    • reinforcement learning
  • Better benchmarks
  • Lemma creation
  • Definition creation
  • Program synthesis
  • Use arXiv proofs
  • Combine with traditional AI (SMT solvers)
  • New ideas

It shows a lot of promise!

  • Good ML systems solve 20%–60% of theorems
  • Performance similar to (SMT-based) hammers
  • ML starting to be incorporated into theorem provers
  • AI labs (Google AI, OpenAI, Facebook AI, Deepmind, IBM Watson) showing interest in this area (and in Lean!)

7 of 40

Extracting Proof Data from Lean

LeanStep datasets

8 of 40

Proof modes

Tactic proof

lemma and_swap : p q q p :=begin � intro h, � cases h with hp hq, � constructor, exact hq, exact hp �end

Term proof

lemma and_swap : p q q p := �iff.intro � (assume h : p q,show q p,from and.intro (and.right h) (and.left h))

(assume h : q p,show p q,from and.intro (and.right h) (and.left h))

9 of 40

LeanStep :Tactic proofs

10 of 40

Available information

Tactic proof

lemma and_swap : p q q p :=begin � intro h, � cases h with hp hq , � constructor, exact hq, exact hp �end

Data to extract

  • Tactic command and position
  • Hypotheses and goals
  • Tactic name
  • Tactic arguments
  • Full abstract syntax tree of the proof
  • Declaration name
  • Hidden tactic state information:
    • Open namespaces
    • Environment
    • Metavariables
    • Other hidden information

p q : Prop, �h : p q � q p

tactic.interactive.cases (none, ``(h)) [`hp, `hq]

11 of 40

Getting start position and state

Tactic proof

lemma and_swap : p q q p :=begin � intro h, � cases h with hp hq, � constructor, exact hq, exact hp �end

Modifying istep

  • Modify istep in Lean core library to trace
    • start position (line and column)
    • declaration name,
    • goals
    • tactic state information
  • Compile mathlib library and capture output
  • Process output

Used Python script to modify Lean code

p q : Prop, �h : p q � q p

12 of 40

Getting tactic arguments

Tactic proof

lemma and_swap : p q q p :=begin � intro h, � cases h with hp hq , � constructor, exact hq, exact hp �end

Modifying the parser state

  • Modify parse arguments for all interactive Lean tactics (in Lean core and MathLib) to trace
    • argument start and end position
    • argument as reflected in Lean
  • Compile Lean library and capture output
  • Process output

Used Python script to find/modify all interactive tactics

tactic.interactive.cases (none, ``(h)) [`hp, `hq]

13 of 40

LeanStep tactic proof dataset

Tactic proof

lemma and_swap : p q q p :=begin � intro h, � cases h with hp hq , � constructor, exact hq, exact hp �end

A machine learning dataset

  • Input text: Pretty printed hypotheses and goal

  • Output text: Tactic command

  • Good dataset for Transformer models like GPT-f

p q : Prop, �h : p q � q p

cases h with hp hq

14 of 40

LeanStep tactic proof dataset

  • 138,720 goal-tactic pairs�
  • 18,992 theorems containing tactic proofs�
  • out of 42,079 total�
  • Please keep adding to mathlib!

15 of 40

Highlights

  • Tracing code is written in Lean
  • Automatically inject this code into Lean
  • Easy to maintain and update to newer versions of Lean
  • Can record anything in tactic state:
    • All declarations in environment
    • All meta-variables and their details
  • Easy to record expressions via
    • pretty printed string
    • pp.all strings
    • raw fmt
    • s-expressions
    • json
    • vector embeddings (not yet)
  • Captures every interactive tactic application without issue
  • Dataset and code will be made public when ready

16 of 40

This dataset has other uses besides ML

  • Cleaning up mathlib
    • Find and remove tactics which are not executed
    • e.g. by simp; refl
  • Learning basic statistics about mathlib
    • Most common tactics
    • Longest tactic proof
  • Porting mathlib to Lean 4
    • Dataset contains abstract syntax trees of every tactic proof in mathlib

17 of 40

LeanStep: Term proofs

18 of 40

Lean stores term proofs for all theorems

#print of_iff_true

theorem of_iff_true : {a : Prop}, (a true) a :=λ {a : Prop} (h : a true), iff.mp (iff.symm h) trivial

19 of 40

Generate datasets by adding holes to proof term

Proof term with hole

Hypotheses and goal

Term that fulfills the goal

_

�⊢ {a : Prop}, (a true) a

λ {a : Prop} (h : a true), �iff.mp (iff.symm h) trivial

λ {a : Prop}, _

a : Prop (a true) a

λ (h : a true), �iff.mp (iff.symm h) trivial

λ {a : Prop} (h : a true),�_

a : Prop, h : a true � a

iff.mp (iff.symm h) trivial

λ {a : Prop} (h : a true), �iff.mp _ trivial

a : Prop, h : a true � true a

iff.symm h

λ {a : Prop} (h : a true), �iff.mp (iff.symm _) trivial

a : Prop, h : a true � a true

h

λ {a : Prop} (h : a true), �iff.mp (iff.symm h) _

a : Prop, h : a true � true

trivial

20 of 40

Possible prediction tasks with this dataset

Why perform auxiliary tasks?

  • Provides simpler tasks to test new methods on
  • Provides more data for training your model (data augmentation, transfer learning, self-supervised learning)
  • Tasks are subtasks as part of fulfilling the main task
  • Predict masked term ("skip tree task") from partial proof

  • Predict masked proof term from hyps. and goal

  • Predict if prop. hyps. are used in the proof of goal

  • Predict main function or its type (c.f. GPT-f) in the proof of goal

a : Prop, �h : a true � true a

a : Prop, �h : a true � true

a : Prop, �h : a true � true a

iff.symm : {a b : Prop}, (a b) (b a)

a : Prop, �h : a true � true a

λ {a : Prop} (h : a true), �iff.mp _ trivial

iff.symm h

iff.symm h

21 of 40

A theorem proving AI environment

22 of 40

Current proof search and evaluation

Train a model on LeanStep tactic proof dataset

Incorporate into Lean testing environment

(implemented with Lean metaprogramming)

Run greedy search (à la tidy) on test theorems

p q : Prop, �h : p q � q p

cases h with hp hq

p q : Prop, �h : p q � q p

cases h with hp hq

p q : Prop, �hp : p, �hq : q � q p

no goals

23 of 40

tidy proof search

n, 2 * n + 1 1

refl

exact dec_trivial

failed

failed

assumption

failed

intros

apply auto_param

n: ℕ � 2 * n + 1 1

refl

exact dec_trivial

failed

assumption

no goals

...

...

24 of 40

Model-guided greedy proof search

failed

split

cases le_total a b

exact le_of_lt h, not_le_of_lt h

a b: ℕ�h: a.succ < b � a b ¬b a

...

query N tactic commands from model

follow first successful tactic

repeat greedy search

up to a fixed depth D

25 of 40

Preliminary�Results

26 of 40

Evaluation Strategy

Training:

  • Train model to predict tactic from goals
  • Train on 80% of LeanStep tactic proof steps�

Testing:

  • Run greedy proof search on 3,279 test theorems
  • No tactic proof steps of test theorems were included in the training split�
  • Compare against tidy as a baseline

p q : Prop, �h : p q � q p

cases h with hp hq

no goals

27 of 40

Fairseq transformer (encoder-decoder) model

Setting

  • Use public library fairseq
  • Model size: 50M parameters.
  • 10 candidates generated from beam search with size 10.

Results:

  • Proof Step test accuracy: 13.8%
  • Fairseq transformer proved 386 out of 3279 (11.8%).
  • Tidy baseline proved 362 out of 3279 (11.0%).
  • Average successful proof length: 3.36 (fairseq) vs 4.91 (tidy)
  • Transformer proved different theorems: fairseq <|> tidy proved 559 theorems (17.0%).

https://github.com/pytorch/fairseq

p q : Prop, �h : p q � q p

cases h with hp hq

28 of 40

Fairseq transformer (encoder-decoder) model

29 of 40

Scale up: Large (3B) GPT-f transformer

Setting

  • Model size: 3B parameters.
  • Pre-trained on WebMath (same as GPT-f).
  • 64 candidates generated from 64 greedy sampling, temperature = 1.2.

Results:

  • Proof Step test accuracy: 20.1%
  • GPT-f proved 811 out of 3279 (24.7%).
  • Average successful proof length: 3.36 (fairseq) vs 4.91 (tidy) vs. 3.80 (GPT-f)
  • Stochasticity exists in evaluation due to sampling.
  • It proved in total 1179 theorems over 8 evaluation runs (35.9%).

p q : Prop, �h : p q � q p

cases h with hp hq

30 of 40

Scale up: Large (3B) GPT-f transformer

31 of 40

Scale even more: Larger (12B) GPT-f transformer

Setting

  • Model size: 12B parameters.
  • Pre-trained on WebMath (same as GPT-f).
  • 64 candidates generated from 64 greedy sampling, temperature = 1.2.

Results:

  • Proof Step test accuracy: 23.4%
  • GPT-f proved 833 out of 3279 (25.7%).
  • Average successful proof length: 4.443

p q : Prop, �h : p q � q p

cases h with hp hq

32 of 40

Scale even more: Larger (12B) GPT-f transformer

33 of 40

Summary

Stats \ Model

tidy

Fairseq�Transformer

Large GPT-f

Larger GPT-f

Large GPT-f over 8 runs

Success rate

11.0%

11.8%

24.8%

25.7%

35.9%

Proof Step accuracy

13.8%

20.1%

23.4%

20.1%

Model Size

50M

3B

12B

3B

Average Proof Length

4.91

3.36

3.80

4.43

4.75

34 of 40

Interactive Neural Theorem Proving

35 of 40

Interactive neural theorem proving

36 of 40

Demo Video

37 of 40

Thank You!

38 of 40

Appendix

39 of 40

The elephant in the room

  • Mathlib will take a long time to port to Lean 4, but the Lean 3 tactic data might help in the porting.
  • The term proof approach should work in Lean 4 (but with new extraction code).
  • The tactic proof recording won't work as written.
  • We now have a stake in the ground.
  • We have been told that it will be possible in the future in Lean 4 to get this sort of data.
  • Lean 4 should make it much easier to inspect and record the tactic state since it is not hidden behind a C++ wall.

Lean 4

40 of 40

Why Lean?

Popular and newsworthy

Extensive and Growing MathLib Library

Great tools and customization

IMO Grand Challenge (In Lean 4)

Easy to learn and use

Active user base and supportive community

meta