InternLM-Step-Prover is a 7B language model trained on Lean-Github and multiple sythesis datasets. InternLM-Step-Prover achieves state-of-the-art performances on MiniF2F, ProofNet, and Putnam math benchmarks, showing its formal math proving ability in multiple domains.
Dialogue Example
### Input Example
DECL MyNat.mul_pow
GOAL a b n : N
⊢ (a * b) ^ n = a ^ n * b ^ n
### Output Example
PROOFSTEP induction n with t Ht
Performance
MiniF2F
Method
Model size
Pass
miniF2F-valid
miniF2F-test
Whole-Proof Generation Methods
GPT-4-turbo 0409
-
64
25.4%
23.0%
DeepSeekMath-Base
7B
128
25.4%
27.5%
DeepSeek-Prover
7B
1
-
30.0%
64
-
46.3%
128
-
46.3%
8192
-
48.8%
65536
-
50.0%
cumulative
60.2%
52.0%
TheoremLlama
-
cumulative
36.5%
33.6%
Tree Search Methods
COPRA (GPT-3.5)
-
1
-
9.0%
COPRA (GPT-4)
-
1
-
26.6%
DSP(Isabelle)
540B
100
42.6%
38.9%
Proof Artifact Co-Training
837M
1
23.9%
24.6%
8
29.3%
29.2%
ReProver
229M
1
-
25.0%
Llemma
7B
1
26.2%
26.2%
Llemma
34B
1
27.9%
25.8%
Curriculum Learning
837M
1
33.6%
29.6%
8
41.2%
34.5%
64
47.3%
36.6%
Hypertree Proof Search
600M
cumulative
58.6%
-
64
-
41.0%
Lean-STaR
7B
64
-
46.3%
InternLM2-Math
7B
1
29.9%
30.3%
InternLM2-Math-Plus
7B
1
-
43.4%
InternLM2-Step-Prover
7B
1
59.8%
48.8%
InternLM2-Step-Prover
7B
64
63.9%
54.5%
Proofnet & Putnam
Method
Model size
Pass
result
ProofNet benchmark
ReProver
229M
1
13.8%
InternLM2-Step-Prover
7B
1
18.1%
Putnam benchmark
GPT-4
-
10
1/640
COPRA (GPT-4)
-
10
1/640
DSP(Isabelle)
540B
10
4/640
ReProver
229M
1
0/640
InternLM2-Step-Prover
7B
1
5/640
Citation and Tech Report
@misc{wu2024leangithubcompilinggithublean,
title={LEAN-GitHub: Compiling GitHub LEAN repositories for a versatile LEAN prover},
author={Zijian Wu and Jiayu Wang and Dahua Lin and Kai Chen},
year={2024},
eprint={2407.17227},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2407.17227},
}