191 lines
6.4 KiB
Markdown
191 lines
6.4 KiB
Markdown
|
|
---
|
|||
|
|
license: apache-2.0
|
|||
|
|
datasets:
|
|||
|
|
- uw-math-ai/APRIL
|
|||
|
|
tags:
|
|||
|
|
- lean4
|
|||
|
|
base_model:
|
|||
|
|
- Goedel-LM/Goedel-Prover-V2-8B
|
|||
|
|
pipeline_tag: text-generation
|
|||
|
|
library_name: transformers
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# APRIL-Goedel-8B: Lean Proof Repair (Repair Only)
|
|||
|
|
|
|||
|
|
This model is a LoRA finetune of [Goedel-Prover-V2-8B](https://huggingface.co/Goedel-LM/Goedel-Prover-V2-8B) on the [APRIL](https://huggingface.co/datasets/uw-math-ai/APRIL) dataset for **Lean 4 proof repair without explanation supervision**. Given an erroneous Lean proof and compiler feedback, the model directly produces a corrected proof. This variant maximizes single-shot repair accuracy by training exclusively on the repair objective.
|
|||
|
|
|
|||
|
|
## Model Details
|
|||
|
|
|
|||
|
|
- **Base model:** Goedel-Prover-V2-8B
|
|||
|
|
- **Method:** Supervised finetuning with LoRA (rank 32, α = 64)
|
|||
|
|
- **Training data:** APRIL — 260K paired erroneous/correct Lean proofs with compiler diagnostics (explanations excluded from supervision)
|
|||
|
|
- **Task:** Proof repair only
|
|||
|
|
- **Lean version:** 4.22.0-rc4
|
|||
|
|
|
|||
|
|
## Results
|
|||
|
|
|
|||
|
|
Single-shot proof repair accuracy (pass@1) on the APRIL test set (1,835 examples), evaluated by Lean compilation:
|
|||
|
|
|
|||
|
|
| Error Type | This Model (w/o exp) | With Explanations | Base Goedel-8B | Goedel-32B |
|
|||
|
|
|---|---|---|---|---|
|
|||
|
|
| **Full** | **36.7%** | 34.6% | 15.5% | 26.8% |
|
|||
|
|
| Tactic | **48.5%** | 41.7% | 19.6% | 34.2% |
|
|||
|
|
| Line | **25.5%** | 18.5% | 20.0% | 28.5% |
|
|||
|
|
| Theorem | **37.5%** | 36.8% | 12.7% | 23.0% |
|
|||
|
|
| Multi-Line | **24.3%** | 20.8% | 19.4% | 32.6% |
|
|||
|
|
|
|||
|
|
Training exclusively for repair (without explanation supervision) yields the highest pass@1 accuracy, gaining ~2% over the joint variant. However, this model does not produce human-interpretable diagnostics. See the [with-explanation variant](https://huggingface.co/uw-math-ai/gAPRIL-w-exp) for the trade-off discussion.
|
|||
|
|
|
|||
|
|
## Model & Dataset Download
|
|||
|
|
|
|||
|
|
| Resource | Description | Link |
|
|||
|
|
|---|---|---|
|
|||
|
|
| **APRIL Dataset** | 260K Lean proof-repair tuples with compiler diagnostics and explanations | [uw-math-ai/APRIL](https://huggingface.co/datasets/uw-math-ai/APRIL) |
|
|||
|
|
| **gAPRIL-w-exp** | Goedel-8B finetuned on APRIL with joint explanation supervision | [uw-math-ai/gAPRIL-w-exp](https://huggingface.co/uw-math-ai/gAPRIL-w-exp) |
|
|||
|
|
| **gAPRIL-wo-exp** | Goedel-8B finetuned on APRIL for repair only (no explanations) | [uw-math-ai/gAPRIL-wo-exp](https://huggingface.co/uw-math-ai/gAPRIL-wo-exp) |
|
|||
|
|
|
|||
|
|
## Usage
|
|||
|
|
|
|||
|
|
The model expects a chat-formatted prompt with the erroneous proof, goal state, error line, and compiler error message. The assistant response contains the corrected proof in a `lean` code block.
|
|||
|
|
|
|||
|
|
**System:** `You are diagnosing a single failing proof`
|
|||
|
|
|
|||
|
|
**User:**
|
|||
|
|
```
|
|||
|
|
Explain the error, suggest a fix, and provide the corrected proof based on the context:
|
|||
|
|
|
|||
|
|
Incorrect Proof: <erroneous proof>
|
|||
|
|
State: <goal state before error from InfoView>
|
|||
|
|
Line at error: <error-occurred line of code>
|
|||
|
|
Lean error: <error messages from InfoView>
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Assistant** (model output):
|
|||
|
|
```
|
|||
|
|
Explanation: <explanation of error cause>
|
|||
|
|
Fix: <code manipulation fix suggestion>
|
|||
|
|
Corrected Proof: <corrected proof>
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Example Inference
|
|||
|
|
|
|||
|
|
````python
|
|||
|
|
import re
|
|||
|
|
|
|||
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|||
|
|
import torch
|
|||
|
|
|
|||
|
|
torch.manual_seed(42)
|
|||
|
|
|
|||
|
|
|
|||
|
|
def extract_proof_from_text(output):
|
|||
|
|
lean_codes = re.findall(r"```lean\s*(.*?)\s*```", output, re.DOTALL)
|
|||
|
|
if not lean_codes or len(lean_codes) == 0:
|
|||
|
|
lean_codes = re.findall(r"```lean4\s*(.*?)\s*```", output, re.DOTALL)
|
|||
|
|
words = ["by", ":="]
|
|||
|
|
|
|||
|
|
for i in range(len(lean_codes)):
|
|||
|
|
lean_code = lean_codes[-i - 1]
|
|||
|
|
if all(word in lean_code for word in words):
|
|||
|
|
return lean_code
|
|||
|
|
return None
|
|||
|
|
|
|||
|
|
|
|||
|
|
model_id = "uw-math-ai/gAPRIL-wo-exp"
|
|||
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
|||
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|||
|
|
model_id, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
system_prompt = (
|
|||
|
|
"You are a Lean 4 programmer diagnosing a single failing proof. "
|
|||
|
|
"Assume you only see the incorrect proof text, the infoview state"
|
|||
|
|
" near the failure, and Lean's error message."
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Context information for the incorrect proof
|
|||
|
|
incorrect_proof = """
|
|||
|
|
theorem lean_problem : IsLeast {x : ℕ | x > 0 ∧ (7 * x) % 100 = 29} 47 := by
|
|||
|
|
constructor
|
|||
|
|
· constructor
|
|||
|
|
· norm_num
|
|||
|
|
· norm_num
|
|||
|
|
· intro x ⟨hx_pos, hx_cong⟩
|
|||
|
|
by_contra h
|
|||
|
|
push_neg at h
|
|||
|
|
obtain ⟨h_le, h_ne⟩ := lt_iff_le_and_ne.mp h
|
|||
|
|
have h_lt := h_le
|
|||
|
|
revert x hx_pos hx_cong h_lt
|
|||
|
|
refine' Nat.le_induction _ _ 47 _ <;> intros x hx_lt hx_pos hx_cong
|
|||
|
|
· rfl
|
|||
|
|
· have : x < 47 := by omega
|
|||
|
|
interval_cases x
|
|||
|
|
all_goals try { norm_num at hx_cong; norm_num }
|
|||
|
|
""".strip()
|
|||
|
|
|
|||
|
|
infoview_state = (
|
|||
|
|
"case right.intro.refine'_4 ⊢ ∀ (n : ℕ), sorry ≤ n → "
|
|||
|
|
"(∀ ⦃x : ℕ⦄, x > 0 → 7 * x % 100 = 29 → x < n → x ≤ n → x ≠ n → x ≤ n → False) → "
|
|||
|
|
"∀ ⦃x : ℕ⦄, x > 0 → 7 * x % 100 = 29 → x < n + 1 → x ≤ n + 1 → x ≠ n + 1 → x ≤ n + 1 → False"
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
line_at_error = "refine' Nat.le_induction _ _ 47 _ <;> intros x hx_lt hx_pos hx_cong"
|
|||
|
|
|
|||
|
|
error_message = (
|
|||
|
|
"tactic 'introN' failed, insufficient number of binders\n"
|
|||
|
|
"case right.intro.refine'_1\n⊢ ℕ"
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
user_prompt = f"""
|
|||
|
|
**Instruction:** Provide the full corrected Lean 4 theorem/proof in a single ```lean``` code block.
|
|||
|
|
|
|||
|
|
**Context:**
|
|||
|
|
|
|||
|
|
Incorrect proof:
|
|||
|
|
```lean
|
|||
|
|
{incorrect_proof}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Infoview state:
|
|||
|
|
{infoview_state}
|
|||
|
|
|
|||
|
|
Line at error:
|
|||
|
|
{line_at_error}
|
|||
|
|
|
|||
|
|
Lean error:
|
|||
|
|
{error_message}
|
|||
|
|
""".strip()
|
|||
|
|
|
|||
|
|
|
|||
|
|
chat = [
|
|||
|
|
{"role": "system", "content": system_prompt},
|
|||
|
|
{"role": "user", "content": user_prompt},
|
|||
|
|
]
|
|||
|
|
|
|||
|
|
inputs = tokenizer.apply_chat_template(
|
|||
|
|
chat, tokenize=True, add_generation_prompt=True, return_tensors="pt"
|
|||
|
|
).to(model.device)
|
|||
|
|
|
|||
|
|
outputs = model.generate(inputs, max_new_tokens=8192)
|
|||
|
|
decoded_outputs = tokenizer.batch_decode(outputs)
|
|||
|
|
proof = extract_proof_from_text(decoded_outputs[0])
|
|||
|
|
print(proof)
|
|||
|
|
````
|
|||
|
|
|
|||
|
|
## Citation
|
|||
|
|
|
|||
|
|
```bibtex
|
|||
|
|
@article{wang2026repairlean,
|
|||
|
|
title = {Learning to Repair Lean Proofs from Compiler Feedback},
|
|||
|
|
author = {Wang, Evan and Chess, Simon and Lee, Daniel and Ge, Siyuan and Mallavarapu, Ajit and Ilin, Vasily},
|
|||
|
|
journal= {arXiv preprint arXiv:2602.02990},
|
|||
|
|
year = {2026},
|
|||
|
|
doi = {10.48550/arXiv.2602.02990},
|
|||
|
|
url = {https://arxiv.org/abs/2602.02990}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Acknowledgements
|
|||
|
|
|
|||
|
|
Work by the Math AI Lab, University of Washington. Supported by UW eScience School, UW IT (AWS credits), UW Department of Applied Mathematics (GPU access), and Nebius (LLM cloud credits).
|