243 lines
8.8 KiB
Markdown
243 lines
8.8 KiB
Markdown
|
|
---
|
|||
|
|
license: apache-2.0
|
|||
|
|
base_model: Qwen/Qwen3-4B-Thinking-2507
|
|||
|
|
tags:
|
|||
|
|
- math
|
|||
|
|
- reasoning
|
|||
|
|
- olympiad
|
|||
|
|
- proof-generation
|
|||
|
|
- sft
|
|||
|
|
- distillation
|
|||
|
|
library_name: transformers
|
|||
|
|
pipeline_tag: text-generation
|
|||
|
|
datasets:
|
|||
|
|
- lm-provers/FineProofs-SFT
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# QED-Nano SFT
|
|||
|
|
|
|||
|
|

|
|||
|
|
|
|||
|
|
|
|||
|
|
## Table of Contents
|
|||
|
|
|
|||
|
|
1. [Model Summary](#model-summary)
|
|||
|
|
2. [How to use](#how-to-use)
|
|||
|
|
3. [Evaluation](#evaluation)
|
|||
|
|
4. [Training](#training)
|
|||
|
|
5. [Limitations](#limitations)
|
|||
|
|
6. [License](#license)
|
|||
|
|
|
|||
|
|
## Model Summary
|
|||
|
|
|
|||
|
|
**QED-Nano SFT** is a 4B parameter mathematical reasoning model fine-tuned on olympiad-level proof problems. This is the supervised fine-tuning (SFT) checkpoint trained via knowledge distillation from [deepseek-ai/DeepSeek-Math-V2](https://huggingface.co/deepseek-ai/DeepSeek-Math-V2) (685B parameters). The model generates mathematical proofs with explicit chain-of-thought reasoning enclosed in `<think>` tags.
|
|||
|
|
|
|||
|
|
This model serves as the foundation for the full QED-Nano pipeline, which includes subsequent reinforcement learning and reasoning cache extensions. For the complete trained model, see [lm-provers/QED-Nano](https://huggingface.co/lm-provers/QED-Nano).
|
|||
|
|
|
|||
|
|
### Key features
|
|||
|
|
- Reasoning model optimized for **theorem proving**
|
|||
|
|
- **Supervised fine-tuning checkpoint**: Foundation for the full QED-Nano model (SFT + RL + Reasoning Cache)
|
|||
|
|
- **Distilled from DeepSeek-Math-V2 (685B)**: Knowledge distillation to 4B parameters
|
|||
|
|
- **Fully open**: open weights + full training details including public data mixture and training configs
|
|||
|
|
|
|||
|
|
For more details refer to our blog post: https://huggingface.co/spaces/lm-provers/qed-nano-blogpost
|
|||
|
|
|
|||
|
|
## How to use
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|||
|
|
|
|||
|
|
model_name = "lm-provers/QED-Nano-SFT"
|
|||
|
|
device = "cuda" # for GPU usage or "cpu" for CPU usage
|
|||
|
|
|
|||
|
|
# load the tokenizer and the model
|
|||
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
|||
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|||
|
|
model_name,
|
|||
|
|
).to(device)
|
|||
|
|
|
|||
|
|
# prepare the model input
|
|||
|
|
prompt = "Generate a rigorous proof to the following question: is \sqrt{2} rational or irrational?"
|
|||
|
|
messages_think = [
|
|||
|
|
{"role": "user", "content": prompt}
|
|||
|
|
]
|
|||
|
|
|
|||
|
|
text = tokenizer.apply_chat_template(
|
|||
|
|
messages_think,
|
|||
|
|
tokenize=False,
|
|||
|
|
add_generation_prompt=True,
|
|||
|
|
)
|
|||
|
|
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
|||
|
|
|
|||
|
|
# Generate the output
|
|||
|
|
generated_ids = model.generate(**model_inputs, max_new_tokens=32768)
|
|||
|
|
|
|||
|
|
# Get and decode the output
|
|||
|
|
output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :]
|
|||
|
|
print(tokenizer.decode(output_ids, skip_special_tokens=True))
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
>[!TIP]
|
|||
|
|
> We recommend setting `temperature=0.6` and `top_p=0.95` in the sampling parameters.
|
|||
|
|
|
|||
|
|
### vLLM and SGLang
|
|||
|
|
|
|||
|
|
You can use vLLM and SGLang to deploy the model in an API compatible with OpenAI format.
|
|||
|
|
|
|||
|
|
#### SGLang
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
python -m sglang.launch_server --model-path lm-provers/QED-Nano-SFT
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### vLLM
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
vllm serve lm-provers/QED-Nano-SFT
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Evaluation
|
|||
|
|
|
|||
|
|
In this section, we report the evaluation results of QED-Nano on IMO-ProofBench, ProofBench, and IMO-AnswerBench. All evaluations except those on IMO-AnswerBench are reported as avg@3 unless stated otherwise.
|
|||
|
|
|
|||
|
|
|
|||
|
|
| Model | IMO-ProofBench | ProofBench | IMO-AnswerBench |
|
|||
|
|
|:---|:---:|:---:|:---:|
|
|||
|
|
| Qwen3-4B-Thinking-2507 | 20.4 (2.6) | 19.5 (0.9) | 55.8 |
|
|||
|
|
| **QED-Nano-SFT** | **39.5 (2.9)** | **33.3 (0.5)** | **57.5** |
|
|||
|
|
| **QED-Nano** | **40.0 (0.6)** | **44.9 (3.4)** | **67.5** |
|
|||
|
|
| **QED-Nano (Agent)** | **54.0 (3.7)** | **54.4 (2.4)** | **-** |
|
|||
|
|
| Qwen3-30B-A3B-Thinking-2507 | 27.6 (1.0) | 26.1 (2.4) | 67.0 |
|
|||
|
|
| Qwen3-235B-A22B-Thinking-2507 | 34.1 (0.7) | 33.7 (1.1) | 70.5 |
|
|||
|
|
| Nomos-1 | 40.3 (3.5) | 28.3 (3.9) | 49.0 |
|
|||
|
|
| GPT-OSS-20B | 38.3 (1.2) | 38.4 (3.9) | 61.5 |
|
|||
|
|
| GPT-OSS-120B | 43.1 (3.2) | 47.5 (1.7) | 70.5 |
|
|||
|
|
| DeepSeek-Math-V2 | 57.9 (2.0) | 60.6 (0.1) | 75.8 |
|
|||
|
|
| Gemini 3 Pro | 58.7 (2.9) | 66.7 (3.1) | 83.2 |
|
|||
|
|
|
|||
|
|
**Note**: This is the SFT-only checkpoint. Performance improves significantly with RL training and reasoning cache. See the full [lm-provers/QED-Nano](https://huggingface.co/lm-provers/QED-Nano) model for complete results.
|
|||
|
|
|
|||
|
|
## Training
|
|||
|
|
|
|||
|
|
### Model
|
|||
|
|
|
|||
|
|
- **Architecture:** Transformer decoder
|
|||
|
|
- **Precision:** bfloat16
|
|||
|
|
- **Base Model**: [Qwen/Qwen3-4B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507)
|
|||
|
|
- **Parameters**: 4 billion
|
|||
|
|
- **Training Data**: [lm-provers/FineProofs-SFT](https://huggingface.co/datasets/lm-provers/FineProofs-SFT) (4,300 unique problems)
|
|||
|
|
- **Teacher Model**: [deepseek-ai/DeepSeek-Math-V2](https://huggingface.co/deepseek-ai/DeepSeek-Math-V2) (685B)
|
|||
|
|
|
|||
|
|
### Training Hyperparameters
|
|||
|
|
|
|||
|
|
- **Optimization Steps**: 620
|
|||
|
|
- **Global Batch Size**: 32
|
|||
|
|
- **Learning Rate Schedule**: Cosine with peak at 3×10⁻⁵
|
|||
|
|
- **Max Sequence Length**: 45,056 tokens
|
|||
|
|
- **Training Configuration**: Unique problems only (4.3k samples outperformed 7.7k with duplicates)
|
|||
|
|
|
|||
|
|
### Software & hardware
|
|||
|
|
|
|||
|
|
- **GPUs:** 1 node with 8 H100
|
|||
|
|
- **Training time:** 5 hours
|
|||
|
|
- **Training Framework:** [TRL (Transformer Reinforcement Learning)](https://github.com/huggingface/trl)
|
|||
|
|
- **Evaluation framework:** [CMU-AIRe/QED-Nano](https://github.com/CMU-AIRe/QED-Nano)
|
|||
|
|
|
|||
|
|
### Training script
|
|||
|
|
|
|||
|
|
You can reproduce this training using the TRL library:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from datasets import load_dataset
|
|||
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|||
|
|
from trl import SFTConfig, SFTTrainer
|
|||
|
|
|
|||
|
|
# Load model and tokenizer
|
|||
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|||
|
|
"Qwen/Qwen3-4B-Thinking-2507",
|
|||
|
|
torch_dtype="auto",
|
|||
|
|
attn_implementation="kernels-community/vllm-flash-attn3",
|
|||
|
|
)
|
|||
|
|
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Thinking-2507")
|
|||
|
|
|
|||
|
|
# Load dataset
|
|||
|
|
dataset = load_dataset("lm-provers/Olympiads-SFT", split="train")
|
|||
|
|
|
|||
|
|
# Training configuration
|
|||
|
|
training_args = SFTConfig(
|
|||
|
|
output_dir="./qed-nano-sft",
|
|||
|
|
max_steps=620,
|
|||
|
|
per_device_train_batch_size=2,
|
|||
|
|
gradient_accumulation_steps=2,
|
|||
|
|
learning_rate=3.0e-5,
|
|||
|
|
lr_scheduler_type="cosine",
|
|||
|
|
warmup_ratio=0.03,
|
|||
|
|
max_length=45056,
|
|||
|
|
bf16=True,
|
|||
|
|
gradient_checkpointing=True,
|
|||
|
|
logging_steps=1,
|
|||
|
|
save_steps=62,
|
|||
|
|
push_to_hub=True,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Train
|
|||
|
|
trainer = SFTTrainer(model=model, tokenizer=tokenizer, args=training_args, train_dataset=dataset)
|
|||
|
|
trainer.train()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Limitations
|
|||
|
|
|
|||
|
|
QED-Nano SFT is a supervised fine-tuning checkpoint with several important limitations:
|
|||
|
|
|
|||
|
|
### Critical Limitations
|
|||
|
|
|
|||
|
|
1. **Length Explosion**: This SFT-only model suffers from uncontrolled reasoning length growth. It frequently generates responses exceeding 100k tokens, often meandering or repeating steps without converging to a solution. This is a known limitation of pure supervised learning on long-form reasoning tasks.
|
|||
|
|
|
|||
|
|
2. **Correctness Issues**: While the model produces fluent mathematical text, it may generate incorrect proofs. The SFT stage does not directly optimize for correctness—only for imitating the teacher's reasoning style.
|
|||
|
|
|
|||
|
|
3. **No Self-Correction**: Unlike the full QED-Nano model with RL training, this checkpoint lacks the ability to verify and refine its solutions.
|
|||
|
|
|
|||
|
|
### Other Limitations
|
|||
|
|
|
|||
|
|
- **Domain-specific**: Designed specifically for theorem proving. Using as a general assistant will likely produce nonsense outside of this domain.
|
|||
|
|
- **Language**: English only
|
|||
|
|
- **Text-Only**: Cannot handle problems with diagrams or visual elements
|
|||
|
|
- **Teacher Errors**: May have inherited some errors from the teacher model (DeepSeek-Math-V2)
|
|||
|
|
|
|||
|
|
### Recommended Mitigations
|
|||
|
|
|
|||
|
|
- Use the full [lm-provers/QED-Nano](https://huggingface.co/lm-provers/QED-Nano) model for production use
|
|||
|
|
- Apply length penalties or early stopping to control generation length
|
|||
|
|
- Combine with external verification tools or judges
|
|||
|
|
- Consider RL fine-tuning to address correctness and length issues
|
|||
|
|
|
|||
|
|
These models should be used as assistive tools rather than definitive sources of information. Users should always verify important information and critically evaluate any generated content.
|
|||
|
|
|
|||
|
|
## License
|
|||
|
|
[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
|||
|
|
|
|||
|
|
## Citation
|
|||
|
|
|
|||
|
|
If you use this model in your research, please cite:
|
|||
|
|
|
|||
|
|
```bibtex
|
|||
|
|
@model{qed_nano_sft_2026,
|
|||
|
|
title={QED-Nano SFT: Distilling Olympiad Mathematical Reasoning to 4B Parameters},
|
|||
|
|
author={Edward Beeching and Jasper Dekoninck and Jia Li and Yuxiao Qu and Amrith Setlur and Ian Wu and Aviral Kumar and Lewis Tunstall},
|
|||
|
|
year={2026},
|
|||
|
|
publisher={Hugging Face},
|
|||
|
|
howpublished={\url{https://huggingface.co/lm-provers/QED-Nano-SFT}}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Please also cite the training dataset:
|
|||
|
|
|
|||
|
|
```bibtex
|
|||
|
|
@dataset{fineproofs_sft_dataset_2026,
|
|||
|
|
title={FineProofs SFT Dataset: Mathematical Olympiad Problems with Chain-of-Thought Reasoning},
|
|||
|
|
author={Edward Beeching and Jasper Dekoninck and Jia Li and Yuxiao Qu and Amrith Setlur and Ian Wu and Aviral Kumar and Lewis Tunstall},
|
|||
|
|
year={2026},
|
|||
|
|
publisher={Hugging Face},
|
|||
|
|
howpublished={\url{https://huggingface.co/datasets/lm-provers/FineProofs-SFT}}
|
|||
|
|
}
|
|||
|
|
```
|