243 lines
8.8 KiB
Markdown
243 lines
8.8 KiB
Markdown
---
|
||
license: apache-2.0
|
||
base_model: Qwen/Qwen3-4B-Thinking-2507
|
||
tags:
|
||
- math
|
||
- reasoning
|
||
- olympiad
|
||
- proof-generation
|
||
- sft
|
||
- distillation
|
||
library_name: transformers
|
||
pipeline_tag: text-generation
|
||
datasets:
|
||
- lm-provers/FineProofs-SFT
|
||
---
|
||
|
||
# QED-Nano SFT
|
||
|
||

|
||
|
||
|
||
## Table of Contents
|
||
|
||
1. [Model Summary](#model-summary)
|
||
2. [How to use](#how-to-use)
|
||
3. [Evaluation](#evaluation)
|
||
4. [Training](#training)
|
||
5. [Limitations](#limitations)
|
||
6. [License](#license)
|
||
|
||
## Model Summary
|
||
|
||
**QED-Nano SFT** is a 4B parameter mathematical reasoning model fine-tuned on olympiad-level proof problems. This is the supervised fine-tuning (SFT) checkpoint trained via knowledge distillation from [deepseek-ai/DeepSeek-Math-V2](https://huggingface.co/deepseek-ai/DeepSeek-Math-V2) (685B parameters). The model generates mathematical proofs with explicit chain-of-thought reasoning enclosed in `<think>` tags.
|
||
|
||
This model serves as the foundation for the full QED-Nano pipeline, which includes subsequent reinforcement learning and reasoning cache extensions. For the complete trained model, see [lm-provers/QED-Nano](https://huggingface.co/lm-provers/QED-Nano).
|
||
|
||
### Key features
|
||
- Reasoning model optimized for **theorem proving**
|
||
- **Supervised fine-tuning checkpoint**: Foundation for the full QED-Nano model (SFT + RL + Reasoning Cache)
|
||
- **Distilled from DeepSeek-Math-V2 (685B)**: Knowledge distillation to 4B parameters
|
||
- **Fully open**: open weights + full training details including public data mixture and training configs
|
||
|
||
For more details refer to our blog post: https://huggingface.co/spaces/lm-provers/qed-nano-blogpost
|
||
|
||
## How to use
|
||
|
||
```python
|
||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||
|
||
model_name = "lm-provers/QED-Nano-SFT"
|
||
device = "cuda" # for GPU usage or "cpu" for CPU usage
|
||
|
||
# load the tokenizer and the model
|
||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||
model = AutoModelForCausalLM.from_pretrained(
|
||
model_name,
|
||
).to(device)
|
||
|
||
# prepare the model input
|
||
prompt = "Generate a rigorous proof to the following question: is \sqrt{2} rational or irrational?"
|
||
messages_think = [
|
||
{"role": "user", "content": prompt}
|
||
]
|
||
|
||
text = tokenizer.apply_chat_template(
|
||
messages_think,
|
||
tokenize=False,
|
||
add_generation_prompt=True,
|
||
)
|
||
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
||
|
||
# Generate the output
|
||
generated_ids = model.generate(**model_inputs, max_new_tokens=32768)
|
||
|
||
# Get and decode the output
|
||
output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :]
|
||
print(tokenizer.decode(output_ids, skip_special_tokens=True))
|
||
```
|
||
|
||
>[!TIP]
|
||
> We recommend setting `temperature=0.6` and `top_p=0.95` in the sampling parameters.
|
||
|
||
### vLLM and SGLang
|
||
|
||
You can use vLLM and SGLang to deploy the model in an API compatible with OpenAI format.
|
||
|
||
#### SGLang
|
||
|
||
```bash
|
||
python -m sglang.launch_server --model-path lm-provers/QED-Nano-SFT
|
||
```
|
||
|
||
#### vLLM
|
||
|
||
```bash
|
||
vllm serve lm-provers/QED-Nano-SFT
|
||
```
|
||
|
||
## Evaluation
|
||
|
||
In this section, we report the evaluation results of QED-Nano on IMO-ProofBench, ProofBench, and IMO-AnswerBench. All evaluations except those on IMO-AnswerBench are reported as avg@3 unless stated otherwise.
|
||
|
||
|
||
| Model | IMO-ProofBench | ProofBench | IMO-AnswerBench |
|
||
|:---|:---:|:---:|:---:|
|
||
| Qwen3-4B-Thinking-2507 | 20.4 (2.6) | 19.5 (0.9) | 55.8 |
|
||
| **QED-Nano-SFT** | **39.5 (2.9)** | **33.3 (0.5)** | **57.5** |
|
||
| **QED-Nano** | **40.0 (0.6)** | **44.9 (3.4)** | **67.5** |
|
||
| **QED-Nano (Agent)** | **54.0 (3.7)** | **54.4 (2.4)** | **-** |
|
||
| Qwen3-30B-A3B-Thinking-2507 | 27.6 (1.0) | 26.1 (2.4) | 67.0 |
|
||
| Qwen3-235B-A22B-Thinking-2507 | 34.1 (0.7) | 33.7 (1.1) | 70.5 |
|
||
| Nomos-1 | 40.3 (3.5) | 28.3 (3.9) | 49.0 |
|
||
| GPT-OSS-20B | 38.3 (1.2) | 38.4 (3.9) | 61.5 |
|
||
| GPT-OSS-120B | 43.1 (3.2) | 47.5 (1.7) | 70.5 |
|
||
| DeepSeek-Math-V2 | 57.9 (2.0) | 60.6 (0.1) | 75.8 |
|
||
| Gemini 3 Pro | 58.7 (2.9) | 66.7 (3.1) | 83.2 |
|
||
|
||
**Note**: This is the SFT-only checkpoint. Performance improves significantly with RL training and reasoning cache. See the full [lm-provers/QED-Nano](https://huggingface.co/lm-provers/QED-Nano) model for complete results.
|
||
|
||
## Training
|
||
|
||
### Model
|
||
|
||
- **Architecture:** Transformer decoder
|
||
- **Precision:** bfloat16
|
||
- **Base Model**: [Qwen/Qwen3-4B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507)
|
||
- **Parameters**: 4 billion
|
||
- **Training Data**: [lm-provers/FineProofs-SFT](https://huggingface.co/datasets/lm-provers/FineProofs-SFT) (4,300 unique problems)
|
||
- **Teacher Model**: [deepseek-ai/DeepSeek-Math-V2](https://huggingface.co/deepseek-ai/DeepSeek-Math-V2) (685B)
|
||
|
||
### Training Hyperparameters
|
||
|
||
- **Optimization Steps**: 620
|
||
- **Global Batch Size**: 32
|
||
- **Learning Rate Schedule**: Cosine with peak at 3×10⁻⁵
|
||
- **Max Sequence Length**: 45,056 tokens
|
||
- **Training Configuration**: Unique problems only (4.3k samples outperformed 7.7k with duplicates)
|
||
|
||
### Software & hardware
|
||
|
||
- **GPUs:** 1 node with 8 H100
|
||
- **Training time:** 5 hours
|
||
- **Training Framework:** [TRL (Transformer Reinforcement Learning)](https://github.com/huggingface/trl)
|
||
- **Evaluation framework:** [CMU-AIRe/QED-Nano](https://github.com/CMU-AIRe/QED-Nano)
|
||
|
||
### Training script
|
||
|
||
You can reproduce this training using the TRL library:
|
||
|
||
```python
|
||
from datasets import load_dataset
|
||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||
from trl import SFTConfig, SFTTrainer
|
||
|
||
# Load model and tokenizer
|
||
model = AutoModelForCausalLM.from_pretrained(
|
||
"Qwen/Qwen3-4B-Thinking-2507",
|
||
torch_dtype="auto",
|
||
attn_implementation="kernels-community/vllm-flash-attn3",
|
||
)
|
||
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Thinking-2507")
|
||
|
||
# Load dataset
|
||
dataset = load_dataset("lm-provers/Olympiads-SFT", split="train")
|
||
|
||
# Training configuration
|
||
training_args = SFTConfig(
|
||
output_dir="./qed-nano-sft",
|
||
max_steps=620,
|
||
per_device_train_batch_size=2,
|
||
gradient_accumulation_steps=2,
|
||
learning_rate=3.0e-5,
|
||
lr_scheduler_type="cosine",
|
||
warmup_ratio=0.03,
|
||
max_length=45056,
|
||
bf16=True,
|
||
gradient_checkpointing=True,
|
||
logging_steps=1,
|
||
save_steps=62,
|
||
push_to_hub=True,
|
||
)
|
||
|
||
# Train
|
||
trainer = SFTTrainer(model=model, tokenizer=tokenizer, args=training_args, train_dataset=dataset)
|
||
trainer.train()
|
||
```
|
||
|
||
## Limitations
|
||
|
||
QED-Nano SFT is a supervised fine-tuning checkpoint with several important limitations:
|
||
|
||
### Critical Limitations
|
||
|
||
1. **Length Explosion**: This SFT-only model suffers from uncontrolled reasoning length growth. It frequently generates responses exceeding 100k tokens, often meandering or repeating steps without converging to a solution. This is a known limitation of pure supervised learning on long-form reasoning tasks.
|
||
|
||
2. **Correctness Issues**: While the model produces fluent mathematical text, it may generate incorrect proofs. The SFT stage does not directly optimize for correctness—only for imitating the teacher's reasoning style.
|
||
|
||
3. **No Self-Correction**: Unlike the full QED-Nano model with RL training, this checkpoint lacks the ability to verify and refine its solutions.
|
||
|
||
### Other Limitations
|
||
|
||
- **Domain-specific**: Designed specifically for theorem proving. Using as a general assistant will likely produce nonsense outside of this domain.
|
||
- **Language**: English only
|
||
- **Text-Only**: Cannot handle problems with diagrams or visual elements
|
||
- **Teacher Errors**: May have inherited some errors from the teacher model (DeepSeek-Math-V2)
|
||
|
||
### Recommended Mitigations
|
||
|
||
- Use the full [lm-provers/QED-Nano](https://huggingface.co/lm-provers/QED-Nano) model for production use
|
||
- Apply length penalties or early stopping to control generation length
|
||
- Combine with external verification tools or judges
|
||
- Consider RL fine-tuning to address correctness and length issues
|
||
|
||
These models should be used as assistive tools rather than definitive sources of information. Users should always verify important information and critically evaluate any generated content.
|
||
|
||
## License
|
||
[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
||
|
||
## Citation
|
||
|
||
If you use this model in your research, please cite:
|
||
|
||
```bibtex
|
||
@model{qed_nano_sft_2026,
|
||
title={QED-Nano SFT: Distilling Olympiad Mathematical Reasoning to 4B Parameters},
|
||
author={Edward Beeching and Jasper Dekoninck and Jia Li and Yuxiao Qu and Amrith Setlur and Ian Wu and Aviral Kumar and Lewis Tunstall},
|
||
year={2026},
|
||
publisher={Hugging Face},
|
||
howpublished={\url{https://huggingface.co/lm-provers/QED-Nano-SFT}}
|
||
}
|
||
```
|
||
|
||
Please also cite the training dataset:
|
||
|
||
```bibtex
|
||
@dataset{fineproofs_sft_dataset_2026,
|
||
title={FineProofs SFT Dataset: Mathematical Olympiad Problems with Chain-of-Thought Reasoning},
|
||
author={Edward Beeching and Jasper Dekoninck and Jia Li and Yuxiao Qu and Amrith Setlur and Ian Wu and Aviral Kumar and Lewis Tunstall},
|
||
year={2026},
|
||
publisher={Hugging Face},
|
||
howpublished={\url{https://huggingface.co/datasets/lm-provers/FineProofs-SFT}}
|
||
}
|
||
``` |