QED-Nano-SFT/README.md

---
license: apache-2.0
base_model: Qwen/Qwen3-4B-Thinking-2507
tags:
- math
- reasoning
- olympiad
- proof-generation
- sft
- distillation
library_name: transformers
pipeline_tag: text-generation
datasets:
- lm-provers/FineProofs-SFT
---

# QED-Nano SFT

![logo.png](https://huggingface.co/lm-provers/QED-Nano-SFT/resolve/main/logo.png)


##  Table of Contents

1. [Model Summary](#model-summary)
2. [How to use](#how-to-use)
3. [Evaluation](#evaluation)
4. [Training](#training)
5. [Limitations](#limitations)
6. [License](#license)

## Model Summary

**QED-Nano SFT** is a 4B parameter mathematical reasoning model fine-tuned on olympiad-level proof problems. This is the supervised fine-tuning (SFT) checkpoint trained via knowledge distillation from [deepseek-ai/DeepSeek-Math-V2](https://huggingface.co/deepseek-ai/DeepSeek-Math-V2) (685B parameters). The model generates mathematical proofs with explicit chain-of-thought reasoning enclosed in `<think>` tags.

This model serves as the foundation for the full QED-Nano pipeline, which includes subsequent reinforcement learning and reasoning cache extensions. For the complete trained model, see [lm-provers/QED-Nano](https://huggingface.co/lm-provers/QED-Nano).

### Key features
- Reasoning model optimized for **theorem proving**
- **Supervised fine-tuning checkpoint**: Foundation for the full QED-Nano model (SFT + RL + Reasoning Cache)
- **Distilled from DeepSeek-Math-V2 (685B)**: Knowledge distillation to 4B parameters
- **Fully open**: open weights + full training details including public data mixture and training configs

For more details refer to our blog post: https://huggingface.co/spaces/lm-provers/qed-nano-blogpost

## How to use

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "lm-provers/QED-Nano-SFT"
device = "cuda"  # for GPU usage or "cpu" for CPU usage

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
).to(device)

# prepare the model input
prompt = "Generate a rigorous proof to the following question: is \sqrt{2} rational or irrational?"
messages_think = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages_think,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Generate the output
generated_ids = model.generate(**model_inputs, max_new_tokens=32768)

# Get and decode the output
output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :]
print(tokenizer.decode(output_ids, skip_special_tokens=True))
```

>[!TIP]
> We recommend setting `temperature=0.6` and `top_p=0.95` in the sampling parameters.

### vLLM and SGLang

You can use vLLM and SGLang to deploy the model in an API compatible with OpenAI format.

#### SGLang

```bash
python -m sglang.launch_server --model-path lm-provers/QED-Nano-SFT
```

#### vLLM

```bash
vllm serve lm-provers/QED-Nano-SFT
```

## Evaluation

In this section, we report the evaluation results of QED-Nano on IMO-ProofBench, ProofBench, and IMO-AnswerBench. All evaluations except those on IMO-AnswerBench are reported as avg@3 unless stated otherwise.


| Model | IMO-ProofBench | ProofBench | IMO-AnswerBench |
|:---|:---:|:---:|:---:|
| Qwen3-4B-Thinking-2507 | 20.4 (2.6) | 19.5 (0.9) | 55.8 |
| **QED-Nano-SFT** | **39.5 (2.9)** | **33.3 (0.5)** | **57.5** |
| **QED-Nano** | **40.0 (0.6)** | **44.9 (3.4)** | **67.5** |
| **QED-Nano (Agent)** | **54.0 (3.7)** | **54.4 (2.4)** | **-** |
| Qwen3-30B-A3B-Thinking-2507 | 27.6 (1.0) | 26.1 (2.4) | 67.0 |
| Qwen3-235B-A22B-Thinking-2507 | 34.1 (0.7) | 33.7 (1.1) | 70.5 |
| Nomos-1 | 40.3 (3.5) | 28.3 (3.9) | 49.0 |
| GPT-OSS-20B | 38.3 (1.2) | 38.4 (3.9) | 61.5 |
| GPT-OSS-120B | 43.1 (3.2) | 47.5 (1.7) | 70.5 |
| DeepSeek-Math-V2 | 57.9 (2.0) | 60.6 (0.1) | 75.8 |
| Gemini 3 Pro | 58.7 (2.9) | 66.7 (3.1) | 83.2 |

**Note**: This is the SFT-only checkpoint. Performance improves significantly with RL training and reasoning cache. See the full [lm-provers/QED-Nano](https://huggingface.co/lm-provers/QED-Nano) model for complete results.

## Training

### Model

- **Architecture:** Transformer decoder
- **Precision:** bfloat16
- **Base Model**: [Qwen/Qwen3-4B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507)
- **Parameters**: 4 billion
- **Training Data**: [lm-provers/FineProofs-SFT](https://huggingface.co/datasets/lm-provers/FineProofs-SFT) (4,300 unique problems)
- **Teacher Model**: [deepseek-ai/DeepSeek-Math-V2](https://huggingface.co/deepseek-ai/DeepSeek-Math-V2) (685B)

### Training Hyperparameters

- **Optimization Steps**: 620
- **Global Batch Size**: 32
- **Learning Rate Schedule**: Cosine with peak at 3×10⁻⁵
- **Max Sequence Length**: 45,056 tokens
- **Training Configuration**: Unique problems only (4.3k samples outperformed 7.7k with duplicates)

### Software & hardware

- **GPUs:** 1 node with 8 H100
- **Training time:** 5 hours
- **Training Framework:** [TRL (Transformer Reinforcement Learning)](https://github.com/huggingface/trl)
- **Evaluation framework:** [CMU-AIRe/QED-Nano](https://github.com/CMU-AIRe/QED-Nano)

### Training script

You can reproduce this training using the TRL library:

```python
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTConfig, SFTTrainer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-4B-Thinking-2507",
    torch_dtype="auto",
    attn_implementation="kernels-community/vllm-flash-attn3",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Thinking-2507")

# Load dataset
dataset = load_dataset("lm-provers/Olympiads-SFT", split="train")

# Training configuration
training_args = SFTConfig(
    output_dir="./qed-nano-sft",
    max_steps=620,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    learning_rate=3.0e-5,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    max_length=45056,
    bf16=True,
    gradient_checkpointing=True,
    logging_steps=1,
    save_steps=62,
    push_to_hub=True,
)

# Train
trainer = SFTTrainer(model=model, tokenizer=tokenizer, args=training_args, train_dataset=dataset)
trainer.train()
```

## Limitations

QED-Nano SFT is a supervised fine-tuning checkpoint with several important limitations:

### Critical Limitations

1. **Length Explosion**: This SFT-only model suffers from uncontrolled reasoning length growth. It frequently generates responses exceeding 100k tokens, often meandering or repeating steps without converging to a solution. This is a known limitation of pure supervised learning on long-form reasoning tasks.

2. **Correctness Issues**: While the model produces fluent mathematical text, it may generate incorrect proofs. The SFT stage does not directly optimize for correctness—only for imitating the teacher's reasoning style.

3. **No Self-Correction**: Unlike the full QED-Nano model with RL training, this checkpoint lacks the ability to verify and refine its solutions.

### Other Limitations

- **Domain-specific**: Designed specifically for theorem proving. Using as a general assistant will likely produce nonsense outside of this domain.
- **Language**: English only
- **Text-Only**: Cannot handle problems with diagrams or visual elements
- **Teacher Errors**: May have inherited some errors from the teacher model (DeepSeek-Math-V2)

### Recommended Mitigations

- Use the full [lm-provers/QED-Nano](https://huggingface.co/lm-provers/QED-Nano) model for production use
- Apply length penalties or early stopping to control generation length
- Combine with external verification tools or judges
- Consider RL fine-tuning to address correctness and length issues

These models should be used as assistive tools rather than definitive sources of information. Users should always verify important information and critically evaluate any generated content.

## License
[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)

## Citation

If you use this model in your research, please cite:

```bibtex
@model{qed_nano_sft_2026,
  title={QED-Nano SFT: Distilling Olympiad Mathematical Reasoning to 4B Parameters},
  author={Edward Beeching and Jasper Dekoninck and Jia Li and Yuxiao Qu and Amrith Setlur and Ian Wu and Aviral Kumar and Lewis Tunstall},
  year={2026},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/lm-provers/QED-Nano-SFT}}
}
```

Please also cite the training dataset:

```bibtex
@dataset{fineproofs_sft_dataset_2026,
  title={FineProofs SFT Dataset: Mathematical Olympiad Problems with Chain-of-Thought Reasoning},
  author={Edward Beeching and Jasper Dekoninck and Jia Li and Yuxiao Qu and Amrith Setlur and Ian Wu and Aviral Kumar and Lewis Tunstall},
  year={2026},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/datasets/lm-provers/FineProofs-SFT}}
}
```