140 lines
4.3 KiB
Markdown
140 lines
4.3 KiB
Markdown
---
|
|
license: other
|
|
language:
|
|
- en
|
|
- zh
|
|
pipeline_tag: text-generation
|
|
tags:
|
|
- qwen
|
|
- qwen3
|
|
- math
|
|
- sft
|
|
- supervised-fine-tuning
|
|
- full-finetuning
|
|
- reasoning
|
|
base_model: Qwen/Qwen3-1.7B-Base
|
|
---
|
|
|
|
<h1 align="center">Qwen3-1.7B-SFT</h1>
|
|
|
|
<div align="center" style="line-height: 1;">
|
|
<a href="https://arxiv.org/abs/2604.13016" style="margin: 2px;">
|
|
<img alt="Paper" src="https://img.shields.io/badge/paper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
|
|
</a>
|
|
<a href="https://github.com/thunlp/OPD" style="margin: 2px;">
|
|
<img alt="Github" src="https://img.shields.io/badge/OPD-000000?style=for-the-badge&logo=github&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
|
|
</a>
|
|
<a href="https://huggingface.co/papers/2604.13016" style="margin: 2px;">
|
|
<img alt="HF Papers" src="https://img.shields.io/badge/HF--Paper-%23FFD14D?style=for-the-badge&logo=huggingface&logoColor=black" style="display: inline-block; vertical-align: middle;"/>
|
|
</a>
|
|
<a href="https://x.com/HBX_hbx/status/2044464414829777354" style="margin: 2px;">
|
|
<img alt="Twitter" src="https://img.shields.io/badge/Twitter-%23000000.svg?style=for-the-badge&logo=x&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
|
|
</a>
|
|
</div>
|
|
|
|
<br>
|
|
|
|
Qwen3-1.7B-SFT is a supervised fine-tuned model based on **Qwen3-1.7B-Base**, trained on the **DeepMath-4B** dataset for mathematical reasoning and problem-solving.
|
|
|
|
This model is associated with the paper:
|
|
**Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe**
|
|
Paper link: https://arxiv.org/abs/2604.13016
|
|
|
|
## Model Description
|
|
|
|
This model is obtained by full-parameter supervised fine-tuning (SFT) from `Qwen3-1.7B-Base`.
|
|
The training is designed to improve the model's performance on math-focused instruction-following and reasoning tasks.
|
|
This model is intended in **Section 5.1. Off-Policy Distillation from Teacher Rollouts as Cold Start**.
|
|
|
|
### Key characteristics
|
|
|
|
- **Base model**: Qwen3-1.7B-Base
|
|
- **Training stage**: Supervised Fine-Tuning (SFT)
|
|
- **Finetuning type**: Full finetuning
|
|
- **Primary domain**: Mathematical reasoning
|
|
- **Thinking mode**: Disabled during training (`enable_thinking: false`)
|
|
- **Context length**: 20480 tokens
|
|
|
|
## Training Details
|
|
|
|
### Training configuration
|
|
|
|
- **Framework**: LLaMA-Factory
|
|
- **Stage**: `sft`
|
|
- **Finetuning type**: `full`
|
|
- **DeepSpeed config**: `ds_z2_config.json`
|
|
- **Kernel optimization**: `enable_liger_kernel: True`
|
|
- **Precision**: `bf16`
|
|
- **Gradient checkpointing**: enabled
|
|
- **Learning rate**: `1e-5`
|
|
- **Scheduler**: cosine
|
|
- **Warmup ratio**: `0.1`
|
|
- **Number of epochs**: `2.0`
|
|
- **Per-device train batch size**: `16`
|
|
- **Gradient accumulation steps**: `1`
|
|
- **Validation split**: `0.01`
|
|
- **Evaluation strategy**: every `100` steps
|
|
- **Save strategy**: every `100` steps
|
|
|
|
### Dataset
|
|
|
|
- **Training dataset**: `deep_math_4b`
|
|
|
|
## Training Hyperparameters
|
|
|
|
For reproducibility, the core configuration is summarized below:
|
|
|
|
```
|
|
model_name_or_path: ../model/Qwen3-1.7B-Base
|
|
trust_remote_code: true
|
|
|
|
stage: sft
|
|
do_train: true
|
|
finetuning_type: full
|
|
deepspeed: examples/deepspeed/ds_z2_config.json
|
|
enable_liger_kernel: true
|
|
|
|
dataset: deep_math_4b
|
|
template: qwen3
|
|
enable_thinking: false
|
|
cutoff_len: 20480
|
|
preprocessing_num_workers: 64
|
|
dataloader_num_workers: 32
|
|
|
|
output_dir: ../model/Qwen3-1.7B-Base-SFT-DeepMath-4B
|
|
logging_steps: 5
|
|
save_steps: 100
|
|
plot_loss: true
|
|
overwrite_output_dir: true
|
|
save_only_model: true
|
|
report_to: swanlab
|
|
|
|
per_device_train_batch_size: 16
|
|
gradient_accumulation_steps: 1
|
|
gradient_checkpointing: true
|
|
learning_rate: 1.0e-5
|
|
num_train_epochs: 2.0
|
|
lr_scheduler_type: cosine
|
|
warmup_ratio: 0.1
|
|
bf16: true
|
|
ddp_timeout: 180000000
|
|
|
|
val_size: 0.01
|
|
per_device_eval_batch_size: 1
|
|
eval_strategy: steps
|
|
eval_steps: 100
|
|
```
|
|
|
|
## Citation
|
|
|
|
If you use this model, please consider citing the related paper:
|
|
```
|
|
@article{li2026rethinking,
|
|
title={Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe},
|
|
author={Li, Yaxuan and Zuo, Yuxin and He, Bingxiang and Zhang, Jinqian and Xiao, Chaojun and Qian, Cheng and Yu, Tianyu and Gao, Huan-ang and Yang, Wenkai and Liu, Zhiyuan and Ding, Ning},
|
|
journal={arXiv preprint arXiv:2604.13016},
|
|
year={2026}
|
|
}
|
|
```
|
|
Paper: https://arxiv.org/abs/2604.13016
|