--- license: other language: - en - zh pipeline_tag: text-generation tags: - qwen - qwen3 - math - sft - supervised-fine-tuning - full-finetuning - reasoning base_model: Qwen/Qwen3-1.7B-Base ---

Qwen3-1.7B-SFT

Paper Github HF Papers Twitter

Qwen3-1.7B-SFT is a supervised fine-tuned model based on **Qwen3-1.7B-Base**, trained on the **DeepMath-4B** dataset for mathematical reasoning and problem-solving. This model is associated with the paper: **Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe** Paper link: https://arxiv.org/abs/2604.13016 ## Model Description This model is obtained by full-parameter supervised fine-tuning (SFT) from `Qwen3-1.7B-Base`. The training is designed to improve the model's performance on math-focused instruction-following and reasoning tasks. This model is intended in **Section 5.1. Off-Policy Distillation from Teacher Rollouts as Cold Start**. ### Key characteristics - **Base model**: Qwen3-1.7B-Base - **Training stage**: Supervised Fine-Tuning (SFT) - **Finetuning type**: Full finetuning - **Primary domain**: Mathematical reasoning - **Thinking mode**: Disabled during training (`enable_thinking: false`) - **Context length**: 20480 tokens ## Training Details ### Training configuration - **Framework**: LLaMA-Factory - **Stage**: `sft` - **Finetuning type**: `full` - **DeepSpeed config**: `ds_z2_config.json` - **Kernel optimization**: `enable_liger_kernel: True` - **Precision**: `bf16` - **Gradient checkpointing**: enabled - **Learning rate**: `1e-5` - **Scheduler**: cosine - **Warmup ratio**: `0.1` - **Number of epochs**: `2.0` - **Per-device train batch size**: `16` - **Gradient accumulation steps**: `1` - **Validation split**: `0.01` - **Evaluation strategy**: every `100` steps - **Save strategy**: every `100` steps ### Dataset - **Training dataset**: `deep_math_4b` ## Training Hyperparameters For reproducibility, the core configuration is summarized below: ``` model_name_or_path: ../model/Qwen3-1.7B-Base trust_remote_code: true stage: sft do_train: true finetuning_type: full deepspeed: examples/deepspeed/ds_z2_config.json enable_liger_kernel: true dataset: deep_math_4b template: qwen3 enable_thinking: false cutoff_len: 20480 preprocessing_num_workers: 64 dataloader_num_workers: 32 output_dir: ../model/Qwen3-1.7B-Base-SFT-DeepMath-4B logging_steps: 5 save_steps: 100 plot_loss: true overwrite_output_dir: true save_only_model: true report_to: swanlab per_device_train_batch_size: 16 gradient_accumulation_steps: 1 gradient_checkpointing: true learning_rate: 1.0e-5 num_train_epochs: 2.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000 val_size: 0.01 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 100 ``` ## Citation If you use this model, please consider citing the related paper: ``` @article{li2026rethinking, title={Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe}, author={Li, Yaxuan and Zuo, Yuxin and He, Bingxiang and Zhang, Jinqian and Xiao, Chaojun and Qian, Cheng and Yu, Tianyu and Gao, Huan-ang and Yang, Wenkai and Liu, Zhiyuan and Ding, Ning}, journal={arXiv preprint arXiv:2604.13016}, year={2026} } ``` Paper: https://arxiv.org/abs/2604.13016