Files
Qwen3-1.7B-SFT/README.md
ModelHub XC 0a5beb0149 初始化项目,由ModelHub XC社区提供模型
Model: lllyx/Qwen3-1.7B-SFT
Source: Original Platform
2026-05-10 05:35:48 +08:00

4.3 KiB

license, language, pipeline_tag, tags, base_model
license language pipeline_tag tags base_model
other
en
zh
text-generation
qwen
qwen3
math
sft
supervised-fine-tuning
full-finetuning
reasoning
Qwen/Qwen3-1.7B-Base

Qwen3-1.7B-SFT


Qwen3-1.7B-SFT is a supervised fine-tuned model based on Qwen3-1.7B-Base, trained on the DeepMath-4B dataset for mathematical reasoning and problem-solving.

This model is associated with the paper:
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
Paper link: https://arxiv.org/abs/2604.13016

Model Description

This model is obtained by full-parameter supervised fine-tuning (SFT) from Qwen3-1.7B-Base.
The training is designed to improve the model's performance on math-focused instruction-following and reasoning tasks. This model is intended in Section 5.1. Off-Policy Distillation from Teacher Rollouts as Cold Start.

Key characteristics

  • Base model: Qwen3-1.7B-Base
  • Training stage: Supervised Fine-Tuning (SFT)
  • Finetuning type: Full finetuning
  • Primary domain: Mathematical reasoning
  • Thinking mode: Disabled during training (enable_thinking: false)
  • Context length: 20480 tokens

Training Details

Training configuration

  • Framework: LLaMA-Factory
  • Stage: sft
  • Finetuning type: full
  • DeepSpeed config: ds_z2_config.json
  • Kernel optimization: enable_liger_kernel: True
  • Precision: bf16
  • Gradient checkpointing: enabled
  • Learning rate: 1e-5
  • Scheduler: cosine
  • Warmup ratio: 0.1
  • Number of epochs: 2.0
  • Per-device train batch size: 16
  • Gradient accumulation steps: 1
  • Validation split: 0.01
  • Evaluation strategy: every 100 steps
  • Save strategy: every 100 steps

Dataset

  • Training dataset: deep_math_4b

Training Hyperparameters

For reproducibility, the core configuration is summarized below:

model_name_or_path: ../model/Qwen3-1.7B-Base
trust_remote_code: true

stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z2_config.json
enable_liger_kernel: true

dataset: deep_math_4b
template: qwen3
enable_thinking: false
cutoff_len: 20480
preprocessing_num_workers: 64
dataloader_num_workers: 32

output_dir: ../model/Qwen3-1.7B-Base-SFT-DeepMath-4B
logging_steps: 5
save_steps: 100
plot_loss: true
overwrite_output_dir: true
save_only_model: true
report_to: swanlab

per_device_train_batch_size: 16
gradient_accumulation_steps: 1
gradient_checkpointing: true
learning_rate: 1.0e-5
num_train_epochs: 2.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

val_size: 0.01
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 100

Citation

If you use this model, please consider citing the related paper:

@article{li2026rethinking,
  title={Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe},
  author={Li, Yaxuan and Zuo, Yuxin and He, Bingxiang and Zhang, Jinqian and Xiao, Chaojun and Qian, Cheng and Yu, Tianyu and Gao, Huan-ang and Yang, Wenkai and Liu, Zhiyuan and Ding, Ning},
  journal={arXiv preprint arXiv:2604.13016},
  year={2026}
}

Paper: https://arxiv.org/abs/2604.13016