初始化项目，由ModelHub XC社区提供模型

Model: lllyx/Qwen3-1.7B-SFT Source: Original Platform
2026-05-10 05:35:48 +08:00
commit 0a5beb0149
14 changed files with 157092 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,139 @@
+---
+license: other
+language:
+- en
+- zh
+pipeline_tag: text-generation
+tags:
+- qwen
+- qwen3
+- math
+- sft
+- supervised-fine-tuning
+- full-finetuning
+- reasoning
+base_model: Qwen/Qwen3-1.7B-Base
+---
+
+<h1 align="center">Qwen3-1.7B-SFT</h1>
+
+<div align="center" style="line-height: 1;">
+  <a href="https://arxiv.org/abs/2604.13016" style="margin: 2px;">
+    <img alt="Paper" src="https://img.shields.io/badge/paper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
+  </a>
+  <a href="https://github.com/thunlp/OPD" style="margin: 2px;">
+    <img alt="Github" src="https://img.shields.io/badge/OPD-000000?style=for-the-badge&logo=github&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
+  </a>
+  <a href="https://huggingface.co/papers/2604.13016" style="margin: 2px;">
+    <img alt="HF Papers" src="https://img.shields.io/badge/HF--Paper-%23FFD14D?style=for-the-badge&logo=huggingface&logoColor=black" style="display: inline-block; vertical-align: middle;"/>
+  </a>
+  <a href="https://x.com/HBX_hbx/status/2044464414829777354" style="margin: 2px;">
+    <img alt="Twitter" src="https://img.shields.io/badge/Twitter-%23000000.svg?style=for-the-badge&logo=x&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
+  </a>
+</div>
+
+<br>
+
+Qwen3-1.7B-SFT is a supervised fine-tuned model based on **Qwen3-1.7B-Base**, trained on the **DeepMath-4B** dataset for mathematical reasoning and problem-solving. 
+
+This model is associated with the paper:  
+**Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe**  
+Paper link: https://arxiv.org/abs/2604.13016
+
+## Model Description
+
+This model is obtained by full-parameter supervised fine-tuning (SFT) from `Qwen3-1.7B-Base`.  
+The training is designed to improve the model's performance on math-focused instruction-following and reasoning tasks.
+This model is intended in **Section 5.1. Off-Policy Distillation from Teacher Rollouts as Cold Start**.
+
+### Key characteristics
+
+- **Base model**: Qwen3-1.7B-Base
+- **Training stage**: Supervised Fine-Tuning (SFT)
+- **Finetuning type**: Full finetuning
+- **Primary domain**: Mathematical reasoning
+- **Thinking mode**: Disabled during training (`enable_thinking: false`)
+- **Context length**: 20480 tokens
+
+## Training Details
+
+### Training configuration
+
+- **Framework**: LLaMA-Factory
+- **Stage**: `sft`
+- **Finetuning type**: `full`
+- **DeepSpeed config**: `ds_z2_config.json`
+- **Kernel optimization**: `enable_liger_kernel: True`
+- **Precision**: `bf16`
+- **Gradient checkpointing**: enabled
+- **Learning rate**: `1e-5`
+- **Scheduler**: cosine
+- **Warmup ratio**: `0.1`
+- **Number of epochs**: `2.0`
+- **Per-device train batch size**: `16`
+- **Gradient accumulation steps**: `1`
+- **Validation split**: `0.01`
+- **Evaluation strategy**: every `100` steps
+- **Save strategy**: every `100` steps
+
+### Dataset
+
+- **Training dataset**: `deep_math_4b`
+
+## Training Hyperparameters
+
+For reproducibility, the core configuration is summarized below:
+
+```
+model_name_or_path: ../model/Qwen3-1.7B-Base
+trust_remote_code: true
+
+stage: sft
+do_train: true
+finetuning_type: full
+deepspeed: examples/deepspeed/ds_z2_config.json
+enable_liger_kernel: true
+
+dataset: deep_math_4b
+template: qwen3
+enable_thinking: false
+cutoff_len: 20480
+preprocessing_num_workers: 64
+dataloader_num_workers: 32
+
+output_dir: ../model/Qwen3-1.7B-Base-SFT-DeepMath-4B
+logging_steps: 5
+save_steps: 100
+plot_loss: true
+overwrite_output_dir: true
+save_only_model: true
+report_to: swanlab
+
+per_device_train_batch_size: 16
+gradient_accumulation_steps: 1
+gradient_checkpointing: true
+learning_rate: 1.0e-5
+num_train_epochs: 2.0
+lr_scheduler_type: cosine
+warmup_ratio: 0.1
+bf16: true
+ddp_timeout: 180000000
+
+val_size: 0.01
+per_device_eval_batch_size: 1
+eval_strategy: steps
+eval_steps: 100
+```
+
+## Citation
+
+If you use this model, please consider citing the related paper:
+```
+@article{li2026rethinking,
+  title={Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe},
+  author={Li, Yaxuan and Zuo, Yuxin and He, Bingxiang and Zhang, Jinqian and Xiao, Chaojun and Qian, Cheng and Yu, Tianyu and Gao, Huan-ang and Yang, Wenkai and Liu, Zhiyuan and Ding, Ning},
+  journal={arXiv preprint arXiv:2604.13016},
+  year={2026}
+}
+```
+Paper: https://arxiv.org/abs/2604.13016