初始化项目，由ModelHub XC社区提供模型

Model: lihaoxin2020/qwen3-4B-refiner-sft-rl-balanced-resume-step100 Source: Original Platform
2026-05-05 09:29:18 +08:00
commit 13dff491ab
14 changed files with 152362 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,77 @@
+---
+library_name: transformers
+base_model: lihaoxin2020/qwen3-4B-instruct-refiner-sft
+model_type: qwen3
+license: apache-2.0
+tags:
+  - refiner
+  - grpo
+  - rl
+  - qwen3
+---
+
+# qwen3-4B-refiner-sft-rl-balanced-step100
+
+This model is a **GRPO-trained** checkpoint (step 100, resumed run) of the Qwen3-4B refiner, fine-tuned with reinforcement learning on the refiner RL dataset.
+
+- **Experiment name:** `sft-5ep-offline-balanced-rl_5e-6-answer_only_resume`
+
+## Training Details
+
+- **Base model:** `lihaoxin2020/qwen3-4B-instruct-refiner-sft`
+- **Training method:** GRPO (Group Relative Policy Optimization) with DeepSpeed Stage 3
+- **Refiner mode:** `answer_only`
+- **Training script:** `open_instruct/grpo_fast_refiner_sft.py`
+
+### Hyperparameters
+
+| Parameter | Value |
+|---|---|
+| **Learning rate** | 5e-6 |
+| **LR scheduler** | constant |
+| **Beta (KL penalty)** | 0.001 |
+| **KL estimator** | kl3 |
+| **Advantage normalization** | standard |
+| **Samples per prompt (rollout)** | 8 |
+| **Unique prompts per rollout** | 32 |
+| **Mini batches** | 1 |
+| **Epochs per batch** | 1 |
+| **Per-device train batch size** | 1 |
+| **Temperature** | 1.0 |
+| **Seed** | 42 |
+| **Async mode** | true |
+| **Adam offload** | true |
+| **vLLM sync backend** | nccl |
+
+### Sequence Lengths
+
+| Parameter | Value |
+|---|---|
+| Max token length | 8192 |
+| Max prompt token length | 6144 |
+| Response length | 1024 |
+| Pack length | 8192 |
+
+### Reward Configuration
+
+| Parameter | Value |
+|---|---|
+| Verification reward | 10.0 |
+| Non-stop penalty | false |
+| Gate judge score with format bonus | false |
+| Apply paper citation reward | true |
+| Paper citation weight | 0.5 |
+
+### Dataset
+
+- **Training:** `lihaoxin2020/refiner_rl` (split: train)
+- **Evaluation:** `lihaoxin2020/refiner_rl` (16 samples, split: test)
+
+### Infrastructure
+
+- **DeepSpeed stage:** 3
+- **Learners per node:** 1
+- **vLLM engines:** 1
+- **vLLM tensor parallel size:** 1
+- **vLLM GPU memory utilization:** 0.90
+- **Judge model:** `Qwen/Qwen3.5-35B-A3B` (via vLLM)